Re: [zfs-discuss] ZFS on Fit-PC Slim?

2008-11-06 Thread Jonathan Hogg
On 6 Nov 2008, at 04:09, Vincent Fox wrote:

> According to the slides I have seen, a ZFS filesystem even on a  
> single disk can handle massive amounts of sector failure before it  
> becomes unusable.   I seem to recall it said 1/8th of the disk?  So  
> even on a single disk the redundancy in the metadata is valuable.   
> And if I don't have really very much data I can set copies=2 so I  
> have better protection for the data as well.
>
> My goal is a compact low-powered and low-maintenance widget.   
> Eliminating the chance of fsck is always a good thing now that I  
> have tasted ZFS.

In my personal experience, disks are more likely to fail completely  
than suffer from small sector failures. But don't get me wrong,  
provided you have a good backup strategy and can afford the downtime  
of replacing the disk and restoring, then ZFS is still a great  
filesystem to use for a single disk.

Dont be put off. Many of the people on this list are running multi- 
terabyte enterprise solutions and are unable to think in terms of non- 
redundant, small numbers of gigabytes :-)

> I'm going to try and see if Nevada will even install when it  
> arrives, and report back.  Perhaps BSD is another option.  If not I  
> will fall back to Ubuntu.

I have FreeBSD and ZFS working fine(*) on a 1.8GHz VIA C7 (32bit)  
processor. Admittedly this is with 2GB of RAM, but I set aside 1GB for  
ARC and the machine is still showing 750MB free at the moment, so I'm  
sure it could run with 256MB of ARC in under 512MB. 1.8GHz is a fair  
bit faster than the Geode in the Fit-PC, but the C7 scales back to  
900MHz and my machine still runs acceptably at that speed (although I  
wouldn't want to buildworld with it).

I say, give it a go and see what happens. I'm sure I can still dimly  
recall a time when 500MHz/512MB was a kick-ass system...

Jonathan


(*) This machine can sustain 110MB/s off of the 4-disk RAIDZ1 set,  
which is substantially more than I can get over my 100Mb network.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-auto-snapshot default schedules

2008-09-25 Thread Jonathan Hogg
On 25 Sep 2008, at 17:14, Darren J Moffat wrote:

> Chris Gerhard has a zfs_versions script that might help: 
> http://blogs.sun.com/chrisg/entry/that_there_is

Ah. Cool. I will have to try this out.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-auto-snapshot default schedules

2008-09-25 Thread Jonathan Hogg
On 25 Sep 2008, at 14:40, Ross wrote:

> For a default setup, I would have thought a years worth of data  
> would be enough, something like:

Given that this can presumably be configured to suit everyone's  
particular data retention plan, for a default setup, what was  
originally proposed seems obvious and sensible to me.

Going slightly off-topic:

All this auto-snapshot stuff is ace, but what's really missing, in my  
view, is some easy way to actually determine where the version of the  
file you want is. I typically find myself futzing about with diff  
across a dozen mounted snapshots trying to figure out when the last  
good version is.

It would be great if there was some way to know if a snapshot contains  
blocks for a particular file, i.e., that snapshot contains an earlier  
version of the file than the next snapshot / now. If you could do that  
and make ls support it with an additional flag/column, it'd be a real  
time-saver.

The current mechanism is especially hard as the auto-mount dirs can  
only be found at the top of the filesystem so you have to work with  
long path names. An fs trick to make .snapshot dirs of symbolic links  
appear automagically would rock, i.e.,

% cd /foo/bar/baz
% ls -l .snapshot
[...] nightly.0 -> /foo/.zfs/snapshot/nightly.0/bar/baz
% diff {,.snapshot/nightly.0/}importantfile

Yes, I know this last command can just be written as:

% diff /foo/{,.zfs/snapshot/nightly.0}/bar/baz/importantfile

but this requires me to a) type more; and b) remember where the top of  
the filesystem is in order to split the path. This is obviously more  
of a pain if the path is 7 items deep, and the split means you can't  
just use $PWD.

[My choice of .snapshot/nightly.0 is a deliberate nod to the  
competition ;-)]

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Announcement: The Unofficial Unsupported Python ZFS API

2008-07-14 Thread Jonathan Hogg
On 14 Jul 2008, at 16:07, Will Murnane wrote:

> As long as I'm composing an email, I might as well mention that I had
> forgotten to mention Swig as a dependency (d'oh!).  I now have a
> mention of it on the page, and a spec file that can be built using
> pkgtool.  If you tried this before and gave up because of a missing
> package, please give it another shot.

Not related to the actual API itself, but just thought I'd note that  
all the cool kids are using ctypes these days to bind Python to  
foreign libraries.

http://docs.python.org/lib/module-ctypes.html

This has the advantage of requiring no other libraries and no compile  
phase at all.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA controller suggestion

2008-06-09 Thread Jonathan Hogg
On 9 Jun 2008, at 14:59, Thomas Maier-Komor wrote:

>> time gdd if=/dev/zero bs=1048576 count=10240 of=/data/video/x
>>
>> real 0m13.503s
>> user 0m0.016s
>> sys  0m8.981s
>>
>>
>
> Are you sure gdd doesn't create a sparse file?

One would presumably expect it to be instantaneous if it was creating  
a sparse file. It's not a compressed filesystem though is it? /dev/ 
zero tends to be fairly compressible ;-)

I think, as someone else pointed out, running zpool iostat at the same  
time might be the best way to see what's really happening.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore

2008-05-30 Thread Jonathan Hogg
On 30 May 2008, at 15:49, J.P. King wrote:

> For _my_ purposes I'd be happy with zfs send/receive, if only it was
> guaranteed to be compatible between versions.  I agree that the  
> inability
> to extract single files is an irritation - I am not sure why this is
> anything more than an implementation detail, but I haven't gone into  
> it in
> depth.

I would presume it is because zfs send/receive works at the block  
level, below the ZFS POSIX layer - i.e., below the filesystem level. I  
would guess that a stream is simply a list of the blocks that were  
modified between the two snapshots, suitable for "re-playing" on  
another pool. This means that the stream may not contain your entire  
file.

An interesting point regarding this is that send/receive will be  
optimal in the case of small modifications to very large files, such  
as database files or large log files. The actual modified/appended  
blocks would be sent rather than the whole changed file. This may be  
an important point depending on your file modification patterns.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore

2008-05-29 Thread Jonathan Hogg
On 29 May 2008, at 17:52, Chris Siebenmann wrote:

> The first issue alone makes 'zfs send' completely unsuitable for the
> purposes that we currently use ufsdump. I don't believe that we've  
> lost
> a complete filesystem in years, but we restore accidentally deleted
> files all the time. (And snapshots are not the answer, as it is common
> that a user doesn't notice the problem until well after the fact.)
>
> ('zfs send' to live disks is not the answer, because we cannot afford
> the space, heat, power, disks, enclosures, and servers to spin as many
> disks as we have tape space, especially if we want the fault isolation
> that separate tapes give us. most especially if we have to build a
> second, physically separate machine room in another building to put  
> the
> backups in.)

However, the original poster did say they were wanting to backup to  
another disk and said they wanted something lightweight/cheap/easy.  
zfs send/receive would seem to fit the bill in that case. Let's answer  
the question rather than getting into an argument about whether zfs  
send/receive is suitable for an enterprise archival solution.

Using snapshots is a useful practice as it costs fairly little in  
terms of disk space and provides immediate access to fairly recent,  
accidentally deleted files. If one is using snapshots, sending the  
streams to the backup pool is a simple procedure. One can then keep as  
many snapshots on the backup pool as necessary to provide the amount  
of history required. All of the files are kept in identical form on  
the backup pool for easy browsing when something needs to be restored.  
In event of catastrophic failure of the primary pool, one can quickly  
move the backup disk to the primary system and import it as the new  
primary pool.

It's a bit-perfect incremental backup strategy that requires no  
additional tools.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore

2008-05-29 Thread Jonathan Hogg
On 29 May 2008, at 15:51, Thomas Maier-Komor wrote:

>> I very strongly disagree.  The closest ZFS equivalent to ufsdump is  
>> 'zfs
>> send'.  'zfs send' like ufsdump has initmiate awareness of the the
>> actual on disk layout and is an integrated part of the filesystem
>> implementation.
>>
>> star is a userland archiver.
>>
>
> The man page for zfs states the following for send:
>
>  The format of the stream is evolving. No backwards  compati-
>  bility  is  guaranteed.  You may not be able to receive your
>  streams on future versions of ZFS.
>
> I think this should be taken into account when considering 'zfs send'
> for backup purposes...

Presumably, if one is backing up to another disk, one could zfs  
receive to a pool on that disk. That way you get simple file-based  
access, full history (although it could be collapsed by deleting older  
snapshots as necessary), and no worries about stream format changes.

Jonathan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Video streaming and prefetch

2008-05-06 Thread Jonathan Hogg
Hi all,

I'm new to this list and ZFS, so forgive me if I'm re-hashing an old  
topic. I'm also using ZFS on FreeBSD not Solaris, so forgive me for  
being a heretic ;-)

I recently setup a home NAS box and decided that ZFS is the only  
sensible way to manage 4TB of disks. The primary use of the box is to  
serve my telly (actually a Mac mini). This is using afp (via netatalk)  
to serve space to the telly for storing and retrieving video. The  
video tends to be 2-4GB files that are read/written sequentially at a  
rate in the region of 800KB/s.

Unfortunately, the performance has been very choppy. The video  
software assumes it's talking to fast local storage and thus makes  
little attempt to buffer. I spent a long time trying to figure out the  
network problem before determining that the problem is actually in  
reading from the FS. This is a pretty cheap box, but it can still  
sustain 110GB/s off the array and low milliseconds access times. So  
there really is no excuse for not being able to serve up 800KB/s in an  
even fashion.

After some experimentation I have determined that the problem is  
prefetching. Given this thing is mostly serving sequentially at a low,  
even rate it ought to be perfect territory for prefetching. I spent  
the weekend reading the ZFS code (bank holiday fun eh?) and running  
some experiments and think the problem is in the interaction between  
the prefetching code and the running processes.

(Warning: some of the following is speculation on observed behaviour  
and may be rubbish.)

The behaviour I see is the file streaming stalling whenever the  
prefetch code decides to read some more blocks. The dmu_zfetch code is  
all run as part of the read() operation. When this finds itself  
getting close to running out of prefetched blocks it queues up  
requests for more blocks - 256 of them. At 128KB per block, that's  
32MB of data it requests. At this point it should be asynchronous and  
the caller should get back control and be able to process the data it  
just read. However, my NAS box is a uniprocessor and the issue thread  
is higher priority than user processes. So, in fact, it immediately  
begins issuing the physical reads to the disks.

Given that modern disks tend to prefetch into their own caches anyway,  
some of these reads are likely to be served up instantly. This causes  
interrupts back into the kernel to deal with the data. This queues up  
the interrupt threads, which are also higher priority than user  
processes. These consume a not-insubstantial amount of CPU time to  
gather, checksum and load the blocks into the ARC. During which time,  
the disks have located the other blocks and started serving them up.

So what I seem to get is a small "perfect storm" of interrupt  
processing. This delays the user process for a few hundred  
milliseconds. Even though the originally requested block was *in* the  
cache! To add insult to injury the, user process in this case, when it  
finally regains the CPU and returns the data to the the caller, then  
sleeps for a couple of hundred milliseconds. So prefetching, instead  
of evening-out reading and reducing jitter, has produced the worst  
case performance of compressing all of the jitter into one massive  
lump every 40 seconds (32MB / 800K).

I get reasonably even performance if I disable prefetching or if I  
reduce the zfetch_block_cap to 16-32 blocks instead of 256.

Other than just taking this opportunity to rant, I'm wondering if  
anyone else has seen similar problems and found a way around them?  
Also, to any ZFS developers: why does the prefetching logic follow the  
same path as a regular async read? Surely these ought to be way down  
the priority list? My immediate thought after a weekend of reading the  
code was to re-write it to use a low priority prefetch thread and have  
all of the dmu_zfetch() logic in that instead of in-line with the  
original dbuf_read().

Jonathan


PS: Hi Darren!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss