Hi all,

I'm new to this list and ZFS, so forgive me if I'm re-hashing an old  
topic. I'm also using ZFS on FreeBSD not Solaris, so forgive me for  
being a heretic ;-)

I recently setup a home NAS box and decided that ZFS is the only  
sensible way to manage 4TB of disks. The primary use of the box is to  
serve my telly (actually a Mac mini). This is using afp (via netatalk)  
to serve space to the telly for storing and retrieving video. The  
video tends to be 2-4GB files that are read/written sequentially at a  
rate in the region of 800KB/s.

Unfortunately, the performance has been very choppy. The video  
software assumes it's talking to fast local storage and thus makes  
little attempt to buffer. I spent a long time trying to figure out the  
network problem before determining that the problem is actually in  
reading from the FS. This is a pretty cheap box, but it can still  
sustain 110GB/s off the array and low milliseconds access times. So  
there really is no excuse for not being able to serve up 800KB/s in an  
even fashion.

After some experimentation I have determined that the problem is  
prefetching. Given this thing is mostly serving sequentially at a low,  
even rate it ought to be perfect territory for prefetching. I spent  
the weekend reading the ZFS code (bank holiday fun eh?) and running  
some experiments and think the problem is in the interaction between  
the prefetching code and the running processes.

(Warning: some of the following is speculation on observed behaviour  
and may be rubbish.)

The behaviour I see is the file streaming stalling whenever the  
prefetch code decides to read some more blocks. The dmu_zfetch code is  
all run as part of the read() operation. When this finds itself  
getting close to running out of prefetched blocks it queues up  
requests for more blocks - 256 of them. At 128KB per block, that's  
32MB of data it requests. At this point it should be asynchronous and  
the caller should get back control and be able to process the data it  
just read. However, my NAS box is a uniprocessor and the issue thread  
is higher priority than user processes. So, in fact, it immediately  
begins issuing the physical reads to the disks.

Given that modern disks tend to prefetch into their own caches anyway,  
some of these reads are likely to be served up instantly. This causes  
interrupts back into the kernel to deal with the data. This queues up  
the interrupt threads, which are also higher priority than user  
processes. These consume a not-insubstantial amount of CPU time to  
gather, checksum and load the blocks into the ARC. During which time,  
the disks have located the other blocks and started serving them up.

So what I seem to get is a small "perfect storm" of interrupt  
processing. This delays the user process for a few hundred  
milliseconds. Even though the originally requested block was *in* the  
cache! To add insult to injury the, user process in this case, when it  
finally regains the CPU and returns the data to the the caller, then  
sleeps for a couple of hundred milliseconds. So prefetching, instead  
of evening-out reading and reducing jitter, has produced the worst  
case performance of compressing all of the jitter into one massive  
lump every 40 seconds (32MB / 800K).

I get reasonably even performance if I disable prefetching or if I  
reduce the zfetch_block_cap to 16-32 blocks instead of 256.

Other than just taking this opportunity to rant, I'm wondering if  
anyone else has seen similar problems and found a way around them?  
Also, to any ZFS developers: why does the prefetching logic follow the  
same path as a regular async read? Surely these ought to be way down  
the priority list? My immediate thought after a weekend of reading the  
code was to re-write it to use a low priority prefetch thread and have  
all of the dmu_zfetch() logic in that instead of in-line with the  
original dbuf_read().

Jonathan


PS: Hi Darren!

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to