Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

Nathan Kroenert Mon, 14 Feb 2011 16:30:00 -0800

Thanks for all the thoughts, Richard.

One thing that still sticks in my craw is that I'm not wanting to writeintermittently. I'm wanting to write flat out, and those writes arebeing held up... Seems to me that zfs should know and do something aboutthat without me needing to tune zfs_vdev_max_pending...

Nonetheless, I'm now at a far more balanced point than when I started,so that's a good thing. :)


Cheers,

Nathan.

On 15/02/2011 6:44 AM, Richard Elling wrote:

Hi Nathan,
comments below...

On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:

On 14/02/2011 4:31 AM, Richard Elling wrote:

On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nat...@tuneunix.com>   wrote:

Hi all,

Exec summary: I have a situation where I'm seeing lots of large reads starving 
writes from being able to get through to disk.

<snip>

What is the average service time of each disk? Multiply that by the average
active queue depth. If that number is greater than, say, 100ms, then the ZFS
I/O scheduler is not able to be very effective because the disks are too slow.
Reducing the active queue depth can help, see zfs_vdev_max_pending in the
ZFS Evil Tuning Guide. Faster disks helps, too.

NexentaStor fans, note that you can do this easily, on the fly, via the Settings 
->
Preferences ->   System web GUI.
   -- richard

Hi Richard,

Long time no speak! Anyhoo - See below.

I'm unconvinced that faster disks would help. I think faster disks, at least in 
what I'm observing, would make it suck just as bad, just reading faster... ;) 
Maybe I'm missing something.

Faster disks always help :-)

Queue depth is around 10 (default and unchanged since install), and average 
service time is about 25ms... Below are 1 second samples with iostat - while I 
have included only about 10 seconds, it's representative of what I'm seeing all 
the time.
                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
sd7     342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100

ok, we'll take sd6 as an example (the math is easy :-) ...
        actv = 10
        svc_t = 26.7

        actv * svc_t = 267 milliseconds

This is the queue at the disk. ZFS manages its own queue for the disk,
but once it leaves ZFS, there is no way for ZFS to manage it. In the
case of the active queue, the I/Os have left the OS, so even the OS
is unable to change what is in the queue or directly influence when
the I/Os will be finished.

In ZFS, the queue has a priority scheduler and does place a higher
priority on async writes than async reads (since b130 or so). But what
you can see is that the intermittent nature of the async writes get
stuck behind the 267 milliseconds as the queue drains the reads.
[no, I'm not sure if that makes sense, try again...]
If it sends reads continuously and writes occasionally, it will appear
that reads have much more domination. In older releases, when the
reads and writes had the same priority, this looks even worse.

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b

sd6     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100
sd7     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
sd7     327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     388.0    7.0 49406.4  290.0  0.0  9.8   24.8   1 100
sd7     409.0    1.0 52350.3    2.0  0.0  9.5   23.2   1  99

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     423.0    0.0 54148.6    0.0  0.0 10.0   23.6   1 100
sd7     413.0    0.0 52868.5    0.0  0.0 10.0   24.2   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     400.0    2.0 51081.2    2.0  0.0 10.0   24.8   1 100
sd7     384.0    4.0 49153.2    4.0  0.0 10.0   25.7   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     401.9    1.0 51448.9    8.0  0.0 10.0   24.8   1 100
sd7     424.9    0.0 54392.4    0.0  0.0 10.0   23.5   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
sd7     189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
sd7     307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     405.0    2.0 51843.8    6.0  0.0 10.0   24.5   1 100
sd7     408.0    3.0 52227.8   10.0  0.0 10.0   24.3   1 100

Bottom line is that ZFS does not seem to be caring about getting my writes to 
disk when there is a heavy read workload.

I have also confirmed that it's not the RAID controller either - behaviour is 
identical with direct attach SATA.

But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes things 
to swing dramatically!
- At 1, writes proceed much more than reads - 20mb/s read per spindle:35mb/s 
write per spindle
- At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s

Though the NexentaStor docs recommend "1" for SATA disks, I find that "2" works 
better.

- At 3, it's starting to lean more heavily to reads again, but writes at least 
get a whack - 35mb/s per spindle read:15-20mb/s write.
- At 4, we are closer to 35-40mb/s read, 15mb/s write

Isn't queuing theory fun! :-)

By the time we get back to the default of 0xa, writes drop off almost 
completely.

The crossover (on the box with no RAID controller) seems to be 5. Anything more 
than that, and writes get shouldered out the way almost completely.

So - aside from the obvious - manually setting zfs_vdev_max_pending - do you 
have any thoughts on ZFS being able to make this sort of determination by 
itself? It would be somewhat of a shame to bust out such 'whacky knobs' for 
plain old direct attach SATA disks to get balance...

Also - can I set this property per-vdev? (just in case I have sata and, say, a 
USP-V connected to the same box)?

Today, there is not a per-vdev setting. There are several changes in the works 
for
this and other scheduling.

Incidentally, you can change the priorities on the fly, so you could experiment
with different settings for mixed workloads. Obviously, non-mixed workloads 
won't
be very interesting.

Also FWIW, for SATA disks in particular, it is not unusual for us to recommend
dropping zfs_vdev_max_pending to 2. It can make a big difference for some
workloads.
  -- richard


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

Reply via email to