Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

Richard Elling Mon, 14 Feb 2011 11:46:01 -0800

Hi Nathan,
comments below...

On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:


> On 14/02/2011 4:31 AM, Richard Elling wrote:
>> On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<[email protected]>  wrote:
>> 
>>> Hi all,
>>> 
>>> Exec summary: I have a situation where I'm seeing lots of large reads 
>>> starving writes from being able to get through to disk.
>>> 
>>> <snip>
>> What is the average service time of each disk? Multiply that by the average
>> active queue depth. If that number is greater than, say, 100ms, then the ZFS
>> I/O scheduler is not able to be very effective because the disks are too 
>> slow.
>> Reducing the active queue depth can help, see zfs_vdev_max_pending in the
>> ZFS Evil Tuning Guide. Faster disks helps, too.
>> 
>> NexentaStor fans, note that you can do this easily, on the fly, via the 
>> Settings ->
>> Preferences ->  System web GUI.
>>   -- richard
>> 
> 
> Hi Richard,
> 
> Long time no speak! Anyhoo - See below.
> 
> I'm unconvinced that faster disks would help. I think faster disks, at least 
> in what I'm observing, would make it suck just as bad, just reading faster... 
> ;) Maybe I'm missing something.

Faster disks always help :-)

> 
> Queue depth is around 10 (default and unchanged since install), and average 
> service time is about 25ms... Below are 1 second samples with iostat - while 
> I have included only about 10 seconds, it's representative of what I'm seeing 
> all the time.
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
> sd7     342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100

ok, we'll take sd6 as an example (the math is easy :-) ...
        actv = 10
        svc_t = 26.7

        actv * svc_t = 267 milliseconds

This is the queue at the disk. ZFS manages its own queue for the disk,
but once it leaves ZFS, there is no way for ZFS to manage it. In the 
case of the active queue, the I/Os have left the OS, so even the OS
is unable to change what is in the queue or directly influence when
the I/Os will be finished.

In ZFS, the queue has a priority scheduler and does place a higher
priority on async writes than async reads (since b130 or so). But what
you can see is that the intermittent nature of the async writes get 
stuck behind the 267 milliseconds as the queue drains the reads.
[no, I'm not sure if that makes sense, try again...]
If it sends reads continuously and writes occasionally, it will appear
that reads have much more domination. In older releases, when the
reads and writes had the same priority, this looks even worse.

> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> 
> sd6     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100
> sd7     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
> sd7     327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     388.0    7.0 49406.4  290.0  0.0  9.8   24.8   1 100
> sd7     409.0    1.0 52350.3    2.0  0.0  9.5   23.2   1  99
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     423.0    0.0 54148.6    0.0  0.0 10.0   23.6   1 100
> sd7     413.0    0.0 52868.5    0.0  0.0 10.0   24.2   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     400.0    2.0 51081.2    2.0  0.0 10.0   24.8   1 100
> sd7     384.0    4.0 49153.2    4.0  0.0 10.0   25.7   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     401.9    1.0 51448.9    8.0  0.0 10.0   24.8   1 100
> sd7     424.9    0.0 54392.4    0.0  0.0 10.0   23.5   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
> sd7     189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
> sd7     307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     405.0    2.0 51843.8    6.0  0.0 10.0   24.5   1 100
> sd7     408.0    3.0 52227.8   10.0  0.0 10.0   24.3   1 100
> 
> Bottom line is that ZFS does not seem to be caring about getting my writes to 
> disk when there is a heavy read workload.
> 
> I have also confirmed that it's not the RAID controller either - behaviour is 
> identical with direct attach SATA.
> 
> But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes 
> things to swing dramatically!
> - At 1, writes proceed much more than reads - 20mb/s read per spindle:35mb/s 
> write per spindle
> - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s

Though the NexentaStor docs recommend "1" for SATA disks, I find that "2" works 
better.

> - At 3, it's starting to lean more heavily to reads again, but writes at 
> least get a whack - 35mb/s per spindle read:15-20mb/s write.
> - At 4, we are closer to 35-40mb/s read, 15mb/s write

Isn't queuing theory fun! :-)

> 
> By the time we get back to the default of 0xa, writes drop off almost 
> completely.
> 
> The crossover (on the box with no RAID controller) seems to be 5. Anything 
> more than that, and writes get shouldered out the way almost completely.
> 
> So - aside from the obvious - manually setting zfs_vdev_max_pending - do you 
> have any thoughts on ZFS being able to make this sort of determination by 
> itself? It would be somewhat of a shame to bust out such 'whacky knobs' for 
> plain old direct attach SATA disks to get balance...
> 
> Also - can I set this property per-vdev? (just in case I have sata and, say, 
> a USP-V connected to the same box)?

Today, there is not a per-vdev setting. There are several changes in the works 
for
this and other scheduling.

Incidentally, you can change the priorities on the fly, so you could experiment
with different settings for mixed workloads. Obviously, non-mixed workloads 
won't 
be very interesting.

Also FWIW, for SATA disks in particular, it is not unusual for us to recommend
dropping zfs_vdev_max_pending to 2. It can make a big difference for some
workloads.
 -- richard

_______________________________________________
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

Reply via email to