Hi Nathan,
comments below...
On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:
On 14/02/2011 4:31 AM, Richard Elling wrote:
On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nat...@tuneunix.com> wrote:
Hi all,
Exec summary: I have a situation where I'm seeing lots of large reads starving
writes from being able to get through to disk.
<snip>
What is the average service time of each disk? Multiply that by the average
active queue depth. If that number is greater than, say, 100ms, then the ZFS
I/O scheduler is not able to be very effective because the disks are too slow.
Reducing the active queue depth can help, see zfs_vdev_max_pending in the
ZFS Evil Tuning Guide. Faster disks helps, too.
NexentaStor fans, note that you can do this easily, on the fly, via the Settings
->
Preferences -> System web GUI.
-- richard
Hi Richard,
Long time no speak! Anyhoo - See below.
I'm unconvinced that faster disks would help. I think faster disks, at least in
what I'm observing, would make it suck just as bad, just reading faster... ;)
Maybe I'm missing something.
Faster disks always help :-)
Queue depth is around 10 (default and unchanged since install), and average
service time is about 25ms... Below are 1 second samples with iostat - while I
have included only about 10 seconds, it's representative of what I'm seeing all
the time.
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 360.9 13.0 46190.5 351.4 0.0 10.0 26.7 1 100
sd7 342.9 12.0 43887.3 329.9 0.0 10.0 28.1 1 100
ok, we'll take sd6 as an example (the math is easy :-) ...
actv = 10
svc_t = 26.7
actv * svc_t = 267 milliseconds
This is the queue at the disk. ZFS manages its own queue for the disk,
but once it leaves ZFS, there is no way for ZFS to manage it. In the
case of the active queue, the I/Os have left the OS, so even the OS
is unable to change what is in the queue or directly influence when
the I/Os will be finished.
In ZFS, the queue has a priority scheduler and does place a higher
priority on async writes than async reads (since b130 or so). But what
you can see is that the intermittent nature of the async writes get
stuck behind the 267 milliseconds as the queue drains the reads.
[no, I'm not sure if that makes sense, try again...]
If it sends reads continuously and writes occasionally, it will appear
that reads have much more domination. In older releases, when the
reads and writes had the same priority, this looks even worse.
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100
sd7 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 370.0 11.0 47360.4 342.0 0.0 10.0 26.2 1 100
sd7 327.0 16.0 41856.4 632.0 0.0 9.6 28.0 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 388.0 7.0 49406.4 290.0 0.0 9.8 24.8 1 100
sd7 409.0 1.0 52350.3 2.0 0.0 9.5 23.2 1 99
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 423.0 0.0 54148.6 0.0 0.0 10.0 23.6 1 100
sd7 413.0 0.0 52868.5 0.0 0.0 10.0 24.2 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 400.0 2.0 51081.2 2.0 0.0 10.0 24.8 1 100
sd7 384.0 4.0 49153.2 4.0 0.0 10.0 25.7 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 401.9 1.0 51448.9 8.0 0.0 10.0 24.8 1 100
sd7 424.9 0.0 54392.4 0.0 0.0 10.0 23.5 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 215.1 208.1 26751.9 25433.5 0.0 9.3 22.1 1 100
sd7 189.1 216.1 24199.1 26833.9 0.0 8.9 22.1 1 91
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 295.0 162.0 37756.8 20610.2 0.0 10.0 21.8 1 100
sd7 307.0 150.0 39292.6 19198.4 0.0 10.0 21.8 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 405.0 2.0 51843.8 6.0 0.0 10.0 24.5 1 100
sd7 408.0 3.0 52227.8 10.0 0.0 10.0 24.3 1 100
Bottom line is that ZFS does not seem to be caring about getting my writes to
disk when there is a heavy read workload.
I have also confirmed that it's not the RAID controller either - behaviour is
identical with direct attach SATA.
But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes things
to swing dramatically!
- At 1, writes proceed much more than reads - 20mb/s read per spindle:35mb/s
write per spindle
- At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s
Though the NexentaStor docs recommend "1" for SATA disks, I find that "2" works
better.
- At 3, it's starting to lean more heavily to reads again, but writes at least
get a whack - 35mb/s per spindle read:15-20mb/s write.
- At 4, we are closer to 35-40mb/s read, 15mb/s write
Isn't queuing theory fun! :-)
By the time we get back to the default of 0xa, writes drop off almost
completely.
The crossover (on the box with no RAID controller) seems to be 5. Anything more
than that, and writes get shouldered out the way almost completely.
So - aside from the obvious - manually setting zfs_vdev_max_pending - do you
have any thoughts on ZFS being able to make this sort of determination by
itself? It would be somewhat of a shame to bust out such 'whacky knobs' for
plain old direct attach SATA disks to get balance...
Also - can I set this property per-vdev? (just in case I have sata and, say, a
USP-V connected to the same box)?
Today, there is not a per-vdev setting. There are several changes in the works
for
this and other scheduling.
Incidentally, you can change the priorities on the fly, so you could experiment
with different settings for mixed workloads. Obviously, non-mixed workloads
won't
be very interesting.
Also FWIW, for SATA disks in particular, it is not unusual for us to recommend
dropping zfs_vdev_max_pending to 2. It can make a big difference for some
workloads.
-- richard