Hi Nathan, comments below... On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:
> On 14/02/2011 4:31 AM, Richard Elling wrote: >> On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<[email protected]> wrote: >> >>> Hi all, >>> >>> Exec summary: I have a situation where I'm seeing lots of large reads >>> starving writes from being able to get through to disk. >>> >>> <snip> >> What is the average service time of each disk? Multiply that by the average >> active queue depth. If that number is greater than, say, 100ms, then the ZFS >> I/O scheduler is not able to be very effective because the disks are too >> slow. >> Reducing the active queue depth can help, see zfs_vdev_max_pending in the >> ZFS Evil Tuning Guide. Faster disks helps, too. >> >> NexentaStor fans, note that you can do this easily, on the fly, via the >> Settings -> >> Preferences -> System web GUI. >> -- richard >> > > Hi Richard, > > Long time no speak! Anyhoo - See below. > > I'm unconvinced that faster disks would help. I think faster disks, at least > in what I'm observing, would make it suck just as bad, just reading faster... > ;) Maybe I'm missing something. Faster disks always help :-) > > Queue depth is around 10 (default and unchanged since install), and average > service time is about 25ms... Below are 1 second samples with iostat - while > I have included only about 10 seconds, it's representative of what I'm seeing > all the time. > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 360.9 13.0 46190.5 351.4 0.0 10.0 26.7 1 100 > sd7 342.9 12.0 43887.3 329.9 0.0 10.0 28.1 1 100 ok, we'll take sd6 as an example (the math is easy :-) ... actv = 10 svc_t = 26.7 actv * svc_t = 267 milliseconds This is the queue at the disk. ZFS manages its own queue for the disk, but once it leaves ZFS, there is no way for ZFS to manage it. In the case of the active queue, the I/Os have left the OS, so even the OS is unable to change what is in the queue or directly influence when the I/Os will be finished. In ZFS, the queue has a priority scheduler and does place a higher priority on async writes than async reads (since b130 or so). But what you can see is that the intermittent nature of the async writes get stuck behind the 267 milliseconds as the queue drains the reads. [no, I'm not sure if that makes sense, try again...] If it sends reads continuously and writes occasionally, it will appear that reads have much more domination. In older releases, when the reads and writes had the same priority, this looks even worse. > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > > sd6 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100 > sd7 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 370.0 11.0 47360.4 342.0 0.0 10.0 26.2 1 100 > sd7 327.0 16.0 41856.4 632.0 0.0 9.6 28.0 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 388.0 7.0 49406.4 290.0 0.0 9.8 24.8 1 100 > sd7 409.0 1.0 52350.3 2.0 0.0 9.5 23.2 1 99 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 423.0 0.0 54148.6 0.0 0.0 10.0 23.6 1 100 > sd7 413.0 0.0 52868.5 0.0 0.0 10.0 24.2 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 400.0 2.0 51081.2 2.0 0.0 10.0 24.8 1 100 > sd7 384.0 4.0 49153.2 4.0 0.0 10.0 25.7 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 401.9 1.0 51448.9 8.0 0.0 10.0 24.8 1 100 > sd7 424.9 0.0 54392.4 0.0 0.0 10.0 23.5 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 215.1 208.1 26751.9 25433.5 0.0 9.3 22.1 1 100 > sd7 189.1 216.1 24199.1 26833.9 0.0 8.9 22.1 1 91 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 295.0 162.0 37756.8 20610.2 0.0 10.0 21.8 1 100 > sd7 307.0 150.0 39292.6 19198.4 0.0 10.0 21.8 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 405.0 2.0 51843.8 6.0 0.0 10.0 24.5 1 100 > sd7 408.0 3.0 52227.8 10.0 0.0 10.0 24.3 1 100 > > Bottom line is that ZFS does not seem to be caring about getting my writes to > disk when there is a heavy read workload. > > I have also confirmed that it's not the RAID controller either - behaviour is > identical with direct attach SATA. > > But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes > things to swing dramatically! > - At 1, writes proceed much more than reads - 20mb/s read per spindle:35mb/s > write per spindle > - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s Though the NexentaStor docs recommend "1" for SATA disks, I find that "2" works better. > - At 3, it's starting to lean more heavily to reads again, but writes at > least get a whack - 35mb/s per spindle read:15-20mb/s write. > - At 4, we are closer to 35-40mb/s read, 15mb/s write Isn't queuing theory fun! :-) > > By the time we get back to the default of 0xa, writes drop off almost > completely. > > The crossover (on the box with no RAID controller) seems to be 5. Anything > more than that, and writes get shouldered out the way almost completely. > > So - aside from the obvious - manually setting zfs_vdev_max_pending - do you > have any thoughts on ZFS being able to make this sort of determination by > itself? It would be somewhat of a shame to bust out such 'whacky knobs' for > plain old direct attach SATA disks to get balance... > > Also - can I set this property per-vdev? (just in case I have sata and, say, > a USP-V connected to the same box)? Today, there is not a per-vdev setting. There are several changes in the works for this and other scheduling. Incidentally, you can change the priorities on the fly, so you could experiment with different settings for mixed workloads. Obviously, non-mixed workloads won't be very interesting. Also FWIW, for SATA disks in particular, it is not unusual for us to recommend dropping zfs_vdev_max_pending to 2. It can make a big difference for some workloads. -- richard _______________________________________________ zfs-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
