On 14/02/2011 4:31 AM, Richard Elling wrote:
On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nat...@tuneunix.com> wrote:
Hi all,
Exec summary: I have a situation where I'm seeing lots of large reads starving
writes from being able to get through to disk.
<snip>
What is the average service time of each disk? Multiply that by the average
active queue depth. If that number is greater than, say, 100ms, then the ZFS
I/O scheduler is not able to be very effective because the disks are too slow.
Reducing the active queue depth can help, see zfs_vdev_max_pending in the
ZFS Evil Tuning Guide. Faster disks helps, too.
NexentaStor fans, note that you can do this easily, on the fly, via the Settings
->
Preferences -> System web GUI.
-- richard
Hi Richard,
Long time no speak! Anyhoo - See below.
I'm unconvinced that faster disks would help. I think faster disks, at
least in what I'm observing, would make it suck just as bad, just
reading faster... ;) Maybe I'm missing something.
Queue depth is around 10 (default and unchanged since install), and
average service time is about 25ms... Below are 1 second samples with
iostat - while I have included only about 10 seconds, it's
representative of what I'm seeing all the time.
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 360.9 13.0 46190.5 351.4 0.0 10.0 26.7 1 100
sd7 342.9 12.0 43887.3 329.9 0.0 10.0 28.1 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100
sd7 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 370.0 11.0 47360.4 342.0 0.0 10.0 26.2 1 100
sd7 327.0 16.0 41856.4 632.0 0.0 9.6 28.0 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 388.0 7.0 49406.4 290.0 0.0 9.8 24.8 1 100
sd7 409.0 1.0 52350.3 2.0 0.0 9.5 23.2 1 99
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 423.0 0.0 54148.6 0.0 0.0 10.0 23.6 1 100
sd7 413.0 0.0 52868.5 0.0 0.0 10.0 24.2 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 400.0 2.0 51081.2 2.0 0.0 10.0 24.8 1 100
sd7 384.0 4.0 49153.2 4.0 0.0 10.0 25.7 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 401.9 1.0 51448.9 8.0 0.0 10.0 24.8 1 100
sd7 424.9 0.0 54392.4 0.0 0.0 10.0 23.5 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 215.1 208.1 26751.9 25433.5 0.0 9.3 22.1 1 100
sd7 189.1 216.1 24199.1 26833.9 0.0 8.9 22.1 1 91
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 295.0 162.0 37756.8 20610.2 0.0 10.0 21.8 1 100
sd7 307.0 150.0 39292.6 19198.4 0.0 10.0 21.8 1 100
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd6 405.0 2.0 51843.8 6.0 0.0 10.0 24.5 1 100
sd7 408.0 3.0 52227.8 10.0 0.0 10.0 24.3 1 100
Bottom line is that ZFS does not seem to be caring about getting my
writes to disk when there is a heavy read workload.
I have also confirmed that it's not the RAID controller either -
behaviour is identical with direct attach SATA.
But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes
things to swing dramatically!
- At 1, writes proceed much more than reads - 20mb/s read per
spindle:35mb/s write per spindle
- At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s
- At 3, it's starting to lean more heavily to reads again, but writes
at least get a whack - 35mb/s per spindle read:15-20mb/s write.
- At 4, we are closer to 35-40mb/s read, 15mb/s write
By the time we get back to the default of 0xa, writes drop off almost
completely.
The crossover (on the box with no RAID controller) seems to be 5.
Anything more than that, and writes get shouldered out the way almost
completely.
So - aside from the obvious - manually setting zfs_vdev_max_pending - do
you have any thoughts on ZFS being able to make this sort of
determination by itself? It would be somewhat of a shame to bust out
such 'whacky knobs' for plain old direct attach SATA disks to get balance...
Also - can I set this property per-vdev? (just in case I have sata and,
say, a USP-V connected to the same box)?
Thanks again, and good to see you are still playing close by!
Cheers!
Nathan.
pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230
Hewlett-Packard Company Smart Array Controller
And it's off this HP controller I'm handing my data zpool.
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
Cpu is AMD Phenom II, 6 core 1075T, for what it's worth
I guess my problem is more one that the ZFS folks should be aware of rather
than something directly impacting me, as the workload I have created is not
something I typically see - but it is something I see easily impacting
customers - and in a nasty way should they encounter it. It *is* also a case
I'll create from time to time - when I'm moving DVD images backwards and
forwards...
I was stress testing the box, giving the new kits legs a stretch and kicked off
the following:
- create a test file to use as source for my 'full speed streaming write' (lazy
way)
- dd if=/dev/urandom> /tmp/1
(and let that run for a few seconds, creating about100MB of random junk.)
- start some jobs
- while :; do cat /tmp/1>> /data/delete.me/2; done&
(The write workload, which is fine and dandy by itself)
- while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done&
Before I kicked off the read workload, everything looked as expected. I was
getting between 40 and 60MB/s to each of the disks and all was good. BUT - As
soon as I introduced the read workload, my write throughput dropped to
virtually zero, and remained there until the write workload was killed.
The starvation is immediate. I can 100% reproducibly go from many MB/s of write
throughput with no read workload to virtually 0MB/s write throughput, simply through
kicking off that reading dd. Write performance picks up again as soon as I kill the
read workload. It also behaves the same way of the file I'm reading is NOT the same
one I'm writing to. (eg: cat>> file3 and the dd reading file 2)
Other things to know about the system:
- Disks are Seagate 2GB, 512 byte sector SATA disks
- OS is Solaris 11 Express (build 151a)
- zpool version is old. I'm still hedging my bets on having to go back to
Nevada (sxce, build 124 or so, which is what I was at before installing
s11express)
Cached configuration:
version: 19
- Plenty of space remains in the pool -
bash-4.0$ zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
data 1.81T 1.34T 480G 74% 1.00x ONLINE -
- The box has 8GB of memory - and ZFS is getting a fair whack at it.
::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 211843 827 11%
ZFS File Data 1426054 5570 73%
Anon 106814 417 5%
Exec and libs 9364 36 0%
Page cache 47192 184 2%
Free (cachelist) 31448 122 2%
Free (freelist) 130431 509 7%
Total 1963146 7668
Physical 1963145 7668
- Rest of the zfs dataset properties:
# zfs get all data
NAME PROPERTY VALUE SOURCE
data type filesystem -
data creation Mon May 24 10:46 2010 -
data used 1.34T -
data available 451G -
data referenced 500G -
data compressratio 1.02x -
data mounted yes -
data quota none default
data reservation none default
data recordsize 128K default
data mountpoint /data default
data sharenfs ro,anon=0 local
data checksum on default
data compression off local
data atime off local
data devices on default
data exec on default
data setuid on default
data readonly off default
data zoned off default
data snapdir hidden default
data aclinherit restricted default
data canmount on default
data xattr on default
data copies 1 default
data version 3 -
data utf8only off -
data normalization none -
data casesensitivity sensitive -
data vscan off default
data nbmand off default
data sharesmb off default
data refquota none default
data refreservation none local
data primarycache all default
data secondarycache all default
data usedbysnapshots 12.2G -
data usedbydataset 500G -
data usedbychildren 864G -
data usedbyrefreservation 0 -
data logbias latency default
data dedup off default
data mlslabel none default
data sync standard default
data encryption off -
data keysource none default
data keystatus none -
data rekeydate - default
data rstchown on default
data com.sun:auto-snapshot true local
Obviously, the potential for performance issues is considerable - and should it
be required, I can provide some other detail, but given that this is so easy to
reproduce, I thought I'd get it out there, just in case.
It is also worthy of note that commands like 'zfs list' take anywhere from 20
to 40 seconds to run when I have that sort of workload running - which also
seems less optimal.
I tried to recreate this issue on the boot pool (rpool) which is a single 2.5"
7200rpm disk (to take the cache controller out of the configuration) - but this
seemed to hard-hang the system (yep - even caps lock / num-lock were non-responsive)
- and I did not have any watchdog/snooping set and ran out of steam myself so just
hit the big button.
When I get the chance, I'll give the rpool thing a crack again, but overall, it
seems to me that the behavior I'm observing is not great...
I'm also happy to supply lockstats / dtrace output etc if it'll help.
Thoughts?
Cheers!
Nathan.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss