On Feb 13, 2011, at 12:56 AM, Nathan Kroenert <nat...@tuneunix.com> wrote:
> Hi all, > > Exec summary: I have a situation where I'm seeing lots of large reads > starving writes from being able to get through to disk. > > Some detail: > I have a newly constructed box (was an old box, but blew the mobo - different > story - sigh). > > Anyhoo - It's a Gigabyte 890GPA-UD3H - with lots of onboard SATA - and an HP > P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2 spindles, as > single member stripes, so, yeah, the nearest thing to JBOD that this > controller gets to) What is the average service time of each disk? Multiply that by the average active queue depth. If that number is greater than, say, 100ms, then the ZFS I/O scheduler is not able to be very effective because the disks are too slow. Reducing the active queue depth can help, see zfs_vdev_max_pending in the ZFS Evil Tuning Guide. Faster disks helps, too. NexentaStor fans, note that you can do this easily, on the fly, via the Settings -> Preferences -> System web GUI. -- richard > > pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230 > Hewlett-Packard Company Smart Array Controller > > And it's off this HP controller I'm handing my data zpool. > > config: > > NAME STATE READ WRITE CKSUM > data ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 > > Cpu is AMD Phenom II, 6 core 1075T, for what it's worth > > I guess my problem is more one that the ZFS folks should be aware of rather > than something directly impacting me, as the workload I have created is not > something I typically see - but it is something I see easily impacting > customers - and in a nasty way should they encounter it. It *is* also a case > I'll create from time to time - when I'm moving DVD images backwards and > forwards... > > I was stress testing the box, giving the new kits legs a stretch and kicked > off the following: > - create a test file to use as source for my 'full speed streaming write' > (lazy way) > - dd if=/dev/urandom > /tmp/1 > (and let that run for a few seconds, creating about100MB of random junk.) > - start some jobs > - while :; do cat /tmp/1 >> /data/delete.me/2; done & > (The write workload, which is fine and dandy by itself) > - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done & > > Before I kicked off the read workload, everything looked as expected. I was > getting between 40 and 60MB/s to each of the disks and all was good. BUT - As > soon as I introduced the read workload, my write throughput dropped to > virtually zero, and remained there until the write workload was killed. > > The starvation is immediate. I can 100% reproducibly go from many MB/s of > write throughput with no read workload to virtually 0MB/s write throughput, > simply through kicking off that reading dd. Write performance picks up again > as soon as I kill the read workload. It also behaves the same way of the file > I'm reading is NOT the same one I'm writing to. (eg: cat >> file3 and the dd > reading file 2) > > Other things to know about the system: > - Disks are Seagate 2GB, 512 byte sector SATA disks > - OS is Solaris 11 Express (build 151a) > - zpool version is old. I'm still hedging my bets on having to go back to > Nevada (sxce, build 124 or so, which is what I was at before installing > s11express) > Cached configuration: > version: 19 > - Plenty of space remains in the pool - > bash-4.0$ zpool list > NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT > data 1.81T 1.34T 480G 74% 1.00x ONLINE - > - The box has 8GB of memory - and ZFS is getting a fair whack at it. > > ::memstat > Page Summary Pages MB %Tot > ------------ ---------------- ---------------- ---- > Kernel 211843 827 11% > ZFS File Data 1426054 5570 73% > Anon 106814 417 5% > Exec and libs 9364 36 0% > Page cache 47192 184 2% > Free (cachelist) 31448 122 2% > Free (freelist) 130431 509 7% > > Total 1963146 7668 > Physical 1963145 7668 > > - Rest of the zfs dataset properties: > # zfs get all data > NAME PROPERTY VALUE SOURCE > data type filesystem - > data creation Mon May 24 10:46 2010 - > data used 1.34T - > data available 451G - > data referenced 500G - > data compressratio 1.02x - > data mounted yes - > data quota none default > data reservation none default > data recordsize 128K default > data mountpoint /data default > data sharenfs ro,anon=0 local > data checksum on default > data compression off local > data atime off local > data devices on default > data exec on default > data setuid on default > data readonly off default > data zoned off default > data snapdir hidden default > data aclinherit restricted default > data canmount on default > data xattr on default > data copies 1 default > data version 3 - > data utf8only off - > data normalization none - > data casesensitivity sensitive - > data vscan off default > data nbmand off default > data sharesmb off default > data refquota none default > data refreservation none local > data primarycache all default > data secondarycache all default > data usedbysnapshots 12.2G - > data usedbydataset 500G - > data usedbychildren 864G - > data usedbyrefreservation 0 - > data logbias latency default > data dedup off default > data mlslabel none default > data sync standard default > data encryption off - > data keysource none default > data keystatus none - > data rekeydate - default > data rstchown on default > data com.sun:auto-snapshot true local > > > Obviously, the potential for performance issues is considerable - and should > it be required, I can provide some other detail, but given that this is so > easy to reproduce, I thought I'd get it out there, just in case. > > It is also worthy of note that commands like 'zfs list' take anywhere from 20 > to 40 seconds to run when I have that sort of workload running - which also > seems less optimal. > > I tried to recreate this issue on the boot pool (rpool) which is a single > 2.5" 7200rpm disk (to take the cache controller out of the configuration) - > but this seemed to hard-hang the system (yep - even caps lock / num-lock were > non-responsive) - and I did not have any watchdog/snooping set and ran out of > steam myself so just hit the big button. > > When I get the chance, I'll give the rpool thing a crack again, but overall, > it seems to me that the behavior I'm observing is not great... > > I'm also happy to supply lockstats / dtrace output etc if it'll help. > > Thoughts? > > Cheers! > > Nathan. > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss