Hi all,
Exec summary: I have a situation where I'm seeing lots of large reads
starving writes from being able to get through to disk.
Some detail:
I have a newly constructed box (was an old box, but blew the mobo -
different story - sigh).
Anyhoo - It's a Gigabyte 890GPA-UD3H - with lots of onboard SATA - and
an HP P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2
spindles, as single member stripes, so, yeah, the nearest thing to JBOD
that this controller gets to)
pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230
Hewlett-Packard Company Smart Array Controller
And it's off this HP controller I'm handing my data zpool.
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
Cpu is AMD Phenom II, 6 core 1075T, for what it's worth
I guess my problem is more one that the ZFS folks should be aware of
rather than something directly impacting me, as the workload I have
created is not something I typically see - but it is something I see
easily impacting customers - and in a nasty way should they encounter
it. It *is* also a case I'll create from time to time - when I'm moving
DVD images backwards and forwards...
I was stress testing the box, giving the new kits legs a stretch and
kicked off the following:
- create a test file to use as source for my 'full speed streaming
write' (lazy way)
- dd if=/dev/urandom > /tmp/1
(and let that run for a few seconds, creating about100MB of random
junk.)
- start some jobs
- while :; do cat /tmp/1 >> /data/delete.me/2; done &
(The write workload, which is fine and dandy by itself)
- while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done &
Before I kicked off the read workload, everything looked as expected. I
was getting between 40 and 60MB/s to each of the disks and all was good.
BUT - As soon as I introduced the read workload, my write throughput
dropped to virtually zero, and remained there until the write workload
was killed.
The starvation is immediate. I can 100% reproducibly go from many MB/s
of write throughput with no read workload to virtually 0MB/s write
throughput, simply through kicking off that reading dd. Write
performance picks up again as soon as I kill the read workload. It also
behaves the same way of the file I'm reading is NOT the same one I'm
writing to. (eg: cat >> file3 and the dd reading file 2)
Other things to know about the system:
- Disks are Seagate 2GB, 512 byte sector SATA disks
- OS is Solaris 11 Express (build 151a)
- zpool version is old. I'm still hedging my bets on having to go back
to Nevada (sxce, build 124 or so, which is what I was at before
installing s11express)
Cached configuration:
version: 19
- Plenty of space remains in the pool -
bash-4.0$ zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
data 1.81T 1.34T 480G 74% 1.00x ONLINE -
- The box has 8GB of memory - and ZFS is getting a fair whack at it.
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 211843 827 11%
ZFS File Data 1426054 5570 73%
Anon 106814 417 5%
Exec and libs 9364 36 0%
Page cache 47192 184 2%
Free (cachelist) 31448 122 2%
Free (freelist) 130431 509 7%
Total 1963146 7668
Physical 1963145 7668
- Rest of the zfs dataset properties:
# zfs get all data
NAME PROPERTY VALUE SOURCE
data type filesystem -
data creation Mon May 24 10:46 2010 -
data used 1.34T -
data available 451G -
data referenced 500G -
data compressratio 1.02x -
data mounted yes -
data quota none default
data reservation none default
data recordsize 128K default
data mountpoint /data default
data sharenfs ro,anon=0 local
data checksum on default
data compression off local
data atime off local
data devices on default
data exec on default
data setuid on default
data readonly off default
data zoned off default
data snapdir hidden default
data aclinherit restricted default
data canmount on default
data xattr on default
data copies 1 default
data version 3 -
data utf8only off -
data normalization none -
data casesensitivity sensitive -
data vscan off default
data nbmand off default
data sharesmb off default
data refquota none default
data refreservation none local
data primarycache all default
data secondarycache all default
data usedbysnapshots 12.2G -
data usedbydataset 500G -
data usedbychildren 864G -
data usedbyrefreservation 0 -
data logbias latency default
data dedup off default
data mlslabel none default
data sync standard default
data encryption off -
data keysource none default
data keystatus none -
data rekeydate - default
data rstchown on default
data com.sun:auto-snapshot true local
Obviously, the potential for performance issues is considerable - and
should it be required, I can provide some other detail, but given that
this is so easy to reproduce, I thought I'd get it out there, just in case.
It is also worthy of note that commands like 'zfs list' take anywhere
from 20 to 40 seconds to run when I have that sort of workload running -
which also seems less optimal.
I tried to recreate this issue on the boot pool (rpool) which is a
single 2.5" 7200rpm disk (to take the cache controller out of the
configuration) - but this seemed to hard-hang the system (yep - even
caps lock / num-lock were non-responsive) - and I did not have any
watchdog/snooping set and ran out of steam myself so just hit the big
button.
When I get the chance, I'll give the rpool thing a crack again, but
overall, it seems to me that the behavior I'm observing is not great...
I'm also happy to supply lockstats / dtrace output etc if it'll help.
Thoughts?
Cheers!
Nathan.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss