[zfs-discuss] ZFS read/write fairness algorithm for single pool

Nathan Kroenert Sun, 13 Feb 2011 03:59:25 -0800

 Hi all,

Exec summary: I have a situation where I'm seeing lots of large readsstarving writes from being able to get through to disk.


Some detail:

I have a newly constructed box (was an old box, but blew the mobo -different story - sigh).

Anyhoo - It's a Gigabyte 890GPA-UD3H - with lots of onboard SATA - andan HP P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2spindles, as single member stripes, so, yeah, the nearest thing to JBODthat this controller gets to)


pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230
 Hewlett-Packard Company Smart Array Controller

And it's off this HP controller I'm handing my data zpool.

config:

    NAME        STATE     READ WRITE CKSUM
    data        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        c0t0d0  ONLINE       0     0     0
        c0t1d0  ONLINE       0     0     0

Cpu is AMD Phenom II, 6 core 1075T, for what it's worth

I guess my problem is more one that the ZFS folks should be aware ofrather than something directly impacting me, as the workload I havecreated is not something I typically see - but it is something I seeeasily impacting customers - and in a nasty way should they encounterit. It *is* also a case I'll create from time to time - when I'm movingDVD images backwards and forwards...

I was stress testing the box, giving the new kits legs a stretch andkicked off the following:- create a test file to use as source for my 'full speed streamingwrite' (lazy way)

 - dd if=/dev/urandom > /tmp/1

(and let that run for a few seconds, creating about100MB of randomjunk.)

 - start some jobs
    - while :; do cat /tmp/1 >> /data/delete.me/2; done &
        (The write workload, which is fine and dandy by itself)
    - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done &

Before I kicked off the read workload, everything looked as expected. Iwas getting between 40 and 60MB/s to each of the disks and all was good.BUT - As soon as I introduced the read workload, my write throughputdropped to virtually zero, and remained there until the write workloadwas killed.

The starvation is immediate. I can 100% reproducibly go from many MB/sof write throughput with no read workload to virtually 0MB/s writethroughput, simply through kicking off that reading dd. Writeperformance picks up again as soon as I kill the read workload. It alsobehaves the same way of the file I'm reading is NOT the same one I'mwriting to. (eg: cat >> file3 and the dd reading file 2)


Other things to know about the system:
 - Disks are Seagate 2GB, 512 byte sector SATA disks
 - OS is Solaris 11 Express (build 151a)

- zpool version is old. I'm still hedging my bets on having to go backto Nevada (sxce, build 124 or so, which is what I was at beforeinstalling s11express)

    Cached configuration:
            version: 19
 - Plenty of space remains in the pool -
    bash-4.0$ zpool list
    NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
    data   1.81T  1.34T   480G    74%  1.00x  ONLINE  -
 - The box has 8GB of memory - and ZFS is getting a fair whack at it.
> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     211843               827   11%
ZFS File Data             1426054              5570   73%
Anon                       106814               417    5%
Exec and libs                9364                36    0%
Page cache                  47192               184    2%
Free (cachelist)            31448               122    2%
Free (freelist)            130431               509    7%

Total                     1963146              7668
Physical                  1963145              7668

 - Rest of the zfs dataset properties:
        # zfs get all data
        NAME  PROPERTY               VALUE                  SOURCE
        data  type                   filesystem             -
        data  creation               Mon May 24 10:46 2010  -
        data  used                   1.34T                  -
        data  available              451G                   -
        data  referenced             500G                   -
        data  compressratio          1.02x                  -
        data  mounted                yes                    -
        data  quota                  none                   default
        data  reservation            none                   default
        data  recordsize             128K                   default
        data  mountpoint             /data                  default
        data  sharenfs               ro,anon=0              local
        data  checksum               on                     default
        data  compression            off                    local
        data  atime                  off                    local
        data  devices                on                     default
        data  exec                   on                     default
        data  setuid                 on                     default
        data  readonly               off                    default
        data  zoned                  off                    default
        data  snapdir                hidden                 default
        data  aclinherit             restricted             default
        data  canmount               on                     default
        data  xattr                  on                     default
        data  copies                 1                      default
        data  version                3                      -
        data  utf8only               off                    -
        data  normalization          none                   -
        data  casesensitivity        sensitive              -
        data  vscan                  off                    default
        data  nbmand                 off                    default
        data  sharesmb               off                    default
        data  refquota               none                   default
        data  refreservation         none                   local
        data  primarycache           all                    default
        data  secondarycache         all                    default
        data  usedbysnapshots        12.2G                  -
        data  usedbydataset          500G                   -
        data  usedbychildren         864G                   -
        data  usedbyrefreservation   0                      -
        data  logbias                latency                default
        data  dedup                  off                    default
        data  mlslabel               none                   default
        data  sync                   standard               default
        data  encryption             off                    -
        data  keysource              none                   default
        data  keystatus              none                   -
        data  rekeydate              -                      default
        data  rstchown               on                     default
        data  com.sun:auto-snapshot  true                   local

Obviously, the potential for performance issues is considerable - andshould it be required, I can provide some other detail, but given thatthis is so easy to reproduce, I thought I'd get it out there, just in case.

It is also worthy of note that commands like 'zfs list' take anywherefrom 20 to 40 seconds to run when I have that sort of workload running -which also seems less optimal.

I tried to recreate this issue on the boot pool (rpool) which is asingle 2.5" 7200rpm disk (to take the cache controller out of theconfiguration) - but this seemed to hard-hang the system (yep - evencaps lock / num-lock were non-responsive) - and I did not have anywatchdog/snooping set and ran out of steam myself so just hit the bigbutton.

When I get the chance, I'll give the rpool thing a crack again, butoverall, it seems to me that the behavior I'm observing is not great...


I'm also happy to supply lockstats / dtrace output etc if it'll help.

Thoughts?

Cheers!

Nathan.




_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS read/write fairness algorithm for single pool

Reply via email to