Hi all,

Exec summary: I have a situation where I'm seeing lots of large reads starving writes from being able to get through to disk.

Some detail:
I have a newly constructed box (was an old box, but blew the mobo - different story - sigh).

Anyhoo - It's a Gigabyte 890GPA-UD3H - with lots of onboard SATA - and an HP P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2 spindles, as single member stripes, so, yeah, the nearest thing to JBOD that this controller gets to)

pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230
 Hewlett-Packard Company Smart Array Controller

And it's off this HP controller I'm handing my data zpool.

config:

    NAME        STATE     READ WRITE CKSUM
    data        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        c0t0d0  ONLINE       0     0     0
        c0t1d0  ONLINE       0     0     0

Cpu is AMD Phenom II, 6 core 1075T, for what it's worth

I guess my problem is more one that the ZFS folks should be aware of rather than something directly impacting me, as the workload I have created is not something I typically see - but it is something I see easily impacting customers - and in a nasty way should they encounter it. It *is* also a case I'll create from time to time - when I'm moving DVD images backwards and forwards...

I was stress testing the box, giving the new kits legs a stretch and kicked off the following: - create a test file to use as source for my 'full speed streaming write' (lazy way)
 - dd if=/dev/urandom > /tmp/1
(and let that run for a few seconds, creating about100MB of random junk.)
 - start some jobs
    - while :; do cat /tmp/1 >> /data/delete.me/2; done &
        (The write workload, which is fine and dandy by itself)
    - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done &

Before I kicked off the read workload, everything looked as expected. I was getting between 40 and 60MB/s to each of the disks and all was good. BUT - As soon as I introduced the read workload, my write throughput dropped to virtually zero, and remained there until the write workload was killed.

The starvation is immediate. I can 100% reproducibly go from many MB/s of write throughput with no read workload to virtually 0MB/s write throughput, simply through kicking off that reading dd. Write performance picks up again as soon as I kill the read workload. It also behaves the same way of the file I'm reading is NOT the same one I'm writing to. (eg: cat >> file3 and the dd reading file 2)

Other things to know about the system:
 - Disks are Seagate 2GB, 512 byte sector SATA disks
 - OS is Solaris 11 Express (build 151a)
- zpool version is old. I'm still hedging my bets on having to go back to Nevada (sxce, build 124 or so, which is what I was at before installing s11express)
    Cached configuration:
            version: 19
 - Plenty of space remains in the pool -
    bash-4.0$ zpool list
    NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
    data   1.81T  1.34T   480G    74%  1.00x  ONLINE  -
 - The box has 8GB of memory - and ZFS is getting a fair whack at it.
> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     211843               827   11%
ZFS File Data             1426054              5570   73%
Anon                       106814               417    5%
Exec and libs                9364                36    0%
Page cache                  47192               184    2%
Free (cachelist)            31448               122    2%
Free (freelist)            130431               509    7%

Total                     1963146              7668
Physical                  1963145              7668

 - Rest of the zfs dataset properties:
        # zfs get all data
        NAME  PROPERTY               VALUE                  SOURCE
        data  type                   filesystem             -
        data  creation               Mon May 24 10:46 2010  -
        data  used                   1.34T                  -
        data  available              451G                   -
        data  referenced             500G                   -
        data  compressratio          1.02x                  -
        data  mounted                yes                    -
        data  quota                  none                   default
        data  reservation            none                   default
        data  recordsize             128K                   default
        data  mountpoint             /data                  default
        data  sharenfs               ro,anon=0              local
        data  checksum               on                     default
        data  compression            off                    local
        data  atime                  off                    local
        data  devices                on                     default
        data  exec                   on                     default
        data  setuid                 on                     default
        data  readonly               off                    default
        data  zoned                  off                    default
        data  snapdir                hidden                 default
        data  aclinherit             restricted             default
        data  canmount               on                     default
        data  xattr                  on                     default
        data  copies                 1                      default
        data  version                3                      -
        data  utf8only               off                    -
        data  normalization          none                   -
        data  casesensitivity        sensitive              -
        data  vscan                  off                    default
        data  nbmand                 off                    default
        data  sharesmb               off                    default
        data  refquota               none                   default
        data  refreservation         none                   local
        data  primarycache           all                    default
        data  secondarycache         all                    default
        data  usedbysnapshots        12.2G                  -
        data  usedbydataset          500G                   -
        data  usedbychildren         864G                   -
        data  usedbyrefreservation   0                      -
        data  logbias                latency                default
        data  dedup                  off                    default
        data  mlslabel               none                   default
        data  sync                   standard               default
        data  encryption             off                    -
        data  keysource              none                   default
        data  keystatus              none                   -
        data  rekeydate              -                      default
        data  rstchown               on                     default
        data  com.sun:auto-snapshot  true                   local


Obviously, the potential for performance issues is considerable - and should it be required, I can provide some other detail, but given that this is so easy to reproduce, I thought I'd get it out there, just in case.

It is also worthy of note that commands like 'zfs list' take anywhere from 20 to 40 seconds to run when I have that sort of workload running - which also seems less optimal.

I tried to recreate this issue on the boot pool (rpool) which is a single 2.5" 7200rpm disk (to take the cache controller out of the configuration) - but this seemed to hard-hang the system (yep - even caps lock / num-lock were non-responsive) - and I did not have any watchdog/snooping set and ran out of steam myself so just hit the big button.

When I get the chance, I'll give the rpool thing a crack again, but overall, it seems to me that the behavior I'm observing is not great...

I'm also happy to supply lockstats / dtrace output etc if it'll help.

Thoughts?

Cheers!

Nathan.




_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to