Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

Richard Elling Sun, 13 Feb 2011 09:26:24 -0800

On Feb 13, 2011, at 12:56 AM, Nathan Kroenert <nat...@tuneunix.com> wrote:


> Hi all,
> 
> Exec summary: I have a situation where I'm seeing lots of large reads 
> starving writes from being able to get through to disk.
> 
> Some detail:
> I have a newly constructed box (was an old box, but blew the mobo - different 
> story - sigh).
> 
> Anyhoo - It's a Gigabyte 890GPA-UD3H - with lots of onboard SATA - and an HP 
> P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2 spindles, as 
> single member stripes, so, yeah, the nearest thing to JBOD that this 
> controller gets to)

What is the average service time of each disk? Multiply that by the average
active queue depth. If that number is greater than, say, 100ms, then the ZFS
I/O scheduler is not able to be very effective because the disks are too slow.
Reducing the active queue depth can help, see zfs_vdev_max_pending in the
ZFS Evil Tuning Guide. Faster disks helps, too.

NexentaStor fans, note that you can do this easily, on the fly, via the 
Settings ->
Preferences -> System web GUI.
  -- richard

> 
> pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230
> Hewlett-Packard Company Smart Array Controller
> 
> And it's off this HP controller I'm handing my data zpool.
> 
> config:
> 
>    NAME        STATE     READ WRITE CKSUM
>    data        ONLINE       0     0     0
>      mirror-0  ONLINE       0     0     0
>        c0t0d0  ONLINE       0     0     0
>        c0t1d0  ONLINE       0     0     0
> 
> Cpu is AMD Phenom II, 6 core 1075T, for what it's worth
> 
> I guess my problem is more one that the ZFS folks should be aware of rather 
> than something directly impacting me, as the workload I have created is not 
> something I typically see - but it is something I see easily impacting 
> customers - and in a nasty way should they encounter it. It *is* also a case 
> I'll create  from time to time - when I'm moving DVD images backwards and 
> forwards...
> 
> I was stress testing the box, giving the new kits legs a stretch and kicked 
> off the following:
> - create a test file to use as source for my 'full speed streaming write' 
> (lazy way)
> - dd if=/dev/urandom > /tmp/1
>    (and let that run for a few seconds, creating about100MB of random junk.)
> - start some jobs
>    - while :; do cat /tmp/1 >> /data/delete.me/2; done &
>        (The write workload, which is fine and dandy by itself)
>    - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done &
> 
> Before I kicked off the read workload, everything looked as expected. I was 
> getting between 40 and 60MB/s to each of the disks and all was good. BUT - As 
> soon as I introduced the read workload, my write throughput dropped to 
> virtually zero, and remained there until the write workload was killed.
> 
> The starvation is immediate. I can 100% reproducibly go from many MB/s of 
> write throughput with no read workload to virtually 0MB/s write throughput, 
> simply through kicking off that reading dd. Write performance picks up again 
> as soon as I kill the read workload. It also behaves the same way of the file 
> I'm reading is NOT the same one I'm writing to. (eg: cat >> file3  and the dd 
> reading file 2)
> 
> Other things to know about the system:
> - Disks are Seagate 2GB, 512 byte sector SATA disks
> - OS is Solaris 11 Express (build 151a)
> - zpool version is old. I'm still hedging my bets on having to go back to 
> Nevada (sxce, build 124 or so, which is what I was at before installing 
> s11express)
>    Cached configuration:
>            version: 19
> - Plenty of space remains in the pool -
>    bash-4.0$ zpool list
>    NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
>    data   1.81T  1.34T   480G    74%  1.00x  ONLINE  -
> - The box has 8GB of memory - and ZFS is getting a fair whack at it.
> > ::memstat
> Page Summary                Pages                MB  %Tot
> ------------     ----------------  ----------------  ----
> Kernel                     211843               827   11%
> ZFS File Data             1426054              5570   73%
> Anon                       106814               417    5%
> Exec and libs                9364                36    0%
> Page cache                  47192               184    2%
> Free (cachelist)            31448               122    2%
> Free (freelist)            130431               509    7%
> 
> Total                     1963146              7668
> Physical                  1963145              7668
> 
> - Rest of the zfs dataset properties:
>        # zfs get all data
>        NAME  PROPERTY               VALUE                  SOURCE
>        data  type                   filesystem             -
>        data  creation               Mon May 24 10:46 2010  -
>        data  used                   1.34T                  -
>        data  available              451G                   -
>        data  referenced             500G                   -
>        data  compressratio          1.02x                  -
>        data  mounted                yes                    -
>        data  quota                  none                   default
>        data  reservation            none                   default
>        data  recordsize             128K                   default
>        data  mountpoint             /data                  default
>        data  sharenfs               ro,anon=0              local
>        data  checksum               on                     default
>        data  compression            off                    local
>        data  atime                  off                    local
>        data  devices                on                     default
>        data  exec                   on                     default
>        data  setuid                 on                     default
>        data  readonly               off                    default
>        data  zoned                  off                    default
>        data  snapdir                hidden                 default
>        data  aclinherit             restricted             default
>        data  canmount               on                     default
>        data  xattr                  on                     default
>        data  copies                 1                      default
>        data  version                3                      -
>        data  utf8only               off                    -
>        data  normalization          none                   -
>        data  casesensitivity        sensitive              -
>        data  vscan                  off                    default
>        data  nbmand                 off                    default
>        data  sharesmb               off                    default
>        data  refquota               none                   default
>        data  refreservation         none                   local
>        data  primarycache           all                    default
>        data  secondarycache         all                    default
>        data  usedbysnapshots        12.2G                  -
>        data  usedbydataset          500G                   -
>        data  usedbychildren         864G                   -
>        data  usedbyrefreservation   0                      -
>        data  logbias                latency                default
>        data  dedup                  off                    default
>        data  mlslabel               none                   default
>        data  sync                   standard               default
>        data  encryption             off                    -
>        data  keysource              none                   default
>        data  keystatus              none                   -
>        data  rekeydate              -                      default
>        data  rstchown               on                     default
>        data  com.sun:auto-snapshot  true                   local
> 
> 
> Obviously, the potential for performance issues is considerable - and should 
> it be required, I can provide some other detail, but given that this is so 
> easy to reproduce, I thought I'd get it out there, just in case.
> 
> It is also worthy of note that commands like 'zfs list' take anywhere from 20 
> to 40 seconds to run when I have that sort of workload running - which also 
> seems less optimal.
> 
> I tried to recreate this issue on the boot pool (rpool) which is a single 
> 2.5" 7200rpm disk (to take the cache controller out of the configuration) - 
> but this seemed to hard-hang the system (yep - even caps lock / num-lock were 
> non-responsive) - and I did not have any watchdog/snooping set and ran out of 
> steam myself so just hit the big button.
> 
> When I get the chance, I'll give the rpool thing a crack again, but overall, 
> it seems to me that the behavior I'm observing is not great...
> 
> I'm also happy to supply lockstats / dtrace output etc if it'll help.
> 
> Thoughts?
> 
> Cheers!
> 
> Nathan.
> 
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

Reply via email to