Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

Nathan Kroenert Sun, 13 Feb 2011 20:30:33 -0800

On 14/02/2011 4:31 AM, Richard Elling wrote:

On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nat...@tuneunix.com>  wrote:

Hi all,

Exec summary: I have a situation where I'm seeing lots of large reads starving 
writes from being able to get through to disk.

<snip>

What is the average service time of each disk? Multiply that by the average
active queue depth. If that number is greater than, say, 100ms, then the ZFS
I/O scheduler is not able to be very effective because the disks are too slow.
Reducing the active queue depth can help, see zfs_vdev_max_pending in the
ZFS Evil Tuning Guide. Faster disks helps, too.

NexentaStor fans, note that you can do this easily, on the fly, via the Settings 
->
Preferences ->  System web GUI.
   -- richard


Hi Richard,

Long time no speak! Anyhoo - See below.

I'm unconvinced that faster disks would help. I think faster disks, atleast in what I'm observing, would make it suck just as bad, justreading faster... ;) Maybe I'm missing something.

Queue depth is around 10 (default and unchanged since install), andaverage service time is about 25ms... Below are 1 second samples withiostat - while I have included only about 10 seconds, it'srepresentative of what I'm seeing all the time.

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
sd7     342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b

sd6     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100
sd7     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
sd7     327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     388.0    7.0 49406.4  290.0  0.0  9.8   24.8   1 100
sd7     409.0    1.0 52350.3    2.0  0.0  9.5   23.2   1  99

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     423.0    0.0 54148.6    0.0  0.0 10.0   23.6   1 100
sd7     413.0    0.0 52868.5    0.0  0.0 10.0   24.2   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     400.0    2.0 51081.2    2.0  0.0 10.0   24.8   1 100
sd7     384.0    4.0 49153.2    4.0  0.0 10.0   25.7   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     401.9    1.0 51448.9    8.0  0.0 10.0   24.8   1 100
sd7     424.9    0.0 54392.4    0.0  0.0 10.0   23.5   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
sd7     189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
sd7     307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100

                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     405.0    2.0 51843.8    6.0  0.0 10.0   24.5   1 100
sd7     408.0    3.0 52227.8   10.0  0.0 10.0   24.3   1 100

Bottom line is that ZFS does not seem to be caring about getting mywrites to disk when there is a heavy read workload.

I have also confirmed that it's not the RAID controller either -behaviour is identical with direct attach SATA.

But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causesthings to swing dramatically!- At 1, writes proceed much more than reads - 20mb/s read perspindle:35mb/s write per spindle

 - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s

- At 3, it's starting to lean more heavily to reads again, but writesat least get a whack - 35mb/s per spindle read:15-20mb/s write.

 - At 4, we are closer to 35-40mb/s read, 15mb/s write

By the time we get back to the default of 0xa, writes drop off almostcompletely.

The crossover (on the box with no RAID controller) seems to be 5.Anything more than that, and writes get shouldered out the way almostcompletely.

So - aside from the obvious - manually setting zfs_vdev_max_pending - doyou have any thoughts on ZFS being able to make this sort ofdetermination by itself? It would be somewhat of a shame to bust outsuch 'whacky knobs' for plain old direct attach SATA disks to get balance...

Also - can I set this property per-vdev? (just in case I have sata and,say, a USP-V connected to the same box)?


Thanks again, and good to see you are still playing close by!

Cheers!

Nathan.

pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230
Hewlett-Packard Company Smart Array Controller

And it's off this HP controller I'm handing my data zpool.

config:

    NAME        STATE     READ WRITE CKSUM
    data        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        c0t0d0  ONLINE       0     0     0
        c0t1d0  ONLINE       0     0     0

Cpu is AMD Phenom II, 6 core 1075T, for what it's worth

I guess my problem is more one that the ZFS folks should be aware of rather 
than something directly impacting me, as the workload I have created is not 
something I typically see - but it is something I see easily impacting 
customers - and in a nasty way should they encounter it. It *is* also a case 
I'll create  from time to time - when I'm moving DVD images backwards and 
forwards...

I was stress testing the box, giving the new kits legs a stretch and kicked off 
the following:
- create a test file to use as source for my 'full speed streaming write' (lazy 
way)
- dd if=/dev/urandom>  /tmp/1
    (and let that run for a few seconds, creating about100MB of random junk.)
- start some jobs
    - while :; do cat /tmp/1>>  /data/delete.me/2; done&
        (The write workload, which is fine and dandy by itself)
    - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done&

Before I kicked off the read workload, everything looked as expected. I was 
getting between 40 and 60MB/s to each of the disks and all was good. BUT - As 
soon as I introduced the read workload, my write throughput dropped to 
virtually zero, and remained there until the write workload was killed.

The starvation is immediate. I can 100% reproducibly go from many MB/s of write 
throughput with no read workload to virtually 0MB/s write throughput, simply through 
kicking off that reading dd. Write performance picks up again as soon as I kill the 
read workload. It also behaves the same way of the file I'm reading is NOT the same 
one I'm writing to. (eg: cat>>  file3  and the dd reading file 2)

Other things to know about the system:
- Disks are Seagate 2GB, 512 byte sector SATA disks
- OS is Solaris 11 Express (build 151a)
- zpool version is old. I'm still hedging my bets on having to go back to 
Nevada (sxce, build 124 or so, which is what I was at before installing 
s11express)
    Cached configuration:
            version: 19
- Plenty of space remains in the pool -
    bash-4.0$ zpool list
    NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
    data   1.81T  1.34T   480G    74%  1.00x  ONLINE  -
- The box has 8GB of memory - and ZFS is getting a fair whack at it.

::memstat

Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     211843               827   11%
ZFS File Data             1426054              5570   73%
Anon                       106814               417    5%
Exec and libs                9364                36    0%
Page cache                  47192               184    2%
Free (cachelist)            31448               122    2%
Free (freelist)            130431               509    7%

Total                     1963146              7668
Physical                  1963145              7668

- Rest of the zfs dataset properties:
        # zfs get all data
        NAME  PROPERTY               VALUE                  SOURCE
        data  type                   filesystem             -
        data  creation               Mon May 24 10:46 2010  -
        data  used                   1.34T                  -
        data  available              451G                   -
        data  referenced             500G                   -
        data  compressratio          1.02x                  -
        data  mounted                yes                    -
        data  quota                  none                   default
        data  reservation            none                   default
        data  recordsize             128K                   default
        data  mountpoint             /data                  default
        data  sharenfs               ro,anon=0              local
        data  checksum               on                     default
        data  compression            off                    local
        data  atime                  off                    local
        data  devices                on                     default
        data  exec                   on                     default
        data  setuid                 on                     default
        data  readonly               off                    default
        data  zoned                  off                    default
        data  snapdir                hidden                 default
        data  aclinherit             restricted             default
        data  canmount               on                     default
        data  xattr                  on                     default
        data  copies                 1                      default
        data  version                3                      -
        data  utf8only               off                    -
        data  normalization          none                   -
        data  casesensitivity        sensitive              -
        data  vscan                  off                    default
        data  nbmand                 off                    default
        data  sharesmb               off                    default
        data  refquota               none                   default
        data  refreservation         none                   local
        data  primarycache           all                    default
        data  secondarycache         all                    default
        data  usedbysnapshots        12.2G                  -
        data  usedbydataset          500G                   -
        data  usedbychildren         864G                   -
        data  usedbyrefreservation   0                      -
        data  logbias                latency                default
        data  dedup                  off                    default
        data  mlslabel               none                   default
        data  sync                   standard               default
        data  encryption             off                    -
        data  keysource              none                   default
        data  keystatus              none                   -
        data  rekeydate              -                      default
        data  rstchown               on                     default
        data  com.sun:auto-snapshot  true                   local


Obviously, the potential for performance issues is considerable - and should it 
be required, I can provide some other detail, but given that this is so easy to 
reproduce, I thought I'd get it out there, just in case.

It is also worthy of note that commands like 'zfs list' take anywhere from 20 
to 40 seconds to run when I have that sort of workload running - which also 
seems less optimal.

I tried to recreate this issue on the boot pool (rpool) which is a single 2.5" 
7200rpm disk (to take the cache controller out of the configuration) - but this 
seemed to hard-hang the system (yep - even caps lock / num-lock were non-responsive) 
- and I did not have any watchdog/snooping set and ran out of steam myself so just 
hit the big button.

When I get the chance, I'll give the rpool thing a crack again, but overall, it 
seems to me that the behavior I'm observing is not great...

I'm also happy to supply lockstats / dtrace output etc if it'll help.

Thoughts?

Cheers!

Nathan.




_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

Reply via email to