[zfs-discuss] NFS slow for small files: idle disks

Michael Hase Thu, 20 Jan 2011 14:45:57 -0800

The discussion is really old: writing many small files on an nfs mounted zfs 
filesystem is slow without ssd zil due to the sync nature of the nfs protocol 
itself. But there is something I don't really understand. My tests on an old 
opteron box with 2 small u160 scsi arrays and a zpool with 4 mirrored vdevs 
built from 146gb disks show mostly idle disks when untarring  an archive with 
many small files over nfs. Any source package can be used for this test. I'm on 
zpool version 22 (still sxce b130, the client is opensolaris b130), nfs mount 
options are all default, NFSD_SERVERS=128.


Configuration of the pool is like this:
zpool status ib1
  pool: ib1
 state: ONLINE
 scrub: scrub completed after 0h52m with 0 errors on Sat Jan 15 14:19:02 2011
config:

        NAME        STATE     READ WRITE CKSUM
        ib1         ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0
            c4t3d0  ONLINE       0     0     0
          mirror-3  ONLINE       0     0     0
            c3t4d0  ONLINE       0     0     0
            c4t4d0  ONLINE       0     0     0

zpool iostat -v shows

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
ib1          268G   276G      0    180      0   723K
  mirror    95.4G  40.6G      0     44      0   180K
    c1t4d0      -      -      0     44      0   180K
    c3t0d0      -      -      0     44      0   180K
  mirror    95.2G  40.8G      0     44      0   180K
    c1t6d0      -      -      0     44      0   180K
    c4t0d0      -      -      0     44      0   180K
  mirror    39.0G  97.0G      0     45      0   184K
    c3t3d0      -      -      0     45      0   184K
    c4t3d0      -      -      0     45      0   184K
  mirror    38.5G  97.5G      0     44      0   180K
    c3t4d0      -      -      0     44      0   180K
    c4t4d0      -      -      0     44      0   180K
----------  -----  -----  -----  -----  -----  -----

So each disk gets 40-50 iops, 180 ops on the whole pool (mirrored). Note that 
these u320 scsi disks should be able to handle about 150 iops per disk, so 
theres no iops aggregation. The strange thing is the following iostat -MindexC 
output:

                           extended device statistics       ---- errors --- 
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot 
device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0  14   0  14 c0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0  14   0  14 
c0t0d0
    0.0  186.0    0.0    0.4  0.0  0.0    0.0    0.1   0   2   0   0   0   0 c1
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0 
c1t4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 
c1t5d0
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0 
c1t6d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c2
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 
c2t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 
c2t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 
c2t2d0
    0.0  279.5    0.0    0.5  0.0  0.0    0.0    0.1   0   3   0   0   0   0 c3
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0 
c3t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 
c3t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 
c3t2d0
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0 
c3t3d0
    0.0   93.5    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0 
c3t4d0
    0.0  279.0    0.0    0.5  0.0  0.0    0.0    0.2   0   5   0   0   0   0 c4
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.3   0   3   0   0   0   0 
c4t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 
c4t2d0
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0 
c4t4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 
c4t1d0
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0 
c4t3d0

Service times for the involved disks are around 0.1-0.3 msec, I think this is 
the sequential write nature of zfs. The disks are at most 3% busy. When writing 
synchronous I'd expect 100% busy disks. And when reading or writing locally the 
disks really get busy, about 50 MB/sec per disk due to the 160 MB/sec scsi bus 
limitation per channel (there are 2 u160 channels with 3 disks each, and 1 
channel with 2 disks).

Richard Ellings zilstat gives

   N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops  <=4kB 
4-32kB >=32kB
      9552       9552       9552     671744     671744     671744    164    164 
     0      0
     10192      10192      10192     724992     724992     724992    177    177 
     0      0
      9568       9568       9568     679936     679936     679936    166    166 
     0      0
     11712      11712      11712     823296     823296     823296    201    201 
     0      0
     10784      10784      10784     765952     765952     765952    187    187 
     0      0
     10024      10024      10024     708608     708608     708608    173    173 
     0      0

About 200 zil ops all < 4k as maximum. As said the disks aren't busy during 
this test.

The test zfs ist configured with atime off. logbias nearly doesn't matter, with 
logbias=latency the iops rate is a little bit lower.

Attached are some bonnie++ results to show, that all disks and the whole pool 
are quite healthy. I get > 1000 random reads/sec local and still nearly 900 
reads/sec via nfs. For large files I easily get gbit wirespeed (105 MB/sec 
read) with nfs. And for random reads in a bonnie or iozone test the disks are 
really 80%-100% busy. Just for small files the array sits almost idle, the 
array can do way more. I discovered this on different solaris versions, not 
only this test system. Is there any explanation for this behaviour?

Thanks,
Michael
-- 
This message posted from opensolaris.org

local

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
ibmr10          16G           108972  25 89923  21           263540  26  1074   
3
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 30359  99 +++++ +++ +++++ +++ 24836  99 +++++ +++ +++++ +++
ibmr10,16G,,,108972,25,89923,21,,,263540,26,1073.5,3,16,30359,99,+++++,+++,+++++,+++,24836,99,+++++,+++,+++++,+++

NFS

Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nfsibmr10       16G           50022  11 42524  14           105335  18 884.8  20
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   152   3 +++++ +++   182   1   151   3 +++++ +++   183   1
nfsibmr10,16G,,,50022,11,42524,14,,,105335,18,884.8,20,16,152,3,+++++,+++,182,1,151,3,+++++,+++,183,1

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] NFS slow for small files: idle disks

Reply via email to