Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Marko Milisavljevic Mon, 14 May 2007 22:48:38 -0700

I am very grateful to everyone who took the time to run a few tests to help
me figure what is going on. As per j's suggestions, I tried some
simultaneous reads, and a few other things, and I am getting interesting and
confusing results.


All tests are done using two Seagate 320G drives on sil3114. In each test I
am using dd if=.... of=/dev/null bs=128k count=10000. Each drive is freshly
formatted with one 2G file copied to it. That way dd from raw disk and from
file are using roughly same area of disk. I tried using raw, zfs and ufs,
single drives and two simultaneously (just executing dd commands in separate
terminal windows). These are snapshots of iostat -xnczpm 3 captured
somewhere in the middle of the operation. I am not bothering to report CPU%
as it never rose over 50%, and was uniformly proportional to reported
throughput.

single drive raw:
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
1378.4    0.0 77190.7    0.0  0.0  1.7    0.0    1.2   0  98 c0d1

single drive, ufs file
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
1255.1    0.0 69949.6    0.0  0.0  1.8    0.0    1.4   0 100 c0d0

Small slowdown, but pretty good.

single drive, zfs file
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 258.3    0.0 33066.6    0.0 33.0  2.0  127.7    7.7 100 100 c0d1

Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s /
r/s gives 256K, as I would imagine it should.

simultaneous raw:
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 797.0    0.0 44632.0    0.0  0.0  1.8    0.0    2.3   0 100 c0d0
 795.7    0.0 44557.4    0.0  0.0  1.8    0.0    2.3   0 100 c0d1

This PCI interface seems to be saturated at 90MB/s. Adequate if the goal is
to serve files on gigabit SOHO network.

sumultaneous raw on c0d1 and ufs on c0d0:
                   extended device statistics
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 722.4    0.0 40246.8    0.0  0.0  1.8    0.0    2.5   0 100 c0d0
 717.1    0.0 40156.2    0.0  0.0  1.8    0.0    2.5   0  99 c0d1

hmm, can no longer get the 90MB/sec.

simultaneous zfs on c0d1 and raw on c0d0:
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0    0.7    0.0    1.8  0.0  0.0    0.0    0.1   0   0 c1d0
 334.9    0.0 18756.0    0.0  0.0  1.9    0.0    5.5   0  97 c0d0
 172.5    0.0 22074.6    0.0 33.0  2.0  191.3   11.6 100 100 c0d1

Everything is slow.

What happens if we throw onboard IDE interface into the mix?
simultaneous raw SATA and raw PATA:
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
1036.3    0.3 58033.9    0.3  0.0  1.6    0.0    1.6   0  99 c1d0
1422.6    0.0 79668.3    0.0  0.0  1.6    0.0    1.1   1  98 c0d0

Both at maximum throughput.

Read ZFS on SATA drive and raw disk on PATA interface:
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
1018.9    0.3 57056.1    4.0  0.0  1.7    0.0    1.7   0  99 c1d0
 268.4    0.0 34353.1    0.0 33.0  2.0  122.9    7.5 100 100 c0d0

SATA is slower with ZFS as expected by now, but ATA remains at full speed.
So they are operating quite independantly. Except...

What if we read a UFS file from the PATA disk and ZFS from SATA:
   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 792.8    0.0 44092.9    0.0  0.0  1.8    0.0    2.2   1  98 c1d0
 224.0    0.0 28675.2    0.0 33.0  2.0  147.3    8.9 100 100 c0d0

Now that is confusing! Why did SATA/ZFS slow down too? I've retried this a
number of times, not a fluke.

Finally, after reviewing all this, I've noticed another interesting bit...
whenever I read from raw disks or UFS files, SATA or PATA, kr/s over r/s is
56k, suggesting that underlying IO system is using that as some kind of a
native block size? (even though dd is requesting 128k). But when reading ZFS
files, this always comes to 128k, which is expected, since that is ZFS
default (and same thing happens regardless of bs= in dd). On the theory that
my system just doesn't like 128k reads (I'm desperate!), and that this would
explain the whole slowdown and wait/wsvc_t column, I tried changing recsize
to 32k and rewriting the test file. However, accessing ZFS files continues
to show 128k reads, and it is just as slow. Is there a way to either confirm
that the ZFS file in question is indeed written with 32k records or, even
better, to force ZFS to use 56k when accessing the disk. Or perhaps I just
misunderstand implications of iostat output.

I've repeated each of these tests a few times and doublechecked, and the
numbers, although snapshots of a point in time, fairly represent averages.

I have no idea what to make of all this, except that it ZFS has a problem
with this hardware/drivers that UFS and other traditional file systems,
don't. Is it a bug in the driver that ZFS is inadvertently exposing? A
specific feature that ZFS assumes the hardware to have, but it doesn't? Who
knows! I will have to give up on Solaris/ZFS on this hardware for now, but I
hope to try it again sometime in the future. I'll give FreeBSD/ZFS a spin to
see if it fares better (although at this point in its development it is
probably more risky then just sticking with Linux and missing out on ZFS).

(Another contributor suggested turning checksumming off - it made no
difference. Same for atime. Compression was always off.)

On 5/14/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:


Marko,

I tried this experiment again using 1 disk and got nearly identical
times:

# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out

real       21.4
user        0.0
sys         2.4

$ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k
count=10000
10000+0 records in
10000+0 records out

real       21.0
user        0.0
sys         0.7


> [I]t is not possible for dd to meaningfully access multiple-disk
> configurations without going through the file system. I find it
> curious that there is such a large slowdown by going through file
> system (with single drive configuration), especially compared to UFS
> or ext3.

Comparing a filesystem to raw dd access isn't a completely fair
comparison either.  Few filesystems actually layout all of their data
and metadata so that every read is a completely sequential read.

> I simply have a small SOHO server and I am trying to evaluate which OS
to
> use to keep a redundant disk array. With unreliable consumer-level
hardware,
> ZFS and the checksum feature are very interesting and the primary
selling
> point compared to a Linux setup, for as long as ZFS can generate enough
> bandwidth from the drive array to saturate single gigabit ethernet.

I would take Bart's reccomendation and go with Solaris on something like a
dual-core box with 4 disks.

> My hardware at the moment is the "wrong" choice for Solaris/ZFS - PCI
3114
> SATA controller on a 32-bit AthlonXP, according to many posts I found.

Bill Moore lists some controller reccomendations here:

http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html

> However, since dd over raw disk is capable of extracting 75+MB/s from
this
> setup, I keep feeling that surely I must be able to get at least that
much
> from reading a pair of striped or mirrored ZFS drives. But I can't -
single
> drive or 2-drive stripes or mirrors, I only get around 34MB/s going
through
> ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)

Maybe this is a problem with your controller?  What happens when you
have two simultaneous dd's to different disks running?  This would
simulate the case where you're reading from the two disks at the same
time.

-j

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Reply via email to