Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-10 Thread Eric C. Taylor
Writes using the character interface (/dev/zvol/rdsk) are synchronous.
If you want caching, you can go through the block interface
(/dev/zvol/dsk) instead.

-  Eric
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-09 Thread Jim Klimov
Do you have any older benchmarks on these cards and arrays (in their pre-ZFS 
life?) Perhaps this is not a ZFS regression but a hardware config issue?

Perhaps there's some caching (like per-disk write-through) not enabled on the 
arrays? As you may know, the ability (and reliability) of such cache often 
depends 
on presence of battery back-up (BBU) modules on an array controller.

It is also often pointed out that ZFS prefers raw disks instead of hardware 
vendors'
RAID arrays (with possible errors crawling into data bits and undetected by the
firmware's algorithms unless the whole disk breaks, or errors related to bugs 
in 
the firmware itself - in more elaborate conspiracy theories). 

Due to this it is often recommended to use external RAID implementations as
vdev's to a redundant ZFS pool (such as a mirror of two equivalent arrays).

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-09 Thread Jim Klimov
Hmm, scratch that. Maybe.

I did not first get the point that your writes to a filesystem dataset work 
quickly.
Perhaps filesystem is (better) cached indeed, i.e. *maybe* zvol writes are 
synchronous and zfs writes may be cached and thus async? Try playing around 
with relevant dataset attributes...

I'm running a test on my system (a snv_114 Thumper, 16Gb RAM, used for other 
purposes as well), the CPU is mostly idle now (2.5-3.2% kernel time, that's 
about 
it). Seems I have results not unlike yours. Not cool because I wanted to play 
with
COMSTAR iSCSI - and I'm not sure it will perform well ;)

I'm dd'ing 30Gb to an uncompressed test zvol with same 64kb block sizes (maybe 
they are too small?), and zpool iostat goes like this - a hundred IOs at 7Mbps 
for a
minute, then a burst of 100-170Mbps and 20-25K IOps for a second:

pond5.79T  4.41T  0106  0  7.09M
pond5.79T  4.41T  0  1.93K  0  20.7M
pond5.79T  4.41T  0  13.3K  0   106M
pond5.79T  4.41T  0116  0  7.76M
pond5.79T  4.41T  0108  0  7.23M
pond5.79T  4.41T  0107  0  7.16M
pond5.79T  4.41T  0107  0  7.16M

or

pond5.79T  4.41T  0117  0  7.83M
pond5.79T  4.41T  0  5.61K  0  49.7M
pond5.79T  4.41T  0  19.0K504   149M
pond5.79T  4.41T  0104  0  6.96M

Weird indeed.

It wrote 10Gb (according to zfs get usedbydataset pond/test) taking roughly 
30 
minutes after which I killed it.

Now, writing to an uncompressed filesystem dataset (although very far from 
what's trumpeted as Thumper performance) yields quite different numbers:

pond5.80T  4.40T  1  3.64K   1022   457M
pond5.80T  4.40T  0866967  75.7M
pond5.80T  4.40T  0  4.65K  0   586M
pond5.80T  4.40T  6802  33.4K  69.2M
pond5.80T  4.40T 29  2.44K  1.10M   301M
pond5.80T  4.40T 32691   735K  25.0M
pond5.80T  4.40T 56  1.59K  2.29M   184M
pond5.80T  4.40T150768  4.61M  10.5M
pond5.80T  4.40T  2  0  25.5K  0
pond5.80T  4.40T  0  2.75K  0   341M
pond5.80T  4.40T  7  3.96K   339K   497M
pond5.80T  4.39T 85740  3.57M  59.0M
pond5.80T  4.39T 67  0  2.22M  0
pond5.80T  4.39T  9  4.67K   292K   581M
pond5.80T  4.39T  4  1.07K   126K   137M
pond5.80T  4.39T 27333   338K  9.15M
pond5.80T  4.39T  5  0  28.0K  3.99K
pond5.82T  4.37T  1  5.42K  1.67K   677M
pond5.83T  4.37T  3  1.69K  8.36K   173M
pond5.83T  4.37T  2  0  5.49K  0
pond5.83T  4.37T  0  6.32K  0   790M
pond5.83T  4.37T  2290  7.95K  27.8M
pond5.83T  4.37T  0  9.64K  1.23K  1.18G

The numbers are jumpy (maybe due to fragmentation, other processes, etc.) but
there are often spikes in excess of 500MBps.

The whole test took a relatively little time:

# time dd if=/dev/zero of=/pond/tmpnocompress/test30g bs=65536 count=50
50+0 records in
50+0 records out

real1m27.657s
user0m0.302s
sys 0m46.976s

# du -hs /pond/tmpnocompress/test30g 
  30G   /pond/tmpnocompress/test30g

To detail about the pool: 

The pool is on a Sun X4500 with 48 250Gb SATA drives. It was created as a 9x5 
set (9 stripes made of 5-disk raidz1 vdevs) spread across different 
controllers, 
with the command:

# zpool create -f pond \
raidz1 c0t0d0 c1t0d0 c4t0d0 c6t0d0 c7t0d0 \
raidz1 c0t1d0 c1t2d0 c4t3d0 c6t5d0 c7t6d0 \
raidz1 c1t1d0 c4t1d0 c5t1d0 c6t1d0 c7t1d0 \
raidz1 c0t2d0 c4t2d0 c5t2d0 c6t2d0 c7t2d0 \
raidz1 c0t3d0 c1t3d0 c5t3d0 c6t3d0 c7t3d0 \
raidz1 c0t4d0 c1t4d0 c4t4d0 c6t4d0 c7t4d0 \
raidz1 c0t5d0 c1t5d0 c4t5d0 c5t5d0 c7t5d0 \
raidz1 c0t6d0 c1t6d0 c4t6d0 c5t6d0 c6t6d0 \
raidz1 c1t7d0 c4t7d0 c5t7d0 c6t7d0 c7t7d0 \
spare c0t7d0

Alas, while there were many blogs, I couldn't find a definitive answer last 
year as
to which Thumper layout is optimal in performance and/or reliability (in regard 
to 
6 controllers of 8 disks each, with 2 disks on one of the controllers reserved 
for 
booting). 

As a result, we spread each raidz1 across 5 controllers, so the loss of one
controller should have minimal impact on data loss on the average. Since the 
system layout is not symmetrical, some controllers are more important than 
others (say, the boot one).

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-09 Thread Ross Walker

On Jul 9, 2009, at 4:22 AM, Jim Klimov no-re...@opensolaris.org wrote:

To tell the truth, I expected zvols to be faster than filesystem  
datasets. They seem
to have less overhead without inodes, posix, acls and so on. So I'm  
puzzled by test

results.

I'm now considering the dd i/o block size, and it means a lot  
indeed, especially if

compared to zvol results with small blocks like 64k.

I ran a number of tests with a zvol recreated by commands before  
each run (this
may however cause varying fragmentation impacting results of  
different runs):


# zfs destroy -r pond/test; zfs create -V 30G pond/test; zfs set  
compression=off pond/test; sync; dd if=/dev/zero of=/dev/zvol/rdsk/ 
pond/test count=1000 bs=512; sync


and tests going like

# time dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1024 bs=1048576
1024+0 records in
1024+0 records out

real0m42.442s
user0m0.006s
sys 0m4.292s

The test progresses were quite jumpy (with zpool iostat pond 1  
values varying

from 30 to 70 MBps, reads coming in sometimes).

So I'd stick to overall result - the rounded wallclock time it takes  
to write 1024
records of varying size and resulting average end-user MBps. I also  
write sys
time since that's what is consumed by the kernel and the disk  
subsystem, after all.
I don't write zpool iostat speeds, since they vary too much and I  
don't bother
with a spreadsheen right now. But the reported values stay about  
halfway between
wallclock MBps ans sys MBps calculations, on the perceived  
average, peaking

at about 350MBps for large block sizes (4MB).

1 MB (bs=1048576): 42s (4s), 24MBps
4 MB (bs=4194304): 42s (15s), 96MBps
16MB (bs=16777216): 129s-148s (62-64s), 127-110MBps
32MB (bs=33554432, 40Gb zvol): 303s (127s), 108MBps

Similar results for writing a file to a filesystem; zpool iostat  
values again
jumped anywhere between single MBps to GBps. Simple cleanups used  
like:


# rm /pool/test30g; sync; time dd if=/dev/zero of=/pool/test30g
count=1024 bs=33554432

Values remain somewhat consistent (in the same league, at least):
1 MB (bs=1048576, 10240 blocks): 20-21s (7-8s), 512-487MBps

1 MB (bs=1048576): 2.3s (0.6s), 445MBps
4 MB (bs=4194304): 8s (3s), 512MBps
16MB (bs=16777216): 37s (15s), 442MBps
32MB (bs=33554432): 74-103s (32-42s), 442-318MBps

64Kb (bs=65536, 545664 blocks): 94s (47s), 362MBps

All in all, to make more precise results these tests should be made  
in greater

numbers and averaged. But here we got some figures to think about...

On a side note, now I'll pay more attention to tuning suggestions  
which involve
multi-megabyte buffers for network sockets, etc. They can actually  
cause an

impact to performance many times over!

On another note,

For some reason I occasionally got results like this:
write: File too large
1+0 records in
1+0 records out

I think the zvol was not considered created by that time. In about  
10-15 sec I was
able to commence the test run. Perhaps it helped that I  
initialized the zvol by a

small write after creation, then:
# dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512
Strange...


When running throughput tests the block sizes to be concerned about  
are: 4k, 8k, 16k and 64k. These are the sizes that most file systems  
an databases use.


If you get 4k to perform well chances are the others will fall into  
line.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-09 Thread Jim Klimov
After reading many-many threads on ZFS performance today (top of the list in 
the 
forum, and some chains of references), I applied a bit of tuning to the server.

In particular, I've set the zfs_write_limit_override to 384Mb so my cache is 
spooled 
to disks more frequently (if streaming lots of writes) and in smaller 
increments.

* echo zfs_write_limit_override/W0t402653184 | mdb -kw
set zfs:zfs_write_limit_override = 0x1800

The system seems to be working more smoothly (vs. jerky), and zpool iostat 
values are not quite as jumpy (i.e. 320MBps to 360MBps for a certain test). 
The results also seem faster and more consistent.

With this tuning applied, I'm writing to a 40G zvol, 1M records (count=1048576) 
of:
4k (bs=4096): 17s (12s), 241MBps
8k (bs=8192): 29s (18s), 282MBps
16k (bs=16384): 54s (30s), 303MBps
32k (bs=32768): 113s (56s), 290MBps
64k (bs=65536): 269s (104s), 243MBps

And 10240 larger records of:
1 MB (bs=1048576): 33s (8s), 310MBps
2 MB (bs=2097152): 74s (23s), 276MBps

And 1024 yet larger records:
1 MB (bs=1048576): 4s (1s), 256MBps
4 MB (bs=4194304): 12s (5s), 341MBps
16MB (bs=16777216): 71s (18s), 230MBps
32MB (bs=33554432): 150s (36s), 218MBps

So the zvol picture is quite better now (albeit not perfect - i.e. no values 
are near 
the 1GBps noted previously in zpool iostat), for both small and large blocks.

For filesystem dataset the new values are very similar (like, to tenths of a 
second 
on smaller blocksizes!) but as the blocksize grows, filesystems start losing to 
the
zvols. Overall the result seems lower than achieved before I tried tuning.

1M records (count=1048576) of:
4k (bs=4096): 17s (12s), 241MBps
8k (bs=8192): 29s (18s), 282MBps
16k (bs=16384): 67s (30s), 245MBps
32k (bs=32768): 144s (55s), 228MBps
64k (bs=65536): 275s (98s), 238MBps

And 10240 larger records go better:
1 MB (bs=1048576): 33s (9s), 310MBps
2 MB (bs=2097152): 70s (21s), 292MBps

And 1024 yet larger records:
1 MB (bs=1048576): 2.8s (0.8s), 366MBps
4 MB (bs=4194304): 12s (4s), 341MBps
16MB (bs=16777216): 55s (17s), 298MBps
32MB (bs=33554432): 140s (36s), 234MBps

Occasionally I did reruns; user time for the same setups can vary significantly
(like 65s vs 84s) while the system time stays pretty much the same.

zpool iostat shows larger values (like 320MBps typically) but I think that 
can be 
attributed to writing parity stripes on raidz vdevs.

//Jim

PS: for completeness, I'll try smaller blocks without tuning in a future post.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss