Re: [zfs-discuss] surprisingly poor performance

2009-08-12 Thread Roch

roland writes:
 > >SSDs with capacitor-backed write caches
 > >seem to be fastest.
 > 
 > how to distinguish them from ssd`s without one?
 > i never saw this explicitly mentioned in the specs.


They probably don't have one then (or they should fire their
entire marketing dept).

Capacitors allows the device to function safely with
write cache enabled even while ignoring the cache flushes being
sent by ZFS. If the device firmware is not setup to ignore
the flushes, better make sure that sd.conf is setup to not
send them otherwise one looses the benefit.

Setting up sd.conf in ZFS Evil tuning guide :

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#How_to_Tune_Cache_Sync_Handling_Per_Storage_Device

-r

 > -- 
 > This message posted from opensolaris.org
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-08-11 Thread roland
>SSDs with capacitor-backed write caches
>seem to be fastest.

how to distinguish them from ssd`s without one?
i never saw this explicitly mentioned in the specs.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-31 Thread Roch Bourbonnais


The things I'd pay most attention to would be all single threaded 4K,  
32K, and 128K writes to the raw device.
Maybe sure the SSD has a capacitor and enable the write cache on the  
device.


-r

Le 5 juil. 09 à 12:06, James Lever a écrit :



On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:

It seems like you may have selected the wrong SSD product to use.  
There seems to be a huge variation in performance (and cost) with  
so-called "enterprise" SSDs.  SSDs with capacitor-backed write  
caches seem to be fastest.


Do you have any methods to "correctly" measure the performance of an  
SSD for the purpose of a slog and any information on others (other  
than anecdotal evidence)?


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-08 Thread Miles Nordin
> "pe" == Peter Eriksson  writes:

pe> With c1t15d0s0 added as log it takes 1:04.2, but with the same
pe> c1t15d0s0 added, but wrapped inside a SVM metadevice the same
pe> operation takes 10.4 seconds...

so now SVM discards cache flushes, too?  great.


pgpFnpp1mdyTO.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-08 Thread Peter Eriksson
Oh, and for completeness: If I wrap 'c1t12d0s0' inside a SVM metadevice to and 
use that to create the "TEST" zpool (without a log) I run the same test command 
in 36.3 seconds... Ie:

# metadb -f -a -c3 c1t13d0s0
# metainit d0 1 1 c1t13d0s0
# metainit d2 1 1 c1t12d0s0
# zpool create TEST /dev/md/dsk/d2

If I then add a log to that device:

# zpool add TEST log /dev/md/dsk/d0

the same test (gtar zxf emacs-22.3.tar.gz) runs in 10.1 seconds...
(Ie, not much better than just using a raw disk + svm-encapsulated log).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-08 Thread Peter Eriksson
You might wanna try one thing I just noticed - wrap the log device inside a SVM 
(disksuite) metadevice - makes wonders for the performance on my test server 
(Sun Fire X4240)... I do wonder what the downsides might be (except for having 
to fiddle with Disksuite again). Ie:

# zpool create TEST c1t12d0
# format c1t13d0
(Create a 4GB partition 0)
# metadb -f -a -c 3 c1t13d0s0
# metainit d0 1 1 c1t13d0s0
# zpool add TEST log /dev/md/dsk/d0

In my case the disks involved above are:
c1t12d0 146GB 10krpm SAS disk
c1t13d0 32GB Intel X25-E SLC SSD SATA disk

Without the log added running 'gtar zxf emacs-22.3.tar.gz' over NFS from 
another server
takes 1:39.2 (almost 2 minutes). With c1t15d0s0 added as log it takes 1:04.2, 
but with the same c1t15d0s0 added, but wrapped inside a SVM metadevice the same 
operation takes 10.4 seconds...

1:39 vs 0:10 is a pretty good speedup I think...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-07 Thread James Andrewartha
James Lever wrote:
> 
> On 07/07/2009, at 8:20 PM, James Andrewartha wrote:
> 
>> Have you tried putting the slog on this controller, either as an SSD or
>> regular disk? It's supported by the mega_sas driver, x86 and amd64 only.
> 
> What exactly are you suggesting here?  Configure one disk on this array
> as a dedicated ZIL?  Would that improve performance any over using all
> disks with an internal ZIL?

I was mainly thinking about using the battery-backup write cache to
eliminate the NFS latency. There's not much difference between internal vs
dedicated ZIL if the disks are the same and on the same controller -
dedicated ZIL wins come from using SSDs and battery-backed cache.
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices

> Is there a way to disable the write barrier in ZFS in the way you can
> with Linux filesystems (-o barrier=0)?  Would this make any difference?

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes
might help if the RAID card is still flushing to disk when ZFS asks it to
even though it's safe in the battery-backed cache.

-- 
James Andrewartha | Sysadmin
Data Analysis Australia Pty Ltd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-07 Thread James Lever


On 07/07/2009, at 8:20 PM, James Andrewartha wrote:

Have you tried putting the slog on this controller, either as an SSD  
or
regular disk? It's supported by the mega_sas driver, x86 and amd64  
only.


What exactly are you suggesting here?  Configure one disk on this  
array as a dedicated ZIL?  Would that improve performance any over  
using all disks with an internal ZIL?


I have now done some tests with the PERC6/E in both RAID10 (all  
devices RAID0 LUNs, ZFS mirror/striped config) and also as a hardware  
RAID5 both with an internal ZIL.


RAID10 (10 disks, 5 mirror vdevs)
create 2m14.448s
unlink  0m54.503s

RAID5 (9 disks, 1 hot spare)
create 1m58.819s
unlink 0m48.509s

Unfortunately, linux on the same RAID5 array using XFS seems  
significantly faster, still.


Linux RAID5 (9 disks, 1 hot spare), XFS
create 1m30.911s
unlink 0m38.953s

Is there a way to disable the write barrier in ZFS in the way you can  
with Linux filesystems (-o barrier=0)?  Would this make any difference?


After much consideration, the lack of barrier capability makes no  
difference to filesystem stability in the scenario where you have a  
battery backed write cache.


Due to using identical hardware and configurations, I think this is a  
fair apples to apples test now.  I'm now wondering if XFS is just the  
faster filesystem... (not the most practical management solution, just  
speed).


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-07 Thread James Andrewartha
James Lever wrote:
> We also have a PERC 6/E w/512MB BBWC to test with or fall back to if we
> go with a Linux solution.

Have you tried putting the slog on this controller, either as an SSD or
regular disk? It's supported by the mega_sas driver, x86 and amd64 only.

-- 
James Andrewartha | Sysadmin
Data Analysis Australia Pty Ltd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread Ross Walker
On Jul 5, 2009, at 9:20 PM, Richard Elling   
wrote:



Ross Walker wrote:


Thanks for the info. SSD is still very much a moving target.

I worry about SSD drives long term reliability. If I mirror two of  
the same drives what do you think the probability of a double  
failure will be in 3, 4, 5 years?


Assuming there are no common cause faults (eg firmware), you should
expect an MTBF of 2-4M hours.  But I can't answer the question without
knowing more info.  It seems to me that you are really asking for the
MTTDL, which is a representation  of probability of data loss.  I  
describe

these algorithms here:
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl

Since the vendors do not report UER rates, which makes sense for
flash devices, the MTTDL[1] model applies.  You can do the math
yourself, once you figure out what your MTTR might be.  For enterprise
systems, we usually default to 8 hour response, but for home users you
might plan on a few days, so you can take a vacation every once in a
while.  For 48 hours MTTR:
2M hours MTBF -> MTTDL[1] = 4,756,469 years
4M hours MTBF -> MTTDL[1] = 19,025,875 years

Most folks find it more intuitive to look at probability per year in  
the

form of a percent, so
2M hours MTBF -> Annual DL rate = 0.21%
4M hours MTBF -> Annual DL rate = 0.05%

If you want to more accurately model based on endurance, then you'll
need to know the expected write rate and the nature of the wear  
leveling
mechanism. It can be done, but the probability is really, really  
small.


Wow, detailed, interested in a career in actuarial analysis? :-)

Thanks, I'll try to wrap my mind around this during daylight hours  
after my caffeine fix.


What I would really like to see is zpool's ability to fail-back to  
an inline zil in the event an external one fails or is missing.  
Then one can remove an slog from a pool and add a different one if  
necessary or just remove it altogether.


It already does this, with caveats.  What you might also want is
CR 6574286, removing a slog doesn't work.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286


Well I'll keep an eye on when the fix gets out and then for it to get  
into Solaris 10.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread Richard Elling

Ross Walker wrote:
On Jul 5, 2009, at 7:47 PM, Richard Elling  
wrote:



Ross Walker wrote:


On Jul 5, 2009, at 6:06 AM, James Lever  wrote:



On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:

It seems like you may have selected the wrong SSD product to use. 
There seems to be a huge variation in performance (and cost) with 
so-called "enterprise" SSDs.  SSDs with capacitor-backed write 
caches seem to be fastest.


Do you have any methods to "correctly" measure the performance of 
an SSD for the purpose of a slog and any information on others 
(other than anecdotal evidence)?


There are two types of SSD drives on the market, the fast write SLC 
(single level cell) and the slow write MLC (multi level cell). MLC 
is usually used in laptops as SLC drives over 16GB usually go for 
$1000+ which isn't cost effective in a laptop. MLC is good for read 
caching though and most use it for L2ARC.


Please don't classify them as MLC vs SLC or you'll find yourself totally
confused by the modern MLC designs which use SLC as a cache.  Be
happy with specs: random write iops: slow or fast.


Thanks for the info. SSD is still very much a moving target.

 I worry about SSD drives long term reliability. If I mirror two of 
the same drives what do you think the probability of a double failure 
will be in 3, 4, 5 years?


Assuming there are no common cause faults (eg firmware), you should
expect an MTBF of 2-4M hours.  But I can't answer the question without
knowing more info.  It seems to me that you are really asking for the
MTTDL, which is a representation  of probability of data loss.  I describe
these algorithms here:
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl

Since the vendors do not report UER rates, which makes sense for
flash devices, the MTTDL[1] model applies.  You can do the math
yourself, once you figure out what your MTTR might be.  For enterprise
systems, we usually default to 8 hour response, but for home users you
might plan on a few days, so you can take a vacation every once in a
while.  For 48 hours MTTR:
2M hours MTBF -> MTTDL[1] = 4,756,469 years
4M hours MTBF -> MTTDL[1] = 19,025,875 years

Most folks find it more intuitive to look at probability per year in the
form of a percent, so
2M hours MTBF -> Annual DL rate = 0.21%
4M hours MTBF -> Annual DL rate = 0.05%

If you want to more accurately model based on endurance, then you'll
need to know the expected write rate and the nature of the wear leveling
mechanism. It can be done, but the probability is really, really small.



What I would really like to see is zpool's ability to fail-back to an 
inline zil in the event an external one fails or is missing. Then one 
can remove an slog from a pool and add a different one if necessary or 
just remove it altogether.


It already does this, with caveats.  What you might also want is
CR 6574286, removing a slog doesn't work.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread Ross Walker
On Jul 5, 2009, at 7:47 PM, Richard Elling   
wrote:



Ross Walker wrote:


On Jul 5, 2009, at 6:06 AM, James Lever  wrote:



On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:

It seems like you may have selected the wrong SSD product to use.  
There seems to be a huge variation in performance (and cost) with  
so-called "enterprise" SSDs.  SSDs with capacitor-backed write  
caches seem to be fastest.


Do you have any methods to "correctly" measure the performance of  
an SSD for the purpose of a slog and any information on others  
(other than anecdotal evidence)?


There are two types of SSD drives on the market, the fast write SLC  
(single level cell) and the slow write MLC (multi level cell). MLC  
is usually used in laptops as SLC drives over 16GB usually go for  
$1000+ which isn't cost effective in a laptop. MLC is good for read  
caching though and most use it for L2ARC.


Please don't classify them as MLC vs SLC or you'll find yourself  
totally

confused by the modern MLC designs which use SLC as a cache.  Be
happy with specs: random write iops: slow or fast.


Thanks for the info. SSD is still very much a moving target.

 I worry about SSD drives long term reliability. If I mirror two of  
the same drives what do you think the probability of a double failure  
will be in 3, 4, 5 years?


What I would really like to see is zpool's ability to fail-back to an  
inline zil in the event an external one fails or is missing. Then one  
can remove an slog from a pool and add a different one if necessary or  
just remove it altogether.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread James Lever


On 06/07/2009, at 9:31 AM, Ross Walker wrote:

There are two types of SSD drives on the market, the fast write SLC  
(single level cell) and the slow write MLC (multi level cell). MLC  
is usually used in laptops as SLC drives over 16GB usually go for  
$1000+ which isn't cost effective in a laptop. MLC is good for read  
caching though and most use it for L2ARC.


I just ordered a bunch of 16GB Imation Pro 7500's (formerly Mtron)  
from CDW lately for $290 a pop. They are suppose to be fast  
sequential write SLC drives and so-so random write. We'll see.


That will be interesting to see.

The Samsung drives we have are 50GB (64GB) SLC and apparently 2nd  
generation.


For a slog, is random write even an issue?  Or is it just the  
mechanism used to measure the IOPS performance of a typical device?


AFAIUI, the ZIL is used as a ring buffer.  How does that work with an  
SSD?  All this pain really makes me think the only sane slog is one  
that is RAM based and has enough capacitance to either make itself  
permanent or move the data to something permanent before failing  
(FusionIO, DDRdrive, for example).

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread Richard Elling

Ross Walker wrote:


On Jul 5, 2009, at 6:06 AM, James Lever  wrote:



On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:

It seems like you may have selected the wrong SSD product to use. 
There seems to be a huge variation in performance (and cost) with 
so-called "enterprise" SSDs.  SSDs with capacitor-backed write 
caches seem to be fastest.


Do you have any methods to "correctly" measure the performance of an 
SSD for the purpose of a slog and any information on others (other 
than anecdotal evidence)?


There are two types of SSD drives on the market, the fast write SLC 
(single level cell) and the slow write MLC (multi level cell). MLC is 
usually used in laptops as SLC drives over 16GB usually go for $1000+ 
which isn't cost effective in a laptop. MLC is good for read caching 
though and most use it for L2ARC.


Please don't classify them as MLC vs SLC or you'll find yourself totally
confused by the modern MLC designs which use SLC as a cache.  Be
happy with specs: random write iops: slow or fast.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread Ross Walker


On Jul 5, 2009, at 6:06 AM, James Lever  wrote:



On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:

It seems like you may have selected the wrong SSD product to use.  
There seems to be a huge variation in performance (and cost) with  
so-called "enterprise" SSDs.  SSDs with capacitor-backed write  
caches seem to be fastest.


Do you have any methods to "correctly" measure the performance of an  
SSD for the purpose of a slog and any information on others (other  
than anecdotal evidence)?


There are two types of SSD drives on the market, the fast write SLC  
(single level cell) and the slow write MLC (multi level cell). MLC is  
usually used in laptops as SLC drives over 16GB usually go for $1000+  
which isn't cost effective in a laptop. MLC is good for read caching  
though and most use it for L2ARC.


I just ordered a bunch of 16GB Imation Pro 7500's (formerly Mtron)  
from CDW lately for $290 a pop. They are suppose to be fast sequential  
write SLC drives and so-so random write. We'll see.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread Richard Elling

James Lever wrote:


On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:

It seems like you may have selected the wrong SSD product to use. 
There seems to be a huge variation in performance (and cost) with 
so-called "enterprise" SSDs.  SSDs with capacitor-backed write caches 
seem to be fastest.


Do you have any methods to "correctly" measure the performance of an 
SSD for the purpose of a slog and any information on others (other 
than anecdotal evidence)?


First, determine the ZIL pattern for your workload using zilstat.
Then buy an SSD which efficiently handles a sequential workload
which is similar to your workload.  For example, if your workload
creates a lot of small ZIL iops, then you'll want to favor SSDs which
have high, small, write iop performance.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread James Lever


On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:

It seems like you may have selected the wrong SSD product to use.  
There seems to be a huge variation in performance (and cost) with so- 
called "enterprise" SSDs.  SSDs with capacitor-backed write caches  
seem to be fastest.


Do you have any methods to "correctly" measure the performance of an  
SSD for the purpose of a slog and any information on others (other  
than anecdotal evidence)?


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-04 Thread Bob Friesenhahn

On Sat, 4 Jul 2009, James Lever wrote:


Any insightful observations?


Probably multiple slog devices are used to expand slog size and not 
used in parallel since that would require somehow knowing the order. 
The principle bottleneck is likely the update rate of the first device 
in the chain, followed by the update rate of the underlying disks. 
If you put the ramdisk first in the slog chain, the performance is 
likely to jump.


Note that using the non-volatile log device is just a way to defer the 
writes to the underlying device, and the writes need to occur 
eventually or else the slog will fill up.  Ideally the writes to the 
underlying devices can be ordered more sequentially for better 
throughput or else the gain will be short-lived since the slog will 
fill up.


If you do a search, you will find that others have reported less than 
hoped for performance with these Samsung SSDs.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread James Lever


On 04/07/2009, at 2:08 PM, Miles Nordin wrote:


iostat -xcnXTdz c3t31d0 1


on that device being used as a slog, a higher range of output looks  
like:


extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0 1477.80.0 2955.4  0.0  0.00.00.0   0   5 c7t2d0
Saturday, July  4, 2009  2:18:48 PM EST
 cpu
 us sy wt id
  0  1  0 99

I started a second task from the first server while using only a  
single slog and the performance of the SSD got up to 1900 w/s


extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0 1945.80.0 3891.7  0.0  0.10.00.0   0   6 c7t2d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c7t3d0
Saturday, July  4, 2009  2:23:11 PM EST
 cpu
 us sy wt id
  0  1  0 99

Interestingly, adding a second SSD into the mix and a 3rd writer (on a  
second client system) showed no further increases:


extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0  942.30.0 1884.4  0.0  0.00.00.0   0   3 c7t2d0
0.0  942.40.0 1884.4  0.0  0.00.00.0   0   3 c7t3d0

Add the ramdisk as a 3rd slog with 3 writers and only an increase in  
the speed of the slowest device


extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0  453.60.0 1814.4  0.0  0.00.00.0   0   1 ramdisk1
0.0  907.20.0 1814.4  0.0  0.00.00.0   0   3 c7t2d0
0.0  907.20.0 1814.4  0.0  0.00.00.0   0   3 c7t3d0
Saturday, July  4, 2009  2:29:08 PM EST
 cpu
 us sy wt id
  0  2  0 98

When only the ramdisk is used as a slog, it gives the following results:

extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0 3999.40.0 15997.8  0.0  0.00.00.0   0   2 ramdisk1
Saturday, July  4, 2009  2:36:58 PM EST
 cpu
 us sy wt id
  0  3  0 96

Any insightful observations?

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread Miles Nordin
> "jl" == James Lever  writes:

jl> if I had disabled the ZIL, writes would have to go direct to
jl> disk (not ZIL) before returning, which would potentially be
jl> even slower than ZIL on zpool.

no, I'm all but certain you are confused.

jl> Has anybody been measuring the IOPS and latency of their SSDs

you might try:

 iostat -xcnXTdz c3t31d0 1 

I haven't done this before though.

jl> One of the developers here had explicitly performed tests to
jl> check these similar assumptions and found no evidence that the
jl> Linux/XFS sync implementation to be lacking even though there
jl> were previous issues with it in one kernel revision.

Did he perform the same test on the one kernel revision with
``issues'', and if so what ``issues'' did the test find?

Also note that it's not only Linux/XFS which must be tested but knfs
and LVM2 as well.

I'm not saying it's broken, only that I've yet to hear of someone
using a decisive test and getting conclusive results---there are only
anecdotal war stories and speculations about how one might test.


pgpKzUXlRl5rG.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread James Lever


On 03/07/2009, at 10:37 PM, Victor Latushkin wrote:

Slog in ramdisk is analogous to no slog at all and disable zil  
(well, it may be actually a bit worse). If you say that your old  
system is 5 years old difference in above numbers may be due to  
difference in CPU and memory speed, and so it suggests that your  
Linux NFS server appears to be working at the memory speed, hence  
the question. Because if it does not honor sync semantics you are  
really comparing apples with oranges here.


The slog in ramdisk is in no way similar to disabling the ZIL.  This  
is an NFS test, so if I had disabled the ZIL, writes would have to go  
direct to disk (not ZIL) before returning, which would potentially be  
even slower than ZIL on zpool.


The appearance of the Linux NFS server appearing to perform at memory  
speed may just be the BBWC in the LSI MegaRaid SCSI card.  One of the  
developers here had explicitly performed tests to check these similar  
assumptions and found no evidence that the Linux/XFS sync  
implementation to be lacking even though there were previous issues  
with it in one kernel revision.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread Ross Walker
On Fri, Jul 3, 2009 at 7:34 AM, James Lever wrote:
> Hi Mertol,
>
> On 03/07/2009, at 6:49 PM, Mertol Ozyoney wrote:
>
>> ZFS SSD usage behaviour heavly depends on access pattern and for asynch
>> ops ZFS will not use SSD's.   I'd suggest you to disable SSD's , create a
>> ram disk and use it as SLOG device to compare the performance. If
>> performance doesnt change, it means that the measurement method have some
>> flaws or you havent configured Slog correctly.
>
> I did some tests with a ramdisk slog and the the write IOPS seemed to run
> about the 4k/s mark vs about 800/s when using the SSD as slog and 200/s
> without a slog.
>
> # osol b117 RAID10+ramdisk slog
> #
> bash-3.2# time tar xf zeroes.tar; rm -rf zeroes/; | tee
> /root/zeroes-test-scalzi-dell-ramdisk_slog.txt
> # tar
> real    1m32.343s
> # rm
> real    0m44.418s
>
> # linux+XFS on Hardware RAID
> bash-3.2# time tar xf zeroes.tar; time rm -rf zeroes/; | tee
> /root/zeroes-test-linux-lsimegaraid_bbwc.txt
> #tar
> real    2m27.791s
> #rm
> real    0m46.112s
>
>> Please note that SSD's are way slower then DRAM based write cache's. SSD's
>> will show performance increase when you create load from multiple clients at
>> the same time, as ZFS will be flushing the dirty cache sequantialy.  So I'd
>> suggest running the test from a lot of clients simultaneously
>
> I'm sure that it will be a more performant system in general, however, it is
> this explicit set of tests that I need to maintain or improve performance
> on.

As my experience with the same setup as yours, but with iSCSI I find
the built-in write-back cache on the PERC 6e controllers doesn't
perform so well when spread out over so many logical drives. It's
better then none, for sure, but not as good as an SSD ZIL I believe.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread erik.ableson
This is something that I've run into as well across various installs  
very similar to the one described (PE2950 backed by an MD1000).  I  
find that overall the write performance across NFS is absolutely  
horrible on 2008.11 and 2009.06.  Worse, I use iSCSI under 2008.11 and  
it's just fine with near wire speeds in most cases, but under 2009.06  
I can't even format a VMFS volume from ESX without hitting a timeout.   
Throughput over the iSCSI connection is mostly around 64K/s with 1  
operation per second.


I'm downgrading my new server back to 2008.11 until I can find a way  
to ensure decent performance since this is really a showstopper. But  
in the meantime I've completely given up on NFS as a primary data  
store - strictly used for templates and iso images and stuff which I  
copy up via scp since it's literally 10 times faster than over NFS.


I have a 2008.11 OpenSolaris server with an MD1000 using 7 mirror  
vdevs. The networking is 4 GbE split into two trunked connections.


Locally, I get 460 MB/s write and 1 GB/s read so raw disk performance  
is not a problem. When I use iSCSI I get wire speed in both directions  
on the GbE from ESX and other clients. However when I use NFS, write  
performance is limited to about 2 MB/s. Read performance is close to  
wire speed.


I'm using a pretty vanilla configuration, using only atime=off and  
sharenfs=anon=0.


I've looked at various tuning guides for NFS with and without ZFS but  
I haven't found anything that seems to address this type of issue.


Anyone have some tuning tips for this issue? Other than adding an SSD  
as a write log or disabling the ZIL.. (although from James' experience  
this too seems to have a limited impact).


Cheers,

Erik
On 3 juil. 09, at 08:39, James Lever wrote:

While this was running, I was looking at the output of zpool iostat  
fastdata 10 to see how it was going and was surprised to see the  
seemingly low IOPS.


jam...@scalzi:~$ zpool iostat fastdata 10
  capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
fastdata10.0G  2.02T  0312268  3.89M
fastdata10.0G  2.02T  0818  0  3.20M
fastdata10.0G  2.02T  0811  0  3.17M
fastdata10.0G  2.02T  0860  0  3.27M

Strangely, when I added a second SSD as a second slog, it made no  
difference to the write operations.


I'm not sure where to go from here, these results are appalling  
(about 3x the time of the old system with 8x 10kRPM spindles) even  
with two Enterprise SSDs as separate log devices.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread Miles Nordin
> "vl" == Victor Latushkin  writes:

vl> Above results make me question whether your Linux NFS server
vl> is really honoring synchronous semantics or not...

Any idea how to test it?


pgpB0K5gXsZ5o.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread Bob Friesenhahn

On Fri, 3 Jul 2009, James Lever wrote:


I did some tests with a ramdisk slog and the the write IOPS seemed to run 
about the 4k/s mark vs about 800/s when using the SSD as slog and 200/s 
without a slog.


It seems like you may have selected the wrong SSD product to use. 
There seems to be a huge variation in performance (and cost) with 
so-called "enterprise" SSDs.  SSDs with capacitor-backed write caches 
seem to be fastest.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread Victor Latushkin

On 03.07.09 15:34, James Lever wrote:

Hi Mertol,

On 03/07/2009, at 6:49 PM, Mertol Ozyoney wrote:

ZFS SSD usage behaviour heavly depends on access pattern and for 
asynch ops ZFS will not use SSD's.   I'd suggest you to disable SSD's 
, create a ram disk and use it as SLOG device to compare the 
performance. If performance doesnt change, it means that the 
measurement method have some flaws or you havent configured Slog 
correctly.


I did some tests with a ramdisk slog and the the write IOPS seemed to 
run about the 4k/s mark vs about 800/s when using the SSD as slog and 
200/s without a slog.


# osol b117 RAID10+ramdisk slog
#
bash-3.2# time tar xf zeroes.tar; rm -rf zeroes/; | tee 
/root/zeroes-test-scalzi-dell-ramdisk_slog.txt

# tar
real1m32.343s
# rm
real0m44.418s

# linux+XFS on Hardware RAID
bash-3.2# time tar xf zeroes.tar; time rm -rf zeroes/; | tee 
/root/zeroes-test-linux-lsimegaraid_bbwc.txt

#tar
real2m27.791s
#rm
real0m46.112s


Above results make me question whether your Linux NFS server is really 
honoring synchronous semantics or not...


Slog in ramdisk is analogous to no slog at all and disable zil (well, it 
may be actually a bit worse). If you say that your old system is 5 years 
old difference in above numbers may be due to difference in CPU and 
memory speed, and so it suggests that your Linux NFS server appears to 
be working at the memory speed, hence the question. Because if it does 
not honor sync semantics you are really comparing apples with oranges here.


victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread James Lever

Hi Mertol,

On 03/07/2009, at 6:49 PM, Mertol Ozyoney wrote:

ZFS SSD usage behaviour heavly depends on access pattern and for  
asynch ops ZFS will not use SSD's.   I'd suggest you to disable  
SSD's , create a ram disk and use it as SLOG device to compare the  
performance. If performance doesnt change, it means that the  
measurement method have some flaws or you havent configured Slog  
correctly.


I did some tests with a ramdisk slog and the the write IOPS seemed to  
run about the 4k/s mark vs about 800/s when using the SSD as slog and  
200/s without a slog.


# osol b117 RAID10+ramdisk slog
#
bash-3.2# time tar xf zeroes.tar; rm -rf zeroes/; | tee /root/zeroes- 
test-scalzi-dell-ramdisk_slog.txt

# tar
real1m32.343s
# rm
real0m44.418s

# linux+XFS on Hardware RAID
bash-3.2# time tar xf zeroes.tar; time rm -rf zeroes/; | tee /root/ 
zeroes-test-linux-lsimegaraid_bbwc.txt

#tar
real2m27.791s
#rm
real0m46.112s

Please note that SSD's are way slower then DRAM based write cache's.  
SSD's will show performance increase when you create load from  
multiple clients at the same time, as ZFS will be flushing the dirty  
cache sequantialy.  So I'd suggest running the test from a lot of  
clients simultaneously


I'm sure that it will be a more performant system in general, however,  
it is this explicit set of tests that I need to maintain or improve  
performance on.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread James Lever

Hej Henrik,

On 03/07/2009, at 8:57 PM, Henrik Johansen wrote:


Have you tried running this locally on your OpenSolaris box - just to
get an idea of what it could deliver in terms of speed ? Which NFS
version are you using ?


Most of the tests shown in my original message are local except the  
explicitly NFS based Metadata test shown at the very end (100k 0b  
files).  The 100k/0b test is an atomic test locally due to caching  
semantics and a lack of 100k explicit SYNC requests so the  
transactions are able to be bundled together and written in one block.


I've just been using NFSv3 so far for these tests as it it widely  
regarded as faster, even though less functional.


cheers,
James


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread Henrik Johansen

Hi,

James Lever wrote:

Hi All,

We have recently acquired hardware for a new fileserver and my task,  
if I want to use OpenSolaris (osol or sxce) on it is for it to perform  
at least as well as Linux (and our 5 year old fileserver) in our  
environment.


Our current file server is a whitebox Debian server with 8x 10,000 RPM  
SCSI drives behind an LSI MegaRaid controller with a BBU.  The  
filesystem in use is XFS.


The raw performance tests that I have to use to compare them are as  
follows:


 * Create 100,000 0 byte files over NFS
 * Delete 100,000 0 byte files over NFS
 * Repeat the previous 2 tasks with 1k files
 * Untar a copy of our product with object files (quite a nasty test)
 * Rebuild the product "make -j"
 * Delete the build directory

The reason for the 100k files tests is that this has been proven to be  
a significant indicator of desktop performance on the desktop systems  
of the developers.


Within the budget we had, we have purchased the following system to  
meet our goals - if the OpenSolaris tests do not meet our  
requirements, it is certain that the equivalent tests under Linux  
will.  I'm the only person here who wants OpenSolaris specificially so  
it is in my interest to try to get it working at least on par if not  
better than our current system.  So here I am begging for further help.


Dell R710
2x 2.40 Ghz Xeon 5330 CPU
16GB RAM (4x 4GB)

mpt0 SAS 6/i (LSI 1068E)
2x 1TB SATA-II drives (rpool)
2x 50GB Enterprise SSD (slog) - Samsung MCCOE50G5MPQ-0VAD3

mpt1 SAS 5/E (LSI 1068E)
Dell MD1000 15-bay External storage chassis with 2 heads
10x 450GB Seagate Cheetah 15,000 RPM SAS

We also have a PERC 6/E w/512MB BBWC to test with or fall back to if  
we go with a Linux solution.


I have installed OpenSolaris 2009.06 and updated to b117 and used mdb  
to modify the kernel to work around a current bug in b117 with the  
newer Dell systems.  http://bugs.opensolaris.org/bugdatabase/view_bug.do%3Bjsessionid=76a34f41df5bbbfc2578934eeff8?bug_id=6850943


Keeping in mind that with these tests, the external MD1000 chassis is  
connected with a single 4 lane SAS cable which should give 12Gbps or  
1.2GBps of throughput.


Individually, each disk exhibits about 170MB/s raw write performance.   
e.g.


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/dev/rdsk/c8t5d0 bs=65536  
count=32768

2147483648 bytes (2.1 GB) copied, 12.4934 s, 172 MB/s

A single spindle zpool seems to perform OK.

jam...@scalzi:~$ pfexec zpool create single c8t20d0
jam...@scalzi:~$ pfexec dd if=/dev/zero of=/single/foo bs=65536  
count=327680

21474836480 bytes (21 GB) copied, 127.201 s, 169 MB/s

RAID10 tests seem to be quite slow (about half the speed I would have  
expected - 170*5 = 850, I would have expected to see around 800MB/s)


jam...@scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0  
mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0  
mirror c8t20d0 c8t21d0


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072  
count=163840

21474836480 bytes (21 GB) copied, 50.3066 s, 427 MB/s

a 5 disk stripe seemed to perform as expected

jam...@scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0  
c8t19d0 c8t21d0


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072  
count=163840

21474836480 bytes (21 GB) copied, 27.7972 s, 773 MB/s

but a 10 disk stripe did not increase significantly

jam...@scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0  
c8t19d0 c8t21d0 c8t20d0 c8t18d0 c8t16d0 c8t11d0 c8t9d0


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072  
count=163840

21474836480 bytes (21 GB) copied, 26.1189 s, 822 MB/s

The best sequential write test I could elicit with redundancy was a  
pool with 2x 5 disk RAIDZ's striped


jam...@scalzi:~$ pfexec zpool create fastdata raidz c8t10d0 c8t15d0  
c8t16d0 c8t11d0 c8t9d0 raidz c8t17d0 c8t19d0 c8t21d0 c8t20d0 c8t18d0


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072  
count=163840

21474836480 bytes (21 GB) copied, 31.3934 s, 684 MB/s

Moving onto testing NFS and trying to perform the create 100,000 0  
byte files (aka, the metadata and NFS sync test).  The test seemed to  
be likely to take about half an hour without a slog as I worked out  
when I killed it.  Painfully slow.  So I added one of the SSDs to the  
system as a slog which improved matters.  The client is a Red Hat  
Enterprise Linux server on modern hardware and has been used for all  
tests against our old fileserver.


The time to beat: RHEL5 client to Debian4+XFS server:

bash-3.2# time tar xf zeroes.tar

real2m41.979s
user0m0.420s
sys 0m5.255s

And on the currently configured system:

jam...@scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0  
mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0  
mirror c8t20d0 c8t21d0 log c7t2d0


jam...@scalzi:~$ pfexec zfs set sharenfs='rw,ro...@10.1.0/23' fastdata

bash-3.2# time tar xf zeroes.tar

real8

Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread Mertol Ozyoney
Hi James,

ZFS SSD usage behaviour heavly depends on access pattern and for asynch ops
ZFS will not use SSD's. 
I'd suggest you to disable SSD's , create a ram disk and use it as SLOG
device to compare the performance. If performance doesnt change, it means
that the measurement method have some flaws or you havent configured Slog
correctly. 

Please note that SSD's are way slower then DRAM based write cache's. SSD's
will show performance increase when you create load from multiple clients at
the same time, as ZFS will be flushing the dirty cache sequantialy.  SO I'd
suggest running the test from a lot of clients simultaneously

Best regards
Mertol 

Mertol Ozyoney 
Storage Practice - Sales Manager

Sun Microsystems, TR
Istanbul TR
Phone +902123352200
Mobile +905339310752
Fax +90212335
Email mertol.ozyo...@sun.com


-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of James Lever
Sent: Friday, July 03, 2009 10:09 AM
To: Brent Jones
Cc: zfs-discuss; storage-disc...@opensolaris.org
Subject: Re: [zfs-discuss] surprisingly poor performance


On 03/07/2009, at 5:03 PM, Brent Jones wrote:

> Are you sure the slog is working right? Try disabling the ZIL to see
> if that helps with your NFS performance.
> If your performance increases a hundred fold, I'm suspecting the slog
> isn't perming well, or even doing its job at all.

The slog appears to be working fine - at ~800 IOPS it wasn't lighting  
up the light significantly and when a second was added both activity  
lights were even more dim.  Without the slog, the pool was only  
providing ~200 IOPS for the NFS metadata test.

Speaking of which, can anybody point me at a good, valid test to  
measure the IOPS of these SSDs?

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread James Lever


On 03/07/2009, at 5:03 PM, Brent Jones wrote:


Are you sure the slog is working right? Try disabling the ZIL to see
if that helps with your NFS performance.
If your performance increases a hundred fold, I'm suspecting the slog
isn't perming well, or even doing its job at all.


The slog appears to be working fine - at ~800 IOPS it wasn't lighting  
up the light significantly and when a second was added both activity  
lights were even more dim.  Without the slog, the pool was only  
providing ~200 IOPS for the NFS metadata test.


Speaking of which, can anybody point me at a good, valid test to  
measure the IOPS of these SSDs?


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread Brent Jones
On Thu, Jul 2, 2009 at 11:39 PM, James Lever wrote:
> Hi All,
>
> We have recently acquired hardware for a new fileserver and my task, if I
> want to use OpenSolaris (osol or sxce) on it is for it to perform at least
> as well as Linux (and our 5 year old fileserver) in our environment.
>
> Our current file server is a whitebox Debian server with 8x 10,000 RPM SCSI
> drives behind an LSI MegaRaid controller with a BBU.  The filesystem in use
> is XFS.
>
> The raw performance tests that I have to use to compare them are as follows:
>
>  * Create 100,000 0 byte files over NFS
>  * Delete 100,000 0 byte files over NFS
>  * Repeat the previous 2 tasks with 1k files
>  * Untar a copy of our product with object files (quite a nasty test)
>  * Rebuild the product "make -j"
>  * Delete the build directory
>
> The reason for the 100k files tests is that this has been proven to be a
> significant indicator of desktop performance on the desktop systems of the
> developers.
>
> Within the budget we had, we have purchased the following system to meet our
> goals - if the OpenSolaris tests do not meet our requirements, it is certain
> that the equivalent tests under Linux will.  I'm the only person here who
> wants OpenSolaris specificially so it is in my interest to try to get it
> working at least on par if not better than our current system.  So here I am
> begging for further help.
>
> Dell R710
> 2x 2.40 Ghz Xeon 5330 CPU
> 16GB RAM (4x 4GB)
>
> mpt0 SAS 6/i (LSI 1068E)
> 2x 1TB SATA-II drives (rpool)
> 2x 50GB Enterprise SSD (slog) - Samsung MCCOE50G5MPQ-0VAD3
>
> mpt1 SAS 5/E (LSI 1068E)
> Dell MD1000 15-bay External storage chassis with 2 heads
> 10x 450GB Seagate Cheetah 15,000 RPM SAS
>
> We also have a PERC 6/E w/512MB BBWC to test with or fall back to if we go
> with a Linux solution.
>
> I have installed OpenSolaris 2009.06 and updated to b117 and used mdb to
> modify the kernel to work around a current bug in b117 with the newer Dell
> systems.
>  http://bugs.opensolaris.org/bugdatabase/view_bug.do%3Bjsessionid=76a34f41df5bbbfc2578934eeff8?bug_id=6850943
>
> Keeping in mind that with these tests, the external MD1000 chassis is
> connected with a single 4 lane SAS cable which should give 12Gbps or 1.2GBps
> of throughput.
>
> Individually, each disk exhibits about 170MB/s raw write performance.  e.g.
>
> jam...@scalzi:~$ pfexec dd if=/dev/zero of=/dev/rdsk/c8t5d0 bs=65536
> count=32768
> 2147483648 bytes (2.1 GB) copied, 12.4934 s, 172 MB/s
>
> A single spindle zpool seems to perform OK.
>
> jam...@scalzi:~$ pfexec zpool create single c8t20d0
> jam...@scalzi:~$ pfexec dd if=/dev/zero of=/single/foo bs=65536 count=327680
> 21474836480 bytes (21 GB) copied, 127.201 s, 169 MB/s
>
> RAID10 tests seem to be quite slow (about half the speed I would have
> expected - 170*5 = 850, I would have expected to see around 800MB/s)
>
> jam...@scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 mirror
> c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 mirror c8t20d0
> c8t21d0
>
> jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072
> count=163840
> 21474836480 bytes (21 GB) copied, 50.3066 s, 427 MB/s
>
> a 5 disk stripe seemed to perform as expected
>
> jam...@scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0
> c8t19d0 c8t21d0
>
> jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072
> count=163840
> 21474836480 bytes (21 GB) copied, 27.7972 s, 773 MB/s
>
> but a 10 disk stripe did not increase significantly
>
> jam...@scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0
> c8t19d0 c8t21d0 c8t20d0 c8t18d0 c8t16d0 c8t11d0 c8t9d0
>
> jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072
> count=163840
> 21474836480 bytes (21 GB) copied, 26.1189 s, 822 MB/s
>
> The best sequential write test I could elicit with redundancy was a pool
> with 2x 5 disk RAIDZ's striped
>
> jam...@scalzi:~$ pfexec zpool create fastdata raidz c8t10d0 c8t15d0 c8t16d0
> c8t11d0 c8t9d0 raidz c8t17d0 c8t19d0 c8t21d0 c8t20d0 c8t18d0
>
> jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072
> count=163840
> 21474836480 bytes (21 GB) copied, 31.3934 s, 684 MB/s
>
> Moving onto testing NFS and trying to perform the create 100,000 0 byte
> files (aka, the metadata and NFS sync test).  The test seemed to be likely
> to take about half an hour without a slog as I worked out when I killed it.
>  Painfully slow.  So I added one of the SSDs to the system as a slog which
> improved matters.  The client is a Red Hat Enterprise Linux server on modern
> hardware and has been used for all tests against our old fileserver.
>
> The time to beat: RHEL5 client to Debian4+XFS server:
>
> bash-3.2# time tar xf zeroes.tar
>
> real    2m41.979s
> user    0m0.420s
> sys     0m5.255s
>
> And on the currently configured system:
>
> jam...@scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 mirror
> c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 mirror c8t20d

[zfs-discuss] surprisingly poor performance

2009-07-02 Thread James Lever

Hi All,

We have recently acquired hardware for a new fileserver and my task,  
if I want to use OpenSolaris (osol or sxce) on it is for it to perform  
at least as well as Linux (and our 5 year old fileserver) in our  
environment.


Our current file server is a whitebox Debian server with 8x 10,000 RPM  
SCSI drives behind an LSI MegaRaid controller with a BBU.  The  
filesystem in use is XFS.


The raw performance tests that I have to use to compare them are as  
follows:


 * Create 100,000 0 byte files over NFS
 * Delete 100,000 0 byte files over NFS
 * Repeat the previous 2 tasks with 1k files
 * Untar a copy of our product with object files (quite a nasty test)
 * Rebuild the product "make -j"
 * Delete the build directory

The reason for the 100k files tests is that this has been proven to be  
a significant indicator of desktop performance on the desktop systems  
of the developers.


Within the budget we had, we have purchased the following system to  
meet our goals - if the OpenSolaris tests do not meet our  
requirements, it is certain that the equivalent tests under Linux  
will.  I'm the only person here who wants OpenSolaris specificially so  
it is in my interest to try to get it working at least on par if not  
better than our current system.  So here I am begging for further help.


Dell R710
2x 2.40 Ghz Xeon 5330 CPU
16GB RAM (4x 4GB)

mpt0 SAS 6/i (LSI 1068E)
2x 1TB SATA-II drives (rpool)
2x 50GB Enterprise SSD (slog) - Samsung MCCOE50G5MPQ-0VAD3

mpt1 SAS 5/E (LSI 1068E)
Dell MD1000 15-bay External storage chassis with 2 heads
10x 450GB Seagate Cheetah 15,000 RPM SAS

We also have a PERC 6/E w/512MB BBWC to test with or fall back to if  
we go with a Linux solution.


I have installed OpenSolaris 2009.06 and updated to b117 and used mdb  
to modify the kernel to work around a current bug in b117 with the  
newer Dell systems.  http://bugs.opensolaris.org/bugdatabase/view_bug.do%3Bjsessionid=76a34f41df5bbbfc2578934eeff8?bug_id=6850943


Keeping in mind that with these tests, the external MD1000 chassis is  
connected with a single 4 lane SAS cable which should give 12Gbps or  
1.2GBps of throughput.


Individually, each disk exhibits about 170MB/s raw write performance.   
e.g.


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/dev/rdsk/c8t5d0 bs=65536  
count=32768

2147483648 bytes (2.1 GB) copied, 12.4934 s, 172 MB/s

A single spindle zpool seems to perform OK.

jam...@scalzi:~$ pfexec zpool create single c8t20d0
jam...@scalzi:~$ pfexec dd if=/dev/zero of=/single/foo bs=65536  
count=327680

21474836480 bytes (21 GB) copied, 127.201 s, 169 MB/s

RAID10 tests seem to be quite slow (about half the speed I would have  
expected - 170*5 = 850, I would have expected to see around 800MB/s)


jam...@scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0  
mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0  
mirror c8t20d0 c8t21d0


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072  
count=163840

21474836480 bytes (21 GB) copied, 50.3066 s, 427 MB/s

a 5 disk stripe seemed to perform as expected

jam...@scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0  
c8t19d0 c8t21d0


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072  
count=163840

21474836480 bytes (21 GB) copied, 27.7972 s, 773 MB/s

but a 10 disk stripe did not increase significantly

jam...@scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0  
c8t19d0 c8t21d0 c8t20d0 c8t18d0 c8t16d0 c8t11d0 c8t9d0


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072  
count=163840

21474836480 bytes (21 GB) copied, 26.1189 s, 822 MB/s

The best sequential write test I could elicit with redundancy was a  
pool with 2x 5 disk RAIDZ's striped


jam...@scalzi:~$ pfexec zpool create fastdata raidz c8t10d0 c8t15d0  
c8t16d0 c8t11d0 c8t9d0 raidz c8t17d0 c8t19d0 c8t21d0 c8t20d0 c8t18d0


jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072  
count=163840

21474836480 bytes (21 GB) copied, 31.3934 s, 684 MB/s

Moving onto testing NFS and trying to perform the create 100,000 0  
byte files (aka, the metadata and NFS sync test).  The test seemed to  
be likely to take about half an hour without a slog as I worked out  
when I killed it.  Painfully slow.  So I added one of the SSDs to the  
system as a slog which improved matters.  The client is a Red Hat  
Enterprise Linux server on modern hardware and has been used for all  
tests against our old fileserver.


The time to beat: RHEL5 client to Debian4+XFS server:

bash-3.2# time tar xf zeroes.tar

real2m41.979s
user0m0.420s
sys 0m5.255s

And on the currently configured system:

jam...@scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0  
mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0  
mirror c8t20d0 c8t21d0 log c7t2d0


jam...@scalzi:~$ pfexec zfs set sharenfs='rw,ro...@10.1.0/23' fastdata

bash-3.2# time tar xf zeroes.tar

real8m7.176s
user0m0.438s