subject:"RE\: \[zfs\-discuss\] Re\: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID"

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-14 Thread Neil Perrin


Robert Milkowski wrote:


ps. however I'm really concerned with ZFS behavior when a pool is
almost full, there're lot of write transactions to that pool and
server is restarted forcibly or panics. I observed that file systems
on that pool will mount in 10-30 minutes each during zfs mount -a, and
one CPU is completely consumed. It's during system start-up so basically
whole system boots waits for it. It means additional 1 hour downtime.
This is something really unexpected for me and unfortunately no one
was really interested in my report - I know people are busy. But still
if it hits other users when zfs pools will be already populated people
won't be happy. For more details see my post here with subject: "zfs
mount stuck in zil_replay".


That problem must have fallen through the cracks. Yes we are busy, but
we really do care about your experiences and bugs. I have just raised
a bug to cover this issue:

6460107 Extremely slow mounts after panic - searching space maps during replay

Thanks for reporting this and helping make ZFS better.

Neil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-14 Thread Roch


  The test case was build 38,  Solaris 11,  a 2 GB file, initially created 
  with 1 MB SW, and a recsize of 8 KB, on a pool with two raid-z 5+1,  
  accessed with 24 threads of 8 KB RW, for 500,000 ops or 40 seconds which 
  ever came first.  The result at the pool level was 78% of the operations 
  were RR, all overhead. 


Hi David,

Could this bug (now fixed) have hit you ?
6424554 full block re-writes need not read data in

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-10 Thread Robert Milkowski

Hello Dave,

Thursday, August 10, 2006, 12:29:05 AM, you wrote:

DF> Hi,

DF> Note that these are page cache rates and that if the application
DF> pushes harder and exposes the supporting device rates there is
DF> another world of performance to be observed. This is where ZFS
DF> gets to be a challenge as the relationship between the application
DF> level I/O and the pool level is very hard to predict. For example
DF> the COW may or may not have to read old data for a small I/O
DF> update operation, and a large portion of the pool vdev capability
DF> can be spent on this kind of overhead.  Also, on read, if the
DF> pattern is random, you may or may not receive any benefit from the
DF> 32 KB to 128 KB reads on each disk of the pool vdev on behalf of a
DF> small read, say 8 KB by the application, again lots of overhead
DF> potential. I am not complaining, ZFS is great, I’m a fan, but you
DF> definitely have your work cut out for you if you want to predict
DF> its ability to scale for any given workload.

I know, you have valid concerns.

However in a tests I performed ZFS behaved better than UFS and it was
most important for me.

Does it mean that it will behave (performance) better than UFS in a
production? Well, I don't know - but thanks to these tests (and some
others I haven't posted) I'm more confident that it's likely it will
not behave worse. And this is only performance point of view, there
are others also important.

ps. however I'm really concerned with ZFS behavior when a pool is
almost full, there're lot of write transactions to that pool and
server is restarted forcibly or panics. I observed that file systems
on that pool will mount in 10-30 minutes each during zfs mount -a, and
one CPU is completely consumed. It's during system start-up so basically
whole system boots waits for it. It means additional 1 hour downtime.
This is something really unexpected for me and unfortunately no one
was really interested in my report - I know people are busy. But still
if it hits other users when zfs pools will be already populated people
won't be happy. For more details see my post here with subject: "zfs
mount stuck in zil_replay".

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-09 Thread Dave C. Fisk





Hi Matthew,

In the case of the 8 KB Random Write to the 128 KB recsize filesystem
the I/O were not full block re-writes, yet the expected COW Random Read
(RR) at the pool level is somehow avoided. I suspect it was able to
coalesce enough I/O in the 5 second transaction window to construct 128
KB blocks. This was after all, 24 threads of I/O to a 2 GB file at a
rate of  140,000  IOPS.

However, when using the 8 KB recsize it was not able to do this.  I
will check to see if it's fixed in b45.

Thanks!

Dave


8 KB  update to a 128 KB block), however, did not have much Random Read
(RR) at the pool level. 

The 8 KB RW to the 8 KB recsize filesystem is where I generaly observed
RR at the pool level. 

RR is Random Read, RW is random Write...

Dave

Matthew Ahrens wrote:

  On Wed, Aug 09, 2006 at 04:24:55PM -0700, Dave C. Fisk wrote:
  
  
Hi Eric,

Thanks for the information. 

I am aware of the recsize option and its intended use. However, when I 
was exploring it to confirm the expected behavior, what I found was the 
opposite!

The test case was build 38,  Solaris 11,  a 2 GB file, initially created 
with 1 MB SW, and a recsize of 8 KB, on a pool with two raid-z 5+1,  
accessed with 24 threads of 8 KB RW, for 500,000 ops or 40 seconds which 
ever came first.  The result at the pool level was 78% of the operations 
were RR, all overhead.  For the same test, with a 128 KB recsize (the 
default),  the pool access was pure SW, beautiful.

  
  
I'm not sure what RR means, but you should re-try your tests on build 42
or later.  Earlier builds have bug 6424554 "full block re-writes need
not read data in" which will cause a lot more data to be read than is
necessary, when overwriting entire blocks.

--matt

  


-- 
Dave Fisk, ORtera Inc.
Phone (562) 433-7078
[EMAIL PROTECTED]
http://www.ORtera.com




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-09 Thread Matthew Ahrens

On Wed, Aug 09, 2006 at 04:24:55PM -0700, Dave C. Fisk wrote:
> Hi Eric,
> 
> Thanks for the information. 
> 
> I am aware of the recsize option and its intended use. However, when I 
> was exploring it to confirm the expected behavior, what I found was the 
> opposite!
> 
> The test case was build 38,  Solaris 11,  a 2 GB file, initially created 
> with 1 MB SW, and a recsize of 8 KB, on a pool with two raid-z 5+1,  
> accessed with 24 threads of 8 KB RW, for 500,000 ops or 40 seconds which 
> ever came first.  The result at the pool level was 78% of the operations 
> were RR, all overhead.  For the same test, with a 128 KB recsize (the 
> default),  the pool access was pure SW, beautiful.

I'm not sure what RR means, but you should re-try your tests on build 42
or later.  Earlier builds have bug 6424554 "full block re-writes need
not read data in" which will cause a lot more data to be read than is
necessary, when overwriting entire blocks.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-09 Thread Dave C. Fisk





Hi Eric,

Thanks for the information.  

I am aware of the recsize option and its intended use. However, when I
was exploring it to confirm the expected behavior, what I found was the
opposite!

The test case was build 38,  Solaris 11,  a 2 GB file, initially
created with 1 MB SW, and a recsize of 8 KB, on a pool with two raid-z
5+1,  accessed with 24 threads of 8 KB RW, for 500,000 ops or 40
seconds which ever came first.  The result at the pool level was 78% of
the operations were RR, all overhead.  For the same test, with a 128 KB
recsize (the default),  the pool access was pure SW, beautiful.  I ran
this test 5 times. The test results with an 8 KB recsize were
consistent, however ONE of the 128 KB recsize tests did have 62% RR at
the pool levelthis is not exactly a confidence builder for
predictability. 

As I understand the striping logic is separate from the on disk format
and can be changed in the future, I would suggest a variant of raid-z
(raid-z+) that would have a variable stripe width instead of a variable
stripe unit. The worst case would be 1+1, but you would generally do
better than mirroring in terms the number of drives used for
protection, and you could avoid dividing an 8 KB I/O over say 5, 10 or
(god forbid) 47 drives. It would be much less overhead, something like
200 to 1 in one analysis (if I recall correctly), and hence much better
performance.

I will be happy to post ORtera summary reports for a pair of these
tests if you would like to see the numbers.  However, the forum would
be the better place to post the reports.

Regards,
Dave



Eric Schrock wrote:

  On Wed, Aug 09, 2006 at 03:29:05PM -0700, Dave Fisk wrote:
  
  
For example the COW may or may not have to read old data for a small
I/O update operation, and a large portion of the pool vdev capability
can be spent on this kind of overhead.

  
  
This is what the 'recordsize' property is for.  If you have a workload
that works on large files in very small sized chunks, setting the
recordsize before creating the files will result in a big improvement.

  
  
Also, on read, if the pattern is random, you may or may not
receive any benefit from the 32 KB to 128 KB reads on each disk of the
pool vdev on behalf of a small read, say 8 KB by the application,
again lots of overhead potential.

  
  
We're evaluating the tradeoffs on this one.  The original vdev cache has
been around forever, and hasn't really been reevaluated in the context
of the latest improvements.  See:

6437054 vdev_cache: wise up or die

The DMU-level prefetch code had to undergo a similar overhaul, and was
fixed up in build 45.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock

  

-- 
Dave Fisk, ORtera Inc.
Phone (562) 433-7078
[EMAIL PROTECTED]
http://www.ORtera.com




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-09 Thread Eric Schrock

On Wed, Aug 09, 2006 at 03:29:05PM -0700, Dave Fisk wrote:
> 
> For example the COW may or may not have to read old data for a small
> I/O update operation, and a large portion of the pool vdev capability
> can be spent on this kind of overhead.

This is what the 'recordsize' property is for.  If you have a workload
that works on large files in very small sized chunks, setting the
recordsize before creating the files will result in a big improvement.

> Also, on read, if the pattern is random, you may or may not
> receive any benefit from the 32 KB to 128 KB reads on each disk of the
> pool vdev on behalf of a small read, say 8 KB by the application,
> again lots of overhead potential.

We're evaluating the tradeoffs on this one.  The original vdev cache has
been around forever, and hasn't really been reevaluated in the context
of the latest improvements.  See:

6437054 vdev_cache: wise up or die

The DMU-level prefetch code had to undergo a similar overhaul, and was
fixed up in build 45.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-08 Thread Matthew Ahrens

On Tue, Aug 08, 2006 at 06:11:09PM +0200, Robert Milkowski wrote:
> filebench/singlestreamread v440
> 
> 1. UFS, noatime, HW RAID5 6 disks, S10U2
>  70MB/s
> 
> 2. ZFS, atime=off, HW RAID5 6 disks, S10U2 (the same lun as in #1)
>  87MB/s
> 
> 3. ZFS, atime=off, SW RAID-Z 6 disks, S10U2
>  130MB/s
>  
> 4. ZFS, atime=off, SW RAID-Z 6 disks, snv_44
>  133MB/s

FYI, Streaming read performance is improved considerably by Mark's
prefetch fixes which are in build 45.  (However, as mentioned you will
soon run into the bandwidth of a single fiber channel connection.)

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-08 Thread Mark Maybee


Luke Lonergan wrote:

Robert,

On 8/8/06 9:11 AM, "Robert Milkowski" <[EMAIL PROTECTED]> wrote:



1. UFS, noatime, HW RAID5 6 disks, S10U2
70MB/s
2. ZFS, atime=off, HW RAID5 6 disks, S10U2 (the same lun as in #1)
87MB/s
3. ZFS, atime=off, SW RAID-Z 6 disks, S10U2
130MB/s
4. ZFS, atime=off, SW RAID-Z 6 disks, snv_44
133MB/s



Well, the UFS results are miserable, but the ZFS results aren't good - I'd
expect between 250-350MB/s from a 6-disk RAID5 with read() blocksize from
8kb to 32kb.

Most of my ZFS experiments have been with RAID10, but there were some
massive improvements to seq I/O with the fixes I mentioned - I'd expect that
this shows that they aren't in snv44.


Those fixes went into snv_45

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

RE: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-08 Thread Luke Lonergan

Does snv44 have the ZFS fixes to the I/O scheduler, the ARC and the prefetch 
logic?

These are great results for random I/O, I wonder how the sequential I/O looks?

Of course you'll not get great results for sequential I/O on the 3510 :-)

- Luke

Sent from my GoodLink synchronized handheld (www.good.com)


 -Original Message-
From:   Robert Milkowski [mailto:[EMAIL PROTECTED]
Sent:   Tuesday, August 08, 2006 10:15 AM Eastern Standard Time
To: zfs-discuss@opensolaris.org
Subject:[zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

Hi.

  This time some RAID5/RAID-Z benchmarks.
This time I connected 3510 head unit with one link to the same server as 3510 
JBODs are connected (using second link). snv_44 is used, server is v440.

I also tried changing max pending IO requests for HW raid5 lun and checked with 
DTrace that larger value is really used - it is but it doesn't change benchmark 
numbers.


1. ZFS on HW RAID5 with 6 disks, atime=off
IO Summary:  444386 ops 7341.7 ops/s, (1129/1130 r/w)  36.1mb/s,297us 
cpu/op,   6.6ms latency
IO Summary:  438649 ops 7247.0 ops/s, (1115/1115 r/w)  35.5mb/s,293us 
cpu/op,   6.7ms latency

2. ZFS with software RAID-Z with 6 disks, atime=off
IO Summary:  457505 ops 7567.3 ops/s, (1164/1164 r/w)  37.2mb/s,340us 
cpu/op,   6.4ms latency
IO Summary:  457767 ops 7567.8 ops/s, (1164/1165 r/w)  36.9mb/s,340us 
cpu/op,   6.4ms latency

3. UFS on HW RAID5 with 6 disks, noatime
IO Summary:  62776 ops 1037.3 ops/s, (160/160 r/w)   5.5mb/s,481us 
cpu/op,  49.7ms latency
IO Summary:  63661 ops 1051.6 ops/s, (162/162 r/w)   5.4mb/s,477us 
cpu/op,  49.1ms latency

4. UFS on HW RAID5 with 6 disks, noatime, S10U2 + patches (the same filesystem 
mounted as in 3)
IO Summary:  393167 ops 6503.1 ops/s, (1000/1001 r/w)  32.4mb/s,405us 
cpu/op,   7.5ms latency
IO Summary:  394525 ops 6521.2 ops/s, (1003/1003 r/w)  32.0mb/s,407us 
cpu/op,   7.7ms latency

5. ZFS with software RAID-Z with 6 disks, atime=off, S10U2 + patches (the same 
disks as in test #2)
IO Summary:  461708 ops 7635.5 ops/s, (1175/1175 r/w)  37.4mb/s,330us 
cpu/op,   6.4ms latency
IO Summary:  457649 ops 7562.1 ops/s, (1163/1164 r/w)  37.0mb/s,328us 
cpu/op,   6.5ms latency


In this benchmark software raid-5 with ZFS (raid-z to be precise) gives a 
little bit better performance than hardware raid-5. ZFS is also faster in both 
cases (HW ans SW raid) than UFS on HW raid.

Something is wrong with UFS on snv_44 - the same ufs filesystem on s10U2 works 
as expected.
ZFS on S10U2 in this benchmark gives the same results as on snv_44.


 details 


// c2t43d0 is a HW raid5 made of 6 disks
// array is configured for random IO's
# zpool create HW_RAID5_6disks c2t43d0
#
# zpool create -f zfs_raid5_6disks raidz c3t16d0 c3t17d0 c3t18d0 c3t19d0 
c3t20d0 c3t21d0
#
# zfs set atime=off zfs_raid5_6disks HW_RAID5_6disks
#
# zfs create HW_RAID5_6disks/t1
# zfs create zfs_raid5_6disks/t1
#

# /opt/filebench/bin/sparcv9/filebench
filebench> load varmail
  450: 3.175: Varmail Version 1.24 2005/06/22 08:08:30 personality successfully 
loaded
  450: 3.199: Usage: set $dir=
  450: 3.199:set $filesize=defaults to 16384
  450: 3.199:set $nfiles= defaults to 1000
  450: 3.199:set $nthreads=   defaults to 16
  450: 3.199:set $meaniosize= defaults to 16384
  450: 3.199:set $meandirwidth= defaults to 100
  450: 3.199: (sets mean dir width and dir depth is calculated as log (width, 
nfiles)
  450: 3.199:  dirdepth therefore defaults to dir depth of 1 as in postmark
  450: 3.199:  set $meandir lower to increase depth beyond 1 if desired)
  450: 3.199:
  450: 3.199:run runtime (e.g. run 60)
  450: 3.199: syntax error, token expected on line 51
filebench> set $dir=/HW_RAID5_6disks/t1
filebench> run 60
  450: 13.320: Fileset bigfileset: 1000 files, avg dir = 100.0, avg depth = 
0.5, mbytes=15
  450: 13.321: Creating fileset bigfileset...
  450: 15.514: Preallocated 812 of 1000 of fileset bigfileset in 3 seconds
  450: 15.515: Creating/pre-allocating files
  450: 15.515: Starting 1 filereader instances
  451: 16.525: Starting 16 filereaderthread threads
  450: 19.535: Running...
  450: 80.065: Run took 60 seconds...
  450: 80.079: Per-Operation Breakdown
closefile4565ops/s   0.0mb/s  0.0ms/op8us/op-cpu
readfile4 565ops/s   9.2mb/s  0.1ms/op   60us/op-cpu
openfile4 565ops/s   0.0mb/s  0.1ms/op   64us/op-cpu
closefile3565ops/s   0.0mb/s  0.0ms/op   11us/op-cpu
fsyncfile3565ops/s   0.0mb/s 12.9ms/op  147us/op-cpu
appendfilerand3   565ops/s   8.8mb/s  0.1ms/op  126us/op-cpu
readfile3 565ops/s   9.2mb/s  0.1ms/op   60us/op-cpu
openfile3 565ops/s   0.0mb/s  0.1ms/op   63us/op-cpu
closefile2

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

RE: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

10 matches

Site Navigation

Mail list logo

Footer information