Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
Robert Milkowski wrote: ps. however I'm really concerned with ZFS behavior when a pool is almost full, there're lot of write transactions to that pool and server is restarted forcibly or panics. I observed that file systems on that pool will mount in 10-30 minutes each during zfs mount -a, and one CPU is completely consumed. It's during system start-up so basically whole system boots waits for it. It means additional 1 hour downtime. This is something really unexpected for me and unfortunately no one was really interested in my report - I know people are busy. But still if it hits other users when zfs pools will be already populated people won't be happy. For more details see my post here with subject: "zfs mount stuck in zil_replay". That problem must have fallen through the cracks. Yes we are busy, but we really do care about your experiences and bugs. I have just raised a bug to cover this issue: 6460107 Extremely slow mounts after panic - searching space maps during replay Thanks for reporting this and helping make ZFS better. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
The test case was build 38, Solaris 11, a 2 GB file, initially created with 1 MB SW, and a recsize of 8 KB, on a pool with two raid-z 5+1, accessed with 24 threads of 8 KB RW, for 500,000 ops or 40 seconds which ever came first. The result at the pool level was 78% of the operations were RR, all overhead. Hi David, Could this bug (now fixed) have hit you ? 6424554 full block re-writes need not read data in -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
Hello Dave, Thursday, August 10, 2006, 12:29:05 AM, you wrote: DF> Hi, DF> Note that these are page cache rates and that if the application DF> pushes harder and exposes the supporting device rates there is DF> another world of performance to be observed. This is where ZFS DF> gets to be a challenge as the relationship between the application DF> level I/O and the pool level is very hard to predict. For example DF> the COW may or may not have to read old data for a small I/O DF> update operation, and a large portion of the pool vdev capability DF> can be spent on this kind of overhead. Also, on read, if the DF> pattern is random, you may or may not receive any benefit from the DF> 32 KB to 128 KB reads on each disk of the pool vdev on behalf of a DF> small read, say 8 KB by the application, again lots of overhead DF> potential. I am not complaining, ZFS is great, I’m a fan, but you DF> definitely have your work cut out for you if you want to predict DF> its ability to scale for any given workload. I know, you have valid concerns. However in a tests I performed ZFS behaved better than UFS and it was most important for me. Does it mean that it will behave (performance) better than UFS in a production? Well, I don't know - but thanks to these tests (and some others I haven't posted) I'm more confident that it's likely it will not behave worse. And this is only performance point of view, there are others also important. ps. however I'm really concerned with ZFS behavior when a pool is almost full, there're lot of write transactions to that pool and server is restarted forcibly or panics. I observed that file systems on that pool will mount in 10-30 minutes each during zfs mount -a, and one CPU is completely consumed. It's during system start-up so basically whole system boots waits for it. It means additional 1 hour downtime. This is something really unexpected for me and unfortunately no one was really interested in my report - I know people are busy. But still if it hits other users when zfs pools will be already populated people won't be happy. For more details see my post here with subject: "zfs mount stuck in zil_replay". -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
Hi Matthew, In the case of the 8 KB Random Write to the 128 KB recsize filesystem the I/O were not full block re-writes, yet the expected COW Random Read (RR) at the pool level is somehow avoided. I suspect it was able to coalesce enough I/O in the 5 second transaction window to construct 128 KB blocks. This was after all, 24 threads of I/O to a 2 GB file at a rate of 140,000 IOPS. However, when using the 8 KB recsize it was not able to do this. I will check to see if it's fixed in b45. Thanks! Dave 8 KB update to a 128 KB block), however, did not have much Random Read (RR) at the pool level. The 8 KB RW to the 8 KB recsize filesystem is where I generaly observed RR at the pool level. RR is Random Read, RW is random Write... Dave Matthew Ahrens wrote: On Wed, Aug 09, 2006 at 04:24:55PM -0700, Dave C. Fisk wrote: Hi Eric, Thanks for the information. I am aware of the recsize option and its intended use. However, when I was exploring it to confirm the expected behavior, what I found was the opposite! The test case was build 38, Solaris 11, a 2 GB file, initially created with 1 MB SW, and a recsize of 8 KB, on a pool with two raid-z 5+1, accessed with 24 threads of 8 KB RW, for 500,000 ops or 40 seconds which ever came first. The result at the pool level was 78% of the operations were RR, all overhead. For the same test, with a 128 KB recsize (the default), the pool access was pure SW, beautiful. I'm not sure what RR means, but you should re-try your tests on build 42 or later. Earlier builds have bug 6424554 "full block re-writes need not read data in" which will cause a lot more data to be read than is necessary, when overwriting entire blocks. --matt -- Dave Fisk, ORtera Inc. Phone (562) 433-7078 [EMAIL PROTECTED] http://www.ORtera.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
On Wed, Aug 09, 2006 at 04:24:55PM -0700, Dave C. Fisk wrote: > Hi Eric, > > Thanks for the information. > > I am aware of the recsize option and its intended use. However, when I > was exploring it to confirm the expected behavior, what I found was the > opposite! > > The test case was build 38, Solaris 11, a 2 GB file, initially created > with 1 MB SW, and a recsize of 8 KB, on a pool with two raid-z 5+1, > accessed with 24 threads of 8 KB RW, for 500,000 ops or 40 seconds which > ever came first. The result at the pool level was 78% of the operations > were RR, all overhead. For the same test, with a 128 KB recsize (the > default), the pool access was pure SW, beautiful. I'm not sure what RR means, but you should re-try your tests on build 42 or later. Earlier builds have bug 6424554 "full block re-writes need not read data in" which will cause a lot more data to be read than is necessary, when overwriting entire blocks. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
Hi Eric, Thanks for the information. I am aware of the recsize option and its intended use. However, when I was exploring it to confirm the expected behavior, what I found was the opposite! The test case was build 38, Solaris 11, a 2 GB file, initially created with 1 MB SW, and a recsize of 8 KB, on a pool with two raid-z 5+1, accessed with 24 threads of 8 KB RW, for 500,000 ops or 40 seconds which ever came first. The result at the pool level was 78% of the operations were RR, all overhead. For the same test, with a 128 KB recsize (the default), the pool access was pure SW, beautiful. I ran this test 5 times. The test results with an 8 KB recsize were consistent, however ONE of the 128 KB recsize tests did have 62% RR at the pool levelthis is not exactly a confidence builder for predictability. As I understand the striping logic is separate from the on disk format and can be changed in the future, I would suggest a variant of raid-z (raid-z+) that would have a variable stripe width instead of a variable stripe unit. The worst case would be 1+1, but you would generally do better than mirroring in terms the number of drives used for protection, and you could avoid dividing an 8 KB I/O over say 5, 10 or (god forbid) 47 drives. It would be much less overhead, something like 200 to 1 in one analysis (if I recall correctly), and hence much better performance. I will be happy to post ORtera summary reports for a pair of these tests if you would like to see the numbers. However, the forum would be the better place to post the reports. Regards, Dave Eric Schrock wrote: On Wed, Aug 09, 2006 at 03:29:05PM -0700, Dave Fisk wrote: For example the COW may or may not have to read old data for a small I/O update operation, and a large portion of the pool vdev capability can be spent on this kind of overhead. This is what the 'recordsize' property is for. If you have a workload that works on large files in very small sized chunks, setting the recordsize before creating the files will result in a big improvement. Also, on read, if the pattern is random, you may or may not receive any benefit from the 32 KB to 128 KB reads on each disk of the pool vdev on behalf of a small read, say 8 KB by the application, again lots of overhead potential. We're evaluating the tradeoffs on this one. The original vdev cache has been around forever, and hasn't really been reevaluated in the context of the latest improvements. See: 6437054 vdev_cache: wise up or die The DMU-level prefetch code had to undergo a similar overhaul, and was fixed up in build 45. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock -- Dave Fisk, ORtera Inc. Phone (562) 433-7078 [EMAIL PROTECTED] http://www.ORtera.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
On Wed, Aug 09, 2006 at 03:29:05PM -0700, Dave Fisk wrote: > > For example the COW may or may not have to read old data for a small > I/O update operation, and a large portion of the pool vdev capability > can be spent on this kind of overhead. This is what the 'recordsize' property is for. If you have a workload that works on large files in very small sized chunks, setting the recordsize before creating the files will result in a big improvement. > Also, on read, if the pattern is random, you may or may not > receive any benefit from the 32 KB to 128 KB reads on each disk of the > pool vdev on behalf of a small read, say 8 KB by the application, > again lots of overhead potential. We're evaluating the tradeoffs on this one. The original vdev cache has been around forever, and hasn't really been reevaluated in the context of the latest improvements. See: 6437054 vdev_cache: wise up or die The DMU-level prefetch code had to undergo a similar overhaul, and was fixed up in build 45. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
On Tue, Aug 08, 2006 at 06:11:09PM +0200, Robert Milkowski wrote: > filebench/singlestreamread v440 > > 1. UFS, noatime, HW RAID5 6 disks, S10U2 > 70MB/s > > 2. ZFS, atime=off, HW RAID5 6 disks, S10U2 (the same lun as in #1) > 87MB/s > > 3. ZFS, atime=off, SW RAID-Z 6 disks, S10U2 > 130MB/s > > 4. ZFS, atime=off, SW RAID-Z 6 disks, snv_44 > 133MB/s FYI, Streaming read performance is improved considerably by Mark's prefetch fixes which are in build 45. (However, as mentioned you will soon run into the bandwidth of a single fiber channel connection.) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
Luke Lonergan wrote: Robert, On 8/8/06 9:11 AM, "Robert Milkowski" <[EMAIL PROTECTED]> wrote: 1. UFS, noatime, HW RAID5 6 disks, S10U2 70MB/s 2. ZFS, atime=off, HW RAID5 6 disks, S10U2 (the same lun as in #1) 87MB/s 3. ZFS, atime=off, SW RAID-Z 6 disks, S10U2 130MB/s 4. ZFS, atime=off, SW RAID-Z 6 disks, snv_44 133MB/s Well, the UFS results are miserable, but the ZFS results aren't good - I'd expect between 250-350MB/s from a 6-disk RAID5 with read() blocksize from 8kb to 32kb. Most of my ZFS experiments have been with RAID10, but there were some massive improvements to seq I/O with the fixes I mentioned - I'd expect that this shows that they aren't in snv44. Those fixes went into snv_45 -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
RE: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
Does snv44 have the ZFS fixes to the I/O scheduler, the ARC and the prefetch logic? These are great results for random I/O, I wonder how the sequential I/O looks? Of course you'll not get great results for sequential I/O on the 3510 :-) - Luke Sent from my GoodLink synchronized handheld (www.good.com) -Original Message- From: Robert Milkowski [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 08, 2006 10:15 AM Eastern Standard Time To: zfs-discuss@opensolaris.org Subject:[zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID Hi. This time some RAID5/RAID-Z benchmarks. This time I connected 3510 head unit with one link to the same server as 3510 JBODs are connected (using second link). snv_44 is used, server is v440. I also tried changing max pending IO requests for HW raid5 lun and checked with DTrace that larger value is really used - it is but it doesn't change benchmark numbers. 1. ZFS on HW RAID5 with 6 disks, atime=off IO Summary: 444386 ops 7341.7 ops/s, (1129/1130 r/w) 36.1mb/s,297us cpu/op, 6.6ms latency IO Summary: 438649 ops 7247.0 ops/s, (1115/1115 r/w) 35.5mb/s,293us cpu/op, 6.7ms latency 2. ZFS with software RAID-Z with 6 disks, atime=off IO Summary: 457505 ops 7567.3 ops/s, (1164/1164 r/w) 37.2mb/s,340us cpu/op, 6.4ms latency IO Summary: 457767 ops 7567.8 ops/s, (1164/1165 r/w) 36.9mb/s,340us cpu/op, 6.4ms latency 3. UFS on HW RAID5 with 6 disks, noatime IO Summary: 62776 ops 1037.3 ops/s, (160/160 r/w) 5.5mb/s,481us cpu/op, 49.7ms latency IO Summary: 63661 ops 1051.6 ops/s, (162/162 r/w) 5.4mb/s,477us cpu/op, 49.1ms latency 4. UFS on HW RAID5 with 6 disks, noatime, S10U2 + patches (the same filesystem mounted as in 3) IO Summary: 393167 ops 6503.1 ops/s, (1000/1001 r/w) 32.4mb/s,405us cpu/op, 7.5ms latency IO Summary: 394525 ops 6521.2 ops/s, (1003/1003 r/w) 32.0mb/s,407us cpu/op, 7.7ms latency 5. ZFS with software RAID-Z with 6 disks, atime=off, S10U2 + patches (the same disks as in test #2) IO Summary: 461708 ops 7635.5 ops/s, (1175/1175 r/w) 37.4mb/s,330us cpu/op, 6.4ms latency IO Summary: 457649 ops 7562.1 ops/s, (1163/1164 r/w) 37.0mb/s,328us cpu/op, 6.5ms latency In this benchmark software raid-5 with ZFS (raid-z to be precise) gives a little bit better performance than hardware raid-5. ZFS is also faster in both cases (HW ans SW raid) than UFS on HW raid. Something is wrong with UFS on snv_44 - the same ufs filesystem on s10U2 works as expected. ZFS on S10U2 in this benchmark gives the same results as on snv_44. details // c2t43d0 is a HW raid5 made of 6 disks // array is configured for random IO's # zpool create HW_RAID5_6disks c2t43d0 # # zpool create -f zfs_raid5_6disks raidz c3t16d0 c3t17d0 c3t18d0 c3t19d0 c3t20d0 c3t21d0 # # zfs set atime=off zfs_raid5_6disks HW_RAID5_6disks # # zfs create HW_RAID5_6disks/t1 # zfs create zfs_raid5_6disks/t1 # # /opt/filebench/bin/sparcv9/filebench filebench> load varmail 450: 3.175: Varmail Version 1.24 2005/06/22 08:08:30 personality successfully loaded 450: 3.199: Usage: set $dir= 450: 3.199:set $filesize=defaults to 16384 450: 3.199:set $nfiles= defaults to 1000 450: 3.199:set $nthreads= defaults to 16 450: 3.199:set $meaniosize= defaults to 16384 450: 3.199:set $meandirwidth= defaults to 100 450: 3.199: (sets mean dir width and dir depth is calculated as log (width, nfiles) 450: 3.199: dirdepth therefore defaults to dir depth of 1 as in postmark 450: 3.199: set $meandir lower to increase depth beyond 1 if desired) 450: 3.199: 450: 3.199:run runtime (e.g. run 60) 450: 3.199: syntax error, token expected on line 51 filebench> set $dir=/HW_RAID5_6disks/t1 filebench> run 60 450: 13.320: Fileset bigfileset: 1000 files, avg dir = 100.0, avg depth = 0.5, mbytes=15 450: 13.321: Creating fileset bigfileset... 450: 15.514: Preallocated 812 of 1000 of fileset bigfileset in 3 seconds 450: 15.515: Creating/pre-allocating files 450: 15.515: Starting 1 filereader instances 451: 16.525: Starting 16 filereaderthread threads 450: 19.535: Running... 450: 80.065: Run took 60 seconds... 450: 80.079: Per-Operation Breakdown closefile4565ops/s 0.0mb/s 0.0ms/op8us/op-cpu readfile4 565ops/s 9.2mb/s 0.1ms/op 60us/op-cpu openfile4 565ops/s 0.0mb/s 0.1ms/op 64us/op-cpu closefile3565ops/s 0.0mb/s 0.0ms/op 11us/op-cpu fsyncfile3565ops/s 0.0mb/s 12.9ms/op 147us/op-cpu appendfilerand3 565ops/s 8.8mb/s 0.1ms/op 126us/op-cpu readfile3 565ops/s 9.2mb/s 0.1ms/op 60us/op-cpu openfile3 565ops/s 0.0mb/s 0.1ms/op 63us/op-cpu closefile2