Hi All,

I've just built an 8 disk zfs storage box, and I'm in the testing phase before 
I put it into production. I've run into some unusual results, and I was hoping 
the community could offer some suggestions. I've bascially made the switch to 
Solaris on the promises of ZFS alone (yes I'm that excited about it!), so 
naturally I'm looking forward to some great performance - but it appears I'm 
going to need some help finding all of it.

I was having even lower numbers with filebench, so I decided to dial back to a 
really simple app for testing - bonnie.

The system is an nevada_41 EM64T 3ghz xeon. 1GB ram, with 8x seagate sata II 
300GB disks, Supermicro SAT2-MV8 8 port sata controller, running at/on a 133Mhz 
64pci-x bus.
The bottle neck here, by my thinkng, should be the disks themselves.
It's not the disk interfaces ('300MB'), the disk bus (300MB EACH), the pci-x 
bus (1.1GB), and I'd hope a 64-bit 3Ghz cpu would be sufficent.

Tests were run on a fresh clean zpool, on an idle system. Rogue results were 
dropped, and as you can see below, all tests were run more then once. 8GB 
should be far more then the 1GB of RAM that the system has, eliminating caching 
issues.

If I've still managed to overlook something in my testing setup, please let me 
know - I sure did try!

Sorry about the formatting - this is bound to end up ugly

Bonnie
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raid0    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0  
2.0
8 disk   8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9  
2.1

so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be 
LOWER then writes??

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
mirror   MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 33285 38.6 46033  9.9 33077  6.8 67934 90.4  93445  7.7 230.5  
1.3 
8 disk   8196 34821 41.4 46136  9.0 32445  6.6 67120 89.1  94403  6.9 210.4  1.8

46MB/sec writes, each disk individually can do better, but I guess keeping 8 
disks in sync is hurting performance. The 94MB/sec writes is interesting. One 
the one hand, that's greater then 1 disk's worth, so I'm getting striping 
performance out of a mirror GO ZFS. On the other, if I can get striping 
performance from mirrored reads, why is it only 94MB/sec? Seemingly it's not 
cpu bound.


Now for the important test, raid-z

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raidz      MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 61785 70.9 142797 29.3 89342 19.9 64197 85.7 320554 32.6 131.3  
1.0
8 disk   8196 62869 72.4 131801 26.7 90692 20.7 63986 85.7 306152 33.4 127.3  
1.0
8 disk   8196 63103 72.9 128164 25.9 86175 19.4 64126 85.7 320410 32.7 124.5  
0.9
7 disk   8196 51103 58.8  93815 19.1 74093 16.1 64705 86.5 331865 32.8 124.9  
1.0
7 disk   8196 49446 56.8  93946 18.7 73092 15.8 64708 86.7 331458 32.7 127.1  
1.0
7 disk   8196 49831 57.1  81305 16.2 78101 16.9 64698 86.4 331577 32.7 132.4  
1.0
6 disk   8196 62360 72.3 157280 33.4 99511 21.9 65360 87.3 288159 27.1 132.7  
0.9
6 disk   8196 63291 72.8 152598 29.1 97085 21.4 65546 87.2 292923 26.7 133.4  
0.8
4 disk   8196 57965 67.9 123268 27.6 78712 17.1 66635 89.3 189482 15.9 134.1  
0.9

I'm getting distinctly non-linear scaling here.
Writes: 4 disks gives me 123MB/sec. Raid0 was giving me 270/8 =33Mb/sec with 
cpu to spare (roughly half on what each individual disk should be capable of). 
Here I'm getting 123/4= 30Mb/sec, or should that be 123/3= 41Mb/sec?
Using 30 as a basline, I'd be expecting to see twice that with 8 disks 
(240ish?). What I end up with is ~135, Clearly not good scaling at all.
The really interesting numbers happen at 7 disks - it's slower then with 4, in 
all tests.
I ran it 3x to be sure.
Note this was a native 7 disk raid-z, it wasn't 8 running in degraded mode with 
7.
Something is really wrong with my write performance here across the board.

Reads: 4 disks gives me 190MB/sec. WOAH! I'm very happy with that. 8 disks 
should scale to 380 then, Well 320 isn't all that far off - no biggie.
Looking at the 6 disk raidz is interesting though, 290MB/sec. The disks are 
good for 60+MB/sec individually. 290 is 48/disk - note also that this is better 
then my raid0 performance?!
Adding another 2 disks to my raidz gives me a mere 30Mb/sec extra performance? 
Something is going very wrong here too.

The 7 disk raidz read test is about what I'd expect (330/7= 47/disk), but it 
shows that the 8 disk is actually going backwards.

hmm...


I understand that going for an 8 disk wide raidz isn't optimal in terms of 
redundancy and IOPS/sec - but my workload shouldn't involve large amounts of 
sustained random IO, so I'm happy to take the loss in favour of absolute 
capacity.
My issue here is the scaling on sequential block transfers, not optimal design.

All three raid levels have had unexpected results, and I'll really apprectiate 
some suggestions on how I can troubleshoot this. I know how to run iostat while 
bonnie is running, but that's about it. Incidentally, iostat is telling me that 
the disks are at best on hitting around 70% B. With the 8 disk tests, it was 
often below 50%....

Is my issue perhaps with the sata card that I'm using? Maybe it's just not able 
to handle that much throughput, despite being advertised to do so. With Raid0 
(aka dynamic stripes), I know that each disk can read at 60-70Mb/sec. Why am I 
not getting 65*8 (500MB/sec+) performance. Maybe it's the marvell driver at 
fault here?

My thinking is that I need to get raid0 performing as expected before looking 
at raidz, but I'm afraid I really don't know where to begin.

All thoughts & suggestions welcome. I'm not using the disks yet, so I can blow 
the zpool away as needed.

Many thanks,
Jonathan Wheeler
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to