Re: [zfs-discuss] Performance issues with iSCSI under Linux
Ian, It would help to have some config detail (e.g. what options are you using? zpool status output; property lists for specific filesystems and zvols; etc) Some basic Solaris stats can be very helpful too (e.g. peak flow samples of vmstat 1, mpstst 1, iostat -xnz 1, etc) It would also be great to know how you are running you tests. I'd also like to know what version of NFS and mount options. A network trace down to NFS RPC or iSCSI operation level with timings would be great too. I'm wondering whether your HBA has a write through or write back cache enabled? The latter might make things very fast, but could put data at risk if not sufficiently non-volatile. Cheers, Phil On 14 Oct 2010, at 22:02, Ian D rewar...@hotmail.com wrote: Our next test is to try with a different kind of HBA, we have a Dell H800 lying around. ok... we're making progress. After swapping the LSI HBA for a Dell H800 the issue disappeared. Now, I'd rather not use those controllers because they don't have a JBOD mode. We have no choice but to make individual RAID0 volumes for each disks which means we need to reboot the server every time we replace a failed drive. That's not good... What can we do with the LSI HBA? Would you call LSI's support? Is there anything we should try besides the obvious (using the latests firmware/driver)? To resume the issue, when we copy files from/to the JBODs connected to that HBA using NFS/iSCSI, we get slow transfer rate 20M/s and a 1-2 second pause between each file. When we do the same experiment locally using the external drives as a local volume (no NFS/iSCSI involved) then it goes upward of 350M/sec with no delay between files. Ian Message was edited by: reward72 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Phil Harman I'm wondering whether your HBA has a write through or write back cache enabled? The latter might make things very fast, but could put data at risk if not sufficiently non-volatile. He already said he has SSD's for dedicated log. This means the best solution is to disable WriteBack and just use WriteThrough. Not only is it more reliable than WriteBack, it's faster. And I know I've said this many times before, but I don't mind repeating: If you have slog devices, then surprisingly, it actually hurts performance to enable the WriteBack on the HBA. Think of it like this: Speed of a naked disk: 1.0 Speed of a disk with WriteBack: 2.2 Speed of a disk with slog and WB: 2.8 Speed of a disk with slog and no WB: 3.0 Of course those are really rough numbers, that vary by architecture and usage patterns. But you get the idea. The consistent result is that disk with slog is the fastest, with WB disabled. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
Am 14.10.10 17:48, schrieb Edward Ned Harvey: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Toby Thain I don't want to heat up the discussion about ZFS managed discs vs. HW raids, but if RAID5/6 would be that bad, no one would use it anymore. It is. And there's no reason not to point it out. The world has Well, neither one of the above statements is really fair. The truth is: radi5/6 are generally not that bad. Data integrity failures are not terribly common (maybe one bit per year out of 20 large disks or something like that.) And in order to reach the conclusion nobody would use it, the people using it would have to first *notice* the failure. Which they don't. That's kind of the point. Since I started using ZFS in production, about a year ago, on three servers totaling approx 1.5TB used, I have had precisely one checksum error, which ZFS corrected. I have every reason to believe, if that were on a raid5/6, the error would have gone undetected and nobody would have noticed. Point taken! So, what would you suggest, if I wanted to create really big pools? Say in the 100 TB range? That would be quite a number of single drives then, especially when you want to go with zpool raid-1. Cheers, budy -- Stephan Budach Jung von Matt/it-services GmbH Glashüttenstraße 79 20357 Hamburg Tel: +49 40-4321-1353 Fax: +49 40-4321-1114 E-Mail: stephan.bud...@jvm.de Internet: http://www.jvm.com Geschäftsführer: Ulrich Pallas, Frank Wilhelm AG HH HRB 98380 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS cache inconsistencies with Oracle
A customer is running ZFS version15 on Solaris SPARC 10/08 supporting Oracle 10.2.0.3 databases in a dev and production test environment. We have come across some cache inconsistencies with one of the Oracle databases where fetching a record displays a 'historical value' (that has been changed and committed many times). This is an isolated occurance and is not always consistent. I can't replicate it to other tables. I'll also be posting a note to the ZFS discussion list. Is it possible for a read to bybpass the write cache and fetch from disk before the flush of the cache to disk occurs? This is a large system that is infrequently busy. The Oracle SGA size is minimized to 1GB per instance and we rely more on the ZFS cache, allowing us to fit 'more instances' (many of which are cloned snapshots). We've been running this setup for 2 years. The filesystems are set with compression on, blocksize 8k for oracle datafiles, 128k for redologs. Here are the details of the scenerio: 1. Update statement re-setting existing value. At this point the previous value was actually set to -643 prior to the update. It was originally set to 3 before today's session: SQL update [name deleted] set status_cd = 1 where id = 65; 1 row updated. SQL commit; Commit complete. SQL select rowid, id, status_cd from [table name deleted] SQL where id = 65; ROWID ID STATUS_CD -- -- -- AAAq/DAAERlAAM 65 3 Note that when retrieved the status_cd reverts to the old original value of 3, not the previous value of -643. 2. Oracle trace file proves that the update was issued and committed: = PARSING IN CURSOR #1 len=70 dep=0 uid=110 oct=6 lid=110 tim=17554807027344 hv=3512595279 ad='fd211878' update [table deleted] set status_cd = 1 where id = 65 END OF STMT PARSE #1:c=0,e=54,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=2,tim=17554807027340 BINDS #1: EXEC #1:c=0,e=257,p=0,cr=3,cu=3,mis=0,r=1,dep=0,og=2,tim=17554807027737 WAIT #1: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554807027803 WAIT #1: nam='SQL*Net message from client' ela= 2999139 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554810026992 STAT #1 id=1 cnt=1 pid=0 pos=1 obj=0 op='UPDATE [TABLE DELETED] (cr=3 pr=0 pw=0 time=144 us)' STAT #1 id=2 cnt=1 pid=1 pos=1 obj=177738 op='INDEX UNIQUE SCAN [TABLE_DELETED]_XPK (cr=3 pr=0 pw=0 time=19 us)' PARSE #2:c=0,e=9,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,tim=17554810027367 XCTEND rlbk=0, rd_only=0 EXEC #2:c=0,e=226,p=0,cr=0,cu=1,mis=0,r=0,dep=0,og=0,tim=17554810027630 WAIT #2: nam='log file sync' ela= 833 buffer#=9408 p2=0 p3=0 obj#=-1 tim=17554810028507 WAIT #2: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554810028578 WAIT #2: nam='SQL*Net message from client' ela= 1825185 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554811853812 = PARSING IN CURSOR #1 len=67 dep=0 uid=110 oct=3 lid=110 tim=17554811854015 hv=1593702413 ad='fd713640' select status_cd from [table_deleted] where id = 65 END OF STMT PARSE #1:c=0,e=41,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=2,tim=17554811854010 BINDS #1: EXEC #1:c=0,e=91,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=2,tim=17554811854273 WAIT #1: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554811854327 FETCH #1:c=0,e=64,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=2,tim=17554811854436 WAIT #1: nam='SQL*Net message from client' ela= 780 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554811855291 FETCH #1:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,tim=17554811855331 WAIT #1: nam='SQL*Net message to client' ela= 0 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554811855366 There are no Oracle or Solaris error messages indicating any issue with this update. Haas anyone seen this behavoir? The features of ZFS (snapshots/clones/compression) save us a ton of time on this platform and we have certainly benefited from it. Just want to understand how something like this could occur and determine how we can prevent it in the future. == Gerry Bragg Sr. Developer Altarum Institute (734) 516-0825 gerry.br...@altarum.orgmailto:gerry.br...@altarum.org www.altarum.orghttp://www.altarum.org/ Systems Research For Better Health ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache inconsistencies with Oracle
Hi so to be absolutely clear in the same session, you ran an update, commit and select, and the select returned an earlier value than the committed update? Things like ALTER SESSION set ISOLATION_LEVEL = SERIALIZABLE; will cause a session to NOT see commits from other sessions, but in Oracle one always sees one updates in ones transactions. ( assuming no other session makes a change of course ) So are you sure that 1 come other session hasn't mucked with the value between the commit and the select in your session. 2 some DB trigger is doing this perhaps, ie setting some default value? In my experience with DB's, triggers are the root of all evil. Enda On 15/10/2010 14:36, Gerry Bragg wrote: A customer is running ZFS version15 on Solaris SPARC 10/08 supporting Oracle 10.2.0.3 databases in a dev and production test environment. We have come across some cache inconsistencies with one of the Oracle databases where fetching a record displays a 'historical value' (that has been changed and committed many times). This is an isolated occurance and is not always consistent. I can't replicate it to other tables. I'll also be posting a note to the ZFS discussion list. Is it possible for a read to bybpass the write cache and fetch from disk before the flush of the cache to disk occurs? This is a large system that is infrequently busy. The Oracle SGA size is minimized to 1GB per instance and we rely more on the ZFS cache, allowing us to fit ‘more instances’ (many of which are cloned snapshots). We’ve been running this setup for 2 years. The filesystems are set with compression on, blocksize 8k for oracle datafiles, 128k for redologs. Here are the details of the scenerio: 1. Update statement re-setting existing value. At this point the previous value was actually set to -643 prior to the update. It was originally set to 3 before today’s session: SQL update [name deleted] set status_cd = 1 where id = 65; 1 row updated. SQL commit; Commit complete. SQL select rowid, id, status_cd from [table name deleted] SQL where id = 65; ROWID ID STATUS_CD -- -- -- AAAq/DAAERlAAM 65 3 Note that when retrieved the status_cd reverts to the old original value of 3, not the previous value of -643. 2. Oracle trace file proves that the update was issued and committed: = PARSING IN CURSOR #1 len=70 dep=0 uid=110 oct=6 lid=110 tim=17554807027344 hv=3512595279 ad='fd211878' update [table deleted] set status_cd = 1 where id = 65 END OF STMT PARSE #1:c=0,e=54,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=2,tim=17554807027340 BINDS #1: EXEC #1:c=0,e=257,p=0,cr=3,cu=3,mis=0,r=1,dep=0,og=2,tim=17554807027737 WAIT #1: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554807027803 WAIT #1: nam='SQL*Net message from client' ela= 2999139 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554810026992 STAT #1 id=1 cnt=1 pid=0 pos=1 obj=0 op='UPDATE [TABLE DELETED] (cr=3 pr=0 pw=0 time=144 us)' STAT #1 id=2 cnt=1 pid=1 pos=1 obj=177738 op='INDEX UNIQUE SCAN [TABLE_DELETED]_XPK (cr=3 pr=0 pw=0 time=19 us)' PARSE #2:c=0,e=9,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,tim=17554810027367 XCTEND rlbk=0, rd_only=0 EXEC #2:c=0,e=226,p=0,cr=0,cu=1,mis=0,r=0,dep=0,og=0,tim=17554810027630 WAIT #2: nam='log file sync' ela= 833 buffer#=9408 p2=0 p3=0 obj#=-1 tim=17554810028507 WAIT #2: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554810028578 WAIT #2: nam='SQL*Net message from client' ela= 1825185 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554811853812 = PARSING IN CURSOR #1 len=67 dep=0 uid=110 oct=3 lid=110 tim=17554811854015 hv=1593702413 ad='fd713640' select status_cd from [table_deleted] where id = 65 END OF STMT PARSE #1:c=0,e=41,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=2,tim=17554811854010 BINDS #1: EXEC #1:c=0,e=91,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=2,tim=17554811854273 WAIT #1: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554811854327 FETCH #1:c=0,e=64,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=2,tim=17554811854436 WAIT #1: nam='SQL*Net message from client' ela= 780 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554811855291 FETCH #1:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,tim=17554811855331 WAIT #1: nam='SQL*Net message to client' ela= 0 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=17554811855366 There are no Oracle or Solaris error messages indicating any issue with this update. Haas anyone seen this behavoir? The features of ZFS (snapshots/clones/compression) save us a ton of time on this platform and we have certainly benefited from it. Just want to understand how something like this could occur and determine how we can prevent it in the future. == Gerry Bragg Sr. Developer Altarum Institute (734) 516-0825 gerry.br...@altarum.org mailto:gerry.br...@altarum.org www.altarum.org http://www.altarum.org/ Systems Research
Re: [zfs-discuss] Performance issues with iSCSI under Linux
As I have mentioned already, we have the same performance issues whether we READ or we WRITE to the array, shouldn't that rule out caching issues? Also we can get great performances with the LSI HBA if we use the JBODs as a local file system. The issues only arise when it is done through iSCSI and NFS. I'm opening tickets with LSI to see if they can help. Thanks all! Ian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
He already said he has SSD's for dedicated log. This means the best solution is to disable WriteBack and just use WriteThrough. Not only is it more reliable than WriteBack, it's faster. And I know I've said this many times before, but I don't mind repeating: If you have slog devices, then surprisingly, it actually hurts performance to enable the WriteBack on the HBA. The HBA who gives us problem is a LSI 9200-16e which has no cache whatsoever. We do get great performances with a Dell H800 that has cache. We'll use H800s if we have to, but i really would like to find a way to make the LSI's work. Thanks! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Available Space Discrepancy
Using snv_111b and yesterday both the Mac OS X Finder and Solaris File Browser started reporting that I had 0 space available on the SMB shares. Earlier in the day I had copied some files from the Mac to the SMB shares and no problems reported by the Mac (Automator will report errors if the destination is full and it is unable to copy the remaining files). Later I tried to move a folder from one share to another share and the Mac Finder crashed and restarted. I tried it again and after the Finder counted the number of files it was going to move, it reported that there wasn't enough space available when there should have been. Now, I know I did at least one thing I had not intended: dragging from one share to another will not MOVE, but will instead COPY. That was not my intention. I have 5 shares on the pool (data, movies, music, photos, scans) and zfs list reports: NAME USED AVAIL mediaz1 4.00T 0 data 760k 0 movies 2.57T 0 music 874G 0 photos 360G 0 scans 235G 0 zpool list reports: NAME SIZE USED AVAIL mediaz1 5.44T 5.35T 86.7G and zpool iostat reports: pool used avail operations read write bandwidth read write mediaz1 5.35T 86.7G 248 2 30.1M 10.4k There should be about 86G free and that sounds about right, but I don't understand why the GUI Finder and File browser report 0 as does zfs list. And how do I correct this or myself? David BTW, I DID search the forums and Google and did not find a solution. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
I've had a few people sending emails directly suggesting it might have something to do with the ZIL/SLOG. I guess I should have said that the issue happen both ways, whether we copy TO or FROM the Nexenta box. You mentioned a second Nexenta box earlier. To rule out client-side issues, have you considered testing with Nexenta as the iSCSI/NFS client? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
As I have mentioned already, it would be useful to know more about the config, how the tests are being done, and to see some basic system performance stats. On 15/10/2010 15:58, Ian D wrote: As I have mentioned already, we have the same performance issues whether we READ or we WRITE to the array, shouldn't that rule out caching issues? Also we can get great performances with the LSI HBA if we use the JBODs as a local file system. The issues only arise when it is done through iSCSI and NFS. I'm opening tickets with LSI to see if they can help. Thanks all! Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] adding new disks and setting up a raidz2
Derek, The c0t5000C500268CFA6Bd0 disk has some kind of label problem. You might compare the label of this disk to the other disks. I agree with Richard that using whole disks (use the d0 device) is best. You could also relabel it manually by using the format--fdisk-- delete the current partition, create a new partition using the EFI option, and save the configuration. Thanks, Cindy On 10/14/10 21:21, Derek G Nokes wrote: Thank you both. I did try without specifying the 's0' portion before posting and got the following error: r...@dnokes.homeip.net:~# zpool create marketData raidz2 c0t5000C5001A6B9C5Ed0 c0t5000C5001A81E100d0 c0t5000C500268C0576d0 c0t5000C500268C5414d0 c0t5000C500268CFA6Bd0 c0t5000C500268D0821d0 cannot label 'c0t5000C500268CFA6Bd0': try using fdisk(1M) and then provide a specific slice Any idea what this means? Thanks again. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
You mentioned a second Nexenta box earlier. To rule out client-side issues, have you considered testing with Nexenta as the iSCSI/NFS client? If you mean running the NFS client AND server on the same box then yes, and it doesn't show the same performance issues. It's only when a Linux box SEND/RECEIVE data to the NFS/iSCSI shares that we have problems. But if the Linux box send/receive file through scp on the external disks mounted by the Nexenta box as a local filesystem then there is no problem. Ian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
As I have mentioned already, it would be useful to know more about the onfig, how the tests are being done, and to see some basic system performance stats. I will shortly. Thanks! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
On 15/10/2010 19:09, Ian D wrote: It's only when a Linux box SEND/RECEIVE data to the NFS/iSCSI shares that we have problems. But if the Linux box send/receive file through scp on the external disks mounted by the Nexenta box as a local filesystem then there is no problem. Does the Linux box have the same issue to any other server ? What if the client box isn't Linux but Solaris or Windows or MacOS X ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
Does the Linux box have the same issue to any other server ? What if the client box isn't Linux but Solaris or Windows or MacOS X ? That would be a good test. We'll try that. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
After contacting LSI they say that the 9200-16e HBA is not supported in OpenSolaris, just Solaris. Aren't Solaris drivers the same as OpenSolaris? Is there anyone here using 9200-16e HBAs? What about the 9200-8e? We have a couple lying around and we'll test one shortly. Ian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS2-L8i
The mpt_sas driver supports it. We've had LSI 2004 and 2008 controllers hang for quite some time when used with SuperMicro chassis and Intel X25-E SSDs (OSOL b134 and b147). It seems to be a firmware issue that isn't fixed with the last update. Do you mean to include all the PCie cards not just the AOC-USAS2-L8i and when it's directly connected and not through the backplane? Prior reports here seem to be implicating the card only when it was connected to the backplane. -- Maurice Volaski, maurice.vola...@einstein.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
A little setback We found out that we also have the issue with the Dell H800 controllers, not just the LSI 9200-16e. With the Dell it's initially faster as we benefit from the cache, but after a little while it goes sour- from 350MB/sec down to less than 40MB/sec. We've also tried with a LSI 9200-8e with the same results. So to recap... No matter what HBA we use, copying through the network to/from the external drives is painfully slow when access is done through either NFS or iSCSI. HOWEVER, it is plenty fast when we do a scp where the data is written to the external drives (or internal ones for that matter) when they are seen by the Nexenta box as local drives- ie when neither NFS or iSCSI are involved. What now? :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
On 15 oct. 2010, at 22:19, Ian D wrote: A little setback We found out that we also have the issue with the Dell H800 controllers, not just the LSI 9200-16e. With the Dell it's initially faster as we benefit from the cache, but after a little while it goes sour- from 350MB/sec down to less than 40MB/sec. We've also tried with a LSI 9200-8e with the same results. So to recap... No matter what HBA we use, copying through the network to/from the external drives is painfully slow when access is done through either NFS or iSCSI. HOWEVER, it is plenty fast when we do a scp where the data is written to the external drives (or internal ones for that matter) when they are seen by the Nexenta box as local drives- ie when neither NFS or iSCSI are involved. Sounds an awful lot like client side issues coupled possibly with networking problems. Have you looked into disabling the Nagle algorithm on the client side? That's something that can impact both iSCSI and NFS badly, but ssh is usually not as affected... I vaguely remember that being a real performance killer on some Linux versions. Another thing to check would be ensure that noatime is set so that your reads aren't triggering writes across the network as well. Cheers, Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
-Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ian D Sent: Friday, October 15, 2010 4:19 PM To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Performance issues with iSCSI under Linux A little setback We found out that we also have the issue with the Dell H800 controllers, not just the LSI 9200-16e. With the Dell it's initially faster as we benefit from the cache, but after a little while it goes sour- from 350MB/sec down to less than 40MB/sec. We've also tried with a LSI 9200-8e with the same results. So to recap... No matter what HBA we use, copying through the network to/from the external drives is painfully slow when access is done through either NFS or iSCSI. HOWEVER, it is plenty fast when we do a scp where the data is written to the external drives (or internal ones for that matter) when they are seen by the Nexenta box as local drives- ie when neither NFS or iSCSI are involved. Has anyone suggested either removing L2ARC/SLOG entirely or relocating them so that all devices are coming off the same controller? You've swapped the external controller but the H700 with the internal drives could be the real culprit. Could there be issues with cross-controller IO in this case? Does the H700 use the same chipset/driver as the other controllers you've tried? I don't have a good understanding of where the various software components here fit together, but it seems like the problem is not with the controller(s) but with whatever is queueing network IO requests to the storage subsystem (or controlling queues/buffers/etc for this). Do NFS and iSCSI share a code path for this? -Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal raidz3 configuration
On Wed, 13 Oct 2010, Edward Ned Harvey wrote: raidzN takes a really long time to resilver (code written inefficiently, it's a known problem.) If you had a huge raidz3, it would literally never finish, because it couldn't resilver as fast as new data appears. A week In what way is the code written inefficiently? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
Has anyone suggested either removing L2ARC/SLOG entirely or relocating them so that all devices are coming off the same controller? You've swapped the external controller but the H700 with the internal drives could be the real culprit. Could there be issues with cross-controller IO in this case? Does the H700 use the same chipset/driver as the other controllers you've tried? We'll try that. We have a couple other devices we could use for the SLOG like a DDRDrive X1 and an OCZ Z-Drive which are both PCIe cards and don't use the local controller. Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal raidz3 configuration
Sorry, I can't not respond... Edward Ned Harvey wrote: whatever you do, *don't* configure one huge raidz3. Peter, whatever you do, *don't* make a decision based on blanket generalizations. If you can afford mirrors, your risk is much lower. Because although it's hysically possible for 2 disks to fail simultaneously and ruin the pool, the probability of that happening is smaller than the probability of 3 simultaneous disk failures on the raidz3. Edward, I normally agree with most of what you have to say, but this has gone off the deep end. I can think of counter-use-cases far faster than I can type. Due to smaller resilver window. Coupled with a smaller MTTDL, smaller cabinet space yield, smaller $/GB ratio, etc. I highly endorse mirrors for nearly all purposes. Clearly. Peter, go straight to the source. http://blogs.sun.com/roch/entry/when_to_and_not_to In short: 1. vdev_count = spindle_count / (stripe_width + parity_count) 2. IO/s is proprotional to vdev_count 3. Usable capacity is proportional to stripe_width * vdev_count 4. A mirror can be approximated by a stripe of width one 5. Mean time to data loss increases exponentially with parity_count 6. Resilver time increases (super)linearly with stripe width Balance capacity available, storage needed, performance needed and your own level of paranoia regarding data loss. My home server's main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 backup pool. Clearly this is not a production Oracle server. Equally clear is that my paranoia index is rather high. ZFS will let you choose the combination of stripe width and parity count which works for you. There is no one size fits all. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal raidz3 configuration
On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes martyscho...@yahoo.com wrote: My home server's main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 backup pool. How long does it take to resilver a disk in that pool? And how long does it take to run a scrub? When I initially setup a 24-disk raidz2 vdev, it died trying to resilver a single 500 GB SATA disk. I/O under 1 MBps, all 24 drives thrashing like crazy, could barely even login to the system and type onscreen. It was a nightmare. That, and normal (no scrub, no resilver) disk I/O was abysmal. Since then, I've avoided any vdev with more than 8 drives in it. -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
On Oct 15, 2010, at 9:18 AM, Stephan Budach stephan.bud...@jvm.de wrote: Am 14.10.10 17:48, schrieb Edward Ned Harvey: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Toby Thain I don't want to heat up the discussion about ZFS managed discs vs. HW raids, but if RAID5/6 would be that bad, no one would use it anymore. It is. And there's no reason not to point it out. The world has Well, neither one of the above statements is really fair. The truth is: radi5/6 are generally not that bad. Data integrity failures are not terribly common (maybe one bit per year out of 20 large disks or something like that.) And in order to reach the conclusion nobody would use it, the people using it would have to first *notice* the failure. Which they don't. That's kind of the point. Since I started using ZFS in production, about a year ago, on three servers totaling approx 1.5TB used, I have had precisely one checksum error, which ZFS corrected. I have every reason to believe, if that were on a raid5/6, the error would have gone undetected and nobody would have noticed. Point taken! So, what would you suggest, if I wanted to create really big pools? Say in the 100 TB range? That would be quite a number of single drives then, especially when you want to go with zpool raid-1. A pool consisting of 4 disk raidz vdevs (25% overhead) or 6 disk raidz2 vdevs (33% overhead) should deliver the storage and performance for a pool that size, versus a pool of mirrors (50% overhead). You need a lot if spindles to reach 100TB. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] how to replace failed vdev on non redundant pool?
Hello, I would like to know how to replace a failed vdev in a non redundant pool? I am using fiber attached disks, and cannot simply place the disk back into the machine, since it is virtual. I have the latest kernel from sept 2010 that includes all of the new ZFS upgrades. Please, can you help me? - Cassandra (609) 243-2413 Unix Administrator From a little spark may burst a mighty flame. -Dante Alighieri ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
On Oct 15, 2010, at 5:34 PM, Ian D rewar...@hotmail.com wrote: Has anyone suggested either removing L2ARC/SLOG entirely or relocating them so that all devices are coming off the same controller? You've swapped the external controller but the H700 with the internal drives could be the real culprit. Could there be issues with cross-controller IO in this case? Does the H700 use the same chipset/driver as the other controllers you've tried? We'll try that. We have a couple other devices we could use for the SLOG like a DDRDrive X1 and an OCZ Z-Drive which are both PCIe cards and don't use the local controller. What mount options are you using on the Linux client for the NFS share? -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to replace failed vdev on non redundant pool?
If the pool is non-redundant and your vdev has failed, you have lost your data. Just rebuild the pool, but consider a redundant configuration. On Oct 15, 2010, at 3:26 PM, Cassandra Pugh wrote: Hello, I would like to know how to replace a failed vdev in a non redundant pool? I am using fiber attached disks, and cannot simply place the disk back into the machine, since it is virtual. I have the latest kernel from sept 2010 that includes all of the new ZFS upgrades. Please, can you help me? - Cassandra (609) 243-2413 Unix Administrator From a little spark may burst a mighty flame. -Dante Alighieri ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Scott Meilicke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal raidz3 configuration
On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes martyscho...@yahoo.com wrote: My home server's main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 backup pool. How long does it take to resilver a disk in that pool? And how long does it take to run a scrub? When I initially setup a 24-disk raidz2 vdev, it died trying to resilver a single 500 GB SATA disk. I/O under 1 MBps, all 24 drives thrashing like crazy, could barely even login to the system and type onscreen. It was a nightmare. That, and normal (no scrub, no resilver) disk I/O was abysmal. Since then, I've avoided any vdev with more than 8 drives in it. MY situation is kind of unique. I picked up 120 15K 73GB FC disks early this year for $2 per. As such, spindle count is a non-issue. As a home server, it has very little need for write iops and I have 8 disks for L2ARC on the main pool. Main pool is at 40% capacity and backup pool is at 65% capacity. Both take about 70 minutes to scrub. The last time I tested a resilver it took about 3 hours. The difference is that these are low capacity 15K FC spindles and the pool has very little sustained I/O; it only bursts now and again. Resilvers would go mostly uncontested, and with RAIDZ3 + autoreplace=off, I can actually schedule a resilver. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZPool creation brings down the host
Thanks James for the response. Please find attached here with the crash dump that we got from the admin. Regards, Anand From: James C. McPherson j...@opensolaris.org To: Ramesh Babu rama.b...@gmail.com Cc: zfs-discuss@opensolaris.org; anand_...@yahoo.com Sent: Thu, 7 October, 2010 11:56:36 AM Subject: Re: [zfs-discuss] ZPool creation brings down the host On 7/10/10 03:46 PM, Ramesh Babu wrote: I am trying to create ZPool using single veritas volume. The host is going down as soon as I issue zpool create command. It looks like the command is crashing and bringing host down. Please let me know what the issue might be.Below is the command used, textvol is the veritas volume and testpool is the name of pool which I am tyring to create. zpool create testpool /dev/vx/dsk/dom/textvol That's not a configuration that I'd recommend - you're layering one volume management system on top of another. It seems that it's getting rather messy inside the kernel. Do you have the panic stack trace we can look at, and/or a crash dump? James C. McPherson -- Oracle http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZPool creation brings down the host
Thanks you very much Victor for the update. Regards, Anand From: Victor Latushkin victor.latush...@oracle.com To: j...@opensolaris.org Cc: Anand Bhakthavatsala anand_...@yahoo.com; zfs-discuss discuss zfs-discuss@opensolaris.org Sent: Fri, 8 October, 2010 1:33:57 PM Subject: Re: [zfs-discuss] ZPool creation brings down the host On Oct 8, 2010, at 10:25 AM, James C. McPherson wrote: On 8/10/10 03:28 PM, Anand Bhakthavatsala wrote: ... -- *From:* James C. McPherson j...@opensolaris.org *To:* Ramesh Babu rama.b...@gmail.com On 7/10/10 03:46 PM, Ramesh Babu wrote: I am trying to create ZPool using single veritas volume. The host is going down as soon as I issue zpool create command. It looks like the command is crashing and bringing host down. Please let me know what the issue might be.Below is the command used, textvol is the veritas volume and testpool is the name of pool which I am tyring to create. zpool create testpool /dev/vx/dsk/dom/textvol That's not a configuration that I'd recommend - you're layering one volume management system on top of another. It seems that it's getting rather messy inside the kernel. Do you have the panic stack trace we can look at, and/or a crash dump? ... vxioioctl+0x4c0(1357918, 42a, 0, ff0, 10, 0) vdev_disk_open+0x4c4(300036fd9c0, 7c00, 2a100fe3440, 18dbc00, 3000cf04900,ctor 18c0268) vdev_open+0x9c(300036fd9c0, 1, 1274400, 0, 3000e647800, 6) vdev_root_open+0x48(30004036080, 2a100fe35b8, 2a100fe35b0, 0, 7c00, 138) vdev_open+0x9c(30004036080, 1c, 0, 0, 3000e647800, 6) vdev_create+4(30004036080, 4, 0, 130e3c8, 0, 130e000) spa_create+0x1a4(0, 30011ffb500, 0, 300124cc040, 0, 3000e647800) zfs_ioc_pool_create+0x18c(30008524000, 0, 0, 74, 0, 300124cc040) zfsdev_ioctl+0x184(0, 18dbff0, ffbfa728, 0, 0, 1000) fop_ioctl+0x20(60015662e40, 5a00, ffbfa728, 13, 3000a4407a0, 127aa58) ioctl+0x184(3, 3000cb5fd28, ffbfa728, 0, 0, 5a00) syscall_trap32+0xcc(3, 5a00, ffbfa728, 0, 0, ffbfa270) Looks like you need to ask Symantec what's going on in their vxioioctl function. This is most likely 6940833 vxio`vxioioctl() panics when zfs passes it a NULL rvalp via ldi_ioctl() http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6940833 victor ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
Am 12.10.10 14:21, schrieb Edward Ned Harvey: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Stephan Budach c3t211378AC0253d0 ONLINE 0 0 0 How many disks are there inside of c3t211378AC0253d0? How are they configured? Hardware raid 5? A mirror of two hardware raid 5's? The point is: This device, as seen by ZFS, is not a pure storage device. It is a high level device representing some LUN or something, which is configured controlled by hardware raid. If there's zero redundancy in that device, then scrub would probably find the checksum errors consistently and repeatably. If there's some redundancy in that device, then all bets are off. Sometimes scrub might read the good half of the data, and other times, the bad half. But then again, the error might not be in the physical disks themselves. The error might be somewhere in the raid controller(s) or the interconnect. Or even some weird unsupported driver or something. Both raid boxes run raid6 with 16 drives each. This is the reason I was running a non-mirrored pool in the first place. I fully understand that zfs' power comes to play, when you're running with multiple independent drives, but that was what I got at hand. I now also got what you meant by good half but I don't dare to say whether or not this is also the case in a raid6 setup. Regards -- Stephan Budach Jung von Matt/it-services GmbH Glashüttenstraße 79 20357 Hamburg Tel: +49 40-4321-1353 Fax: +49 40-4321-1114 E-Mail: stephan.bud...@jvm.de Internet: http://www.jvm.com Geschäftsführer: Ulrich Pallas, Frank Wilhelm AG HH HRB 98380 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZPOOL_CONFIG_IS_HOLE
Hi, Can someone shed some light on what this ZPOOL_CONFIG is exactly. At a guess is it a bad sector of the disk, non writable and thus ZFS marks it as a hole ? cheers Matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] New STEP pkgs built via autoTSI
The following new test versions have had STEP pkgs built for them. [You are receiving this email because you are listed as the owner of the testsuite in the STC.INFO file, or you are on the s...@sun.com alias] tcp v2.7.10 STEP pkg built for Solaris Snv zfstest v1.23 STEP pkg built for Solaris Snv tcp v2.6.11 STEP pkg built for Solaris S10 zfstest v1.23 STEP pkg built for Solaris S10 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZPOOL_CONFIG_IS_HOLE
You should only see a HOLE in your config if you removed a slog after having added more stripes. Nothing to do with bad sectors. On 14 Oct 2010, at 06:27, Matt Keenan wrote: Hi, Can someone shed some light on what this ZPOOL_CONFIG is exactly. At a guess is it a bad sector of the disk, non writable and thus ZFS marks it as a hole ? cheers Matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: Stephan Budach [mailto:stephan.bud...@jvm.de] Point taken! So, what would you suggest, if I wanted to create really big pools? Say in the 100 TB range? That would be quite a number of single drives then, especially when you want to go with zpool raid-1. You have a lot of disks. You either tell the hardware to manage a lot of disks, and then tell ZFS to manage a single device, and you take unnecessary risk and performance degradation for no apparent reason ... Or you tell ZFS to manage a lot of disks. Either way, you have a lot of disks that need to be managed by something. Why would you want to make that hardware instead of ZFS? For 100TB ... I suppose you have 2TB disks. I suppose you have 12 buses. I would make a raidz1 using 1 disk from bus0, bus1, ... bus5. I would make another raidz1 vdev using a disk from bus6, bus7, ... bus11. And so forth. Then, even if you lose a whole bus, you still haven't lost your pool. Each raidz1 vdev would be 6 disks with a capacity of 5, so you would have a total of 10 vdev's and that means 5 disks on each bus. Or do whatever you want. The point is yes, give all the individual disks to ZFS. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to replace failed vdev on non redundant pool?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Cassandra Pugh I would like to know how to replace a failed vdev in a non redundant pool? Non redundant ... Failed ... What do you expect? This seems like a really simple answer... You can't. Unless perhaps I've misunderstood the question, or the question wasn't asked right or something... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal raidz3 configuration
On 10/16/10 12:29 PM, Marty Scholes wrote: On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes martyscho...@yahoo.com wrote: My home server's main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 backup pool. How long does it take to resilver a disk in that pool? And how long does it take to run a scrub? When I initially setup a 24-disk raidz2 vdev, it died trying to resilver a single 500 GB SATA disk. I/O under 1 MBps, all 24 drives thrashing like crazy, could barely even login to the system and type onscreen. It was a nightmare. That, and normal (no scrub, no resilver) disk I/O was abysmal. Since then, I've avoided any vdev with more than 8 drives in it. MY situation is kind of unique. I picked up 120 15K 73GB FC disks early this year for $2 per. As such, spindle count is a non-issue. As a home server, it has very little need for write iops and I have 8 disks for L2ARC on the main pool. I'd hate to be paying your power bill! Main pool is at 40% capacity and backup pool is at 65% capacity. Both take about 70 minutes to scrub. The last time I tested a resilver it took about 3 hours. So a tiny fast drive takes three hours, consider how long a 30x bigger, much slower drive will take. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss