[zfs-discuss] Re: zfs send -i A B with B older than A
Matthew Ahrens Matthew.Ahrens at sun.com writes: True, but presumably restoring the snapshots is a rare event. You are right, this would only happen in case of disaster and total loss of the backup server. I thought that your onsite and offsite pools were the same size? If so then you should be able to fit the entire contents of the onsite pool in one of the offsite ones. Well, I simplified the example. In reality, the offsite pool is slightly smaller due to different number of disks and sizes. Also, if you can afford to waste some space, you could do something like: zfs send onsite at T-100 | ... zfs send -i T-100 onsite at t-0 | ... zfs send -i T-100 onsite at t-99 | ... zfs send -i T-99 onsite at t-98 | ... [...] Yes, I thought about it. I might do this if the delta between T-100 and T-0 is reasonable. Oh, and while I am thinking about it, beside zfs send | gzip | gpg, and zfs-crypto, a 3rd option would be to use zfs on top of a loficc device (lofi compression cryptography). I went to the project page, only to realize that they haven't shipped anything yet. Do you know how hard would it be to implement zfs send -i A B with B older than A ? Or why hasn't this been done in the first place ? I am just being curious here, I can't wait for this feature anyway (even though it would make my life soo much simpler). -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs seems slow deleting files
My shop recently switched our mail fileserver from an old Network Appliance to a Solaris box running ZFS. Customers are mostly indifferent to the change, except for one or two uses which are dramatically slower. The most noticeable problem is that deleting email messages is much slower. Each customer has a Courier-style maildir, so messages are stored as individual files. Typically, a user with Mutt or Pine, accessing the maildir via NFSv3, will mark a dozen or more messages as deleted, then exit the mail client. Only at exit will Mutt/Pine actually delete the files - when that happens, the delay can be as long as 30 seconds for 15 files, or 90 seconds of 150 files (these are estimates, I haven't been timing things yet). The NFS clients are NetBSD, we've started to run ktrace (the closest thing to truss on BSD) and initial indications are that the unlink() call (one for each deleted mail message) is taking a long time to complete. Any suggestions as to what might be going on? We have 9 snapshots online, taken every 4 hours or so. The ZFS server is using a 12 disk array, one spare, in a raidz2 configuration. There are scads of available CPU (Intel Core 2 Quad, CPU idle time is generally 90%), and we're running a local IMAP/POP server for some clients (who also complain about occasional slowness with some operations). Also, any pointers to troubleshooting performance issues with Solaris and ZFS would be appreciated. The last time I was heavily using Solaris was 2.6, and I see a lot of good toys have been added to the system since then. -- Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS - SAN and Raid
Hi All, We have come across a problem at a client where ZFS brought the system down with a write error on a EMC device due to mirroring done at the EMC level and not ZFS, Client is total EMC committed and not too happy to use the ZFS for oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices and understand the ZFS behaviour. Can someone help me with the following Questions: Is this the way ZFS will work in the future ? is there going to be any compromise to allow SAN Raid and ZFS to do the rest. If so when and if possible details of it ? Many Thanks Rgds Roshan ZFS work with SAN-attached devices? Yes, ZFS works with either direct-attached devices or SAN-attached devices. However, if your storage pool contains no mirror or RAID-Z top-level devices, ZFS can only report checksum errors but cannot correct them. If your storage pool consists of mirror or RAID-Z devices built using storage from SAN-attached devices, ZFS can report and correct checksum errors. This says that if we are not using ZFS raid or mirror then the expected event would be for ZFS to report but not fix the error. In our case the system kernel panicked, which is something different. Is the FAQ wrong or is there a bug in ZFS? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS - SAN and Raid
Roshan, Could you provide more detail please. The host and zfs should be unaware of any EMC array side replication so this sounds more like an EMC misconfiguration than a ZFS problem. Did you look in the messages file to see if anything happened to the devices that were in your zpools? If so then that wouldn't be a zfs error. If your EMC devices fall offline because of something happening on the array or fabric then zfs is not to blame. The same thing would have happened with any other filesystem built on those devices. What kind of pools were in use, raidz, mirror or simple stripe? Regards, Vic On 6/19/07, Roshan Perera [EMAIL PROTECTED] wrote: Hi All, We have come across a problem at a client where ZFS brought the system down with a write error on a EMC device due to mirroring done at the EMC level and not ZFS, Client is total EMC committed and not too happy to use the ZFS for oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices and understand the ZFS behaviour. Can someone help me with the following Questions: Is this the way ZFS will work in the future ? is there going to be any compromise to allow SAN Raid and ZFS to do the rest. If so when and if possible details of it ? Many Thanks Rgds Roshan ZFS work with SAN-attached devices? Yes, ZFS works with either direct-attached devices or SAN-attached devices. However, if your storage pool contains no mirror or RAID-Z top-level devices, ZFS can only report checksum errors but cannot correct them. If your storage pool consists of mirror or RAID-Z devices built using storage from SAN-attached devices, ZFS can report and correct checksum errors. This says that if we are not using ZFS raid or mirror then the expected event would be for ZFS to report but not fix the error. In our case the system kernel panicked, which is something different. Is the FAQ wrong or is there a bug in ZFS? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] New german white paper on ZFS
Hi, if you understand german or want to brush it up a little, I've a new ZFS white paper in german for you: http://blogs.sun.com/constantin/entry/new_zfs_white_paper_in Since there's already so much collateral on ZFS in english, I thought it's time for some localized stuff for my country. There are also some new ZFS slides that go with it, also in german. Let me know if you have any suggestions. Hope this helps, Constantin -- Constantin GonzalezSun Microsystems GmbH, Germany Platform Technology Group, Global Systems Engineering http://www.sun.de/ Tel.: +49 89/4 60 08-25 91 http://blogs.sun.com/constantin/ Sitz d. Ges.: Sun Microsystems GmbH, Sonnenallee 1, 85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer Vorsitzender des Aufsichtsrates: Martin Haering ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS - SAN and Raid
Victror, Thanks for your comments but I believe it contradict what ZFS information given below and now Bruce's mail. After some digging around I found that the messages file has thrown out some powerpath errors to one of the devices that may have caused the proble. attached below the errors. But the question still remains is ZFS only happy with JBOD disks and not SAN storage with hardware raid. Thanks Roshan Jun 4 16:30:09 su621dwdb ltid[23093]: [ID 815759 daemon.error] Cannot start rdevmi pr ocess for remote shared drive operations on host su621dh01, cannot connect to vmd Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 290100491 vol 0ffe to Jun 4 16:30:12 su621dwdb last message repeated 1 time Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 290100491 vol 0fee to Jun 4 16:30:12 su621dwdb unix: [ID 836849 kern.notice] Jun 4 16:30:12 su621dwdb ^Mpanic[cpu550]/thread=2a101dd9cc0: Jun 4 16:30:12 su621dwdb unix: [ID 809409 kern.notice] ZFS: I/O failure (write on un known off 0: zio 600574e7500 [L0 unallocated] 4000L/400P DVA[0]=5:55c00:400 DVA[1]= 6:2b800:400 fletcher4 lzjb BE contiguous birth=107027 fill=0 cksum=673200f97f:34804a 0e20dc:102879bdcf1d13:3ce1b8dac7357de): error 5 Jun 4 16:30:12 su621dwdb unix: [ID 10 kern.notice] Jun 4 16:30:12 su621dwdb genunix: [ID 723222 kern.notice] 02a101dd9740 zfs:zio_do ne+284 (600574e7500, 0, a8, 708fdca0, 0, 6000f26cdc0) Jun 4 16:30:12 su621dwdb genunix: [ID 179002 kern.notice] %l0-3: 060015beaf00 0 000708fdc00 0005 0005 We have the same problem and I have just moved back to UFS because of this issue. According to the engineer at Sun that i spoke with, he implied that there is an RFE out internally that is to address this problem. The issue is this: When configuring a zpool with 1 vdev in it and zfs times out a write operation to the pool/filesystem for whatever reason, possibly just a hold back or retyrable error, the zfs module will cause a system panic because it thinks there are no other mirror's in the pool to write to and forces a kernel panic. The way around this is to configure the zpools with mirror's which negates the use of a hardware raid array, and sends twice the amount of data down to the RAID cache that is actually required (because of the mirroring at the ZFS layer). In our case it was a little old Sun StorEdge 3511 FC SATA Array, but the principle applies to any RAID arraythat is not configured as a JBOD. Victor Engle wrote: Roshan, Could you provide more detail please. The host and zfs should be unaware of any EMC array side replication so this sounds more like an EMC misconfiguration than a ZFS problem. Did you look in the messages file to see if anything happened to the devices that were in your zpools? If so then that wouldn't be a zfs error. If your EMC devices fall offline because of something happening on the array or fabric then zfs is not to blame. The same thing would have happened with any other filesystem built on those devices. What kind of pools were in use, raidz, mirror or simple stripe? Regards, Vic On 6/19/07, Roshan Perera [EMAIL PROTECTED] wrote: Hi All, We have come across a problem at a client where ZFS brought the system down with a write error on a EMC device due to mirroring done at the EMC level and not ZFS, Client is total EMC committed and not too happy to use the ZFS for oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices and understand the ZFS behaviour. Can someone help me with the following Questions: Is this the way ZFS will work in the future ? is there going to be any compromise to allow SAN Raid and ZFS to do the rest. If so when and if possible details of it ? Many Thanks Rgds Roshan ZFS work with SAN-attached devices? Yes, ZFS works with either direct-attached devices or SAN- attached devices. However, if your storage pool contains no mirror or RAID-Z top-level devices, ZFS can only report checksum errors but cannot correct them. If your storage pool consists of mirror or RAID-Z devices built using storage from SAN-attached devices, ZFS can report and correct checksum errors. This says that if we are not using ZFS raid or mirror then the expected event would be for ZFS to report but not fix the error. In our case the system kernel panicked, which is something different. Is the FAQ wrong or is there a bug in ZFS? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS - SAN and Raid
Roshan, As far as I know, there is no problem at all with using SAN storage with ZFS and it does look like you were having an underlying problem with either powerpath or the array. The best practices guide on opensolaris does recommend replicated pools even if your backend storage is redundant. There are at least 2 good reasons for that. ZFS needs a replica for the self healing feature to work. Also there is no fsck like tool for ZFS so it is a good idea to make sure self healing can work. I think first I would track down the cause of the messages just prior to the zfs write error because even with replicated pools if several devices error at once then the pool could be lost. Regards, Vic On 6/19/07, Roshan Perera [EMAIL PROTECTED] wrote: Victror, Thanks for your comments but I believe it contradict what ZFS information given below and now Bruce's mail. After some digging around I found that the messages file has thrown out some powerpath errors to one of the devices that may have caused the proble. attached below the errors. But the question still remains is ZFS only happy with JBOD disks and not SAN storage with hardware raid. Thanks Roshan Jun 4 16:30:09 su621dwdb ltid[23093]: [ID 815759 daemon.error] Cannot start rdevmi pr ocess for remote shared drive operations on host su621dh01, cannot connect to vmd Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 290100491 vol 0ffe to Jun 4 16:30:12 su621dwdb last message repeated 1 time Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 290100491 vol 0fee to Jun 4 16:30:12 su621dwdb unix: [ID 836849 kern.notice] Jun 4 16:30:12 su621dwdb ^Mpanic[cpu550]/thread=2a101dd9cc0: Jun 4 16:30:12 su621dwdb unix: [ID 809409 kern.notice] ZFS: I/O failure (write on un known off 0: zio 600574e7500 [L0 unallocated] 4000L/400P DVA[0]=5:55c00:400 DVA[1]= 6:2b800:400 fletcher4 lzjb BE contiguous birth=107027 fill=0 cksum=673200f97f:34804a 0e20dc:102879bdcf1d13:3ce1b8dac7357de): error 5 Jun 4 16:30:12 su621dwdb unix: [ID 10 kern.notice] Jun 4 16:30:12 su621dwdb genunix: [ID 723222 kern.notice] 02a101dd9740 zfs:zio_do ne+284 (600574e7500, 0, a8, 708fdca0, 0, 6000f26cdc0) Jun 4 16:30:12 su621dwdb genunix: [ID 179002 kern.notice] %l0-3: 060015beaf00 0 000708fdc00 0005 0005 We have the same problem and I have just moved back to UFS because of this issue. According to the engineer at Sun that i spoke with, he implied that there is an RFE out internally that is to address this problem. The issue is this: When configuring a zpool with 1 vdev in it and zfs times out a write operation to the pool/filesystem for whatever reason, possibly just a hold back or retyrable error, the zfs module will cause a system panic because it thinks there are no other mirror's in the pool to write to and forces a kernel panic. The way around this is to configure the zpools with mirror's which negates the use of a hardware raid array, and sends twice the amount of data down to the RAID cache that is actually required (because of the mirroring at the ZFS layer). In our case it was a little old Sun StorEdge 3511 FC SATA Array, but the principle applies to any RAID arraythat is not configured as a JBOD. Victor Engle wrote: Roshan, Could you provide more detail please. The host and zfs should be unaware of any EMC array side replication so this sounds more like an EMC misconfiguration than a ZFS problem. Did you look in the messages file to see if anything happened to the devices that were in your zpools? If so then that wouldn't be a zfs error. If your EMC devices fall offline because of something happening on the array or fabric then zfs is not to blame. The same thing would have happened with any other filesystem built on those devices. What kind of pools were in use, raidz, mirror or simple stripe? Regards, Vic On 6/19/07, Roshan Perera [EMAIL PROTECTED] wrote: Hi All, We have come across a problem at a client where ZFS brought the system down with a write error on a EMC device due to mirroring done at the EMC level and not ZFS, Client is total EMC committed and not too happy to use the ZFS for oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices and understand the ZFS behaviour. Can someone help me with the following Questions: Is this the way ZFS will work in the future ? is there going to be any compromise to allow SAN Raid and ZFS to do the rest. If so when and if possible details of it ? Many Thanks Rgds Roshan ZFS work with SAN-attached devices? Yes, ZFS works with either direct-attached devices or SAN- attached devices. However, if your storage pool contains no mirror or RAID-Z top-level devices, ZFS can only report checksum errors but cannot correct them. If your storage pool
Re: [zfs-discuss] Re: ZFS - SAN and Raid
Victor Engle wrote: Roshan, As far as I know, there is no problem at all with using SAN storage with ZFS and it does look like you were having an underlying problem with either powerpath or the array. Correct. A write failed. The best practices guide on opensolaris does recommend replicated pools even if your backend storage is redundant. There are at least 2 good reasons for that. ZFS needs a replica for the self healing feature to work. Also there is no fsck like tool for ZFS so it is a good idea to make sure self healing can work. Yes, currently ZFS on Solaris will panic if a non-redundant write fails. This is known and being worked on, but there really isn't a good solution if a write fails, unless you have some ZFS-level redundancy. NB. fsck is not needed for ZFS because the on-disk format is always consistent. This is orthogonal to hardware faults. I think first I would track down the cause of the messages just prior to the zfs write error because even with replicated pools if several devices error at once then the pool could be lost. Yes, multiple failures can cause data loss. No magic here. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] changing mdb memory values
I want to set the values of arc c and arc p (C_max and P_addr) to different memory values. What would be the hexademical value for 256mb and for 128mb? I'm trying to use mdb -k to limit the amount of memory ZFS uses. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs seems slow deleting files
On 19 June, 2007 - Ed Ravin sent me these 1,7K bytes: Also, any pointers to troubleshooting performance issues with Solaris and ZFS would be appreciated. The last time I was heavily using Solaris was 2.6, and I see a lot of good toys have been added to the system since then. Does it only happen with NetBSD as client? Try Linux/Solaris/something and see what happens. Is it slow when doing local rm? Try a regular 'rm -rf somedirectorywiththecorrectamountoffilesinit' in various test cases too. How many files are there in those directories? Almost empty? A million files? Is it still slow on filesystems which doesn't have a bunch of snapshots? ...etc /Tomas -- Tomas Ögren, [EMAIL PROTECTED], http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Minimum number of Disks
Hi, I'm planning to deploy a small file server based on ZFS, but I want to know how many disks do I need for raidz and for raidz2, I mean, which are the minimum disks required. Thank you in advance. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS - SAN and Raid
The best practices guide on opensolaris does recommend replicated pools even if your backend storage is redundant. There are at least 2 good reasons for that. ZFS needs a replica for the self healing feature to work. Also there is no fsck like tool for ZFS so it is a good idea to make sure self healing can work. NB. fsck is not needed for ZFS because the on-disk format is always consistent. This is orthogonal to hardware faults. I understand that the on disk state is always consistent but the self healing feature can correct blocks that have bad checksums if zfs is able to retrieve the block from a good replica. So even though the filesystem is consistent, the data can be corrupt in non-redundant pools. I am unsure of what happens with a non-redundant pool when a block has a bad checksum and perhaps you could clear that up. Does this cause a problem for the pool or is it limited to the file or files affected by the bad block and otherwise the pool is online and healthy. Thanks, Vic ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Minimum number of Disks
Huitzi wrote: Hi, I'm planning to deploy a small file server based on ZFS, but I want to know how many disks do I need for raidz and for raidz2, I mean, which are the minimum disks required. If you have 2 disks, use mirroring (raidz would be no better) If you have 3 disks, use 3-way mirror or raidz single parity (raidz2 would be no better than a 3-way mirror) If you have 4 disks, use 2 2-way mirrors or raidz single parity or raidz double parity --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Minimum number of Disks
Hi The minimum disks for raidz is 3, ( you can fool it but it wont protect your data), and the minimum disks for raidz2 is 4. James Dickens uadmin.blogspot.com On 6/19/07, Huitzi [EMAIL PROTECTED] wrote: Hi, I'm planning to deploy a small file server based on ZFS, but I want to know how many disks do I need for raidz and for raidz2, I mean, which are the minimum disks required. Thank you in advance. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS - SAN and Raid
[EMAIL PROTECTED] said: attached below the errors. But the question still remains is ZFS only happy with JBOD disks and not SAN storage with hardware raid. Thanks ZFS works fine on our SAN here. You do get a kernel panic (Solaris-10U3) if a LUN disappears for some reason (without ZFS-level redundancy), but I understand that bug is fixed in a Nevada build; I'm hoping to see the fix in Solaris-10U4. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS - SAN and Raid
We have the same problem and I have just moved back to UFS because of this issue. According to the engineer at Sun that i spoke with, he implied that there is an RFE out internally that is to address this problem. The issue is this: When configuring a zpool with 1 vdev in it and zfs times out a write operation to the pool/filesystem for whatever reason, possibly just a hold back or retyrable error, the zfs module will cause a system panic because it thinks there are no other mirror's in the pool to write to and forces a kernel panic. The way around this is to configure the zpools with mirror's which negates the use of a hardware raid array, and sends twice the amount of data down to the RAID cache that is actually required (because of the mirroring at the ZFS layer). In our case it was a little old Sun StorEdge 3511 FC SATA Array, but the principle applies to any RAID array that is not configured as a JBOD. Victor Engle wrote: Roshan, Could you provide more detail please. The host and zfs should be unaware of any EMC array side replication so this sounds more like an EMC misconfiguration than a ZFS problem. Did you look in the messages file to see if anything happened to the devices that were in your zpools? If so then that wouldn't be a zfs error. If your EMC devices fall offline because of something happening on the array or fabric then zfs is not to blame. The same thing would have happened with any other filesystem built on those devices. What kind of pools were in use, raidz, mirror or simple stripe? Regards, Vic On 6/19/07, Roshan Perera [EMAIL PROTECTED] wrote: Hi All, We have come across a problem at a client where ZFS brought the system down with a write error on a EMC device due to mirroring done at the EMC level and not ZFS, Client is total EMC committed and not too happy to use the ZFS for oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices and understand the ZFS behaviour. Can someone help me with the following Questions: Is this the way ZFS will work in the future ? is there going to be any compromise to allow SAN Raid and ZFS to do the rest. If so when and if possible details of it ? Many Thanks Rgds Roshan ZFS work with SAN-attached devices? Yes, ZFS works with either direct-attached devices or SAN-attached devices. However, if your storage pool contains no mirror or RAID-Z top-level devices, ZFS can only report checksum errors but cannot correct them. If your storage pool consists of mirror or RAID-Z devices built using storage from SAN-attached devices, ZFS can report and correct checksum errors. This says that if we are not using ZFS raid or mirror then the expected event would be for ZFS to report but not fix the error. In our case the system kernel panicked, which is something different. Is the FAQ wrong or is there a bug in ZFS? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is this storage model correct?
On Jun 19, 2007, at 11:23 AM, Huitzi wrote: Hi once again and thank you very much for your reply. Here is another thread. I'm planning to deploy a small file server based on ZFS. I want to know if I can start with 2 RAIDs, and add more RAIDs in the future (like the gray RAID in the attached picture) to increase the space in the storage pool. Yep, that's exactly how you do it. I think the ZFS documentation is not very clear yet (although ZFS is very simple, I get confused) , that is why I'm asking for help in this forum. Thank you in advance and best regards. If you have suggestions to improve the documentation, please let us know. That, by the way, is a very nice picture! eric Huitzi This message posted from opensolaris.org Storage.jpeg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is this storage model correct?
Huitzi, Yes, you are correct. You can add more raidz devices in the future as your excellent graphic suggests. A similar zpool add example is described here: http://docs.sun.com/app/docs/doc/817-2271/6mhupg6fu?a=view This new section describes what operations are supported for both raidz and mirrored configurations: http://docs.sun.com/app/docs/doc/817-2271/6mhupg6ft?a=view#gaynr If you have any suggestions for making these sections clearer, please drop me a note. Thanks, Cindy Huitzi wrote: Hi once again and thank you very much for your reply. Here is another thread. I'm planning to deploy a small file server based on ZFS. I want to know if I can start with 2 RAIDs, and add more RAIDs in the future (like the gray RAID in the attached picture) to increase the space in the storage pool. I think the ZFS documentation is not very clear yet (although ZFS is very simple, I get confused) , that is why I'm asking for help in this forum. Thank you in advance and best regards. Huitzi This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [storage-discuss] Performance expectations of iscsi targets?
Paul, While testing iscsi targets exported from thumpers via 10GbE and imported 10GbE on T2000s I am not seeing the throughput I expect, and more importantly there is a tremendous amount of read IO happending on a purely sequential write workload. (Note all systems have Sun 10GbE cards and are running Nevada b65.) The read IO activity you are seeing is a direct result of re-writes on the ZFS storage pool. If you were to recreate the test from scratch, you would notice that on the very first pass of write I/Os from 'dd', there would be no reads. This is an artifact of using zvols as backing store for iSCSI Targets. The iSCSI Target software supports raw SCSI disks, Solaris raw devices (/dev/rdsk/), Solaris block devices (/dev/dsk/...), zvols, SVM volumes, files in file systems, including temps. Simple write workload (from T2000): # time dd if=/dev/zero of=/dev/rdsk/ c6t01144F210ECC2A004675E957d0 \ bs=64k count=100 A couple of things, maybe missing here, or the commands are not true cut-n-paste of what is being tested. 1). From the iSCSI initiator, there is no device at /dev/rdsk/ c6t01144F210ECC2A004675E957d0, note the missing slice. (s0, s1, s2, etc). 2). Even if one was to specify a slice, as in /dev/rdsk/ c6t01144F210ECC2A004675E957d0s2, it is unlikely that the LUN has been formatted. When I run format the first time, I get the error message of Please run fdisk first. Of course this does not have to be the case, because if the ZFS storage pool that backed up this LUN had previously been formatted with either a Solaris VTOC or Intel EFI label, then the disk would show up correctly. Performance of iscsi target pool on new blocks: bash-3.00# zpool iostat thumper1-vdev0 1 thumper1-vdev0 17.4G 2.70T 0526 0 63.6M thumper1-vdev0 17.5G 2.70T 0564 0 60.5M thumper1-vdev0 17.5G 2.70T 0 0 0 0 thumper1-vdev0 17.5G 2.70T 0 0 0 0 thumper1-vdev0 17.5G 2.70T 0 0 0 0 Configuration of zpool/iscsi target: # zpool status thumper1-vdev0 pool: thumper1-vdev0 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM thumper1-vdev0 ONLINE 0 0 0 c0t7d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 c5t7d0ONLINE 0 0 0 c6t7d0ONLINE 0 0 0 c7t7d0ONLINE 0 0 0 c8t7d0ONLINE 0 0 0 errors: No known data errors The first thing is that for this pool I was expecting 200-300MB/s throughput, since it is a simple stripe across 6, 500G disks. In fact, a direct local workload (directly on thumper1) of the same type confirms what I expected: bash-3.00# dd if=/dev/zero of=/dev/zvol/rdsk/thumper1-vdev0/iscsi bs=64k count=100 bash-3.00# zpool iostat thumper1-vdev0 1 thumper1-vdev0 20.4G 2.70T 0 2.71K 0 335M thumper1-vdev0 20.4G 2.70T 0 2.92K 0 374M thumper1-vdev0 20.4G 2.70T 0 2.88K 0 368M thumper1-vdev0 20.4G 2.70T 0 2.84K 0 363M thumper1-vdev0 20.4G 2.70T 0 2.57K 0 327M The second thing, is that when overwriting already written blocks via the iscsi target (from the T2000) I see a lot of read bandwidth for blocks that are being completely overwritten. This does not seem to slow down the write performance, but 1) it is not seem in the direct case; and 2) it consumes channel bandwidth unnecessarily. bash-3.00# zpool iostat thumper1-vdev0 1 thumper1-vdev0 8.90G 2.71T279783 31.7M 95.9M thumper1-vdev0 8.90G 2.71T281318 31.7M 29.1M thumper1-vdev0 8.90G 2.71T139 0 15.8M 0 thumper1-vdev0 8.90G 2.71T279 0 31.7M 0 thumper1-vdev0 8.90G 2.71T139 0 15.8M 0 Can anyone help to explain what I am seeing, or give me some guidance on diagnosing the cause of the following: - The bottleneck in accessing the iscsi target from the T2000 From the iSCSI Initiator's point of view, there are various (Negotiated) Login Parameters, which may have a direct effect on performance. Take a look at iscsiadm list target --verbose, then consult the iSCSI man pages, or documentation online at docs.sun.com. Remember to keep track of what you change on a per-target basis, and only change one parameter at a time, and measure your results. - The cause of the extra read bandwidth when overwriting blocks on the iscsi target from the T2000. ZFS as the backing store, and it COW (Copy-on-write) in maintaining the ZFS zvols within the storage pool. Any help is much appreciated, paul ___ storage-discuss mailing list [EMAIL PROTECTED] http://mail.opensolaris.org/mailman/listinfo/storage-discuss Jim Dunham Solaris, Storage Software
[zfs-discuss] Re: Minimum number of Disks
I also have two trivial questions (just to be sure). Do the disks have to be equal in size for RAID-Z? In a three disks RAID-Z, can I specify which disk to use for parity? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is this storage model correct?
I had the same question last week decided to take a similar approach. Instead of a giant raidz of 6 disks, i created 2 raidz's of 3 disks each. So when I want to add more storage, I just add 3 more disks. On 6/19/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Huitzi, Yes, you are correct. You can add more raidz devices in the future as your excellent graphic suggests. A similar zpool add example is described here: http://docs.sun.com/app/docs/doc/817-2271/6mhupg6fu?a=view This new section describes what operations are supported for both raidz and mirrored configurations: http://docs.sun.com/app/docs/doc/817-2271/6mhupg6ft?a=view#gaynr If you have any suggestions for making these sections clearer, please drop me a note. Thanks, Cindy Huitzi wrote: Hi once again and thank you very much for your reply. Here is another thread. I'm planning to deploy a small file server based on ZFS. I want to know if I can start with 2 RAIDs, and add more RAIDs in the future (like the gray RAID in the attached picture) to increase the space in the storage pool. I think the ZFS documentation is not very clear yet (although ZFS is very simple, I get confused) , that is why I'm asking for help in this forum. Thank you in advance and best regards. Huitzi This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Slow write speed to ZFS pool (via NFS)
I have a couple of performance questions. Right now, I am transferring about 200GB of data via NFS to my new Solaris server. I started this YESTERDAY. When writing to my ZFS pool via NFS, I notice what I believe to be slow write speeds. My client hosts vary between a MacBook Pro running Tiger to a FreeBSD 6.2 Intel server. All clients are connected to the a 10/100/1000 switch. * Is there anything I can tune on my server? * Is the problem with NFS? * Do I need to provide any other information? PERFORMANCE NUMBERS: (The file transfer is still going on) bash-3.00# zpool iostat 5 capacity operationsbandwidth pool used avail read write read write -- - - - - - - tank 140G 1.50T 13 91 1.45M 2.60M tank 140G 1.50T 0 89 0 1.42M tank 140G 1.50T 0 89 1.40K 1.40M tank 140G 1.50T 0 94 0 1.46M tank 140G 1.50T 0 85 1.50K 1.35M tank 140G 1.50T 0101 0 1.47M tank 140G 1.50T 0 90 0 1.35M tank 140G 1.50T 0 84 0 1.37M tank 140G 1.50T 0 90 0 1.39M tank 140G 1.50T 0 90 0 1.43M tank 140G 1.50T 0 91 0 1.40M tank 140G 1.50T 0 91 0 1.43M tank 140G 1.50T 0 90 1.60K 1.39M bash-3.00# zpool iostat -v capacity operationsbandwidth pool used avail read write read write -- - - - - - - tank 141G 1.50T 13 91 1.45M 2.59M raidz170.3G 768G 6 45 793K 1.30M c3d0- - 3 43 357K 721K c4d0- - 3 42 404K 665K c6d0- - 3 43 404K 665K raidz170.2G 768G 6 45 692K 1.30M c3d1- - 3 42 354K 665K c4d1- - 3 42 354K 665K c5d0- - 3 43 354K 665K -- - - - - - - I also decided to time a local filesystem write test: bash-3.00# time dd if=/dev/zero of=/data/testfile bs=1024k count=1000 1000+0 records in 1000+0 records out real0m16.490s user0m0.012s sys 0m2.547s SERVER INFORMATION: Solaris 10 U3 Intel Pentium 4 3.0GHz 2GB RAM Intel NIC (e1000g0) 1x 80 GB ATA drive for OS - 6x 300GB SATA drives for /data c3d0 - Sil3112 PCI SATA card port 1 c3d1 - Sil3112 PCI SATA card port 2 c4d0 - Sil3112 PCI SATA card port 3 c4d1 - Sil3112 PCI SATA card port 4 c5d0 - Onboard Intel SATA c6d0 - Onboard Intel SATA DISK INFORMATION: bash-3.00# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c1d0 DEFAULT cyl 9961 alt 2 hd 255 sec 63 /[EMAIL PROTECTED],0/[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 1. c3d0 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 2. c3d1 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/[EMAIL PROTECTED] /[EMAIL PROTECTED]/[EMAIL PROTECTED],0 3. c4d0 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 4. c4d1 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/pci8086, [EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 5. c5d0 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/[EMAIL PROTECTED],2/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 6. c6d0 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/[EMAIL PROTECTED] ,2/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 Specify disk (enter its number): ^C (XXX = drive serial number) ZPOOL CONFIGURATION: bash-3.00# zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT tank 1.64T140G 1.50T 8% ONLINE - bash-3.00# zpool status pool: tank state: ONLINE scrub: scrub completed with 0 errors on Tue Jun 19 07:33:05 2007 config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz1ONLINE 0 0 0 c3d0ONLINE 0 0 0 c4d0ONLINE 0 0 0 c6d0ONLINE 0 0 0 raidz1ONLINE 0 0 0 c3d1ONLINE 0 0 0 c4d1ONLINE 0 0 0 c5d0ONLINE 0 0 0 errors: No known data errors ZFS Configuration: bash-3.00# zfs list NAME USED AVAIL REFER MOUNTPOINT tank 93.3G 1006G 32.6K /tank tank/data 93.3G 1006G 93.3G /data ___ zfs-discuss mailing list
Re: [zfs-discuss] Re: Minimum number of Disks
I also have two trivial questions (just to be sure). Do the disks have to be equal in size for RAID-Z? Not really. But just like most raid5 implementations, only the amount of space on the smallest disk (or other storage object) can be used on all the components. The extra space on the other objects will not be used. So if one disk is much smaller than the others, you lose a lot of space. 400, 400, 450 = 3x400 = 1200 raw before overhead (lose 50 of 1250 avail) 100, 400, 450 = 3x100 = 300 raw before overhead (lose 650 of 950 avail) In a three disks RAID-Z, can I specify which disk to use for parity? No. Even in raid-5, there is no single parity disk. The parity is spread throughout the disks. In raidz, there is also no single disk dedicated to parity. As each write occurs, it will place data and parity blocks on disk, but it will not try to place parity on any disk in particular (it will just make sure that it is different from the location holding the data). -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS - SAN and Raid
Victor Engle wrote: The best practices guide on opensolaris does recommend replicated pools even if your backend storage is redundant. There are at least 2 good reasons for that. ZFS needs a replica for the self healing feature to work. Also there is no fsck like tool for ZFS so it is a good idea to make sure self healing can work. NB. fsck is not needed for ZFS because the on-disk format is always consistent. This is orthogonal to hardware faults. I understand that the on disk state is always consistent but the self healing feature can correct blocks that have bad checksums if zfs is able to retrieve the block from a good replica. Yes. That is how it works. By default, metatadata is replicated. For real data, you can use copies, mirroring, or raidz[12] So even though the filesystem is consistent, the data can be corrupt in non-redundant pools. No. If the data is corrupt and cannot be reconstructed, it is lost. Recall that UFS's fsck only corrects file system metadata, not real data. Most file systems which have any kind of preformance work this way. ZFS is safer, because of COW, ZFS won't overwrite existing data leading to corruption -- but other file systems can (eg. UFS). I am unsure of what happens with a non-redundant pool when a block has a bad checksum and perhaps you could clear that up. Does this cause a problem for the pool or is it limited to the file or files affected by the bad block and otherwise the pool is online and healthy. It depends on where the bad block is. If it isn't being used, no foul[1]. If it is metadata, then we recover because of redundant metadata. If it is in a file with no redundancy (copies=1, by default) then an error will be logged to FMA and the file name is visible to zpool status. You can decide if that file is important to you. This is an area where there is continuing development, far beyond what ZFS alone can do. The ultimate goal is that we get to the point where most faults can be tolerated. No rest for the weary :-) [1] this is different than software RAID systems which don't know if a block is being used or not. In ZFS, we only care about faults in blocks which are being used, for the most part. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Slow write speed to ZFS pool (via NFS)
I have a very similar setup on opensolaris b62 - 5 disks on raidz on one onboard sata port and four 3112-based ports. I have noticed that although this card seems like a nice cheap one, it is only two channels, so therein lies a huge performance decrease. I have thought about getting another card so that there is no contention on the sata channels. -o This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Slow write speed to ZFS pool (via NFS)
Correction: SATA Controller is a Sillcon Image 3114, not a 3112. On 6/19/07, Joe S [EMAIL PROTECTED] wrote: I have a couple of performance questions. Right now, I am transferring about 200GB of data via NFS to my new Solaris server. I started this YESTERDAY. When writing to my ZFS pool via NFS, I notice what I believe to be slow write speeds. My client hosts vary between a MacBook Pro running Tiger to a FreeBSD 6.2 Intel server. All clients are connected to the a 10/100/1000 switch. * Is there anything I can tune on my server? * Is the problem with NFS? * Do I need to provide any other information? PERFORMANCE NUMBERS: (The file transfer is still going on) bash-3.00# zpool iostat 5 capacity operationsbandwidth pool used avail read write read write -- - - - - - - tank 140G 1.50T 13 91 1.45M 2.60M tank 140G 1.50T 0 89 0 1.42M tank 140G 1.50T 0 89 1.40K 1.40M tank 140G 1.50T 0 94 0 1.46M tank 140G 1.50T 0 85 1.50K 1.35M tank 140G 1.50T 0101 0 1.47M tank 140G 1.50T 0 90 0 1.35M tank 140G 1.50T 0 84 0 1.37M tank 140G 1.50T 0 90 0 1.39M tank 140G 1.50T 0 90 0 1.43M tank 140G 1.50T 0 91 0 1.40M tank 140G 1.50T 0 91 0 1.43M tank 140G 1.50T 0 90 1.60K 1.39M bash-3.00# zpool iostat -v capacity operationsbandwidth pool used avail read write read write -- - - - - - - tank 141G 1.50T 13 91 1.45M 2.59M raidz170.3G 768G 6 45 793K 1.30M c3d0- - 3 43 357K 721K c4d0- - 3 42 404K 665K c6d0- - 3 43 404K 665K raidz170.2G 768G 6 45 692K 1.30M c3d1- - 3 42 354K 665K c4d1- - 3 42 354K 665K c5d0- - 3 43 354K 665K -- - - - - - - I also decided to time a local filesystem write test: bash-3.00# time dd if=/dev/zero of=/data/testfile bs=1024k count=1000 1000+0 records in 1000+0 records out real0m16.490s user0m0.012s sys 0m2.547s SERVER INFORMATION: Solaris 10 U3 Intel Pentium 4 3.0GHz 2GB RAM Intel NIC (e1000g0) 1x 80 GB ATA drive for OS - 6x 300GB SATA drives for /data c3d0 - Sil3112 PCI SATA card port 1 c3d1 - Sil3112 PCI SATA card port 2 c4d0 - Sil3112 PCI SATA card port 3 c4d1 - Sil3112 PCI SATA card port 4 c5d0 - Onboard Intel SATA c6d0 - Onboard Intel SATA DISK INFORMATION: bash-3.00# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c1d0 DEFAULT cyl 9961 alt 2 hd 255 sec 63 /[EMAIL PROTECTED],0/[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 1. c3d0 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 2. c3d1 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/[EMAIL PROTECTED] /[EMAIL PROTECTED]/[EMAIL PROTECTED],0 3. c4d0 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 4. c4d1 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/pci8086, [EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 5. c5d0 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/[EMAIL PROTECTED],2/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 6. c6d0 Maxtor 6-XXX-0001-279.48GB /[EMAIL PROTECTED],0/[EMAIL PROTECTED] ,2/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 Specify disk (enter its number): ^C (XXX = drive serial number) ZPOOL CONFIGURATION: bash-3.00# zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT tank 1.64T140G 1.50T 8% ONLINE - bash-3.00# zpool status pool: tank state: ONLINE scrub: scrub completed with 0 errors on Tue Jun 19 07:33:05 2007 config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz1ONLINE 0 0 0 c3d0ONLINE 0 0 0 c4d0ONLINE 0 0 0 c6d0ONLINE 0 0 0 raidz1ONLINE 0 0 0 c3d1ONLINE 0 0 0 c4d1ONLINE 0 0 0 c5d0ONLINE 0 0 0 errors: No known data errors ZFS Configuration: bash-3.00# zfs list NAME USED AVAIL REFER MOUNTPOINT tank 93.3G
Re: [zfs-discuss] Slow write speed to ZFS pool (via NFS)
Joe S wrote: I have a couple of performance questions. Right now, I am transferring about 200GB of data via NFS to my new Solaris server. I started this YESTERDAY. When writing to my ZFS pool via NFS, I notice what I believe to be slow write speeds. My client hosts vary between a MacBook Pro running Tiger to a FreeBSD 6.2 Intel server. All clients are connected to the a 10/100/1000 switch. * Is there anything I can tune on my server? * Is the problem with NFS? * Do I need to provide any other information? If you have a lot of small files, doing this sort of thing over NFS can be pretty painful... for a speedup, consider: (cd oldroot on client; tar cf - .) | ssh [EMAIL PROTECTED] '(cd newroot on server; tar xf -)' - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS - SAN and Raid
Thanks for all your replies. Lot of info to take it back. In this case it seems like emcp carried out a repair to a path to LUN Followed by a panic. Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000290100491 vol 0ffe to I don't think panic should be the answer in this type of scenario, as there is redundant path to the LUN and Hardware Raid is in place inside SAN. From what I gather there is work being carried out to find a better solution. What is the proposed solution or when it will be availble is the question ? Thanks again. Roshan - Original Message - From: Richard Elling [EMAIL PROTECTED] Date: Tuesday, June 19, 2007 6:28 pm Subject: Re: [zfs-discuss] Re: ZFS - SAN and Raid To: Victor Engle [EMAIL PROTECTED] Cc: Bruce McAlister [EMAIL PROTECTED], zfs-discuss@opensolaris.org, Roshan Perera [EMAIL PROTECTED] Victor Engle wrote: Roshan, As far as I know, there is no problem at all with using SAN storage with ZFS and it does look like you were having an underlying problem with either powerpath or the array. Correct. A write failed. The best practices guide on opensolaris does recommend replicated pools even if your backend storage is redundant. There are at least 2 good reasons for that. ZFS needs a replica for the self healing feature to work. Also there is no fsck like tool for ZFS so it is a good idea to make sure self healing can work. Yes, currently ZFS on Solaris will panic if a non-redundant write fails.This is known and being worked on, but there really isn't a good solution if a write fails, unless you have some ZFS-level redundancy. NB. fsck is not needed for ZFS because the on-disk format is always consistent. This is orthogonal to hardware faults. I think first I would track down the cause of the messages just prior to the zfs write error because even with replicated pools if several devices error at once then the pool could be lost. Yes, multiple failures can cause data loss. No magic here. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Best practice for moving FS between pool on same machine?
What is the best (meaning fastest) way to move a large file system from one pool to another pool on the same machine. I have a machine with two pools. One pool currently has all my data (4 filesystems), but it's misconfigured. Another pool is configured correctly, and I want to move the file systems to the new pool. Should I use 'rsync' or 'zfs send'? What happens is I forgot I couldn't incrementally add raid devices. I want to end up with two raidz(x4) vdevs in the same pool. Here's what I have now: B# zpool status pool: dbxpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM dbxpool ONLINE 0 0 0 raidz1ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 c2t1d0ONLINE 0 0 0 c2t4d0ONLINE 0 0 0 errors: No known data errors pool: dbxpool2 state: ONLINE scrub: resilver completed with 0 errors on Tue Jun 19 15:16:19 2007 config: NAMESTATE READ WRITE CKSUM dbxpool2ONLINE 0 0 0 raidz1ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 errors: No known data errors --- 'dbxpool' has all my data today. Here are my steps: 1. move data to dbxpool2 2. remount using dbxpool2 3. destroy dbxpool1 4. create new proper raidz vdev inside dbxpool2 using devices from dbxpool1 Any advice? I'm constrained by trying to minimize the downtime for the group of people using this as their file server. So I ended up with an ad-hoc assignment of devices. I'm not worried about optimizing my controller traffic at the moment. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Scalability/performance
Hello, I'm quite interested in ZFS, like everybody else I suppose, and am about to install FBSD with ZFS. On that note, i have a different first question to start with. I personally am a Linux fanboy, and would love to see/use ZFS on linux. I assume that I can use those ZFS disks later with any os that can work/recognizes ZFS correct? e.g. I can install/setup ZFS in FBSD, and later use it in OpenSolaris/Linux Fuse(native) later? Anyway, back to business :) I have a whole bunch of different sized disks/speeds. E.g. 3 300GB disks @ 40mb, a 320GB disk @ 60mb/s, 3 120gb disks @ 50mb/s and so on. Raid-Z and ZFS claims to be uber scalable and all that, but would it 'just work' with a setup like that too? I used to match up partition sizes in linux, so make the 320gb disk into 2 partitions of 300 and 20gb, then use the 4 300gb partitions as a raid5, same with the 120 gigs and use the scrap on those aswell, finally stiching everything together with LVM2. I can't easly find how this would work with raid-Z/ZFS, e.g. can I really just put all these disks in 1 big pool and remove/add to it at will? And I really don't need to use softwareraid yet still have the same reliablity with raid-z as I had with raid-5? What about hardware raid controllers, just use it as a JBOD device, or would I use it to match up disk sizes in raid0 stripes (e.g. the 3x 120gb to make a 360 raid0). Or you'd recommend to just stick with raid/lvm/reiserfs and use that. thanks, Oliver ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS version 5 to version 6 fails to import or upgrade
How do you upgrade from version 5 to version 6, I had created this under snv_62 and it zpool called zones worked with snv_b63 and 10u4beta now under snv_b66 I get error and the upgrade option does not work: any ideas? bash-3.00# df / (/dev/dsk/c0d0s0 ): 6819012 blocks 765336 files /devices (/devices ): 0 blocks0 files /dev (/dev ): 0 blocks0 files /system/contract (ctfs ): 0 blocks 2147483618 files /proc (proc ): 0 blocks29920 files /etc/mnttab(mnttab): 0 blocks0 files /etc/svc/volatile (swap ):18815312 blocks 283706 files /system/object (objfs ): 0 blocks 2147483442 files /etc/dfs/sharetab (sharefs ): 0 blocks 2147483646 files /lib/libc.so.1 (/usr/lib/libc/libc_hwcap2.so.1): 6819012 blocks 765336 files /dev/fd(fd): 0 blocks0 files /tmp (swap ):18815312 blocks 283706 files /var/run (swap ):18815312 blocks 283706 files /export(/dev/dsk/c0d1s4 ): 7231850 blocks 990233 files /media/DAYS_OF_HEAVEN(/dev/dsk/c1t0d0s2 ): 0 blocks0 files bash-3.00# spool import bash: spool: command not found bash-3.00# zpool import pool: zones id: 4567711835620380868 state: ONLINE status: The pool is formatted using an older on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config: zones ONLINE c0d1s5ONLINE bash-3.00# df -k /zones Filesystemkbytesused avail capacity Mounted on /dev/dsk/c0d0s0 8068883 4659811 332838459%/ bash-3.00# zpool upgrade This system is currently running ZFS version 6. All pools are formatted using this version. bash-3.00# zpool upgrade -a zones -a option is incompatible with other arguments usage: upgrade upgrade -v upgrade -a | pool bash-3.00# zpool upgrade -s zones invalid option 's' usage: upgrade upgrade -v upgrade -a | pool bash-3.00# zpool upgrade zones This system is currently running ZFS version 6. cannot open 'zones': no such pool bash-3.00# This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS version 5 to version 6 fails to import or upgrade
On Tue, Jun 19, 2007 at 07:16:06PM -0700, John Brewer wrote: bash-3.00# zpool import pool: zones id: 4567711835620380868 state: ONLINE status: The pool is formatted using an older on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config: zones ONLINE c0d1s5ONLINE zpool import lists the pools available for import. Maybe you need to actually _import_ the pool first before you can upgrade. -- albert chin ([EMAIL PROTECTED]) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Z-Raid performance with Random reads/writes
michael T sedwick wrote: Given a 1.6TB ZFS Z-Raid consisting 6 disks: And a system that does an extreme amount of small /(20K) /random reads /(more than twice as many reads as writes) / 1) What performance gains, if any does Z-Raid offer over other RAID or Large filesystem configurations? For magnetic disk drives, RAID-Z performance for small, random reads will approximate the performance of a single disk, regardless of the number of disks in the set. The writes will not be random, so it should perform decently for writes. 2) What is any hindrance is Z-Raid to this configuration, given the complete randomness and size of these accesses? ZFS must read the entire RAID-Z stripe to verify the checksum. Would there be a better means of configuring a ZFS environment for this type of activity? In general, mirrors with dynamic stripes will offer better performance and RAS than RAID-Z. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Z-Raid performance with Random reads/writes
michael T sedwick wrote: Given a 1.6TB ZFS Z-Raid consisting 6 disks: And a system that does an extreme amount of small /(20K) /random reads /(more than twice as many reads as writes) / 1) What performance gains, if any does Z-Raid offer over other RAID or Large filesystem configurations? 2) What is any hindrance is Z-Raid to this configuration, given the complete randomness and size of these accesses? Would there be a better means of configuring a ZFS environment for this type of activity? thanks; A 6 disk raidz set is not optimal for random reads, since each disk in the raidz set needs to be accessed to retrieve each item. Note that if the reads are single threaded, this doesn't apply. However, if multiple reads are extant at the same time, configuring the disks as 2 sets of 3 disk raidz vdevs or 3 pairs of mirrored disk will result in 2x and 3x (approx) total parallel random read throughput. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS - SAN and Raid
On Wed, Jun 20, 2007 at 11:16:39AM +1000, James C. McPherson wrote: Roshan Perera wrote: I don't think panic should be the answer in this type of scenario, as there is redundant path to the LUN and Hardware Raid is in place inside SAN. From what I gather there is work being carried out to find a better solution. What is the proposed solution or when it will be availble is the question ? But Roshan, if your pool is not replicated from ZFS' point of view, then all the multipathing and raid controller backup in the world will not make a difference. If the multipathing is working correctly, and one path to the data remains intact, the SCSI level should retry the write error successfully. This certainly happens with UFS on our fibre-channel SAN. There's usually a SCSI bus reset message along with a message the failover to the other path. Of course, once the SCSI level exhausts its retries, something else has to happen, just as it would with a physical disk. This must be when ZFS causes a panic. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Scalability/performance
Oliver Schinagl wrote: Hello, I'm quite interested in ZFS, like everybody else I suppose, and am about to install FBSD with ZFS. cool. On that note, i have a different first question to start with. I personally am a Linux fanboy, and would love to see/use ZFS on linux. I assume that I can use those ZFS disks later with any os that can work/recognizes ZFS correct? e.g. I can install/setup ZFS in FBSD, and later use it in OpenSolaris/Linux Fuse(native) later? The on-disk format is an available specification and is designed to be platform neutral. We certainly hope you will be able to access the zpools from different OSes (one at a time). Anyway, back to business :) I have a whole bunch of different sized disks/speeds. E.g. 3 300GB disks @ 40mb, a 320GB disk @ 60mb/s, 3 120gb disks @ 50mb/s and so on. Raid-Z and ZFS claims to be uber scalable and all that, but would it 'just work' with a setup like that too? Yes, for most definitions of 'just work.' I used to match up partition sizes in linux, so make the 320gb disk into 2 partitions of 300 and 20gb, then use the 4 300gb partitions as a raid5, same with the 120 gigs and use the scrap on those aswell, finally stiching everything together with LVM2. I can't easly find how this would work with raid-Z/ZFS, e.g. can I really just put all these disks in 1 big pool and remove/add to it at will? Yes is the simple answer. But we generally recommend planning. To begin your plan, decide your priority: space, performance, data protection. ZFS is very dynamic, which has the property that for redundancy schemes (mirror, raidz[12]) it will use as much a space as possible. For example, if you mirror a 1 GByte drive with a 2 GByte drive, then you will have available space of 1 GByte. If you later replace the 1 GByte drive with a 4 GByte drive, then you will instantly have the available space of 2 GBytes. If you replace the 2 GByte drive with an 8 GByte drive, you will instantly have access to 4 GBytes of mirrored data. And I really don't need to use softwareraid yet still have the same reliablity with raid-z as I had with raid-5? raidz is more reliable than software raid-5. What about hardware raid controllers, just use it as a JBOD device, or would I use it to match up disk sizes in raid0 stripes (e.g. the 3x 120gb to make a 360 raid0). ZFS is dynamic. Or you'd recommend to just stick with raid/lvm/reiserfs and use that. ZFS rocks! -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Z-Raid performance with Random reads/writes
Bart Smaalders wrote: michael T sedwick wrote: Given a 1.6TB ZFS Z-Raid consisting 6 disks: And a system that does an extreme amount of small /(20K) /random reads /(more than twice as many reads as writes) / 1) What performance gains, if any does Z-Raid offer over other RAID or Large filesystem configurations? 2) What is any hindrance is Z-Raid to this configuration, given the complete randomness and size of these accesses? Would there be a better means of configuring a ZFS environment for this type of activity? thanks; A 6 disk raidz set is not optimal for random reads, since each disk in the raidz set needs to be accessed to retrieve each item. Note that if the reads are single threaded, this doesn't apply. However, if multiple reads are extant at the same time, configuring the disks as 2 sets of 3 disk raidz vdevs or 3 pairs of mirrored disk will result in 2x and 3x (approx) total parallel random read throughput. I'm not sure why, but when I was testing various configurations with bonnie++, 3 pairs of mirrors did give about 3x the random read performance of a 6 disk raidz, but with 4 pairs, the random read performance dropped by 50%: 3x2 Blockread: 220464 Random read: 1520.1 4x2 Block read: 295747 Random read: 765.3 Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Z-Raid performance with Random reads/writes
Ian Collins wrote: Bart Smaalders wrote: michael T sedwick wrote: Given a 1.6TB ZFS Z-Raid consisting 6 disks: And a system that does an extreme amount of small /(20K) /random reads /(more than twice as many reads as writes) / 1) What performance gains, if any does Z-Raid offer over other RAID or Large filesystem configurations? 2) What is any hindrance is Z-Raid to this configuration, given the complete randomness and size of these accesses? Would there be a better means of configuring a ZFS environment for this type of activity? thanks; A 6 disk raidz set is not optimal for random reads, since each disk in the raidz set needs to be accessed to retrieve each item. Note that if the reads are single threaded, this doesn't apply. However, if multiple reads are extant at the same time, configuring the disks as 2 sets of 3 disk raidz vdevs or 3 pairs of mirrored disk will result in 2x and 3x (approx) total parallel random read throughput. I'm not sure why, but when I was testing various configurations with bonnie++, 3 pairs of mirrors did give about 3x the random read performance of a 6 disk raidz, but with 4 pairs, the random read performance dropped by 50%: 3x2 Blockread: 220464 Random read: 1520.1 4x2 Block read: 295747 Random read: 765.3 Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss interesting I wonder if the blocks being read were stripped across two mirror pairs; this would result in having to read 2 sets of mirror pairs, which would produce the reported results... - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] marvell88sx error in command 0x2f: status 0x51
with no seen effects `dmesg` reports lots of kern.warning] WARNING: marvell88sx1: port 3: error in command 0x2f: status 0x51 found in snv_62 and opensol-b66 perhaps http://bugs.opensolaris.org/view_bug.do?bug_id=6539787 can someone post part of the headers even if the code is closed? Rob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Z-Raid performance with Random reads/writes
Bart Smaalders wrote: Ian Collins wrote: Bart Smaalders wrote: A 6 disk raidz set is not optimal for random reads, since each disk in the raidz set needs to be accessed to retrieve each item. Note that if the reads are single threaded, this doesn't apply. However, if multiple reads are extant at the same time, configuring the disks as 2 sets of 3 disk raidz vdevs or 3 pairs of mirrored disk will result in 2x and 3x (approx) total parallel random read throughput. Actually, with 6 disks as 3 mirrored pairs, you should get around 6x the random read iops of a 6-disk raidz[2], because each side of the mirror can fulfill different read requests. We use the checksum to verify correctness, so we don't need to read the same data from both sides of the mirror. I'm not sure why, but when I was testing various configurations with bonnie++, 3 pairs of mirrors did give about 3x the random read performance of a 6 disk raidz, but with 4 pairs, the random read performance dropped by 50%: interesting I wonder if the blocks being read were stripped across two mirror pairs; this would result in having to read 2 sets of mirror pairs, which would produce the reported results... Each block is entirely[*] on one top-level vdev (ie, mirrored pair in this case), so that would not happen. The observed performance degradation remains a mystery. --matt [*] assuming you have enough contiguous free space. On nearly-full pools, performance can suffer due to (among other things) gang blocks which essentially break large blocks into many several smaller blocks if there isn't enough contiguous free space for the large block. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Z-Raid performance with Random reads/writes
OK... Is all this 3x; 6x potential performance boost still going to hold true in a Single Controller scenario? Hardware is x4100's (Solaris 10) w/ 6-disk raidz on external 3320's? I seem to remember /(wait... checking Notes...) / correct... the ZFS filesystem is 50% capacity. This info could lead me to follow why I was also seeing 'sched' running a lot as well... ---michael === )_( Matthew Ahrens wrote: Bart Smaalders wrote: Ian Collins wrote: Bart Smaalders wrote: A 6 disk raidz set is not optimal for random reads, since each disk in the raidz set needs to be accessed to retrieve each item. Note that if the reads are single threaded, this doesn't apply. However, if multiple reads are extant at the same time, configuring the disks as 2 sets of 3 disk raidz vdevs or 3 pairs of mirrored disk will result in 2x and 3x (approx) total parallel random read throughput. Actually, with 6 disks as 3 mirrored pairs, you should get around 6x the random read iops of a 6-disk raidz[2], because each side of the mirror can fulfill different read requests. We use the checksum to verify correctness, so we don't need to read the same data from both sides of the mirror. I'm not sure why, but when I was testing various configurations with bonnie++, 3 pairs of mirrors did give about 3x the random read performance of a 6 disk raidz, but with 4 pairs, the random read performance dropped by 50%: interesting I wonder if the blocks being read were stripped across two mirror pairs; this would result in having to read 2 sets of mirror pairs, which would produce the reported results... Each block is entirely[*] on one top-level vdev (ie, mirrored pair in this case), so that would not happen. The observed performance degradation remains a mystery. --matt [*] assuming you have enough contiguous free space. On nearly-full pools, performance can suffer due to (among other things) gang blocks which essentially break large blocks into many several smaller blocks if there isn't enough contiguous free space for the large block. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss