Re: [zfs-discuss] ZFS and Storage
On Thu, Jun 29, 2006 at 10:01:15AM +0200, Robert Milkowski wrote: Hello przemolicc, Thursday, June 29, 2006, 8:01:26 AM, you wrote: ppf On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote: ppf What I wanted to point out is the Al's example: he wrote about damaged data. Data ppf were damaged by firmware _not_ disk surface ! In such case ZFS doesn't help. ZFS can ppf detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair ppf errors in its (ZFS) code. Not in its code but definitely in a firmware code in a controller. ppf As Jeff pointed out: if you mirror two different storage arrays. Not only I belive. There are some classes of problems that even in one array ZFS could help for fw problems (with many controllers in active-active config like Symetrix). Any real example ? przemol ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs iscsi storage for virtual machines
I have hundreds of Xen-based virtual machines running off a ZFS/iSCSI service; yes, it's viable. I can't speak for CentOS specifically; our infrastructure is using Debian Etch with our own build of Xen. How does ZFS handle snapshots of large files like VM images? Is replication done on the bit/block level or by file? In otherwords, does a snapshot of a changed VM image take up the same amount of space as the image or only the amount of space of the bits that have changed within the image? I'm also strongly considering going with NFS or AFS instead of iSCSI so I don't have to deal with management of an extra filesystem layer. The VMWare community is split on which is faster. Are there any significant benefits to either one on the ZFS side? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs iscsi storage for virtual machines
How does ZFS handle snapshots of large files like VM images? Is replication done on the bit/block level or by file? In otherwords, does a snapshot of a changed VM image take up the same amount of space as the image or only the amount of space of the bits that have changed within the image? ZFS uses Copy On Write to implement snap shots. No replication is done. When changes are made only the blocks changed are different (the originals are kept by the snapshot). Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs iscsi storage for virtual machines
[EMAIL PROTECTED] wrote on 17/07/2007 05:12:49 AM: I'm going to be setting up about 6 virtual machines (Windows Linux) in either VMWare Server or Xen on a CentOS 5 box. I'd like to connect to a ZFS iSCSI target to store the vm images and be able to use zfs snapshots for backup. I have no experience with ZFS, so I have a couple of questions before I move forward. 1. Is this a feasible setup? If not, is there any way to make something like this work reliably? 2. Since I'd most likely want to restore single machines at a time, is it best to have a zpool for each machine? Any insight is appreciated. I have hundreds of Xen-based virtual machines running off a ZFS/iSCSI service; yes, it's viable. I can't speak for CentOS specifically; our infrastructure is using Debian Etch with our own build of Xen. Your success criteria are: 1. Staggering virtual machine cron entries to avoid high I/O contention, 2. Reliable gigabit/10gigabit switching infrastructure. Skimp on this and your project is sunk. Use a high-quality managed switch, good NICs, and well manufactured cables. 3. If using a smart, battery-backed, order-preserving storage array, append set zfs:zfs_nocacheflush = 1 (sans quotes) to /etc/system. Splitting your storage into multiple zpools will just cause wastage and administrative complexity. I strongly advise using a single zpool. JG This email, including any attachments, is intended only for the use of the individual or entity named above and may contain information that is confidential and privileged. Any information contained in this email is not to be used or disclosed for any purpose other than the purpose for which you received it. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email by mistake, please delete this email permanently from your system. WARNING: Although Editure has taken reasonable precautions to ensure no viruses are present in this email, Editure can not accept responsibility for any losses or damages whatsoever, arising from the use of this email and/or its attachments. www.editure.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs iscsi storage for virtual machines
I'm going to be setting up about 6 virtual machines (Windows Linux) in either VMWare Server or Xen on a CentOS 5 box. I'd like to connect to a ZFS iSCSI target to store the vm images and be able to use zfs snapshots for backup. I have no experience with ZFS, so I have a couple of questions before I move forward. 1. Is this a feasible setup? If not, is there any way to make something like this work reliably? 2. Since I'd most likely want to restore single machines at a time, is it best to have a zpool for each machine? Any insight is appreciated. -- Pete ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs iscsi storage for virtual machines
I had originally considered something similar, but... for ZFS snapshot abilities, I am leaning more towards zfs-hosted NFS... Most of the other VMs (FreeBSD, for example) can install onto NFS, it wouldn't actually be going over the network, and it would allow file-level restore instead of drive-level restore. Just my untested 2 cents Malachi On 7/16/07, Peter Baumgartner [EMAIL PROTECTED] wrote: I'm going to be setting up about 6 virtual machines (Windows Linux) in either VMWare Server or Xen on a CentOS 5 box. I'd like to connect to a ZFS iSCSI target to store the vm images and be able to use zfs snapshots for backup. I have no experience with ZFS, so I have a couple of questions before I move forward. 1. Is this a feasible setup? If not, is there any way to make something like this work reliably? 2. Since I'd most likely want to restore single machines at a time, is it best to have a zpool for each machine? Any insight is appreciated. -- Pete ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs iscsi storage for virtual machines
Peter Baumgartner wrote: I'm going to be setting up about 6 virtual machines (Windows Linux) in either VMWare Server or Xen on a CentOS 5 box. I'd like to connect to a ZFS iSCSI target to store the vm images and be able to use zfs snapshots for backup. I have no experience with ZFS, so I have a couple of questions before I move forward. 1. Is this a feasible setup? If not, is there any way to make something like this work reliably? Use some sort of data redundancy on the data. The simplest is a mirrored zpoool. 2. Since I'd most likely want to restore single machines at a time, is it best to have a zpool for each machine? I'd recommend one zpool, multiple file systems. That way you can manage each file system (iSCSI target) separately, but still have the flexibility of a large zpool. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote: ppf What I wanted to point out is the Al's example: he wrote about damaged data. Data ppf were damaged by firmware _not_ disk surface ! In such case ZFS doesn't help. ZFS can ppf detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair ppf errors in its (ZFS) code. Not in its code but definitely in a firmware code in a controller. As Jeff pointed out: if you mirror two different storage arrays. przemol ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
On Thu, Jun 29, 2006 at 10:01:15AM +0200, Robert Milkowski wrote: Hello przemolicc, Thursday, June 29, 2006, 8:01:26 AM, you wrote: ppf On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote: ppf What I wanted to point out is the Al's example: he wrote about damaged data. Data ppf were damaged by firmware _not_ disk surface ! In such case ZFS doesn't help. ZFS can ppf detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair ppf errors in its (ZFS) code. Not in its code but definitely in a firmware code in a controller. ppf As Jeff pointed out: if you mirror two different storage arrays. Not only I belive. There are some classes of problems that even in one array ZFS could help for fw problems (with many controllers in active-active config like Symetrix). Any real example ? przemol ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] ZFS and Storage
Hello przemolicc, Thursday, June 29, 2006, 10:08:23 AM, you wrote: ppf On Thu, Jun 29, 2006 at 10:01:15AM +0200, Robert Milkowski wrote: Hello przemolicc, Thursday, June 29, 2006, 8:01:26 AM, you wrote: ppf On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote: ppf What I wanted to point out is the Al's example: he wrote about damaged data. Data ppf were damaged by firmware _not_ disk surface ! In such case ZFS doesn't help. ZFS can ppf detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair ppf errors in its (ZFS) code. Not in its code but definitely in a firmware code in a controller. ppf As Jeff pointed out: if you mirror two different storage arrays. Not only I belive. There are some classes of problems that even in one array ZFS could help for fw problems (with many controllers in active-active config like Symetrix). ppf Any real example ? I wouldn't say such problems are common. The issue is we don't know. From time to time some files are bad, sometimes fsck is needed with no apparent reason. I think only the future will tell how and when ZFS will protect us. All I can say there's big potential in ZFS. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] ZFS and Storage
Hello przemolicc, Wednesday, June 28, 2006, 10:57:17 AM, you wrote: ppf On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote: Case in point, there was a gentleman who posted on the Yahoo Groups solx86 list and described how faulty firmware on a Hitach HDS system damaged a bunch of data. The HDS system moves disk blocks around, between one disk and another, in the background, to optimized the filesystem layout. Long after he had written data, blocks from one data set were intermingled with blocks for other data sets/files causing extensive data corruption. ppf Al, ppf the problem you described comes probably from failures in code of firmware ppf not the failure of disk surface. Sun's engineers can also do some mistakes ppf in ZFS code, right ? But the point is that ZFS should detect also such errors and take proper actions. Other filesystems can't. And of course there are bugs in ZFS :P -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski wrote: Hello przemolicc, Wednesday, June 28, 2006, 10:57:17 AM, you wrote: ppf On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote: Case in point, there was a gentleman who posted on the Yahoo Groups solx86 list and described how faulty firmware on a Hitach HDS system damaged a bunch of data. The HDS system moves disk blocks around, between one disk and another, in the background, to optimized the filesystem layout. Long after he had written data, blocks from one data set were intermingled with blocks for other data sets/files causing extensive data corruption. ppf Al, ppf the problem you described comes probably from failures in code of firmware ppf not the failure of disk surface. Sun's engineers can also do some mistakes ppf in ZFS code, right ? But the point is that ZFS should detect also such errors and take proper actions. Other filesystems can't. Does it mean that ZFS can detect errors in ZFS's code itself ? ;-) What I wanted to point out is the Al's example: he wrote about damaged data. Data were damaged by firmware _not_ disk surface ! In such case ZFS doesn't help. ZFS can detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair errors in its (ZFS) code. I am comparing firmware code to ZFS code. przemol ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Hello, What I wanted to point out is the Al's example: he wrote about damaged data. Data were damaged by firmware _not_ disk surface ! In such case ZFS doesn't help. ZFS can detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair errors in its (ZFS) code. I am comparing firmware code to ZFS code. Firmware doesn't do end to end checksumming. If ZFS code is buggy, the checksums won't match up anyway, so you still detect errors. Plus it is a lot easier to debug ZFS code than firmware. -- Regards, Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
[EMAIL PROTECTED] wrote: On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski wrote: What I wanted to point out is the Al's example: he wrote about damaged data. Data were damaged by firmware _not_ disk surface ! In such case ZFS doesn't help. ZFS can detect (and repair) errors on disk surface, bad cables, etc. But cannot detect and repair errors in its (ZFS) code. If you mean ZFS doesn't help with firmware problems that is not true. For example, if ZFS is mirroring a pool across two different storage arrays, a firmware error in one of them will cause problems that ZFS will detect when it tries to read the data. Further, ZFS would be able to correct the error by reading from the other mirror, unless the second array also suffered from a firmware error. There are categories of problems that ZFS cannot handle, mostly regarding data availability after catastophes (as Richard E described) but ZFS can help with many firmware problems. -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Depends on your definition of firmware. In higher end arrays the data is checksummed when it comes in and a hash is written when it gets to disk. Of course this is no where near end to end but it is better then nothing. The checksum is often stored with the data (so if the data is not written or in the wrong location the checksum is still valid) ZFS stores the checksum with the data pointer; so it knows more about the data and whether is was proper. ZFS also checksums before the data travels over the fabric. ... and code is code. Easier to debug is a context sensitive term. Uhm, well, firmware, in production systems? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Depends on your definition of firmware. In higher end arrays the data is checksummed when it comes in and a hash is written when it gets to disk. Of course this is no where near end to end but it is better then nothing. ... and code is code. Easier to debug is a context sensitive term. Its unfortunate that so many posts hung about the code, Its the design that protects your data and with ZFS you have a better design for data integrity. If the code is faulty and now thats a bug. And design should protect you unless your error detection and correction logic is faulty. (I mean this is like anti-corruption buereau being corrupt :-)). There is a huge difference between ability to detect corruption versus not knowing that data is corrupted at all. Now if the code is upto design or not, is what real world testing shows, in most of the cases ZFS should help. Kiran ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
The vdev can handle dynamic lun growth, but the underlying VTOC or EFI label may need to be zero'd and reapplied if you setup the initial vdev on a slice. If you introduced the entire disk to the pool you should be fine, but I believe you'll still need to offline/online the pool. Fine, at least the vdev can handle this... I asked about this feature in October and hoped that it would be implemented when integrating ZFS into Sol10U2 ... http://www.opensolaris.org/jive/thread.jspa?messageID=11646 Does anybody know something about when this feature is finally coming? This would keep the number of LUNs low on the host. Especially as devicenames can be really ugly (long!). //Mika # mv Disclaimer.txt /dev/null - This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. - ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
but there may not be filesystem space for double the data. Sounds like there is a need for a zfs-defragement-file utility perhaps? Or if you want to be politically cagey about naming choice, perhaps, zfs-seq-read-optimize-file ? :-) For Datawarehouse and streaming applications a seq-read-omptimization could bring additional performance. For normal databases this should be benchmarked... This brings me back to another question. We have a production database, that is cloned on every end of month for end-of-month processing (currently with a feature on our storage array). I'm thinking about a ZFS version of this task. Requirements: the production database should not suffer from performance degradation, whilst running the clone in parallel. As ZFS does not clone all the blocks, I wonder how much the procution database will suffer from sharing most of the data with the clone (concurrent access vs. caching) Maybe we need a feature in ZFS to do a full clone (speak: copy all blocks) inside the pool, if performance is an issue just like the Quick Copy vs. Shadow Image -features on HDS Arrays... - This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. - ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Philip Brown writes: Roch wrote: And, ifthe load can accomodate a reorder, to get top per-spindle read-streaming performance, a cp(1) of the file should do wonders on the layout. but there may not be filesystem space for double the data. Sounds like there is a need for a zfs-defragement-file utility perhaps? Or if you want to be politically cagey about naming choice, perhaps, zfs-seq-read-optimize-file ? :-) Possibly or may using fcntl ? Now the goal is to take a file with scattered blocks and order them in contiguous chunks. So this is contigent on the existence of regions of free contiguous disk space. This will get more difficult as we get close to full on the storage. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Most controllers support a background-scrub that will read a volume and repair any bad stripes. This addresses the bad block issue in most cases. It still doesn't help when a double-failure occurs. Luckily, that's very rare. Usually, in that case, you need to evacuate the volume and try to restore what was damaged. On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ eschrock - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Bart Smaalders wrote: Gregory Shaw wrote: On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote: How would ZFS self heal in this case? You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. If you've got requirements for surviving an array failure, the recommended solution in that case is to mirror between volumes on multiple arrays. I've always liked software raid (mirroring) in that case, as no manual intervention is needed in the event of an array failure. Mirroring between discrete arrays is usually reserved for mission-critical applications that cost thousands of dollars per hour in downtime. In other words, it won't. You've spent the disk space, but because you're mirroring in the wrong place (the raid array) all ZFS can do is tell you that your data is gone. With luck, subsequent reads _might_ get the right data, but maybe not. Careful here when you say wrong place. There are many scenarios where mirroring in the hardware is the correct way to go even when running ZFS on top of it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Unfortunately, a storage-based RAID controller cannot detect errors which occurred between the filesystem layer and the RAID controller, in either direction - in or out. ZFS will detect them through its use of checksums. But ZFS can only fix them if it can access redundant bits. It can't tell a storage device to provide the redundant bits, so it must use its own data protection system (RAIDZ or RAID1) in order to correct errors it detects. Gregory Shaw wrote: Most controllers support a background-scrub that will read a volume and repair any bad stripes. This addresses the bad block issue in most cases. It still doesn't help when a double-failure occurs. Luckily, that's very rare. Usually, in that case, you need to evacuate the volume and try to restore what was damaged. On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ eschrock - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Not at all. ZFS is a quantum leap in Solaris filesystem/VM functionality. However, I don't see a lot of use for RAID-Z (or Z2) in large enterprise customers situations. For instance, does ZFS enable Sun to walk into an account and say You can now replace all of your high- end (EMC) disk with JBOD.? I don't think many customers would bite on that. RAID-Z is an excellent feature, however, it doesn't address many of the reasons for using high-end arrays: - Exporting snapshots to alternate systems (for live database or backup purposes) - Remote replication - Sharing of storage among multiple systems (LUN masking and equivalent) - Storage management (migration between tiers of storage) - No-downtime failure replacement (the system doesn't even know) - Clustering I know that ZFS is still a work in progress, so some of the above may arrive in future versions of the product. I see the RAID-Z[2] value in small-to-mid size systems where the storage is relatively small and you don't have high availability requirements. On Jun 27, 2006, at 8:48 AM, Darren J Moffat wrote: So everything you are saying seems to suggest you think ZFS was a waste of engineering time since hardware raid solves all the problems ? I don't believe it does but I'm no storage expert and maybe I've drank too much cool aid. I'm software person and for me ZFS is brilliant it is so much easier than managing any of the hardware raid systems I've dealt with. -- Darren J Moffat - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
This is getting pretty picky. You're saying that ZFS will detect any errors introduced after ZFS has gotten the data. However, as stated in a previous post, that doesn't guarantee that the data given to ZFS wasn't already corrupted. If you don't trust your storage subsystem, you're going to encounter issues regardless of the software use to store data. We'll have to see if ZFS can 'save' customers in this situation. I've found that regardless of the storage solution in question you can't anticipate all issues and when a brownout or other ugly loss-of-service occurs, you may or may not be intact, ZFS or no. I've never seen a product that can deal with all possible situations. On Jun 27, 2006, at 9:01 AM, Jeff Victor wrote: Unfortunately, a storage-based RAID controller cannot detect errors which occurred between the filesystem layer and the RAID controller, in either direction - in or out. ZFS will detect them through its use of checksums. But ZFS can only fix them if it can access redundant bits. It can't tell a storage device to provide the redundant bits, so it must use its own data protection system (RAIDZ or RAID1) in order to correct errors it detects. Gregory Shaw wrote: Most controllers support a background-scrub that will read a volume and repair any bad stripes. This addresses the bad block issue in most cases. It still doesn't help when a double-failure occurs. Luckily, that's very rare. Usually, in that case, you need to evacuate the volume and try to restore what was damaged. On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. - Eric -- Eric Schrock, Solaris Kernel Development http:// blogs.sun.com/ eschrock - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/ zones/faq -- - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
This is getting pretty picky. You're saying that ZFS will detect any errors introduced after ZFS has gotten the data. However, as stated in a previous post, that doesn't guarantee that the data given to ZFS wasn't already corrupted. But there's a big difference between the time ZFS gets the data and the time your typical storage system gets it. And your typical storage system does not store any information which allows it to detect all but the most simple errors. Storage systems are complicated and have many failure modes at many different levels. - disks not writing data or writing data in incorrect location - disks not reporting failures when they occur - bit errors in disk write buffers causing data corruption - storage array software with bugs - storage array with undetected hardware errors - data corruption in the path (such as switches with mangle packets but keep the TCP checksum working If you don't trust your storage subsystem, you're going to encounter issues regardless of the software use to store data. We'll have to see if ZFS can 'save' customers in this situation. I've found that regardless of the storage solution in question you can't anticipate all issues and when a brownout or other ugly loss-of-service occurs, you may or may not be intact, ZFS or no. I've never seen a product that can deal with all possible situations. ZFS attempts to deal with more problems than any of the current existing solutions by giving end-to-end verification of the data. One of the reasons why ZFS was created was a particular large customer who had datacorruption which occured two years (!) before it was detected. The bad data had migrated and corrupted; the good data was no longer available on backups (which weren't very relevant anyway after such a long time) ZFS tries to give one important guarantee: if the data is bad, we will not return it. One case in point is the person in MPK with a SATA controller which corrupts memory; he didn't discover this using UFS (except for perhaps a few strange events he noticed). After switch to ZFS he started to find corruption so now he uses a self-healing ZFS mirror (or RAIDZ). ZFS helps at the low end as much as it does at the highend. I'll bet that ZFS will generate more calls about broken hardware and fingers will be pointed at ZFS at first because it's the new kid; it will be some time before people realize that the data was rotting all along. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw wrote: This is getting pretty picky. You're saying that ZFS will detect any errors introduced after ZFS has gotten the data. However, as stated in a previous post, that doesn't guarantee that the data given to ZFS wasn't already corrupted. There will always be some place where errors can be introduced and go on undetected. But some parts of the system are more error prone than others, and ZFS targets the most error prone of them: rotating rust. For the rest, make sure you have ECC memory, that you're using secure NFS (with krb5i or krb5p), and the probability of undetectable data corruption errors should be much closer to zero than what you'd get with other systems. That said, there's a proposal to add end-to-end data checksumming over NFSv4 (see the IETF NFSv4 WG list archives). That proposal can't protect meta-data, and it doesn't remove any one type of data corruption error on the client side, but it does on the server side. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Torrey McMahon wrote: ZFS is greatfor the systems that can run it. However, any enterprise datacenter is going to be made up of many many hosts running many many OS. In that world you're going to consolidate on large arrays and use the features of those arrays where they cover the most ground. For example, if I've 100 hosts all running different OS and apps and I can perform my data replication and redundancy algorithms, in most cases Raid, in one spot then it will be much more cost efficient to do it there. Exactly what I'm pondering. In the near to mid term, Solaris with ZFS can be seen as sort of a storage virtualizer where it takes disks into ZFS pools and volumes and then presents them to other hosts and OSes via iSCSI, NFS, SMB and so on. At that point, those other OSes can enjoy the benefits of ZFS. In the long term, it would be nice to see ZFS (or its concepts) integrated as the LUN provisioning and backing store mechanism on hardware RAID arrays themselves, supplanting the traditional RAID paradigms that have been in use for years. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Jason Schroeder wrote: Torrey McMahon wrote: [EMAIL PROTECTED] wrote: I'll bet that ZFS will generate more calls about broken hardware and fingers will be pointed at ZFS at first because it's the new kid; it will be some time before people realize that the data was rotting all along. EhhhI don't think so. Most of our customers have HW arrays that have been scrubbing data for years and years as well as apps on the top that have been verifying the data. (Oracle for example.) Not to mention there will be a bit of time before people move over to ZFS in the high end. Ahh... but there is the rub. Today - you/we don't *really* know, do we? Maybe there's bad juju blocks, maybe not. Running ZFS, whether in a redundant vdev or not, will certainly turn the big spotlight on and give us the data that checksums matched, or they didn't. A spotlight on what? How is that data going to get into ZFS? The more I think about this more I realize it's going to do little for existing data sets. You're going to have to migrate that data from filesystem X into ZFS first. From that point on ZFS has no idea if the data was bad to begin with. If you can do an in place migration then you might be able to weed out some bad physical blocks/drives over time but I assert that the current disk scrubbing methodologies catch most of those. Yes, it's great for new data sets where you started with ZFS. Sorry if I sound like I'm raining on the parade here folks. That's not the case, really, and I'm all for the great new features and EAU ZFS gives where applicable. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Nicolas Williams wrote: On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw wrote: This is getting pretty picky. You're saying that ZFS will detect any errors introduced after ZFS has gotten the data. However, as stated in a previous post, that doesn't guarantee that the data given to ZFS wasn't already corrupted. There will always be some place where errors can be introduced and go on undetected. But some parts of the system are more error prone than others, and ZFS targets the most error prone of them: rotating rust. For the rest, make sure you have ECC memory, that you're using secure NFS (with krb5i or krb5p), and the probability of undetectable data corruption errors should be much closer to zero than what you'd get with other systems. Another alternative is using IPsec with just AH. For the benefit of those outside of Sun MPK17 both krb5i and IPsec AH were used to diagnose and prove that we have a faulty router in a lab that was causing very strange build errors. TCP/IP alone didn't catch the problems and sometimes they showed up with SCCS simple checksums and sometimes we had compile errors. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Torrey McMahon wrote: Darren J Moffat wrote: So everything you are saying seems to suggest you think ZFS was a waste of engineering time since hardware raid solves all the problems ? I don't believe it does but I'm no storage expert and maybe I've drank too much cool aid. I'm software person and for me ZFS is brilliant it is so much easier than managing any of the hardware raid systems I've dealt with. ZFS is greatfor the systems that can run it. However, any enterprise datacenter is going to be made up of many many hosts running many many OS. In that world you're going to consolidate on large arrays and use the features of those arrays where they cover the most ground. For example, if I've 100 hosts all running different OS and apps and I can perform my data replication and redundancy algorithms, in most cases Raid, in one spot then it will be much more cost efficient to do it there. but you still need a local file system on those systems in many cases. So back to where we started I guess, how to effectively use ZFS to benefit Solaris (and the other platforms it gets ported to) while still using Hardware RAID because you have no choice but to use it. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS and Storage
Hi Now that Solaris 10 06/06 is finally downloadable I have some questions about ZFS. -We have a big storage sytem supporting RAID5 and RAID1. At the moment, we only use RAID5 (for non-solaris systems as well). We are thinking about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5 seems like overkill, an option would be to use RAID1 with RAID-Z. Then again, this is a waist of space, as it needs more disks, due to the mirroring. Later on, we might be using asynchronous replication to another storage system using SAN, even more waste of space. This looks somehow like storage virtualization as of today just doesn't work nicely together. What we need, would be the feature to use JBODs. -Does ZFS in the current version support LUN extension? With UFS, we have to zero the VTOC, and then adjust the new disk geometry. How does it look like with ZFS? -I've read the threads about zfs and databases. Still I'm not 100% convenienced about read performance. Doesn't the fragmentation of the large database files (because of the concept of COW) impact read-performance? -Does anybody have any experience in database cloning using the ZFS mechanism? What factors influence the performance, when running the cloned database in parallel? -I really like the idea to keep all needed databasefiles together, to allow fast and consistent cloning. Thanks Mika # mv Disclaimer.txt /dev/null - This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. - ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
About: -I've read the threads about zfs and databases. Still I'm not 100% convenienced about read performance. Doesn't the fragmentation of the large database files (because of the concept of COW) impact read-performance? I do need to get back to this thread. The way I am currently looking at this is this: ZFS will perform great at doing the transaction component (say the small (8K) O_DSYNC writes) because the ZIL will aggregate them in fewer larger I/Os and the block allocation will stream them to the surface. On the other hand, read streaming will require a good prefetch code (under review) to get the read performance we want. If the requirements balances random writes and read streaming, then ZFS should be right there with the best FS. If the critical requirement focuses exclusively on read streaming a file that was written randomly and, in addition, the number of spindles is limited then that is not the sweetspot of ZFS. Read performance should still scale with number of spindles. And, ifthe load can accomodate a reorder, to get top per-spindle read-streaming performance, a cp(1) of the file should do wonders on the layout. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
On Jun 26, 2006, at 1:15 AM, Mika Borner wrote: Hi Now that Solaris 10 06/06 is finally downloadable I have some questions about ZFS. -We have a big storage sytem supporting RAID5 and RAID1. At the moment, we only use RAID5 (for non-solaris systems as well). We are thinking about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5 seems like overkill, an option would be to use RAID1 with RAID-Z. Then again, this is a waist of space, as it needs more disks, due to the mirroring. Later on, we might be using asynchronous replication to another storage system using SAN, even more waste of space. This looks somehow like storage virtualization as of today just doesn't work nicely together. What we need, would be the feature to use JBODs. If you've got hardware raid-5, why not just run regular (non-raid) pools on top of the raid-5? I wouldn't go back to JBOD. Hardware arrays offer a number of advantages to JBOD: - disk microcode management - optimized access to storage - large write caches - RAID computation can be done in specialized hardware - SAN-based hardware products allow sharing of storage among multiple hosts. This allows storage to be utilized more effectively. -Does ZFS in the current version support LUN extension? With UFS, we have to zero the VTOC, and then adjust the new disk geometry. How does it look like with ZFS? I don't understand what you're asking. What problem is solved by zeroing the vtoc? -I've read the threads about zfs and databases. Still I'm not 100% convenienced about read performance. Doesn't the fragmentation of the large database files (because of the concept of COW) impact read-performance? This is discussed elsewhere in the zfs discussion group. -Does anybody have any experience in database cloning using the ZFS mechanism? What factors influence the performance, when running the cloned database in parallel? -I really like the idea to keep all needed databasefiles together, to allow fast and consistent cloning. Thanks Mika # mv Disclaimer.txt /dev/null -- --- This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. -- --- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382[EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Roch wrote: And, ifthe load can accomodate a reorder, to get top per-spindle read-streaming performance, a cp(1) of the file should do wonders on the layout. but there may not be filesystem space for double the data. Sounds like there is a need for a zfs-defragement-file utility perhaps? Or if you want to be politically cagey about naming choice, perhaps, zfs-seq-read-optimize-file ? :-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. Keep in mind that each disk data block is accompanied by a pretty long error correction code (ECC) which allows for (a) verification of data integrity (b) repair of lost/misread bits (typically up to about 10% of the block data). Therefore, in case of single block errors there are several possible situations: - non-recoverable errors - the amount of correct bits in the combined data + ECC in insufficient - such errors are visible to the RAID controller, the controller can use a redundant copy of the data, and the controller can perform the repair - recoverable errors - some bits can't be read correctly but they can be reconstructed using ECC - these errors are not directly visible to either the RAID controller or ZFS. However, the disks keep the count of recoverable errors so disk scrubbers can identify disk areas with rotten blocks and force block relocation - silent data corruption - it can happen in memory before the data was written to disk, it can occur in the disk cache, it can be caused by a bug in disk firmware. Here the disk controller can't do anything and the end-to-end checksums, which ZFS offers, are the only solution. -- Olaf ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Gregory Shaw wrote: On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote: How would ZFS self heal in this case? You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. If you've got requirements for surviving an array failure, the recommended solution in that case is to mirror between volumes on multiple arrays. I've always liked software raid (mirroring) in that case, as no manual intervention is needed in the event of an array failure. Mirroring between discrete arrays is usually reserved for mission-critical applications that cost thousands of dollars per hour in downtime. In other words, it won't. You've spent the disk space, but because you're mirroring in the wrong place (the raid array) all ZFS can do is tell you that your data is gone. With luck, subsequent reads _might_ get the right data, but maybe not. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Olaf Manczak wrote: Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. Keep in mind that each disk data block is accompanied by a pretty long error correction code (ECC) which allows for (a) verification of data integrity (b) repair of lost/misread bits (typically up to about 10% of the block data). AFAIK, typical disk ECC will correct 8 bytes. I'd love for it to be 10% (51 bytes). Do you have a pointer to such information? Therefore, in case of single block errors there are several possible situations: - non-recoverable errors - the amount of correct bits in the combined data + ECC in insufficient - such errors are visible to the RAID controller, the controller can use a redundant copy of the data, and the controller can perform the repair - recoverable errors - some bits can't be read correctly but they can be reconstructed using ECC - these errors are not directly visible to either the RAID controller or ZFS. However, the disks keep the count of recoverable errors so disk scrubbers can identify disk areas with rotten blocks and force block relocation - silent data corruption - it can happen in memory before the data was written to disk, it can occur in the disk cache, it can be caused by a bug in disk firmware. Here the disk controller can't do anything and the end-to-end checksums, which ZFS offers, are the only solution. Another mode occurs when you use a format(1m)-like utility to scan and repair disks. For such utilities, if the data cannot be reconstructed it is zero-filled. If there was real data stored there, then ZFS will detect it and the majority of other file systems will not detect it. For an array, one should not be able to readily access such utilities, and cause such corrective actions, but I would not bet the farm on it -- end-to-end error detection will always prevail. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss