Re: [zfs-discuss] How Virtual Box handles the IO
>From what i understand, and from everything i've read by following threads here, there are ways to do it but there is not a standardized tool yet, and it's complicated and on a per-case basis but people who pay for support have recovered pools. i'm sure they are working on it, and i would imagine it would be a major goal. On Wed, Aug 5, 2009 at 1:23 AM, James Hess wrote: > So much for the "it's a consumer hardware problem" argument. > I for one gotta count it as a major drawback of ZFS that it doesn't provide > you a mechanism to get something of your pool back in the manner of > reconstruction or reversion, if a failure occurs, where there is a metadata > inconsistency. > > A policy of data integrity taken to the extreme of blocking access to good > data is not something OS users want. > > Users don't put up with this sort of thing from other filesystems... some > sort of improvement here is sorely needed. > > ZFS ought to be retaining enough information and make an effort to bring > pool metadata back to a consistent state, even if it means loss of data, > that a file may have to revert to an older state, or a file that was > undergoing changes may now be unreadable, since the log was inconsistent.. > > even if the user should have to zpool import with a recovery-mode option > or something of that nature. > > It beats losing a TB of data on the pool that should be otherwise intact. > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How Virtual Box handles the IO
So much for the "it's a consumer hardware problem" argument. I for one gotta count it as a major drawback of ZFS that it doesn't provide you a mechanism to get something of your pool back in the manner of reconstruction or reversion, if a failure occurs, where there is a metadata inconsistency. A policy of data integrity taken to the extreme of blocking access to good data is not something OS users want. Users don't put up with this sort of thing from other filesystems... some sort of improvement here is sorely needed. ZFS ought to be retaining enough information and make an effort to bring pool metadata back to a consistent state, even if it means loss of data, that a file may have to revert to an older state, or a file that was undergoing changes may now be unreadable, since the log was inconsistent.. even if the user should have to zpool import with a recovery-mode option or something of that nature. It beats losing a TB of data on the pool that should be otherwise intact. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How Virtual Box handles the IO
On Fri, Jul 31, 2009 at 7:58 PM, Frank Middleton wrote: > Has anyone ever actually lost a pool on Sun hardware other than > by losing too many replicas or operator error? As you have so Yes, I have lost a pool when running on Sun hardware. http://mail.opensolaris.org/pipermail/zfs-discuss/2007-September/013233.html Quite likely related to: http://bugs.opensolaris.org/view_bug.do?bug_id=6684721 In other words, it was a buggy Sun component that didn't do the right thing with cache flushes. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How Virtual Box handles the IO
I understand > that the ZILs are allocated out of the general pool. There is one intent log chain per dataset (file system or zvol). The head of each log the log is kept in the main pool. Without slog(s) we allocate (and chain) blocks from the main pool. If separate intent log(s) exist then blocks are allocated and chained there. If we fail to allocate from the slog(s) then we revert to allocation from the main pool. Is there a ZIL for the ZILs, or does this make no sense? There is no ZIL for the ZILs. Note the ZIL is not a journal (like ext3 or ufs logging). It simply contains records of system calls (including data) that need to be replayed if the system crashes and those records have not been committed in a transaction group. Hope that helps: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How Virtual Box handles the IO
Great to hear a few success stories! We have been experimentally running ZFS on really crappy hardware and it has never lost a pool. Running on VB with ZFS/iscsi raw disks we have yet to see any errors at all. On sun4u with lsi sas/sata it is really rock solid. And we've been going out of our way to break it because of bad experiences with ntfs, ext2 and UFS as well as many disk failures (ever had fsck run amok?). On 07/31/09 12:11 PM, Richard Elling wrote: Making flush be a nop destroys the ability to check for errors thus breaking the trust between ZFS and the data on medium. -- richard Can you comment on the issue that the underlying disks were, as far as we know, never powered down? My understanding is that disks usually try to flush their caches as quickly as possible to make room for more data, so in this scenario things were probably quiet after the guest crash, so likely what ever was in the cache would have been flushed anyway, certainly by the time the OP restarted VB and the guest. Could you also comment on CR 6667683. which I believe is proposed as a solution for recovery in this very rare case? I understand that the ZILs are allocated out of the general pool. Is there a ZIL for the ZILs, or does this make no sense? As the one who started the whole ECC discussion, I don't think anyone has ever claimed that lack of ECC caused this loss of a pool or that it could. AFAIK lack of ECC can't be a problem at all on RAIDZ vdevs, only with single drives or plain mirrors. I've suggested an RFE for the mirrored case to double buffer the writes in this case, but disabling checksums pretty much fixes the problem if you don't have ECC, so it isn't worth pursuing. You can disable checksum per file system, so this is an elegant solution if you don't have ECC memory but you do mirror. No mirror IMO is suicidal with any file system. Has anyone ever actually lost a pool on Sun hardware other than by losing too many replicas or operator error? As you have so eloquently pointed out, building a reliable storage system is an engineering problem. There are a lot of folks out there who are very happy with ZFS on decent hardware. On crappy hardware you get what you pay for... Cheers -- Frank (happy ZFS evangelist) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How Virtual Box handles the IO
Thanks for following up with this, Russel. On Jul 31, 2009, at 7:11 AM, Russel wrote: After all the discussion here about VB, and all the finger pointing I raised a bug on VB about flushing. Remember I am using RAW disks via the SATA emulation in VB the disks are WD 2TB drives. Also remember the HOST machine NEVER crashed or stopped. BUT the guest OS OpenSolaris was hung and so I powered off the VIRTUAL host. OK, this is what the VB engineer had to say after reading this and another thread I had pointed him to. (he missed the fast I was using RAW not supprising as its a rather long thread now!) === Just looked at those two threads, and from what I saw all vital information is missing - no hint whatsoever on how the user set up his disks, nothing about what errors should be dealt with and so on. So hard to say anything sensible, especially as people seem most interested in assigning blame to some product. ZFS doesn't deserve this, and VirtualBox doesn't deserve this either. In the first place, there is absolutely no difference in how the IDE and SATA devices handle the flush command. The documentation just wasn't updated to talk about the SATA controller. Thanks for pointing this out, it will be fixed in the next major release. If you want to get the information straight away: just replace "piix3ide" with "ahci", and all other flushing behavior settings apply as well. See a bit further below of what it buys you (or not). What I haven't mentioned is the rationale behind the current behavior. The reason for ignoring flushes is simple: the biggest competitor does it by default as well, and one gets beaten up by every reviewer if VirtualBox is just a few percent slower than you know what. Forget about arguing with reviewers. That said, a bit about what flushing can achieve - or not. Just keep in mind that VirtualBox doesn't really buffer anything. In the IDE case every read and write requests gets handed more or less straight (depending on the image format complexity) to the host OS. So there is absolutely nothing which can be lost if one assumes the host OS doesn't crash. In the SATA case things are slightly more complicated. If you're using anything but raw disks or flat file VMDKs, the behavior is 100% identical to IDE. If you use raw disks or flat file VMDKs, we activate NCQ support in the SATA device code, which means that the guest can push through a number of commands at once, and they get handled on the host via async I/O. Again - if the host OS works reliably there is nothing to lose. The problem with this thought process is that since the data is not on medium, a fault that occurs between the flush request and the bogus ack goes undetected. The OS trusts when the disk said "the data is on the medium" that the data is on the medium with no errors. This problem also affects "hardware" RAID arrays which provide nonvolatile caches. If the array acks a write and flush, but the data is not yet committed to medium, then if the disk fails, the data must remain in nonvolatile cache until it can be committed to the medium. A use case may help, suppose the power goes out. Most arrays have enough battery to last for some time. But if power isn't restored prior to the batteries discharging, then there is a risk of data loss. For ZFS, cache flush requests are not gratuitous. One critical case is the uberblock or label update. ZFS does: 1. update labels 0 and 2 2. flush 3. check for errors 4. update labels 1 and 3 5. flush 6. check for errors Making flush be a nop destroys the ability to check for errors thus breaking the trust between ZFS and the data on medium. -- richard The only thing what flushing can potentially improve is the behavior when the host OS crashes. But that depends on many assumptions on what the respective OS does, the filesystems do etc etc. Hope those facts can be the basis of a real discussion. Feel free to raise any issue you have in this context, as long as it's not purely hypothetical. === -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How Virtual Box handles the IO
After all the discussion here about VB, and all the finger pointing I raised a bug on VB about flushing. Remember I am using RAW disks via the SATA emulation in VB the disks are WD 2TB drives. Also remember the HOST machine NEVER crashed or stopped. BUT the guest OS OpenSolaris was hung and so I powered off the VIRTUAL host. OK, this is what the VB engineer had to say after reading this and another thread I had pointed him to. (he missed the fast I was using RAW not supprising as its a rather long thread now!) === Just looked at those two threads, and from what I saw all vital information is missing - no hint whatsoever on how the user set up his disks, nothing about what errors should be dealt with and so on. So hard to say anything sensible, especially as people seem most interested in assigning blame to some product. ZFS doesn't deserve this, and VirtualBox doesn't deserve this either. In the first place, there is absolutely no difference in how the IDE and SATA devices handle the flush command. The documentation just wasn't updated to talk about the SATA controller. Thanks for pointing this out, it will be fixed in the next major release. If you want to get the information straight away: just replace "piix3ide" with "ahci", and all other flushing behavior settings apply as well. See a bit further below of what it buys you (or not). What I haven't mentioned is the rationale behind the current behavior. The reason for ignoring flushes is simple: the biggest competitor does it by default as well, and one gets beaten up by every reviewer if VirtualBox is just a few percent slower than you know what. Forget about arguing with reviewers. That said, a bit about what flushing can achieve - or not. Just keep in mind that VirtualBox doesn't really buffer anything. In the IDE case every read and write requests gets handed more or less straight (depending on the image format complexity) to the host OS. So there is absolutely nothing which can be lost if one assumes the host OS doesn't crash. In the SATA case things are slightly more complicated. If you're using anything but raw disks or flat file VMDKs, the behavior is 100% identical to IDE. If you use raw disks or flat file VMDKs, we activate NCQ support in the SATA device code, which means that the guest can push through a number of commands at once, and they get handled on the host via async I/O. Again - if the host OS works reliably there is nothing to lose. The only thing what flushing can potentially improve is the behavior when the host OS crashes. But that depends on many assumptions on what the respective OS does, the filesystems do etc etc. Hope those facts can be the basis of a real discussion. Feel free to raise any issue you have in this context, as long as it's not purely hypothetical. === -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss