Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Yes, if you value your data you should change from USB drives to normal drives. I heard that USB did some strange things? Normal connection such as SATA is more reliable. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Hello, I'm now thinking there is some _real_ bug in the way zfs handles files systems created with the pool itself (ie tank filesystem when zpool is tank, usually mounted as /tank) My own experiens shows that zfs is unable to send/receive recursively (snapshots, child fs) properly when the destination is such a "level 0" files system ie othertank, thought everything works as expected when i send to othertank/tank (see my posts) I think you might also see some aspects of this problem Bruno -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
I just have the say this, and I don't mean it in a bad way... If you really care about your data why then use usb drives with lose cables and (apparently no backup) USB connected drives for data backup are okay, for playing around and getting to know ZFS seems also okay. Using it for online data that you care about and expecting it to be reliable...its just not the right technology for that imho. ..Remco On 2/13/10 11:23 AM, Andy Stenger wrote: I had a very similar problem. 8 external USB drives running OpenSolaris native. When I moved the machine into a different room and powered it back up (there were a couple of reboots and a couple of broken usb cables and drive shut downs in between), I got the same error. Loosing that much data is definitely a shock. I m running zraid2 and I would have assumed that a 2 level redundancy should fine to toss a lot of roughness at the pool. After panicking a little, stressing my family out, and some playing with zdb that lead nowhere, I did a zpool export mypool zpool import mypool It complained about being unable to mount because the mount point was not empty, so I did umount /mypool/mypool zfs mount mypool/mypool zfs status mypool and to my relieving surprise it seems all fine. ls /mypool/mypool does show data. Scrub is running right now to be on the safe side. Thought that may help some folks out there. Cheers! Andy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
I had a very similar problem. 8 external USB drives running OpenSolaris native. When I moved the machine into a different room and powered it back up (there were a couple of reboots and a couple of broken usb cables and drive shut downs in between), I got the same error. Loosing that much data is definitely a shock. I m running zraid2 and I would have assumed that a 2 level redundancy should fine to toss a lot of roughness at the pool. After panicking a little, stressing my family out, and some playing with zdb that lead nowhere, I did a zpool export mypool zpool import mypool It complained about being unable to mount because the mount point was not empty, so I did umount /mypool/mypool zfs mount mypool/mypool zfs status mypool and to my relieving surprise it seems all fine. ls /mypool/mypool does show data. Scrub is running right now to be on the safe side. Thought that may help some folks out there. Cheers! Andy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 4-Aug-09, at 9:28 AM, Roch Bourbonnais wrote: Le 26 juil. 09 à 01:34, Toby Thain a écrit : On 25-Jul-09, at 3:32 PM, Frank Middleton wrote: On 07/25/09 02:50 PM, David Magda wrote: Yes, it can be affected. If the snapshot's data structure / record is underneath the corrupted data in the tree then it won't be able to be reached. Can you comment on if/how mirroring or raidz mitigates this, or tree corruption in general? I have yet to lose a pool even on a machine with fairly pathological problems, but it is mirrored (and copies=2). I was also wondering if you could explain why the ZIL can't repair such damage. Finally, a number of posters blamed VB for ignoring a flush, but according to the evil tuning guide, without any application syncs, ZFS may wait up to 5 seconds before issuing a synch, ^^ of course this can never cause inconsistency. The issue under discussion is inconsistency - unexpected corruption of on-disk structures. and there must be all kinds of failure modes even on bare hardware where it never gets a chance to do one at shutdown. This is interesting if you do ZFS over iscsi because of the possibility of someone tripping over a patch cord or a router blowing a fuse. Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? The problem is assumed *ordering*. In this respect VB ignoring flushes and real hardware are not going to behave the same. --Toby I agree that noone should be ignoring cache flushes. However the path to corruption must involve some dropped acknowledged I/Os. The ueberblock I/O was issued to stable storage but the blocks it pointed to, which had reached the disk firmware earlier, never make it to stable storage. I can see this scenerio when the disk looses power Or if the host O/S crashes. All this applies to virtual IDE devices alone, of course. iSCSI is a different case entirely as presumably flushes/barriers are processed normally. but I don't see it with cutting power to the guest. Right, in this case it's unlikely or nearly impossible. --Toby When managing a zpool on external storage, do people export the pool before taking snapshots of the guest ? -r Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Le 26 juil. 09 à 01:34, Toby Thain a écrit : On 25-Jul-09, at 3:32 PM, Frank Middleton wrote: On 07/25/09 02:50 PM, David Magda wrote: Yes, it can be affected. If the snapshot's data structure / record is underneath the corrupted data in the tree then it won't be able to be reached. Can you comment on if/how mirroring or raidz mitigates this, or tree corruption in general? I have yet to lose a pool even on a machine with fairly pathological problems, but it is mirrored (and copies=2). I was also wondering if you could explain why the ZIL can't repair such damage. Finally, a number of posters blamed VB for ignoring a flush, but according to the evil tuning guide, without any application syncs, ZFS may wait up to 5 seconds before issuing a synch, and there must be all kinds of failure modes even on bare hardware where it never gets a chance to do one at shutdown. This is interesting if you do ZFS over iscsi because of the possibility of someone tripping over a patch cord or a router blowing a fuse. Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? The problem is assumed *ordering*. In this respect VB ignoring flushes and real hardware are not going to behave the same. --Toby I agree that noone should be ignoring cache flushes. However the path to corruption must involve some dropped acknowledged I/Os. The ueberblock I/O was issued to stable storage but the blocks it pointed to, which had reached the disk firmware earlier, never make it to stable storage. I can see this scenerio when the disk looses power but I don't see it with cutting power to the guest. When managing a zpool on external storage, do people export the pool before taking snapshots of the guest ? -r Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss smime.p7s Description: S/MIME cryptographic signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Le 19 juil. 09 à 16:47, Bob Friesenhahn a écrit : On Sun, 19 Jul 2009, Ross wrote: The success of any ZFS implementation is *very* dependent on the hardware you choose to run it on. To clarify: "The success of any filesystem implementation is *very* dependent on the hardware you choose to run it on." ZFS requires that the hardware cache sync works and is respected. yes. Without taking advantage of the drive caches, zfs would be considerably less performant. That, I'm not so sure. When ZFS first came out, most pools were built on thumpers with a SATA device driver that did not handle NCQ concurrency. Enabling the write cache on a drive was a necessary way to have the drive firmware handle multiple request with small service times. Today we've got better device drivers but we've stopped comparing performance data with on/off settings on the disk write caches. The delta today might be a lot smaller than it used to be (and even less noticeable if one uses a slog on SSD). -r Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss smime.p7s Description: S/MIME cryptographic signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Have you considered this? *Maybe* a little time travel to an old uberblock could help you? http://www.opensolaris.org/jive/thread.jspa?threadID=85794 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 28, 2009, at 6:34 PM, Eric D. Mudama wrote: On Mon, Jul 27 at 13:50, Richard Elling wrote: On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote: Can *someone* please name a single drive+firmware or RAID controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT commands? Or worse, responds "ok" when the flush hasn't occurred? two seconds with google shows http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush Give it up. These things happen. Not much you can do about it, other than design around it. -- richard That example is a windows-specific, and is a software driver, where the data integrity feature must be manually disabled by the end user. The default behavior was always maximum data protection. I don't think you read the post. It specifically says, "Previous versions of the Promise drivers ignored the flush cache command until system power down. " Promise makes RAID controllers and has a firmware fix for this. This is the kind of thing we face: some performance engineer tries to get an edge by assuming there is only one case where cache flush matters. Another 2 seconds with google shows: http://sunsolve.sun.com/search/document.do?assetkey=1-66-27-1 (interestingly, for this one, fsck also fails) http://sunsolve.sun.com/search/document.do?assetkey=1-21-103622-06-1 http://forums.seagate.com/stx/board/message?board.id=freeagent&message.id=5060&query.id=3999#M5060 But they also get cache flush code wrong in the opposite direction. A good example of that is the notorious Seagate 1.5 TB disk "stutter" problem. NB, for the most part, vendors do not air their dirty laundry (eg bug reports) on the internet for those without support contracts. If you have a support contract, your search may show many more cases. While perhaps analagous at some level, the perpetual "your hardware must be crappy/cheap/not-as-expensive-as-mine" doesn't seem to be a sufficient explanation when things go wrong, like complete loss of a pool. As I said before, it is a systems engineering problem. If you do your own systems engineering, then you should make sure the components you select work as you expect. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Hi James Many thanks for finding & posting that link. I'm sure many people on this forum will be interested in trying out Brad Fitzpatrick's perl script 'diskchecker.pl'. It will be interesting to hear their results. I've not yet had time to work out how Brad's script works. If would be good if others here can take a critical look at it, and feedback their comments to the forum. I'm disappointed that I've not had a reply from someone at Sun to explain how they test their hard drives. We've had a few people here quick to claim that most hard drives fail to sync/flush correctly, but AFAIK no one is saying how they know this. Have they actually tested, in which case how have they tested. Or do they just know because of bad experiences having lost lots of data. Best Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Nigel Smith wrote: > David Magda wrote: >> This is also (theoretically) why a drive purchased from Sun is more >> that expensive then a drive purchased from your neighbourhood computer >> shop: Sun (and presumably other manufacturers) takes the time and >> effort to test things to make sure that when a drive says "I've synced >> the data", it actually has synced the data. This testing is what >> you're presumably paying for. > > So how do you test a hard drive to check it does actually sync the data? > How would you do it in theory? > And in practice? > > Now say we are talking about a virtual hard drive, > rather than a physical hard drive. > How would that affect the answer to the above questions? http://brad.livejournal.com/2116715.html has a utility that can be used to test if your systems (including virtual ones) properly sync data to disk when asked to. -- James Andrewartha ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Mon, Jul 27 at 13:50, Richard Elling wrote: On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote: Can *someone* please name a single drive+firmware or RAID controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT commands? Or worse, responds "ok" when the flush hasn't occurred? two seconds with google shows http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush Give it up. These things happen. Not much you can do about it, other than design around it. -- richard That example is a windows-specific, and is a software driver, where the data integrity feature must be manually disabled by the end user. The default behavior was always maximum data protection. While perhaps analagous at some level, the perpetual "your hardware must be crappy/cheap/not-as-expensive-as-mine" doesn't seem to be a sufficient explanation when things go wrong, like complete loss of a pool. -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> This is also (theoretically) why a drive purchased > from Sun is more > that expensive then a drive purchased from your > neighbourhood computer > shop: It's more significant than that. Drives aimed at the consumer market are at a competitive disadvantage if they do handle cache flush correctly (since the popular hardware blog of the day will show that the device is far slower than the competitors that throw away the sync requests). Sun (and presumably other manufacturers) takes > the time and > effort to test things to make sure that when a drive > says "I've synced > the data", it actually has synced the data. This > testing is what > you're presumably paying for. It wouldn't cost any more for commercial vendors to implement cache flush properly, it is just that they are penalized by the market for doing so. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> > Can *someone* please name a single drive+firmware or > RAID > controller+firmware that ignores FLUSH CACHE / FLUSH > CACHE EXT > commands? Or worse, responds "ok" when the flush > hasn't occurred? I think it would be a shorter list if one were to name the drives/controllers that actually implement a flush properly. > Everyone on this list seems to blame lying hardware > for ignoring > commands, but disks are relatively mature and I can't > believe that > major OEMs would qualify disks or other hardware that > willingly ignore > commands. It seems you have too much faith in major OEM's of storage, considering that 99.9% of the market is personal use, and for which a 2% throughput advantage over a competitor can make or break the profit margin on a device. Ignoring cache requests is guaranteed to get the best drive performance benchmarks regardless of what the software is driving the device. For example, it is virtually impossible to find a USB drive that honors cache sync (to do so would require that the device would stop completely until a fully synchronous USB transaction had made it to the device, the data had been written). Can you imagine how long a USB drive would sit on store shelves if it actually did do a proper cache sync? While USB is the extreme case; and it does get better the more expensive the drive, it is still far from a given that any particular device properly handles cache flushes. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
I think people can understand the concept of missing flushes. The big conceptual problem is how this manages to hose an entire filesystem, which is assumed to have rather a lot of data which ZFS has already verified to be ok. Hardware ignoring flushes and loosing recent data is understandable, I don't think anybody would argue with that. Loosing access to your entire pool and multiple gigabytes of data because a few writes failed is a whole different story, and while I understand how it happens, ZFS appears to be unique among modern filesystems in suffering such a catastrophic failure so often. To give a quick personal example: I can plug a fat32 usb disk into a windows system, drag some files to it, and pull that drive at any point. I might loose a few files, but I've never lost the entire filesystem. Even if the absolute worst happened, I know I can run scandisk, chkdisk, or any number of file recovery tools and get my data back. I would never, ever attempt this with ZFS. For a filesystem like ZFS where it's integrity and stability are sold as being way better than existing filesystems, loosing your entire pool is a bit of a shock. I know that work is going on to be able to recover pools, and I'll sleep a lot sounder at night once it is available. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 27-Jul-09, at 3:44 PM, Frank Middleton wrote: On 07/27/09 01:27 PM, Eric D. Mudama wrote: Everyone on this list seems to blame lying hardware for ignoring commands, but disks are relatively mature and I can't believe that major OEMs would qualify disks or other hardware that willingly ignore commands. You are absolutely correct, but if the cache flush command never makes it to the disk, then it won't see it. The contention is that by not relaying the cache flush to the disk, No - by COMPLETELY ignoring the flush. VirtualBox caused the OP to lose his pool. IMO this argument is bogus because AFAIK the OP didn't actually power his system down, so the data would still have been in the cache, and presumably have eventually have been written. The out-of-order writes theory is also somewhat dubious, since he was able to write 10TB without VB relaying the cache flushes. Huh? Of course he could. The guest didn't crash while he was doing it! The corruption occurred when the guest crashed (iirc). And the "out of order theory" need not be the *only possible* explanation, but it *is* sufficient. This is all highly hardware dependant, Not in the least. It's a logical problem. and AFAIK no one ever asked the OP what hardware he had, instead, blasting him for running VB on MSWindows. Which is certainly not relevant to my hypothesis of what broke. I don't care what host he is running. The argument is the same for all. Since IIRC he was using raw disk access, it is questionable whether or not MS was to blame, but in general it simply shouldn't be possible to lose a pool under any conditions. How about "when flushes are ignored"? It does raise the question of what happens in general if a cache flush doesn't happen if, for example, a system crashes in such a way that it requires a power cycle to restart, and the cache never gets flushed. Previous explanations have not dented your misunderstanding one iota. The problem is not that an attempted flush did not complete. It was that any and all flushes *prior to crash* were ignored. This is where the failure mode diverges from real hardware. Again, look: A B C FLUSH D E F FLUSH Note that it does not matter *at all* whether the 2nd flush completed. What matters from an integrity point of view is that the *previous* flush was completed (and synchronously). Visualise this on the two scenarios: 1) real hardware: (barring actual defects) that A,B,C were written was guaranteed by the first flush (otherwise D would never have been issued). Integrity of system is intact regardless of whether the 2nd flush completed. 2) VirtualBox: flush never happened. Integrity of system is lost, or at best unknown, if it depends on A,B,C all completing before D. ... Of course the ZIL isn't a journal in the traditional sense, and AFAIK it has no undo capability the way that a DBMS usually has, but it needs to be structured so that bizarre things that happen when something as robust as Solaris crashes don't cause data loss. A lot of engineering effort has been expended in UFS and ZFS to achieve just that. Which is why it's so nutty to undermine that by violating semantics in lower layers. The nightmare scenario is when one disk of a mirror begins to fail and the system comes to a grinding halt where even stop-a doesn't respond, and a power cycle is the only way out. Who knows what writes may or may not have been issued or what the state of the disk cache might be at such a time. Again, if the flush semantics are respected*, this is not a problem. --Toby * - "When this operation completes, previous writes are verifiably on durable media**." ** - Durable media meaning physical media in a bare metal environment, and potentially "virtual media" in a virtualised environment. -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
David Magda wrote: > This is also (theoretically) why a drive purchased from Sun is more > that expensive then a drive purchased from your neighbourhood computer > shop: Sun (and presumably other manufacturers) takes the time and > effort to test things to make sure that when a drive says "I've synced > the data", it actually has synced the data. This testing is what > you're presumably paying for. So how do you test a hard drive to check it does actually sync the data? How would you do it in theory? And in practice? Now say we are talking about a virtual hard drive, rather than a physical hard drive. How would that affect the answer to the above questions? Thanks Nigel -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 27-Jul-09, at 15:14 , David Magda wrote: Also, I think it may have already been posted, but I haven't found the option to disable VirtualBox' disk cache. Anyone have the incantation handy? http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0 It tells VB not to ignore the sync/flush command. Caching is still enabled (it wasn't the problem). Thanks! As Russell points on in the last post to that thread, it doesn't seem possible to do this with virtual SATA disks? Odd. A. -- Adam Sherman CTO, Versature Corp. Tel: +1.877.498.3772 x113 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote: On Sun, Jul 26 at 1:47, David Magda wrote: On Jul 25, 2009, at 16:30, Carson Gaspar wrote: Frank Middleton wrote: Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? No. You'll lose unwritten data, but won't corrupt the pool, because the on-disk state will be sane, as long as your iSCSI stack doesn't lie about data commits or ignore cache flush commands. But this entire thread started because Virtual Box's virtual disk / did/ lie about data commits. Why is this so difficult for people to understand? Because most people make the (not unreasonable assumption) that disks save data the way that they're supposed to: that the data goes in is the data that comes out, and that when the OS tells them to empty the buffer that they actually flush it. It's only us storage geeks that generally know the ugly truth that this assumption is not always true. :) Can *someone* please name a single drive+firmware or RAID controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT commands? Or worse, responds "ok" when the flush hasn't occurred? two seconds with google shows http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush Give it up. These things happen. Not much you can do about it, other than design around it. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/27/09 01:27 PM, Eric D. Mudama wrote: Everyone on this list seems to blame lying hardware for ignoring commands, but disks are relatively mature and I can't believe that major OEMs would qualify disks or other hardware that willingly ignore commands. You are absolutely correct, but if the cache flush command never makes it to the disk, then it won't see it. The contention is that by not relaying the cache flush to the disk, VirtualBox caused the OP to lose his pool. IMO this argument is bogus because AFAIK the OP didn't actually power his system down, so the data would still have been in the cache, and presumably have eventually have been written. The out-of-order writes theory is also somewhat dubious, since he was able to write 10TB without VB relaying the cache flushes. This is all highly hardware dependant, and AFAIK no one ever asked the OP what hardware he had, instead, blasting him for running VB on MSWindows. Since IIRC he was using raw disk access, it is questionable whether or not MS was to blame, but in general it simply shouldn't be possible to lose a pool under any conditions. It does raise the question of what happens in general if a cache flush doesn't happen if, for example, a system crashes in such a way that it requires a power cycle to restart, and the cache never gets flushed. Do disks with volatile caches attempt to flush the cache by themselves if they detect power down? It seems that the ZFS team recognizes this as a problem, hence the CR to address it. It turns out that (at least on this almost 4 year old blog) http://blogs.sun.com/perrin/entry/the_lumberjack that the ZILs /are/ allocated recursively from the main pool. Unless there is a ZIL for the ZILs, ZFS really isn't fully journalled, and this could be the real explanation for all lost pools and/or file systems. It would be great to hear from the ZFS team that writing a ZIL, presumably a transaction in it's own right, is protected somehow (by a ZIL for the ZILs?). Of course the ZIL isn't a journal in the traditional sense, and AFAIK it has no undo capability the way that a DBMS usually has, but it needs to be structured so that bizarre things that happen when something as robust as Solaris crashes don't cause data loss. The nightmare scenario is when one disk of a mirror begins to fail and the system comes to a grinding halt where even stop-a doesn't respond, and a power cycle is the only way out. Who knows what writes may or may not have been issued or what the state of the disk cache might be at such a time. -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Mon, July 27, 2009 13:59, Adam Sherman wrote: > Also, I think it may have already been posted, but I haven't found the > option to disable VirtualBox' disk cache. Anyone have the incantation > handy? http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0 It tells VB not to ignore the sync/flush command. Caching is still enabled (it wasn't the problem). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Mon, Jul 27, 2009 at 12:54 PM, Chris Ridd wrote: > > On 27 Jul 2009, at 18:49, Thomas Burgess wrote: > >> >> i was under the impression it was virtualbox and it's default setting that >> ignored the command, not the hard drive > > Do other virtualization products (eg VMware, Parallels, Virtual PC) have the > same default behaviour as VirtualBox? I've lost a pool due to LDoms doing the same. This bug seems to be related. http://bugs.opensolaris.org/view_bug.do?bug_id=6684721 -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 27-Jul-09, at 13:54 , Chris Ridd wrote: i was under the impression it was virtualbox and it's default setting that ignored the command, not the hard drive Do other virtualization products (eg VMware, Parallels, Virtual PC) have the same default behaviour as VirtualBox? I've a suspicion they all behave similarly dangerously, but actual data would be useful. Also, I think it may have already been posted, but I haven't found the option to disable VirtualBox' disk cache. Anyone have the incantation handy? Thanks, A -- Adam Sherman CTO, Versature Corp. Tel: +1.877.498.3772 x113 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 27 Jul 2009, at 18:49, Thomas Burgess wrote: i was under the impression it was virtualbox and it's default setting that ignored the command, not the hard drive Do other virtualization products (eg VMware, Parallels, Virtual PC) have the same default behaviour as VirtualBox? I've a suspicion they all behave similarly dangerously, but actual data would be useful. Cheers, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
i was under the impression it was virtualbox and it's default setting that ignored the command, not the hard drive On Mon, Jul 27, 2009 at 1:27 PM, Eric D. Mudama wrote: > On Sun, Jul 26 at 1:47, David Magda wrote: > >> >> On Jul 25, 2009, at 16:30, Carson Gaspar wrote: >> >> Frank Middleton wrote: >>> >>> Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? >>> >>> No. You'll lose unwritten data, but won't corrupt the pool, because the >>> on-disk state will be sane, as long as your iSCSI stack doesn't lie about >>> data commits or ignore cache flush commands. >>> >> >> But this entire thread started because Virtual Box's virtual disk / >> did/ lie about data commits. >> >> Why is this so difficult for people to understand? >>> >> >> Because most people make the (not unreasonable assumption) that disks save >> data the way that they're supposed to: that the data goes in is the data >> that comes out, and that when the OS tells them to empty the buffer that >> they actually flush it. >> >> It's only us storage geeks that generally know the ugly truth that this >> assumption is not always true. :) >> > > Can *someone* please name a single drive+firmware or RAID > controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT > commands? Or worse, responds "ok" when the flush hasn't occurred? > > Everyone on this list seems to blame lying hardware for ignoring > commands, but disks are relatively mature and I can't believe that > major OEMs would qualify disks or other hardware that willingly ignore > commands. > > --eric > > -- > Eric D. Mudama > edmud...@mail.bounceswoosh.org > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, Jul 26 at 1:47, David Magda wrote: On Jul 25, 2009, at 16:30, Carson Gaspar wrote: Frank Middleton wrote: Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? No. You'll lose unwritten data, but won't corrupt the pool, because the on-disk state will be sane, as long as your iSCSI stack doesn't lie about data commits or ignore cache flush commands. But this entire thread started because Virtual Box's virtual disk / did/ lie about data commits. Why is this so difficult for people to understand? Because most people make the (not unreasonable assumption) that disks save data the way that they're supposed to: that the data goes in is the data that comes out, and that when the OS tells them to empty the buffer that they actually flush it. It's only us storage geeks that generally know the ugly truth that this assumption is not always true. :) Can *someone* please name a single drive+firmware or RAID controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT commands? Or worse, responds "ok" when the flush hasn't occurred? Everyone on this list seems to blame lying hardware for ignoring commands, but disks are relatively mature and I can't believe that major OEMs would qualify disks or other hardware that willingly ignore commands. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Heh, I'd kill for failures to be handled in 2 or 3 seconds. I saw the failure of a mirrored iSCSI disk lock the entire pool for 3 minutes. That has been addressed now, but device hangs have the potential to be *very* disruptive. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> That's only one element of it Bob. ZFS also needs > devices to fail quickly and in a predictable manner. > > A consumer grade hard disk could lock up your entire > pool as it fails. The kit Sun supply is more likely > to fail in a manner ZFS can cope with. I agree 100%. Hardware, firmware, drivers, should be fully integrated to a mission critical app. With the wrong firmware, and consumer grade HD, disks failures stalls the entire pool. I have experience with disks failing and taking 2 or tree seconds to the system cope with (not just ZFS, but the controller, etc). Leal. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 26-Jul-09, at 11:08 AM, Frank Middleton wrote: On 07/25/09 04:30 PM, Carson Gaspar wrote: No. You'll lose unwritten data, but won't corrupt the pool, because the on-disk state will be sane, as long as your iSCSI stack doesn't lie about data commits or ignore cache flush commands. Why is this so difficult for people to understand? Let me create a simple example for you. Are you sure about this example? AFAIK metadata refers to things like the file's name, atime, ACLs, etc., etc. Your example seems to be more about how a journal works, which has little to do with metatdata other than to manage it. Now if you were too lazy to bother to follow the instructions properly, we could end up with bizarre things. This is what happens when storage lies and re-orders writes across boundaries. On 07/25/09 07:34 PM, Toby Thain wrote: The problem is assumed *ordering*. In this respect VB ignoring flushes and real hardware are not going to behave the same. Why? An ignored flush is ignored. It may be more likely in VB, but it can always happen. And whenever it does: guess what happens? It mystifies me that VB would in some way alter the ordering. Carson already went through a more detailed explanation. Let me try a different one: ZFS issues writes A, B, C, FLUSH, D, E, F. case 1) the semantics of the flush* allow ZFS to presume that A, B, C are all 'committed' at the point that D is issued. You can understand that A, B, C may be done in any order, and D, E, F may be done in any order, due to the numerous abstraction layers involved - all the way down to the disk's internal scheduling. ANY of these layers can affect the ordering of durable, physical writes _in the absence of a flush/barrier_. case 2) but if the flush does NOT occur with the necessary semantics, the ordering of ALL SIX operations is now indeterminate, and by the time ZFS issues D, any of the first 3 (A, B, C) may well not have been committed at all. There is a very good chance this will violate an integrity assumption (I haven't studied the source so I can't point you to a specific design detail or line; rather I am working from how I understand transactional/journaled systems to work. Assuming my argument is valid, I am sure a ZFS engineer can cite a specific violation). As has already been mentioned in this context, I think by David Magda, ordinary hardware will show this problem _if flushes are not functioning_ (an unusual case on bare metal), while on VirtualBox this is the default! ... Doesn't ZIL effectively make ZFS into a journalled file system Of course ZFS is transactional, as are other filesystems and software systems, such as RDBMS. But integrity of such systems depends on a hardware flush primitive that actually works. We are getting hoarse repeating this. --Toby * Essentially 'commit' semantics: Flush synchronously, operation is complete only when data is durably stored. ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 26 Jul 2009, David Magda wrote: That's the whole point of this thread: what should happen, or what should the file system do, when the drive (real or virtual) lies about the syncing? It's just as much a problem with any other POSIX file system (which have to deal with fsync(2))--ZFS isn't that special in that regard. The Linux folks went through a protracted debate on a similar issue not too long ago: Zfs is pretty darn special. RAIDed disk setups under Linux or *BSD work differently than zfs in a rather big way. Consider that with a normal software-based RAID setup, you use OS tools to create a virtual RAIDed device (LUN) which appears as a large device that you can then create (e.g. mkfs) a traditional filesystem on top of. Zfs works quite differently in that it is uses a pooled design which incorporates several RAID strategies directly. Instead of sending the data to a virtual device which then arranges the underlying data according to a policy (striping, mirror, RAID5), zfs incorporates knowledge of the vdev RAID strategy and intelligently issues data to the disks in an ideal order, executing the disk drive commit requests directly. Zfs removes the RAID obfustication which exists in traditional RAID systems. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/25/09 04:30 PM, Carson Gaspar wrote: No. You'll lose unwritten data, but won't corrupt the pool, because the on-disk state will be sane, as long as your iSCSI stack doesn't lie about data commits or ignore cache flush commands. Why is this so difficult for people to understand? Let me create a simple example for you. Are you sure about this example? AFAIK metadata refers to things like the file's name, atime, ACLs, etc., etc. Your example seems to be more about how a journal works, which has little to do with metatdata other than to manage it. Now if you were too lazy to bother to follow the instructions properly, we could end up with bizarre things. This is what happens when storage lies and re-orders writes across boundaries. On 07/25/09 07:34 PM, Toby Thain wrote: The problem is assumed *ordering*. In this respect VB ignoring flushes and real hardware are not going to behave the same. Why? An ignored flush is ignored. It may be more likely in VB, but it can always happen. It mystifies me that VB would in some way alter the ordering. I wonder if the OP could tell us what actual disks and controller he used to see if the hardware might actually have done out-of-order writes despite the fact that ZFS already does write optimization. Maybe the disk didn't like the physical location of the log relative to the data so it wrote the data first? Even then it isn't onvious why this would cause the pool to be lost. A traditional journalling file system should survive the loss pf a flush. Either the log entry was written or it wasn't. Even if the disk, for some bizarre reason, writes some of the actual data before writing the log, the repair process should undo that, If written properly, it will use the information in the most current complete journal entry to repair the file system. Doing synchs are devastating to performance so usually there's an option to disable them, at the known risk of losing a lot more data. I've been using SPARCs and Solaris from the beginning. Ever since UFS supported journalling, I've never lost a file unless the disk went totally bad, and none since mirroring. Didn't miss fsck either :-) Doesn't ZIL effectively make ZFS into a journalled file system (in another thread, Bob Friesenhahn says it isn't, but I would submit that the general opinion is correct that it is; "log" and "journal" have similar semantics). The evil tuning guide is pretty emphatic about not disabling it! My intuition (and this is entirely speculative) is that the ZFS ZIL either doesn't contain everything needed to restore the superstructure, or that if it does, the recovery process is ignoring it. I think I read that the ZIL is per-file system, but one hopes it doesn't rely on the superstructure recursively, or this would be impossible to fix (maybe there's a ZIL for the ZILs :) ). On 07/21/09 11:53 AM, George Wilson wrote: We are working on the pool rollback mechanism and hope to have that soon. The ZFS team recognizes that not all hardware is created equal and thus the need for this mechanism. We are using the following CR as the tracker for this work: 6667683 need a way to rollback to an uberblock from a previous txg so maybe this discussion is moot :-) -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 25, 2009, at 15:32, Frank Middleton wrote: Can you comment on if/how mirroring or raidz mitigates this, or tree corruption in general? I have yet to lose a pool even on a machine with fairly pathological problems, but it is mirrored (and copies=2). Presumably at least on of the drives in the mirror or RAID set would have the correct data or non-corrupted data structures. There was a thread a while back on the risks involved in a SAN LUN (served from something like an EMC array), and whether you could trust the array or whether you should mirror LUNs. (I think the consensus was it was best to mirror LUNs--even from SANs, which presumably are more reliable than consumer SATA drives). I was also wondering if you could explain why the ZIL can't repair such damage. Beyond my knowledge. Finally, a number of posters blamed VB for ignoring a flush, but according to the evil tuning guide, without any application syncs, ZFS may wait up to 5 seconds before issuing a synch, and there Yes, it will sync every 5 to 30 seconds, but how do you know the data is actually synced?! If the five second timer triggers and ZFS says "okay, time to sync", and goes through the proper procedures, what happens if the drive lies about the sync operation? What then? That's the whole point of this thread: what should happen, or what should the file system do, when the drive (real or virtual) lies about the syncing? It's just as much a problem with any other POSIX file system (which have to deal with fsync(2))--ZFS isn't that special in that regard. The Linux folks went through a protracted debate on a similar issue not too long ago: http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/ http://lwn.net/Articles/322823/ tripping over a patch cord or a router blowing a fuse. Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? Yes, which is why it's always recommended to have redundancy in your configuration (mirroring or RAID-Z). This way, hopefully, at least one drive is in a consistent state. This is also (theoretically) why a drive purchased from Sun is more that expensive then a drive purchased from your neighbourhood computer shop: Sun (and presumably other manufacturers) takes the time and effort to test things to make sure that when a drive says "I've synced the data", it actually has synced the data. This testing is what you're presumably paying for. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 25, 2009, at 16:30, Carson Gaspar wrote: Frank Middleton wrote: Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? No. You'll lose unwritten data, but won't corrupt the pool, because the on-disk state will be sane, as long as your iSCSI stack doesn't lie about data commits or ignore cache flush commands. But this entire thread started because Virtual Box's virtual disk / did/ lie about data commits. Why is this so difficult for people to understand? Because most people make the (not unreasonable assumption) that disks save data the way that they're supposed to: that the data goes in is the data that comes out, and that when the OS tells them to empty the buffer that they actually flush it. It's only us storage geeks that generally know the ugly truth that this assumption is not always true. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 25-Jul-09, at 3:32 PM, Frank Middleton wrote: On 07/25/09 02:50 PM, David Magda wrote: Yes, it can be affected. If the snapshot's data structure / record is underneath the corrupted data in the tree then it won't be able to be reached. Can you comment on if/how mirroring or raidz mitigates this, or tree corruption in general? I have yet to lose a pool even on a machine with fairly pathological problems, but it is mirrored (and copies=2). I was also wondering if you could explain why the ZIL can't repair such damage. Finally, a number of posters blamed VB for ignoring a flush, but according to the evil tuning guide, without any application syncs, ZFS may wait up to 5 seconds before issuing a synch, and there must be all kinds of failure modes even on bare hardware where it never gets a chance to do one at shutdown. This is interesting if you do ZFS over iscsi because of the possibility of someone tripping over a patch cord or a router blowing a fuse. Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? The problem is assumed *ordering*. In this respect VB ignoring flushes and real hardware are not going to behave the same. --Toby Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Frank Middleton wrote: Finally, a number of posters blamed VB for ignoring a flush, but according to the evil tuning guide, without any application syncs, ZFS may wait up to 5 seconds before issuing a synch, and there must be all kinds of failure modes even on bare hardware where it never gets a chance to do one at shutdown. This is interesting if you do ZFS over iscsi because of the possibility of someone tripping over a patch cord or a router blowing a fuse. Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? No. You'll lose unwritten data, but won't corrupt the pool, because the on-disk state will be sane, as long as your iSCSI stack doesn't lie about data commits or ignore cache flush commands. Why is this so difficult for people to understand? Let me create a simple example for you. Get yourself 4 small pieces of paper, and number them 1 through 4. On piece 1, write "Four" (app write disk A) On piece 2, write "Score" (app write disk B) Place piece 1 and piece 2 together on the side (metadata write, cache flush) On piece 3, write "Every" (app overwrite disk A) On piece 4, write "Good" (app overwrite disk B) Place piece 2 and piece 3 on top of pieces one and 2 (metadata write, cache flush) IFF you obeyed the instructions, the only things you could ever have on the side are nothing, "Four Score", or "Every Good" (we assume that side placement is atomic). You could get killed after writing something on pieces 3 or 4, and lose them, but you could never have garbage. Now if you were too lazy to bother to follow the instructions properly, we could end up with bizarre things. This is what happens when storage lies and re-orders writes across boundaries. -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/25/09 02:50 PM, David Magda wrote: Yes, it can be affected. If the snapshot's data structure / record is underneath the corrupted data in the tree then it won't be able to be reached. Can you comment on if/how mirroring or raidz mitigates this, or tree corruption in general? I have yet to lose a pool even on a machine with fairly pathological problems, but it is mirrored (and copies=2). I was also wondering if you could explain why the ZIL can't repair such damage. Finally, a number of posters blamed VB for ignoring a flush, but according to the evil tuning guide, without any application syncs, ZFS may wait up to 5 seconds before issuing a synch, and there must be all kinds of failure modes even on bare hardware where it never gets a chance to do one at shutdown. This is interesting if you do ZFS over iscsi because of the possibility of someone tripping over a patch cord or a router blowing a fuse. Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 25, 2009, at 14:17, roland wrote: thanks for the explanation ! one more question: there are situations where the disks doing strange things (like lying) have caused the ZFS data structures to become wonky. The 'broken' data structure will cause all branches underneath it to be lost--and if it's near the top of the tree, it could mean a good portion of the pool is inaccessible. can snapshots also be affected by such issue or are they somewhat "immune" here? Yes, it can be affected. If the snapshot's data structure / record is underneath the corrupted data in the tree then it won't be able to be reached. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
thanks for the explanation ! one more question: > there are situations where the disks doing strange things >(like lying) have caused the ZFS data structures to become wonky. The >'broken' data structure will cause all branches underneath it to be >lost--and if it's near the top of the tree, it could mean a good >portion of the pool is inaccessible. can snapshots also be affected by such issue or are they somewhat "immune" here? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 25, 2009, at 12:24, roland wrote: why can i loose a whole 10TB pool including all the snapshots with the logging/transactional nature of zfs? Because ZFS does not (yet) have an (easy) way to go back a previous state. That's what this bug is about: need a way to rollback to an uberblock from a previous txg http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6667683 While in most cases ZFS will cleanly recover after a non-clean shutdown, there are situations where the disks doing strange things (like lying) have caused the ZFS data structures to become wonky. The 'broken' data structure will cause all branches underneath it to be lost--and if it's near the top of the tree, it could mean a good portion of the pool is inaccessible. Fixing the above bug should hopefully allow users / sysadmins to tell ZFS to go 'back in time' and look up previous versions of the data structures. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
>As soon as you have more then one disk in the equation, then it is >vital that the disks commit their data when requested since otherwise >the data on disk will not be in a consistent state. ok, but doesn`t that refer only to the most recent data? why can i loose a whole 10TB pool including all the snapshots with the logging/transactional nature of zfs? isn`t the data in the snapshots set to read only so all blocks with snapshotted data don`t change over time (and thus give an secure "entry" to a consistent point in time) ? ok, this are probably some short-sighted questions, but i`m trying to understand how things could go wrong with zfs and how issues like these happen. on other filesystems, we have tools for fsck as a last resort or tools to recover data from unmountable filesystems. with zfs i don`t know any of these, so it`s that "will solaris mount my zfs after the next crash?" question which frightens me a little bit. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sat, 25 Jul 2009, roland wrote: When that happens, ZFS believes the data is safely written, but a power cut or >crash can cause severe problems with the pool. didn`t i read a million times that zfs ensures an "always consistent state" and is self healing, too? so, if new blocks are always written at new positions - why can`t we just roll back to a point in time (for example last snapshot) which is known to be safe/consistent ? As soon as you have more then one disk in the equation, then it is vital that the disks commit their data when requested since otherwise the data on disk will not be in a consistent state. If the disks simply do whatever they want then some disks will have written the data while other disks will still have it cached. This blows the "consistent state on disk" even though zfs wrote the data in order and did all the right things. Any uncommitted data in disk cache will be forgotten if the system loses power. There is an additional problem if when the disks finally get around to writing the cached data that they write it in a different order than requested while ignoring the commit request. It is common that the disks write data in the most efficient order, but it absolutely must commit all of the data when requested so that the checkpoint is valid. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
>Running this kind of setup absolutely can give you NO garanties at all. >Virtualisation, OSOL/zfs on WinXP. It's nice to play with and see it >"working" but would I TRUST precious data to it? No way! why not? if i write some data trough virtualization layer which goes straight trough to raw disk - what`s the problem? do a snapshot and you can be sure you have a safe state. or not? you can check if you are consistent by doing a scrub. or not? taken buffers/caches into consideration, you could eventually loose some seconds/minutes of work, but doesn`t zfs use transactional design which ensures consistency? so, how can that happen what´s being reported here, if zfs takes so much care of consistency? >When that happens, ZFS believes the data is safely written, but a power cut or >>crash can cause severe problems with the pool. didn`t i read a million times that zfs ensures an "always consistent state" and is self healing, too? so, if new blocks are always written at new positions - why can`t we just roll back to a point in time (for example last snapshot) which is known to be safe/consistent ? i give a shit about the last 5 minutes of work if i can recover my TB sized pool instead. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/21/09 01:21 PM, Richard Elling wrote: I never win the lottery either :-) Let's see. Your chance of winning a 49 ball lottery is apparently around 1 in 14*10^6, although it's much better than that because of submatches (smaller payoffs for matches on less than 6 balls). There are about 32*10^6 seconds in a year. If ZFS saves its writes for 30 seconds and batches them out, that means 1 write leaves the buffer exposed for roughly one millionth of a year. If you have 4GB of memory, you might get 50 errors a year, but you say ZFS uses only 1/10 of this for writes, so that memory could see 5 errors/year. If your single write was 1/70th of that (say around 6 MB), your chance of a hit is around 5/70/10^-6 or 1 in 14*10^6, so you are correct! So if you do one 6MB write/year, your chances of a hit in a year are about the same as that of winning a grand slam lottery. Hopefully not every hit will trash a file or pool, but odds are that you'll do many more writes than that, so on the whole I think a ZFS hit is quite a bit more likely than winning the lottery each year :-). Conversely, if you average one big write every 3 minutes or so (20% occupancy), odds are almost certain that you'll get one hit a year. So some SOHO users who do far fewer writes won't see any hits (say) over a 5 year period. But some will, and they will be most unhappy -- calculate your odds and then make a decision! I daresay the PC makers have done this calculation, which is why PCs don't have ECC, and hence IMO make for insufficiently reliable servers. Conclusions from what I've gleaned from all the discussions here: if you are too cheap to opt for mirroring, your best bet is to disable checksumming and set copies=2. If you mirror but don't have ECC then at least set copies=2 and consider disabling checksums. Actually, set copies=2 regardless, so that you have some redundancy if one half of the mirror fails and you have a 10 hour resilver, in which time you could easily get a (real) disk read error. It seems to me some vendor is going to cotton onto the SOHO server problem and make a bundle at the right price point. Sun's offerings seem unfortunately mostly overkill for the SOHO market, although the X4140 looks rather interesting... Shame there aren't any entry level SPARCs any more :-(. Now what would doctors' front offices do if they couldn't blame the computer for being down all the time? It is quite simple -- ZFS sent the flush command and VirtualBox ignored it. Therefore the bits on the persistent store are consistent. But even on the most majestic of hardware, a flush command could be lost, could it not? An obvious case in point is ZFS over iscsi and a router glitch. But the discussion seems to be moot since CR 6667683 is being addressed. Now about those writes to mirrored disks :) Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
To All : The ECC discussion was very interesting as I had never considered it that way! I willl be buying ECC memory for my home machine!! You have to make sure your mainboard, chipset and/or CPU support it, otherwise any ECC modules will just work like regular modules. The mainboard needs to have the necessary lanes to either the chipset that supports ECC (in case of Intel) or the CPU (in case of AMD). I think all Xeon chipsets do ECC, as do various consumer ones (I only know of X38/X48, there's also some 9xx ones that do). For consumer boards, it's hard to figure out which actually do support it. I have an X48-DQ6 mainboard from Gigabyte, which does it. Regards, -mg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Once these bits are available in Opensolaris then users will be able to upgrade rather easily. This would allow you to take a liveCD running these bits and recover older pools. Do you currently have a pool which needs recovery? Thanks, George Alexander Skwar wrote: Hi. Good to Know! But how do we deal with that on older sStems, which don't have the patch applied, once it is out? Thanks, Alexander On Tuesday, July 21, 2009, George Wilson wrote: Russel wrote: OK. So do we have an zpool import --xtg 56574 mypoolname or help to do it (script?) Russel We are working on the pool rollback mechanism and hope to have that soon. The ZFS team recognizes that not all hardware is created equal and thus the need for this mechanism. We are using the following CR as the tracker for this work: 6667683 need a way to rollback to an uberblock from a previous txg Thanks, George ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Hi. Good to Know! But how do we deal with that on older sStems, which don't have the patch applied, once it is out? Thanks, Alexander On Tuesday, July 21, 2009, George Wilson wrote: > Russel wrote: > > OK. > > So do we have an zpool import --xtg 56574 mypoolname > or help to do it (script?) > > Russel > > > We are working on the pool rollback mechanism and hope to have that soon. The > ZFS team recognizes that not all hardware is created equal and thus the need > for this mechanism. We are using the following CR as the tracker for this > work: > > 6667683 need a way to rollback to an uberblock from a previous txg > > Thanks, > George > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Alexander -- [[ http://zensursula.net ]] [ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr => http://zyb.com/alexws77 ] [ Chat => Jabber: alexw...@jabber80.com | Google Talk: a.sk...@gmail.com ] [ Mehr => AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo 'CLICK!' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Thanks for the feed back George. I hope we get the tools soon. At home I have now blown the ZFS away now and creating a HW raid-5 set :-( Hopefully in the future when the tools are there I will return to ZFS. To All : The ECC discussion was very interesting as I had never considered it that way! I willl be buying ECC memory for my home machine!! Again many many thanks to all how have replied it has been a very interesting and informative discussion for me. Best regards Russel -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 20, 2009, at 12:48 PM, Frank Middleton wrote: On 07/19/09 06:10 PM, Richard Elling wrote: Not that bad. Uncommitted ZFS data in memory does not tend to live that long. Writes are generally out to media in 30 seconds. Yes, but memory hits are instantaneous. On a reasonably busy system there may be buffers in queue all the time. You may have a buffer in memory for 100uS but it only takes 1nS for that buffer to be clobbered. If that happened to be metadata about to be written to both sides of a mirror than you are toast. Good thing this never happens, right :-) I never win the lottery either :-) Beware, if you go down this path of thought for very long, you'll soon be afraid to get out of bed in the morning... wait... most people actually die in beds, so perhaps you'll be afraid to go to bed instead :-) Not at all. As with any rational business, my servers all have ECC, and getting up and out isn't a problem :-). Maybe I've had too many disks go bad, so I have ECC, mirrors, and backup to a system with ECC and mirrors (and copies=2, as well). Maybe I've read too many of your excellent blogs :-). Sun doesn't even sell machines without ECC. There's a reason for that. Yes, but all of the discussions in this thread can be classified as systems engineering problems, not product design problems. Not sure I follow. We've had this discussion before. OSOL+ZFS lets you build enterprise class systems on cheap hardware that has errors. ZFS gives the illusion of being fragile because it, uniquely, reports these errors. Running OSOL as a VM in VirtualBox using MSWanything as a host is a bit like building on sand, but there's nothing in documentation anywhere to even warn folks that they shouldn't rely on software to get them out of trouble on cheap hardware. ECC is just one (but essential) part of that. It is a systems engineering problem because ZFS is working as designed and VirtualBox is also working as designed. If you file a bug against either, the bug should be closed as "not a defect." That means the responsibility for making sure that the two interoperate lies at the systems level -- where systems engineers do their job. For an analogy, guns don't kill people, bullets kill people. The gun is just a platform for directing bullets. If you shoot yourself in the foot, then the failure is not with the gun or bullet, it is one layer above -- in the system. It hurts when you do that, so don't do that. On 07/19/09 08:29 PM, David Magda wrote: It's a nice-to-have, but at some point we're getting into the tinfoil hat-equivalent of data protection. But it is going to happen! Sun sells only machines with ECC because that is the only way to ensure reliability. Someone who spends weeks building a media server at home isn't going to be happy if they lose one media file let alone a whole pool. At least they should be warned that without ECC at some point they will lose files. I'm not convinced that there is any reasonable scenario for losing an entire pool though, which was the original complaint in this thread. Even trusty old SPARCs occasionally hang without a panic (in my experience especially when a disk is about to go bad). If this happens, and you have to power cycle because even stop-A doesn't respond, are you all saying that there is a risk of losing a pool at that point? Surely the whole point of a journalled file system is that it is pretty much proof against any catastrophe, even the one described initially. There have been a couple of (to me) unconvincing explanations of how this pool was lost. It is quite simple -- ZFS sent the flush command and VirtualBox ignored it. Therefore the bits on the persistent store are consistent. Surely if there is a mechanism whereby unflushed i/os can cause fatal metadata corruption, this should be a high priority bug since this can happen on /any/ hardware; it is just more likely if the foundations are shaky, so the explanation must require more than that if it isn't a bug. It isn't a bug in ZFS or VirtualBox. They work as designed. As has been mentioned before, many times, the recovery of the data is now a forensics exercise. ZFS knows is that the consistency is broken and is implementing the policy that consistency is more important than automated access. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Russel wrote: OK. So do we have an zpool import --xtg 56574 mypoolname or help to do it (script?) Russel We are working on the pool rollback mechanism and hope to have that soon. The ZFS team recognizes that not all hardware is created equal and thus the need for this mechanism. We are using the following CR as the tracker for this work: 6667683 need a way to rollback to an uberblock from a previous txg Thanks, George ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
My understanding of the root cause of these issues is that the vast majority are happening with consumer grade hardware that is reporting to ZFS that writes have succeeded, when in fact they are still in the cache. When that happens, ZFS believes the data is safely written, but a power cut or crash can cause severe problems with the pool. This is (I think) the reason for comments about this being a system engineering, not design problem - ZFS assumes the disks are telling the truth and has been designed this way. It is up to the administrator to engineer the server from components that accurately report their status. However, while the majority of these cases are with consumer hardware, the BBC have reported that they hit this problem using Sun T2000 servers and commodity SATA drives, so unless somebody from Sun can say otherwise, I feel that there is still some risk of this occurring on Sun hardware. I feel the ZFS marketing and documentation is very misleading in that it completely ignores the issue of your entire pool being at risk unless you are careful about the hardware used, leading to a lot of stories like this from enthusiasts and early adopters. I also believe ZFS needs recovery tools as a matter of urgency, to protect its reputation if nothing else. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/19/09 06:10 PM, Richard Elling wrote: Not that bad. Uncommitted ZFS data in memory does not tend to live that long. Writes are generally out to media in 30 seconds. Yes, but memory hits are instantaneous. On a reasonably busy system there may be buffers in queue all the time. You may have a buffer in memory for 100uS but it only takes 1nS for that buffer to be clobbered. If that happened to be metadata about to be written to both sides of a mirror than you are toast. Good thing this never happens, right :-) Beware, if you go down this path of thought for very long, you'll soon be afraid to get out of bed in the morning... wait... most people actually die in beds, so perhaps you'll be afraid to go to bed instead :-) Not at all. As with any rational business, my servers all have ECC, and getting up and out isn't a problem :-). Maybe I've had too many disks go bad, so I have ECC, mirrors, and backup to a system with ECC and mirrors (and copies=2, as well). Maybe I've read too many of your excellent blogs :-). Sun doesn't even sell machines without ECC. There's a reason for that. Yes, but all of the discussions in this thread can be classified as systems engineering problems, not product design problems. Not sure I follow. We've had this discussion before. OSOL+ZFS lets you build enterprise class systems on cheap hardware that has errors. ZFS gives the illusion of being fragile because it, uniquely, reports these errors. Running OSOL as a VM in VirtualBox using MSWanything as a host is a bit like building on sand, but there's nothing in documentation anywhere to even warn folks that they shouldn't rely on software to get them out of trouble on cheap hardware. ECC is just one (but essential) part of that. On 07/19/09 08:29 PM, David Magda wrote: It's a nice-to-have, but at some point we're getting into the tinfoil hat-equivalent of data protection. But it is going to happen! Sun sells only machines with ECC because that is the only way to ensure reliability. Someone who spends weeks building a media server at home isn't going to be happy if they lose one media file let alone a whole pool. At least they should be warned that without ECC at some point they will lose files. I'm not convinced that there is any reasonable scenario for losing an entire pool though, which was the original complaint in this thread. Even trusty old SPARCs occasionally hang without a panic (in my experience especially when a disk is about to go bad). If this happens, and you have to power cycle because even stop-A doesn't respond, are you all saying that there is a risk of losing a pool at that point? Surely the whole point of a journalled file system is that it is pretty much proof against any catastrophe, even the one described initially. There have been a couple of (to me) unconvincing explanations of how this pool was lost. Surely if there is a mechanism whereby unflushed i/os can cause fatal metadata corruption, this should be a high priority bug since this can happen on /any/ hardware; it is just more likely if the foundations are shaky, so the explanation must require more than that if it isn't a bug. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 20-Jul-09, at 6:26 AM, Russel wrote: Well I did have a UPS on the machine :-) but the machine hung and I had to power it off... (yep it was vertual, but that happens on direct HW too, As has been discussed here before, the failure modes are different as the layer stack from filesystem to disk is obviously very different. --Toby and virtualisasion is the happening ting at sun and else where! I have a version of the data backed up, but will take ages (10days) to restore). -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
OK. So do we have an zpool import --xtg 56574 mypoolname or help to do it (script?) Russel -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> the machine hung and I had to power it off. kinda getting off the "zpool import --tgx -3" request, but "hangs" are exceptionally rare and usually ram or other hardware issue, solairs usually abends on software faults. r...@pdm # uptime 9:33am up 1116 day(s), 21:12, 1 user, load average: 0.07, 0.05, 0.05 r...@pdm # date Mon Jul 20 09:33:07 EDT 2009 r...@pdm # uname -a SunOS pdm 5.9 Generic_112233-12 sun4u sparc SUNW,Ultra-250 Rob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Well I did have a UPS on the machine :-) but the machine hung and I had to power it off... (yep it was vertual, but that happens on direct HW too, and virtualisasion is the happening ting at sun and else where! I have a version of the data backed up, but will take ages (10days) to restore). -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, Richard Elling wrote: I do, even though I have a small business. Neither InDesign nor Illustrator will be ported to Linux or OpenSolaris in my lifetime... besides, iTunes rocks and it is the best iPhone developer's environment on the planet. Richard, I think the point that Gavin was trying to make is that a sensible business would commit their valuable data back to a fileserver running on solid hardware with a solid operating system rather than relying on their single-spindle laptops to store their valuable content - not making any statement on the actual desktop platform. For example, I use a mixture of Windows, MacOS, Solaris and OpenBSD around here, but all the valuable data is stored on a zpool located on a SPARC server (obviously with ECC RAM) with UPS power. With Windows around, I like the fact that I don't need to think twice before reinstalling those machines. Andre. -- Andre van Eyssen. mail: an...@purplecow.org jabber: an...@interact.purplecow.org purplecow.org: UNIX for the masses http://www2.purplecow.org purplecow.org: PCOWpix http://pix.purplecow.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Gavin Maltby wrote: Hi, David Magda wrote: On Jul 19, 2009, at 20:13, Gavin Maltby wrote: No, ECC memory is a must too. ZFS checksumming verifies and corrects data read back from a disk, but once it is read from disk it is stashed in memory for your application to use - without ECC you erode confidence that what you read from memory is correct. Right, because once (say) Apple incorporates ZFS into Mac OS X they'll also start shipping MacBooks and iMacs with ECC. If customers were committing valuable business data to MacBooks and iMacs then ECC would be a requirement. I don't know of terribly many customers running their business of of a laptop. I do, even though I have a small business. Neither InDesign nor Illustrator will be ported to Linux or OpenSolaris in my lifetime... besides, iTunes rocks and it is the best iPhone developer's environment on the planet. The bigger problem is that not all of Intel's CPU products do ECC... the embedded and server models do, but it is the low-margin PC market that is willing to make that cost trade-off. If people demanded ECC, like they do in the embedded and server markets, then we wouldn't be having this conversation. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Hi, David Magda wrote: On Jul 19, 2009, at 20:13, Gavin Maltby wrote: No, ECC memory is a must too. ZFS checksumming verifies and corrects data read back from a disk, but once it is read from disk it is stashed in memory for your application to use - without ECC you erode confidence that what you read from memory is correct. Right, because once (say) Apple incorporates ZFS into Mac OS X they'll also start shipping MacBooks and iMacs with ECC. If customers were committing valuable business data to MacBooks and iMacs then ECC would be a requirement. I don't know of terribly many customers running their business of of a laptop. If it's so necessary we might as well have any kernel that has ZFS in it only allow 'zpool create' to be run if the kernel detects ECC modules. Come on. > It's a nice-to-have, but at some point we're getting into the tinfoil hat-equivalent of data protection. On a laptop zfs is a huge amount safer than other filesystems, still has all the great usability features etc - but zfs does not magically turn your laptop into a server-grade system. What you refer to as a tinfoil hat is an essential component of any server if that is housing business-vital data; obviously it is just a nice-to-have on a laptop, but recognise what you're losing. Gavin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, David Magda wrote: Right, because once (say) Apple incorporates ZFS into Mac OS X they'll also start shipping MacBooks and iMacs with ECC. If it's so necessary we might as well have any kernel that has ZFS in it only allow 'zpool create' to be run if the kernel detects ECC modules. The MacBooks and iMacs are only used as an execution environment for the Safari web browser. ECC is only necessary for computers which save data somewhere so the MacBook and iMac do not need ECC. Regardless (in order to stay on topic) it is worth mentioning that the 10TB data lost to a failed pool was not lost due to lack of ECC. It was lost because VirtualBox intentionally broke the guest operating system. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 19, 2009, at 20:13, Gavin Maltby wrote: No, ECC memory is a must too. ZFS checksumming verifies and corrects data read back from a disk, but once it is read from disk it is stashed in memory for your application to use - without ECC you erode confidence that what you read from memory is correct. Right, because once (say) Apple incorporates ZFS into Mac OS X they'll also start shipping MacBooks and iMacs with ECC. If it's so necessary we might as well have any kernel that has ZFS in it only allow 'zpool create' to be run if the kernel detects ECC modules. Come on. It's a nice-to-have, but at some point we're getting into the tinfoil hat-equivalent of data protection. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
dick hoogendijk wrote: true. Furthermore, much so-called consumer hardware is very good these days. My guess is ZFS should work quite reliably on that hardware. (i.e. non ECC memory should work fine!) / mirroring is a -must- ! No, ECC memory is a must too. ZFS checksumming verifies and corrects data read back from a disk, but once it is read from disk it is stashed in memory for your application to use - without ECC you erode confidence that what you read from memory is correct. Gavin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, Miles Nordin wrote: "r" == Ross writes: "tt" == Toby Thain writes: r> ZFS was never designed to run on consumer hardware, this is markedroid garbage, as well as post-facto apologetics. Don't lower the bar. Don't blame the victim. I think that the standard disclaimer "Always use protection" applies here. Victims who do not use protection should assume substantial guilt for their subsequent woes. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Frank Middleton wrote: On 07/19/09 05:00 AM, dick hoogendijk wrote: (i.e. non ECC memory should work fine!) / mirroring is a -must- ! Yes, mirroring is a must, although it doesn't help much if you have memory errors (see several other threads on this topic): http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction "Tests[ecc]give widely varying error rates, but about 10^-12 error/bit·h is typical, roughly one bit error, per month, per gigabyte of memory." That's roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS hit, that's one/year per user on average. Some get more, some get less.That sounds like pretty bad odds... Not that bad. Uncommitted ZFS data in memory does not tend to live that long. Writes are generally out to media in 30 seconds. Solaris scrubs memory, with a 12-hour cycle time, so memory does not remain untouched for a month. For high-end systems, memory scrubs are also performed by the memory controllers. Beware, if you go down this path of thought for very long, you'll soon be afraid to get out of bed in the morning... wait... most people actually die in beds, so perhaps you'll be afraid to go to bed instead :-) "In most computers used for serious scientific or financial computing and as servers, ECC is the rule rather than the exception, as can be seen by examining manufacturers' specifications." Sun doesn't even sell machines without ECC. There's a reason for that. Yes, but all of the discussions in this thread can be classified as systems engineering problems, not product design problems. If you do your own systems engineering, then add this to your (hopefully long) checklist. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> "r" == Ross writes: > "tt" == Toby Thain writes: r> ZFS was never designed to run on consumer hardware, this is markedroid garbage, as well as post-facto apologetics. Don't lower the bar. Don't blame the victim. tt> I posted about that insane default, six months ago. Obviously tt> ZFS isn't the only subsystem that this breaks. yes, but remember, in this case the host did not crash, so the insane default should be irrelevant. pgpc8tzQ0aGF0.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, Frank Middleton wrote: Yes, mirroring is a must, although it doesn't help much if you have memory errors (see several other threads on this topic): http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction "Tests[ecc]give widely varying error rates, but about 10^-12 error/bit·h is typical, roughly one bit error, per month, per gigabyte of memory." That's roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS hit, that's one/year per user on average. Some get more, some get less.That sounds like pretty bad odds... I fail to see anything zfs-specific in the above. It does not have anything more to do with zfs than it does with any other software running on the system. I do have a couple of Windows PCs here without ECC, but they were gifts from other people, and not hardware that I purchased, and not used for any critical application. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/19/09 05:00 AM, dick hoogendijk wrote: (i.e. non ECC memory should work fine!) / mirroring is a -must- ! Yes, mirroring is a must, although it doesn't help much if you have memory errors (see several other threads on this topic): http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction "Tests[ecc]give widely varying error rates, but about 10^-12 error/bit·h is typical, roughly one bit error, per month, per gigabyte of memory." That's roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS hit, that's one/year per user on average. Some get more, some get less.That sounds like pretty bad odds... "In most computers used for serious scientific or financial computing and as servers, ECC is the rule rather than the exception, as can be seen by examining manufacturers' specifications." Sun doesn't even sell machines without ECC. There's a reason for that. IMO you'd be nuts to run ZFS on a machine without ECC unless you don't care about losing some or all of the data. Having said that, we have yet to lose an entire pool - this is pretty hard to do! I should add that since setting copies=2 and forcing the files to be copied, there have been no more unrecoverable errors on a particularly low end machine that was plagued with them even with mirrors (and a UPS with a bad battery :-) ). On 19-Jul-09, at 7:12 AM, Russel wrote: As this was not clear to me. I use VB like others use vmware etc to run solaris because its the ONLY way I can, Given that PC hardware is so cheap these days (used SPARCS even cheaper), surely it makes far more sense to build a nice robust OSOL/ZFS based file server *with* ECC. Then you can use iscsi for your VirtualBox VMs and solve all kinds of interesting problems. But you still need to do backups. My solution for that is to replicate the server and backup to it using zfs send/recv. If a disk fails, you switch to the backup and no worries about the second disk of the mirror failing during a resilver. A small price to pay for peace of mind. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
That's only one element of it Bob. ZFS also needs devices to fail quickly and in a predictable manner. A consumer grade hard disk could lock up your entire pool as it fails. The kit Sun supply is more likely to fail in a manner ZFS can cope with. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 19-Jul-09, at 7:12 AM, Russel wrote: Guys guys please chill... First thanks to the info about virtualbox option to bypass the cache (I don't suppose you can give me a reference for that info? (I'll search the VB site :-)) I posted about that insane default, six months ago. Obviously ZFS isn't the only subsystem that this breaks. http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0 As this was not clear to me. I use VB like others use vmware etc to run solaris because its the ONLY way I can, Convenience always has a price. --Toby ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, Ross wrote: The success of any ZFS implementation is *very* dependent on the hardware you choose to run it on. To clarify: "The success of any filesystem implementation is *very* dependent on the hardware you choose to run it on." ZFS requires that the hardware cache sync works and is respected. Without taking advantage of the drive caches, zfs would be considerably less performant. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Heh, yes, I assumed similar things Russel. I also assumed that a faulty disk in a raid-z set wouldn't hang my entire pool indefinitely, that hot plugging a drive wouldn't reboot Solaris, and that my pool would continue working after I disconnected one half of an iscsi mirror. I also like yourself assumed that if ZFS is using copy on write, then even after a really nasty crash, the vast majority of my data would be accessible. And I also believed that when I had disconnected every drive from a ZFS pool, that ZFS wouldn't accept writes to it any more... Unfortunately, all of these assumptions turned out to be false. Learning ZFS has been a painful experience. I still like it, but I am very aware of its limitations, and am cautious how I apply it these days. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
>From the experience myself and others have had, and Sun's approach with their >Amber Road storage (FISHWORKS - fully integrated *hardware* and software), my >feeling is very much that ZFS was designed by Sun to run on Sun's own >hardware, and as such, they were able to make certain assumptions with their >design. ZFS was never designed to run on consumer hardware, it makes assumptions that devices and drivers will always be well behaved when errors occur, and in general is quite fragile if you're running it on the wrong system. On the right hardware, I've no doubt that ZFS is incredibly reliable and easy to manage. On the wrong hardware, disk errors can hang your entire system, hot swap can down your pool, and a power cut or other error can render your entire pool inaccessible. The success of any ZFS implementation is *very* dependent on the hardware you choose to run it on. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Guys guys please chill... First thanks to the info about virtualbox option to bypass the cache (I don't suppose you can give me a reference for that info? (I'll search the VB site :-)) As this was not clear to me. I use VB like others use vmware etc to run solaris because its the ONLY way I can, as I can't get the drives for most of the H/W out there in hobby land, so a virtualised system allows me to run my SilImage chipset to link 3gb/s to the sata multi-port array I got for £160. ===anyway lets top there or we will be off topic even more ===just though you should know why I did it, even bsd does ===not have a driver or I would have gone there so get zfs :-) Anyway... my view on zfs was quite simple, it looked after bit rot, and did self healing, and most importantly for me as running it on consumer kit, was it seemed to avoid the Raid-5 write hole in the case of a crash! So if stuff falls over eg windows,VB,Opensolaris etc I would not suffer unknown data corruption and would just loose just that write which was fine as the thing crashed. so for a flaky envorment ZFS sounds even more like the one you want, LOL. Loved all the technical stuff, I have had rather good deep dives from Suns best here in UK/europe (I'm lucky as was a very early employee of sun, and now work for a major firm :-)). Liked the idea that you can build your own storage server etc etc. I new most bugs, as I saw them, were fixed in Jan 09 patch. I THOUGHT/ASSUMMED (yes you should never :-()) that given everything else it would be blatantly obvious that when you try to mount a zpool the thing would either rollback to last consistent state (that includes the U-block and meta data thank you) or have a tool like fsck which lets you do it, BUT you know once you start rolling back (just like clearing inodes) your not going to be in such a good place and you'd need to scrub or something, even if it say these files are now corrupt, FINE, but I DIDN'T loose the filesystem just a file or two. We should never loose the filesystem. But in ZFS land thats the most likely fault it sounds we have in data-loss. SUMMARY = What I see here is the lack of the (not needed lol) fsck type tool. WELL WE DO NEED it, we need to be able roll-back and recover and repair. I have lost data stored on large Sun6790 arrays and now my home system. So PLEASE anyone got a beta version of a tool to perform roll back? Russel (It will take me 10 days to pull my data off litttle my little drives again, and 5 days to format with raid5 (H/W) and NFTS not what I want, nor the raid-5 hole :-)) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009 01:48:40 PDT Ross wrote: > As far as I can see, the ZFS Administrator Guide is sorely lacking in > any warning that you are risking data loss if you run on consumer > grade hardware. And yet, ZFS is not only for NON-consumer grade hardware is it? the fact that many, many people run "normal" consumer hardware does not rule them out fro ZFS, does it? The "best filesystem ever", the "end of all other filesystems" would be nothing more than a dream if that was true. Furthermore, much so-called consumer hardware is very good these days. My guess is ZFS should work quite reliably on that hardware. (i.e. non ECC memory should work fine!) / mirroring is a -must- ! -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | nevada / OpenSolaris 2010.02 B118 + All that's really worth doing is what we do for others (Lewis Carrol) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
While I agree with Brent, I think this is something that should be stressed in the ZFS documentation. Those of us with long term experience of ZFS know that it's really designed to work with hardware meeting quite specific requirements. Unfortunately, that isn't documented anywhere, and more and more people are being bitten by quite severe dataloss by virtue of the fact that ZFS is far less forgiving than other filesystems when data hasn't been properly written to disk. As far as I can see, the ZFS Administrator Guide is sorely lacking in any warning that you are risking data loss if you run on consumer grade hardware. In fact, the requirements section states nothing more than: "ZFS Hardware and Software Requirements and Recommendations Make sure you review the following hardware and software requirements and recommendations before attempting to use the ZFS software: * A SPARC® or x86 system that is running the or the Solaris 10 6/06 release or later release. * The minimum disk size is 128 Mbytes. The minimum amount of disk space required for a storage pool is approximately 64 Mbytes. * Currently, the minimum amount of memory recommended to install a Solaris system is 768 Mbytes. However, for good ZFS performance, at least one Gbyte or more of memory is recommended. * If you create a mirrored disk configuration, multiple controllers are recommended." -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009 00:00:06 -0700 Brent Jones wrote: > No offense, but you trusted 10TB of important data, running in > OpenSolaris from inside Virtualbox (not stable) on top of Windows XP > (arguably not stable, especially for production) on probably consumer > grade hardware with unknown support for any of the above products? Running this kind of setup absolutely can give you NO garanties at all. Virtualisation, OSOL/zfs on WinXP. It's nice to play with and see it "working" but would I TRUST precious data to it? No way! -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | nevada / OpenSolaris 2010.02 B118 + All that's really worth doing is what we do for others (Lewis Carrol) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
I would be intrested in how to roll-back to certain txg-points in case of disaster, that was what Russel was after anyway. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Miles Nordin Sent: 19. heinäkuuta 2009 11:24 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work >>>>> "bj" == Brent Jones writes: bj> many levels of fail here, pft. Virtualbox isn't unstable in any of my experience. It doesn't by default pass cache flushes from guest to host unless you set VBoxManage setextradata VMNAME "VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0 however OP does not mention the _host_ crashing, so this questionable ``optimization'' should not matter. Yanking the guest's virtual cord is something ZFS is supposed to tolerate: remember the ``crash-consistent backup'' concept (not to mention the ``always consistent on disk'' claim, but really any filesystem even without that claim should tolerate having the guest's virtual cord yanked, or the guest's kernel crashing, without losing all its contents---the claim only means no time-consuming fsck after reboot). bj> to blame ZFS seems misplaced, -1 The fact that it's a known problem doesn't make it not a problem. bj> the subject on this thread especially inflammatory. so what? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> "bj" == Brent Jones writes: bj> many levels of fail here, pft. Virtualbox isn't unstable in any of my experience. It doesn't by default pass cache flushes from guest to host unless you set VBoxManage setextradata VMNAME "VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0 however OP does not mention the _host_ crashing, so this questionable ``optimization'' should not matter. Yanking the guest's virtual cord is something ZFS is supposed to tolerate: remember the ``crash-consistent backup'' concept (not to mention the ``always consistent on disk'' claim, but really any filesystem even without that claim should tolerate having the guest's virtual cord yanked, or the guest's kernel crashing, without losing all its contents---the claim only means no time-consuming fsck after reboot). bj> to blame ZFS seems misplaced, -1 The fact that it's a known problem doesn't make it not a problem. bj> the subject on this thread especially inflammatory. so what? pgpsa1Xq1kR3M.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sat, Jul 18, 2009 at 7:39 PM, Russel wrote: > Yes you'll find my name all over VB at the moment, but I have found it to be > stable > (don't install the addons disk for solaris!!, use 3.0.2, and for me > winXP32bit and > OpenSolaris 2009.6 has been rock solid, it was (seems) to be opensolaris > failed > with extract_boot_list doesn't belong to 101, but noone on opensol, seems > interested about it as other have reported it to, prob a rare issue. > > But yer, I hope Vicktor or someone will take a look. My worry is that if we > can't recover from this, which a number of people (in variuos forms) have > come accross zfs may be introuble. We had this happen at work about 18 months > ago > lost all the data (20TB)(didn't know about zdb nor did sun support) so we > have start > to back away, but I though since jan 2009 patches things were meant to be > alot better, esp with sun using it in there storage servers now > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > No offense, but you trusted 10TB of important data, running in OpenSolaris from inside Virtualbox (not stable) on top of Windows XP (arguably not stable, especially for production) on probably consumer grade hardware with unknown support for any of the above products? I'd like to say this was an unfortunate circumstance, but there are many levels of fail here, and to blame ZFS seems misplaced, and the subject on this thread especially inflammatory. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Yes you'll find my name all over VB at the moment, but I have found it to be stable (don't install the addons disk for solaris!!, use 3.0.2, and for me winXP32bit and OpenSolaris 2009.6 has been rock solid, it was (seems) to be opensolaris failed with extract_boot_list doesn't belong to 101, but noone on opensol, seems interested about it as other have reported it to, prob a rare issue. But yer, I hope Vicktor or someone will take a look. My worry is that if we can't recover from this, which a number of people (in variuos forms) have come accross zfs may be introuble. We had this happen at work about 18 months ago lost all the data (20TB)(didn't know about zdb nor did sun support) so we have start to back away, but I though since jan 2009 patches things were meant to be alot better, esp with sun using it in there storage servers now -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Sorry to hear that, but you do know that VirtualBox is not really stable? VirtualBox does show some instability from time to time. You havent read the VirtualBox forums? I would advice against VirtualBox for saving all your data in ZFS. I would use OpenSolaris without virtualization. I hope your problem gets fixed, though. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss