Re: [zfs-discuss] ZFS + fsck

Jason King Sun, 08 Nov 2009 13:19:26 -0800

On Sun, Nov 8, 2009 at 7:55 AM, Robert Milkowski <mi...@task.gda.pl> wrote:
>
> fyi
>
> Robert Milkowski wrote:
>>
>> XXX wrote:
>>>
>>> | Have you actually tried to roll-back to previous uberblocks when you
>>> | hit the issue?  I'm asking as I haven't yet heard about any case
>>> | of the issue witch was not solved by rolling back to a previous
>>> | uberblock. The problem though was that the way to do it was "hackish".
>>>
>>>  Until recently I didn't even know that this was possible or a likely
>>> solution to 'pool panics system on import' and similar pool destruction,
>>> and I don't have any tools to do it. (Since we run Solaris 10, we won't
>>> have official support for it for quite some time.)
>>>
>>
>> I wouldn't be that surprised if this particular feature would actually be
>> backported to S10 soon. At least you may raise a CR asking for it - maybe
>> you will get an access to IDR first (I'm not saying there is or isn't
>> already one).
>>
>>>  If there are (public) tools for doing this, I will give them a try
>>> the next time I get a test pool into this situation.
>>>
>>
>> IIRC someone send one to the zfs-discuss list some time ago.
>> Then usually you will also need to poke with zdb.
>> A sketchy and unsupported procedure was discussed on the list as well.
>> Look at the archives.
>>
>>> | The bugs which prevented importing a pool in some circumstances were
>>> | really "annoying" but lets face it - it was bound to happen and they
>>> | are just bugs which are getting fixed. ZFS is still young after all.
>>> | And when you google for data loss on other filesystems I'm sure you
>>> | will find lots of user testimonies - be it ufs, ext3, raiserfs or your
>>> | favourite one.
>>>
>>>  The difference between ZFS and those other filesystems is that with
>>> a few exceptions (XFS, ReiserFS), which sysadmins in the field didn't
>>> like either, those filesystems didn't generally lose *all* your data
>>> when something went wrong. Their official repair tools could usually
>>> put things back together to at least some extent.
>>>
>>
>> Generally they didn't although I've seen situation when entire ext2 and
>> ufs were lost and fsck was not able to get them even mounted (kernel panics
>> right after mounting them). In other occassion fsck was crashing the box in
>> yet another one fsck claimed everything was ok but then when doing backup
>> system was crashing (fsck can't really properly fix filesystem state - it is
>> more of guessing and sometimes it goes terribly wrong).
>>
>> But I agrre that generally with other file systems you can recover most or
>> all data just fine.
>> And generally it is the case with zfs - there were probably more bugs in
>> ZFS as it is much younger filesystem, but most of them were very quickly
>> fixed. And the uberblock one - I 100% agree then when you hit the issue and
>> didn't know about manual method to recover it was very bad - but it has
>> finally been fixed.
>>
>>> (Just as importantly, when they couldn't put things back together you
>>> could honestly tell management and the users 'we ran the recovery tools
>>> and this is all they could get back'. At the moment, we would have
>>> to tell users and management 'well, there are no (official) recovery
>>> tools...', unless Sun Support came through for once.)
>>>
>>
>> But these tools are built-in into zfs and are happening automatically and
>> with virtually 100% confidence that if something can be fixed it is fixed
>> correctly and if something is wrong it will be detected - thanks to
>> end-to-end checksumming of data and meta-data. The problem *was* that one
>> case scenario when rolling back to previous uberblock is required was not
>> implemented and required a complicated and undocumented procedure to follow.
>> It wasn't high priority for Sun as it was very rare , wasn't affecting much
>> enterprise customers and although complicated the procedure is there is one
>> and was successfully used on many occasions even for non paying customers
>> thanks to guys like Victor on the zfs mailing list who helped some people in
>> such a situations.
>>
>> But you didn't know about it and it seems like Sun's support service was
>> no use for you - which is really a shame.
>> In your case I would probably point that out to them and at least get some
>> good deal as a compensation or something...
>>
>> But what is most important is that finally fully supported, built in and
>> easy to use procedure is available to recover from such situations. As time
>> will progress and more bugs will be fixed ZFS will behave much better under
>> many corner cases as it does already in Open Solaris - last 6 months or so
>> were really very productive in fixing many bugs like that.
>>
>>> | However the whole point of the discussion is that zfs really doesn't |
>>> need a fsck tool.
>>> | All the problems encountered so far were bugs and most of them are |
>>> already fixed. One missing feature was a built-in support for | rolling-back
>>> uberblock which just has been integrated. But I'm sure | there are more bugs
>>> to be found..
>>>
>>>  I disagree strongly. Fsck tools have multiple purposes; ZFS obsoletes
>>> some of them but not all. One thing fsck is there for is to recover as
>>> much as possible after things happen that are supposed to be impossible,
>>> like operating system bugs or crazy corruption. ZFS's current attitude
>>> is more or less that impossible things won't happen so it doesn't have
>>> to do anything (except, perhaps, panic with assert failures).
>>>
>>
>> This is not true - I will try to explain why.
>> Generally if you want to recover some data from a filesystem you need to
>> get it into a state you can mount it (at least RO). Most legacy filesystems
>> when  hitting with the problem that metadata do not make sense to them and
>> they think it is wrong  won't allow you to mount the filesystem and will ask
>> you to run fsck. Now as there are not checksum in these filesystems
>> generally there is no accurate way of telling how the bad metadata should be
>> fixed. Fsck is looking for obvious things and is trying to "guess" in many
>> cases and sometimes it is right and sometimes it is not. Then sometimes it
>> won't even detect then there was corruption. Also keep in mind that fsck in
>> most filesystems does not even try to check for user data - just metadata.
>> The main reason is that it can't really do it.
>> Now because running fsck could potentially be disastrous  to a filesystem
>> and lead to even more damage if it is started automatically (for example
>> during system boot) it is started in an interactive-mode and if some less
>> obvious fixes are required it will require a human to confirm its action.
>> But even then it is still just guessing what it is supposed to do. And it
>> happens that situation gets even worse.
>>
>> Then sometimes there were bugs both in filesystems and fsck and user was
>> left with no access to data at all until these bugs were fixed (or user was
>> skilled enough to fix/workaround them on his/her own). I came across such
>> problems on EMC IP4700, EMC Celerra and couple of other systems in my life.
>> For example fsck was running for well over 10h consuming more and more
>> memory and finally server was running out of memory and fsck died... and it
>> all started over again, failed again.... in other case fsck was just
>> crashing during repair in the same location and file system was crashing the
>> os after couple of minutes after mounting it..
>>
>> The other problem with fsck is that even if it thinks that filesystem is
>> ok it actually might not be - even its metadata state. Then all different
>> things might happen - like when accessing a given file or directory a system
>> will panic or more data will get corrupted... I was in such a situation
>> couple of times and it took days to copy files from such a filesystem to
>> another one with many panics in-between when we had to skip such files or
>> directories, etc. fsck didn't help and reported everything is fine.
>>
>> Now with ZFS it is completely different world. ZFS is able in virtually
>> all cases to detect if its meta-data and data on-disk is corrupted in anyway
>> or not thanks to its end-to-end checksumming. If someone is concern with how
>> strong default checksumming is (fletcher4) then currently one cas switch zfs
>> to use sha256 to have a good sleep. So here is first big difference compared
>> to most filesystems in a market - ZFS if some data is corrupted does not
>> have to *guess* if it is the case or not but can actually detect it with
>> almost 100% confidence when it is the case.
>> Once such a case is detected ZFS will try to automatically fix the issue
>> if there is redundant copy of corrupted block available - if there is it
>> will all happen transparently to applications without any need to unmount
>> filesystems or run external tools like fsck. Then because ZFS checksums both
>> metadata and user data it will be able to detect and possibly fix data
>> corruptions in both cases (which fsck can't even if it is lucky). Now even
>> if you are not doing any redundancy at pool level by using ZFS its metadata
>> blocks are always kept in at least two copies physically separated on disk
>> if possible. What it means is that even in a single disk configuration (or
>> stirpe) if some data is corrupted zfs will be able to detect it and if it is
>> meta-data block it will be able not only to detect it but also automatically
>> and transparently fix it and preserve filesystem consistency. There is a
>> simple test you may run - create a pool on top of one disk drive, put some
>> files in it then overwrite lets say 20% of the disk drives with some random
>> data or zeros while zfs is running. Then flush caches (export/import pool)
>> and try to access allmetadata by doing a full ls -lra on a filesystem. You
>> should be able to get a full listing with proper attributes, etc. but if you
>> check zpool status it will probably report many checksum errors which were
>> corrected. (when overwriting overwrite so portion of the beginning of the
>> disk as zfs will usually start writing to a disk from the beginning). Now if
>> you actually try to read a file contents it should be fine if you lucky
>> enough to read onwhich was not overwritten and if you are unlucky you won't
>> be able to read blocks which are corrupted (and since you don't have ane
>> redundancy at zfs level it can't fix its user-data but can detect it) but
>> you will be able to read all the other blocks from the file. Now try to do
>> something like these with any other file system - you will probably end-up
>> with os panic and in many cases fsck won't be able to recover file system to
>> such a point so you can recover some data.... and when fixing it will be
>> only guessing what to do and skip user data entirely...
>>
>> Now there is a specific scenario case of the above when metadata is
>> corrupted which is describing pool itself or its root block and it can't be
>> fixed as all copies are wrong. ZFS can also detect it but an extra
>> functionality was not implemented until very recently to actually try to use
>> N-1 rootblock in such the case. This was very unfortunate but because it was
>> very rare in the field and resources are limited as usual it wasn't
>> implemented - instead there was an undocumented, unsupported and hard to
>> follow procedure on how to do it manually - and some people did use it
>> successfully (check zfs-discuss archives). But of course it shouldn't be
>> like that and ZFS developers did recognized it by having accepted bug report
>> on it. Bur limited resources...... fortunately a built-in mechanism to deal
>> with such a case has finally been implemented. So now when it happens a user
>> will have a choice of importing a pool with extra option to rollback to a
>> previous version of txg so the pool can be imported. From now one all the
>> mechanisms described above will kick-in. And again - no guessing here but a
>> guarantee of detecting a corruption and fixing it if possible. And you don't
>> even have to run any check and wait hours sometimes days on large
>> filesystems with millions of files before you can access your data (and
>> still not be sure what exactly you're accessing and if it won't cause
>> further issues). Of course it would probably be wise to run zpool scrub to
>> force reading all data and metadata and checking their checksum and fix them
>> if possible at convinient time for you but in a mean time you may run your
>> applications and any corruptions will be detected and fixed while data is
>> being accessed.
>>
>> So from the practical point of view you make think of the mechanisms in
>> ZFS as a built-in fsck with an ability to actually detect when corruption
>> happens (instead of just guessing it and just for meta-data), get it fixed
>> if a redundant copy is available (and do it transparently to applications).
>> Having a separate tool doesn't really makes sense here. Of course you can
>> always write a script called fsck.zfs which will import a pool and run zpool
>> scrub if you want. And sometimes people will do exactly that before going
>> back into production. But having a genuine extra tool like fsck doesn't
>> really make sense - what such a tool should exactly do (keeping in mind all
>> the above)?
>>
>> Then there were a couple of bugs which prevented ZFS from importing a pool
>> with some specific corruptions which were entirely fixable (AFAIK all known
>> were fixed in Open Solaris). When you think about it - we are talking about
>> bugs here - if you would put all the recovery mechanisms into a separate
>> tool called fsck with the same bugs it wouldn't be able to repair such a
>> pool anyway, would it? So you would need to fix these bugs first - but once
>> you fixed them the zfs will able to mount such a pool and still an external
>> tool to do so is not needed (or after applying a patch/fix do 'alias
>> fsck='zpool import'' and then fsck pool will get your pool fixed... :)
>> You might ask but what are you supposed to do until such a bug is fixed?
>> Well, what would you do if you wouldn't be able to mount ext2 filesystem (or
>> any other) and there was a bug in its fsck which would prevent it from
>> getting the fs into a mountable state.... you would have to wait for a fix,
>> or get it fixed yourself, or play with its on-disk format with tools like
>> e2fs, fsdb, ... and try to fix filesystem manually. Well, on zfs you've also
>> have zdb...
>> or you would probably be forced to recover data from backup.
>>
>> The point here is that most filesystem and their tool had such bugs and
>> zfs is one of the youngest filesystems in the market so it is no wonder in a
>> way that such bugs are getting fixed now and not 5-7 years ago. Then there
>> is a critical mass of users required for a given filesystems so it is
>> deployed in many different environments, different workloads, hardware,
>> drievers, usage cases, ... so all these corner cases can surface, users
>> hopefully will report them and they will get fixed. ZFS is becoming widely
>> deployed only for last couple of years or so so no wonder that most of these
>> bugs were spotted (and fixed) during the same period.
>>
>> But then thanks to a fundamentally different architecture of ZFS once most
>> (all? :)) of bugs like these are fixed ZFS offers something MUCH better than
>> legacy filesystems + fsck. It offers a guarantee of detecting data
>> corruption and fixing it properly when possible while reporting what can't
>> be fixed and still providing an access to all the other data in your pool.
>>
>>
>> btw: the email exchange is private so I don't won't to include zfs-discuss
>> without your consent but if you want to forward this email to zfs-discuss
>> for other users benefit feel free to do so.
>>
>>
>>> ) As the evolution of ZFS has demonstrated, impossible things *do* happen
>>> and you *do* need the ability to recover as much as possible.  ZFS is
>>> busy slapping bandaids over specific problems instead of dealing with
>>> the general issue.
>>>
>>
>> Just a quick "google" and:
>>
>> 1. fsck fails and causes panic of Linux kernel
>> https://bugzilla.redhat.com/show_bug.cgi?id=126238
>>
>> 2. btrfs - filesystem gots corrupted, running btrfsck causes even more
>> damage and entire filesystem is nuked due to a bug. BTRFS is not the best
>> example as it is far from being production ready but still...
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=497821
>>
>> 3. linux gfs2 - fsck has a bug (or lack of feature) and is not able to fix
>> the filesystem with a specific corruption, but filesystem is unmountable.
>> The only option is to manually fix data on-disk with help from a support
>> service on case-by-case basis...
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=457557
>>
>> 4. e2fsck segfaults + dumps core when trying to check a filesystem
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=108075
>>
>> 5. ext3 filesystem crashes - fsck can't repair it and goes into infinite
>> loop.... fixed in development version of fsck
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=467677
>>
>> 6. gfs2 corruption is causing a linux kernel to panic.... fsck says it
>> fixes the issue but it doesn't and system crashes all over again under
>> load...
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=519049
>>
>> 7. ext3 filesystem can't be mounted anf fsck won't finish after 10 days of
>> running (probably some kind of infinite looping bug again)
>>
>> http://ubuntuforums.org/archive/index.php/t-394744.html
>>
>> 8. AIX JFS2 filesystem corruption - due to a bug in fsck it can't fix the
>> fs, data had to be recovered from backup
>>
>>
>> http://unix.ittoolbox.com/groups/technical-functional/ibm-aix-l/error-518-file-system-corruption-366503
>>
>>
>> 9.
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=514511
>> https://bugzilla.redhat.com/show_bug.cgi?id=477856
>>
>>
>> And there are many more...


You missed the fun vxfs ones where a full fs can corrupt itself so
badly that your only option is to restore from backup (fsck won't help
you).  Then there was the vxfs memory leak on Solaris 10 (didn't cause
corruption, but at some point you had to take outages to workaround
the problem).

Or the 'feature' that was there for a long time, where unclean
shutdowns could (not always, but often enough to be annoying) mess up
vxvm so much that you had to run vxprivutil on your luns, send the
output to veritas, then they create a custom file for vxmake to repair
the private area just to be able to import the disk group.

>> The point again is that bugs happen even in fsck and until they are fixed
>> a common user/sysadmin quote often won't be able to recover on its own. ZFS
>> is not exception here when it comes to bugs. But thanks to its different
>> approach (mostly end-to-end checksumming + COW) its ability to detect data
>> corruption and deal with it exceeds  most  generally available solutions in
>> the market. The fixes for some bugs mentioned before make it only more
>> robust and reliable even for those unlucky users before... :)

And they even happen in 'mature' and 'proven' filesystems too...

>>
>>
>>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + fsck

Reply via email to