On Sex, 2008-10-10 at 11:23 -0700, Eric Schrock wrote:
> But I haven't actually heard a reasonable proposal for what a
> fsck-like tool (i.e. one that could "repair" things automatically) would
> actually *do*, let alone how it would work in the variety of situations
> it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
> fails.

I'd say an fsck-like tool for ZFS should not worry much compression,
checksums, RAID-Z and whatnot. In essence, it would try to do what an
fsck tool does for a typical filesystem, and so would be mostly
oblivious to the layout or encoding of the blocks, perhaps treating
blocks with failed checksums as blocks full of zeros.

Here's how it could work (of course, this is all easier said than done):

1) Open all the devices specified by the user. Optionally, take just a
pool name/guid and scan for the right devices in /dev/[r]dsk.

2) Verify if the pool configuration read from the devices is sane -- if
not, try to generate a consistent configuration. Some elements of the
pool configuration, such as the correct pool version, could be checked
in later steps, depending on features that were found.

3) Starting from the last uberblock, fully traverse a few levels down
the tree. If less than 100% of the blocks could be read without errors,
do the same for previous uberblocks and offer the user the choice to
which uberblock to use, or if running non-interactively, choose the one
with the best success rate.

4) Traverse the list/tree of filesystems, snapshots and clones. Make
sure that they are well-connected. For each filesystem, try to replay
the ZILs, clean them out.

5) Now fully traverse the pool. Compute the space maps and FS space
usage on-the-go, as blocks are read.

6) For each metadata block read, check whether the fields are sane, fix
them/zero them out if they're not. Basically we're assuming here that we
may have corrupted metadata with correct checksums.

If some metadata block can not be read due to a failed checksum, assume
the block is full of zeros, and fix it.

By the way, this includes every field of every kind of metadata block,
including ZAPs, ACLs, FID maps, znode fields, everything.

For fields that reference other objects, make sure that the object they
reference is of the correct type and that the object itself is correct.

For objects that are missing, create empty ones if necessary.

7) Check that every object is referenced somewhere and link unreferenced
objects to /lost+found/object-type/, or similar.

8) Probably do other things that I'm forgetting.

9) In the end, check if the space maps are consistent with the ones
computed, write correct ones if not. Check that space
usage/reservations/quotas are correct.

Essentially, the goal is that at the end of this process, the pool
should contain consistent information, should have as much data as could
be recovered and should never cause any further errors in ZFS due to
invalid metadata/fields; either when importing it, reading from it or
writing/modifying it (except that it would still return EIO errors when
trying to read corrupted file data blocks, of course).

Now, a problem with fsck-like tools, and perhaps especially with ZFS, is
that some of these steps may either require lots of memory or multiple
filesystem/pool traversals.

I'd say having such a tool, even if it required additional temporary
storage for operation (hopefully not a very large fraction of the pool
size), would be *very* useful and would clear up any worries that people
currently have.

Kind regards,
Ricardo

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to