On Fri, Feb 26, 2010 at 09:49:14PM +0100, Diego Calleja wrote:
> On Viernes, 26 de Febrero de 2010 20:09:15 Chris Mason escribió:
> > My would be the super block, it is updated more often and so more likely
> > to get stuck in the array's cache.
> 
> IIRC, this is exactly the same problem that ZFS users have been
> hitting. Some users got cheap disks that don't honour barriers
> correctly, so their uberblock didn't have the correct data.

This isn't new, XFS and reiserfs v3 have had problems as well.  But,
this is just my first suspect, Bill might be hitting something entirely
different.

> They developed an app that tries to rollback transactions to
> get the pool into a sane state...I guess that fsck will be able
> to do that at some point?

Yes, this is something that fsck will need to fix.  This corruption is
hardest because it involves the tree that maps all the other trees
(ugh).

The ioctl I'm working on for snapshot/subvol listing will make it easier
to create a program to backup the chunk tree externally.

> 
> Stupid question from someone who is not a fs dev...it's not possible
> to solve this issue by doing some sort of "superblock journaling"?
> Since there are several superblock copies you could:
>  -Modify a secondary superblock copy to point to the tree root block
>   that still has not been written to disk
>  -Write whatever tree root block has been COW'ed
>  -Modify the primary superblock
> 
> So in case of these failures, mount code could look in the secondary
> superblock copy before failing. Since barriers are not being honoured,
> there's still a possibility that the tree root blocks would be written
> before the secondary superblock block that was submitted before, but
> that problem would be much harder to hit I guess. But maybe the fs code
> can not know where the tree root blocks are going to be written before
> writting them, and hence it can't generate a valid superblock?
> 
> Sorry if all this has not sense at all, I'm just wondering if there's
> a way to solve these drive issues without any kind of recovery tools

The problem is that with a writeback cache, any write is likely
to be missed on power failures.  journalling in general requires some
notion of being able to wait for block A to be on disk before you write
block B, and that's difficult to do when the disk lies about what is
really there ;)

To make things especially difficult, you can't really just roll back to
an older state.  Internally the filesystem does something like this:

allocate a bunch of blocks
free a bunch of blocks

commit

reuse blocks that were freed

Basically once that commit is on disk, we're allowed to (and likely to)
start writing over blocks that were freed in the earlier transaction.
If you try to roll back to the state at the start of that transaction
many of those blocks won't have the same data they did before.

Now, the size of the corruption might be smaller in the rolled back
transaction than in the main transaction or it might be much worse.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to