Re: Adventures in btrfs raid5 disk recovery

Hugo Mills Fri, 24 Jun 2016 04:00:32 -0700

On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote:
> On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills <h...@carfax.org.uk> wrote:
> > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <h...@carfax.org.uk> wrote:
> >> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> >> 24.06.2016 04:47, Zygo Blaxell пишет:
> >> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli 
> >> >> >> <kreij...@inwind.it> wrote:
> >> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> >> >> >>> checksum.
> >> >> >>
> >> >> >> Yeah I'm kinda confused on this point.
> >> >> >>
> >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >> >> >>
> >> >> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> >> >> parity possibly being stale after a crash. I think the term comes not
> >> >> >> from merely parity being wrong but parity being wrong *and* then 
> >> >> >> being
> >> >> >> used to wrongly reconstruct data because it's blindly trusted.
> >> >> >
> >> >> > I think the opposite is more likely, as the layers above raid56
> >> >> > seem to check the data against sums before raid56 ever sees it.
> >> >> > (If those layers seem inverted to you, I agree, but OTOH there are
> >> >> > probably good reason to do it that way).
> >> >> >
> >> >>
> >> >> Yes, that's how I read code as well. btrfs layer that does checksumming
> >> >> is unaware of parity blocks at all; for all practical purposes they do
> >> >> not exist. What happens is approximately
> >> >>
> >> >> 1. logical extent is allocated and checksum computed
> >> >> 2. it is mapped to physical area(s) on disks, skipping over what would
> >> >> be parity blocks
> >> >> 3. when these areas are written out, RAID56 parity is computed and 
> >> >> filled in
> >> >>
> >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
> >> >
> >> >    Checksums are not parity, correct. However, every data block
> >> > (including, I think, the parity) is checksummed and put into the csum
> >> > tree. This allows the FS to determine where damage has occurred,
> >> > rather thansimply detecting that it has occurred (which would be the
> >> > case if the parity doesn't match the data, or if the two copies of a
> >> > RAID-1 array don't match).
> >> >
> >>
> >> Yes, that is what I wrote below. But that means that RAID5 with one
> >> degraded disk won't be able to reconstruct data on this degraded disk
> >> because reconstructed extent content won't match checksum. Which kinda
> >> makes RAID5 pointless.
> >
> >    Eh? How do you come to that conclusion?
> >
> >    For data, say you have n-1 good devices, with n-1 blocks on them.
> > Each block has a checksum in the metadata, so you can read that
> > checksum, read the blocks, and verify that they're not damaged. From
> > those n-1 known-good blocks (all data, or one parity and the rest
> 
> We do not know whether parity is good or not because as far as I can
> tell parity is not checksummed.


   I was about to write a devastating rebuttal of this... then I
actually tested it, and holy crap you're right.

   I've just closed the terminal in question by accident, so I can't
copy-and-paste, but the way I checked was:

# mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2}
# mount /dev/loop0 foo
# dd if=/dev/urandom of=foo/file bs=4k count=32
# umount /dev/loop0
# btrfs-debug-tree /dev/loop0

then look at the csum tree:

     item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128
          extent csum item

There is a single csum item in it, of length 128. At 4 bytes per csum,
that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving
nothing for the parity.

   This is fundamentally broken, and I think we need to change the
wiki to indicate that the parity RAID implementation is not
recommended, because it doesn't actually do the job it's meant to in a
reliable way. :(

   Hugo.

> > data) you can reconstruct the remaining block. That reconstructed
> > block won't be checked against the csum for the missing block -- it'll
> > just be written and a new csum for it written with it.
> >
> 
> So we have silent corruption. I fail to understand how it is an improvement :)
> 
> >    Hugo.
> >
> >> ...
> >> >
> >> >> > It looks like uncorrectable failures might occur because parity is
> >> >> > correct, but the parity checksum is out of date, so the parity 
> >> >> > checksum
> >> >> > doesn't match even though data blindly reconstructed from the parity
> >> >> > *would* match the data.
> >> >> >
> >> >>
> >> >> Yep, that is how I read it too. So if your data is checksummed, it
> >> >> should at least avoid silent corruption.
> >> >>
> >

-- 
Hugo Mills             | Debugging is like hitting yourself in the head with
hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
http://carfax.org.uk/  | you're allowed to stop debugging.
PGP: E2AB1DE4          |                                        PotatoEngineer

signature.asc
Description: Digital signature

Re: Adventures in btrfs raid5 disk recovery

Reply via email to