Re: Adventures in btrfs raid5 disk recovery

Andrei Borzenkov Fri, 24 Jun 2016 03:20:05 -0700

On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills <h...@carfax.org.uk> wrote:
> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <h...@carfax.org.uk> wrote:
>> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
>> >> 24.06.2016 04:47, Zygo Blaxell пишет:
>> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli 
>> >> >> <kreij...@inwind.it> wrote:
>> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
>> >> >>> checksum.
>> >> >>
>> >> >> Yeah I'm kinda confused on this point.
>> >> >>
>> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
>> >> >>
>> >> >> It says there is a write hole for Btrfs. But defines it in terms of
>> >> >> parity possibly being stale after a crash. I think the term comes not
>> >> >> from merely parity being wrong but parity being wrong *and* then being
>> >> >> used to wrongly reconstruct data because it's blindly trusted.
>> >> >
>> >> > I think the opposite is more likely, as the layers above raid56
>> >> > seem to check the data against sums before raid56 ever sees it.
>> >> > (If those layers seem inverted to you, I agree, but OTOH there are
>> >> > probably good reason to do it that way).
>> >> >
>> >>
>> >> Yes, that's how I read code as well. btrfs layer that does checksumming
>> >> is unaware of parity blocks at all; for all practical purposes they do
>> >> not exist. What happens is approximately
>> >>
>> >> 1. logical extent is allocated and checksum computed
>> >> 2. it is mapped to physical area(s) on disks, skipping over what would
>> >> be parity blocks
>> >> 3. when these areas are written out, RAID56 parity is computed and filled 
>> >> in
>> >>
>> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
>> >
>> >    Checksums are not parity, correct. However, every data block
>> > (including, I think, the parity) is checksummed and put into the csum
>> > tree. This allows the FS to determine where damage has occurred,
>> > rather thansimply detecting that it has occurred (which would be the
>> > case if the parity doesn't match the data, or if the two copies of a
>> > RAID-1 array don't match).
>> >
>>
>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
>
>    Eh? How do you come to that conclusion?
>
>    For data, say you have n-1 good devices, with n-1 blocks on them.
> Each block has a checksum in the metadata, so you can read that
> checksum, read the blocks, and verify that they're not damaged. From
> those n-1 known-good blocks (all data, or one parity and the rest


We do not know whether parity is good or not because as far as I can
tell parity is not checksummed.

> data) you can reconstruct the remaining block. That reconstructed
> block won't be checked against the csum for the missing block -- it'll
> just be written and a new csum for it written with it.
>

So we have silent corruption. I fail to understand how it is an improvement :)

>    Hugo.
>
>> ...
>> >
>> >> > It looks like uncorrectable failures might occur because parity is
>> >> > correct, but the parity checksum is out of date, so the parity checksum
>> >> > doesn't match even though data blindly reconstructed from the parity
>> >> > *would* match the data.
>> >> >
>> >>
>> >> Yep, that is how I read it too. So if your data is checksummed, it
>> >> should at least avoid silent corruption.
>> >>
>
> --
> Hugo Mills             | Debugging is like hitting yourself in the head with
> hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
> http://carfax.org.uk/  | you're allowed to stop debugging.
> PGP: E2AB1DE4          |                                        PotatoEngineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Adventures in btrfs raid5 disk recovery

Reply via email to