Re: Adventures in btrfs raid5 disk recovery

Hugo Mills Fri, 24 Jun 2016 01:51:07 -0700

On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> 24.06.2016 04:47, Zygo Blaxell пишет:
> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreij...@inwind.it> 
> >> wrote:
> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> >>> checksum.
> >>
> >> Yeah I'm kinda confused on this point.
> >>
> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >>
> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> parity possibly being stale after a crash. I think the term comes not
> >> from merely parity being wrong but parity being wrong *and* then being
> >> used to wrongly reconstruct data because it's blindly trusted.
> > 
> > I think the opposite is more likely, as the layers above raid56
> > seem to check the data against sums before raid56 ever sees it.
> > (If those layers seem inverted to you, I agree, but OTOH there are
> > probably good reason to do it that way).
> > 
> 
> Yes, that's how I read code as well. btrfs layer that does checksumming
> is unaware of parity blocks at all; for all practical purposes they do
> not exist. What happens is approximately
> 
> 1. logical extent is allocated and checksum computed
> 2. it is mapped to physical area(s) on disks, skipping over what would
> be parity blocks
> 3. when these areas are written out, RAID56 parity is computed and filled in
> 
> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.


   Checksums are not parity, correct. However, every data block
(including, I think, the parity) is checksummed and put into the csum
tree. This allows the FS to determine where damage has occurred,
rather thansimply detecting that it has occurred (which would be the
case if the parity doesn't match the data, or if the two copies of a
RAID-1 array don't match).

   (Note that csums for metadata are stored in the metadata block
itself, not in the csum tree).

   Hugo.

> > It looks like uncorrectable failures might occur because parity is
> > correct, but the parity checksum is out of date, so the parity checksum
> > doesn't match even though data blindly reconstructed from the parity
> > *would* match the data.
> > 
> 
> Yep, that is how I read it too. So if your data is checksummed, it
> should at least avoid silent corruption.
> 
> >> I don't read code well enough, but I'd be surprised if Btrfs
> >> reconstructs from parity and doesn't then check the resulting
> >> reconstructed data to its EXTENT_CSUM.
> > 
> > I wouldn't be surprised if both things happen in different code paths,
> > given the number of different paths leading into the raid56 code and
> > the number of distinct failure modes it seems to have.
> > 
> 
> Well, the problem is that parity block cannot be redirected on write as
> data blocks; which makes it impossible to version control it. The only
> solution I see is to always use full stripe writes by either wasting
> time in fixed width stripe or using variable width, so that every stripe
> always gets new version of parity. This makes it possible to keep parity
> checksums like data checksums.
> 



-- 
Hugo Mills             | Darkling's First Law of Filesystems:
hugo@... carfax.org.uk | The user hates their data
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

signature.asc
Description: Digital signature

Re: Adventures in btrfs raid5 disk recovery

Reply via email to