On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: > On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <h...@carfax.org.uk> wrote: > > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: > >> 24.06.2016 04:47, Zygo Blaxell пишет: > >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: > >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli > >> >> <kreij...@inwind.it> wrote: > >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the > >> >>> checksum. > >> >> > >> >> Yeah I'm kinda confused on this point. > >> >> > >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 > >> >> > >> >> It says there is a write hole for Btrfs. But defines it in terms of > >> >> parity possibly being stale after a crash. I think the term comes not > >> >> from merely parity being wrong but parity being wrong *and* then being > >> >> used to wrongly reconstruct data because it's blindly trusted. > >> > > >> > I think the opposite is more likely, as the layers above raid56 > >> > seem to check the data against sums before raid56 ever sees it. > >> > (If those layers seem inverted to you, I agree, but OTOH there are > >> > probably good reason to do it that way). > >> > > >> > >> Yes, that's how I read code as well. btrfs layer that does checksumming > >> is unaware of parity blocks at all; for all practical purposes they do > >> not exist. What happens is approximately > >> > >> 1. logical extent is allocated and checksum computed > >> 2. it is mapped to physical area(s) on disks, skipping over what would > >> be parity blocks > >> 3. when these areas are written out, RAID56 parity is computed and filled > >> in > >> > >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. > > > > Checksums are not parity, correct. However, every data block > > (including, I think, the parity) is checksummed and put into the csum > > tree. This allows the FS to determine where damage has occurred, > > rather thansimply detecting that it has occurred (which would be the > > case if the parity doesn't match the data, or if the two copies of a > > RAID-1 array don't match). > > > > Yes, that is what I wrote below. But that means that RAID5 with one > degraded disk won't be able to reconstruct data on this degraded disk > because reconstructed extent content won't match checksum. Which kinda > makes RAID5 pointless.
Eh? How do you come to that conclusion? For data, say you have n-1 good devices, with n-1 blocks on them. Each block has a checksum in the metadata, so you can read that checksum, read the blocks, and verify that they're not damaged. From those n-1 known-good blocks (all data, or one parity and the rest data) you can reconstruct the remaining block. That reconstructed block won't be checked against the csum for the missing block -- it'll just be written and a new csum for it written with it. Hugo. > ... > > > >> > It looks like uncorrectable failures might occur because parity is > >> > correct, but the parity checksum is out of date, so the parity checksum > >> > doesn't match even though data blindly reconstructed from the parity > >> > *would* match the data. > >> > > >> > >> Yep, that is how I read it too. So if your data is checksummed, it > >> should at least avoid silent corruption. > >> -- Hugo Mills | Debugging is like hitting yourself in the head with hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and http://carfax.org.uk/ | you're allowed to stop debugging. PGP: E2AB1DE4 | PotatoEngineer
signature.asc
Description: Digital signature