On Fri, Jun 24, 2016 at 11:40:56AM -0600, Chris Murphy wrote: > On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills <h...@carfax.org.uk> wrote: > > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: > > For data, say you have n-1 good devices, with n-1 blocks on them. > > Each block has a checksum in the metadata, so you can read that > > checksum, read the blocks, and verify that they're not damaged. From > > those n-1 known-good blocks (all data, or one parity and the rest > > data) you can reconstruct the remaining block. That reconstructed > > block won't be checked against the csum for the missing block -- it'll > > just be written and a new csum for it written with it. > > The last sentence is hugely problematic. Parity doesn't appear to be > either CoW'd or checksummed. If it is used for reconstruction and the > reconstructed data isn't compared to the data's EXTENT_CSUM entry, but > that entry is rather recomputed and written, that's just like blindly > trusting the parity is correct and then authenticating it with a csum.
I think what happens is the data is recomputed, but the csum on the data is _not_ updated (the csum does not reside in the raid56 code). A read of the reconstructed data would get a csum failure (of course, every 4 billionth time this happens the csum is correct by random chance, so you wouldn't want to be reading parity blocks from a drive full of garbage, but that's a different matter). > It's not difficult to test. Corrupt one byte of parity. Yank a drive. > Add a new one. Start a reconstruction with scrub or balance (or both > to see if they differ) and find out what happens. What should happen > is the reconstruct should work for everything except that one file. If > it's reconstructed silently, it should contain visible corruption and > we all collectively raise our eyebrows. I've done something like that test: write random data to 1000 random blocks on one disk, then run scrub. It reconstructs the data without problems (except for the minor wart that 'scrub status -d' counts the randomly against every device, while 'dev stats' counts all the errors on the disk that was corrupted). Disk-side data corruption is a thing I have to deal with a few times each year, so I tested the btrfs raid5 implementation for that case before I started using it. As far as I can tell so far, everything in btrfs raid5 works properly if a disk fails _while the filesystem is not mounted_. The problem I see in the field is not *silent* corruption. It's a whole lot of very *noisy* corruption detected under circumstances where I'd expect to see no corruption at all (silent or otherwise).
signature.asc
Description: Digital signature