Re: Adventures in btrfs raid5 disk recovery

Chris Murphy Mon, 20 Jun 2016 14:35:50 -0700

On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
<ce3g8...@umail.furryterror.org> wrote:
> On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:


>> For me the critical question is what does "some corrupted sectors" mean?
>
> On other raid5 arrays, I would observe a small amount of corruption every
> time there was a system crash (some of which were triggered by disk
> failures, some not).

What test are you using to determine there is corruption, and how much
data is corrupted? Is this on every disk? Non-deterministically fewer
than all disks? Have you identified this as a torn write or
misdirected write or is it just garbage at some sectors? And what's
the size? Partial sector? Partial md chunk (or fs block?)

>  It looked like any writes in progress at the time
> of the failure would be damaged.  In the past I would just mop up the
> corrupt files (they were always the last extents written, easy to find
> with find-new or scrub) and have no further problems.

This is on Btrfs? This isn't supposed to be possible. Even a literal
overwrite of a file is not an overwrite on Btrfs unless the file is
nodatacow. Data extents get written, then the metadata is updated to
point to those new blocks. There should be flush or fua requests to
make sure the order is such that the fs points to either the old or
new file, in either case uncorrupted. That's why I'm curious about the
nature of this corruption. It sounds like your hardware is not exactly
honoring flush requests.

With md raid and any other file system, it's pure luck that such
corrupted writes would only affect data extents and not the fs
metadata. Corrupted fs metadata is not well tolerated by any file
system, not least of which is most of them have no idea the metadata
is corrupt. At least Btrfs can determine this and if there's another
copy use that or just stop and face plant before more damage happens.
Maybe an exception now is XFS v5 metadata which employs checksumming.
But it still doesn't know if data extents are wrong (i.e. a torn or
misdirected write).

I've had perhaps a hundred power off during write with Btrfs and SSD
and I don't ever see corrupt files. It's definitely not normal to see
this with Btrfs.


> In the earlier
> cases there were no new instances of corruption after the initial failure
> event and manual cleanup.
>
> Now that I did a little deeper into this, I do see one fairly significant
> piece of data:
>
>         root@host:~# btrfs dev stat /data | grep -v ' 0$'
>         [/dev/vdc].corruption_errs 16774
>         [/dev/vde].write_io_errs   121
>         [/dev/vde].read_io_errs    4
>         [devid:8].read_io_errs    16
>
> Prior to the failure of devid:8, vde had 121 write errors and 4 read
> errors (these counter values are months old and the errors were long
> since repaired by scrub).  The 16774 corruption errors on vdc are all
> new since the devid:8 failure, though.

On md RAID 5 and 6, if the array gets parity mismatch counts above 0
doing a scrub (check > md/sync_action) there's a hardware problem.
It's entirely possible you've found a bug, but it must be extremely
obscure to basically not have hit everyone trying Btrfs raid56. I
think you need to track down the source of this corruption and stop it
 however possible; whether that's changing hardware, or making sure
the system isn't crashing.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Adventures in btrfs raid5 disk recovery

Reply via email to