Re: Trying to rescue my data :(

Chris Murphy Sat, 25 Jun 2016 10:15:06 -0700

On Sat, Jun 25, 2016 at 10:39 AM, Steven Haigh <net...@crc.id.au> wrote:


> Well, I did end up recovering the data that I cared about. I'm not
> really keen to ride the BTRFS RAID6 train again any time soon :\
>
> I now have the same as I've had for years - md RAID6 with XFS on top of
> it. I'm still copying data back to the array from the various sources I
> had to copy it to so I had enough space to do so.

Just make sure you've got each drive's SCT ERC shorter than the kernel
SCSI command timer for each block device in
/sys/block/device-name/device/timeout or you can very easily end up
with the same if not worse problem which is total array collapse. It's
more rare to see the problem on mdraid6 because the extra parity ends
up papering over the problem caused by this misconfiguration, but it's
a misconfiguration that's the default unless you're using
enterprise/NAS specific drives with short recoveries set on them by
default. The linux-raid@ list is full of problems resulting from this
issue.

I think the obvious mistake here though is assuming reshapes entail no
risk. There's a -f required for a reason. You could have ended up in
just as bad situation doing a reshape without a backup of an md or lvm
based array. Yes it should work, and if it doesn't it's a bug, but how
much data do you want to lose today?



> What I find interesting is that the patterns of corruption in the BTRFS
> RAID6 is quite clustered. I have ~80Gb of MP3s ripped over the years -
> of that, the corruption would take out 3-4 songs in a row, then the next
> 10 albums or so were intact. What made recovery VERY hard, is that it
> got to several situations that just caused a complete system hang.

The data stripe size is 64KiB * (num of disks - 2). So in your case I
think that's 64 *3 = 192KiB. That's less than the size of one song, so
that means roughly 15 bad stripes in a row. That's less than a block
group also.

The Btrfs conversion should be safer than methods used by mdadm and
lvm because the operation is cow. The raid6 block group is supposed to
remain intact and "live" if you will, until the single block group is
written to stable media. The full crash set of kernel messages might
be useful to find out what was happening that instigated all of this
corruption. But even still the subsequent mount should at worst
rollback to state of block groups of different profiles where the most
recent (failed) conversion is still a raid6 block group intact.

So, I'd still say btrfs-image it and host it somewhere, file a bug,
cross reference this thread in the bug, and the bug URL in this
thread. Might take months or even a year before a dev looks at it, but
better than nothing.


>
> I tried it on bare metal - just in case it was a Xen thing, but it hard
> hung the entire machine then. In every case, it was a flurry of csum
> error messages, then instant death. I would have been much happier if
> the file had been skipped or returned as unavailable instead of having
> the entire machine crash.

Of course. The unanswered question though is why are there so many
csum errors? Are these metadata csum errors, or are they EXTENT_CSUM
errors, and how are they becoming wrong? Wrongly read, wrongly
written, wrongly recomputed from parity? How did the parity go bad if
that's the case? So it needs an autopsy or it just doesn't get better.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trying to rescue my data :(

Reply via email to