On Sat, Jun 25, 2016 at 10:39 AM, Steven Haigh <net...@crc.id.au> wrote:
> Well, I did end up recovering the data that I cared about. I'm not > really keen to ride the BTRFS RAID6 train again any time soon :\ > > I now have the same as I've had for years - md RAID6 with XFS on top of > it. I'm still copying data back to the array from the various sources I > had to copy it to so I had enough space to do so. Just make sure you've got each drive's SCT ERC shorter than the kernel SCSI command timer for each block device in /sys/block/device-name/device/timeout or you can very easily end up with the same if not worse problem which is total array collapse. It's more rare to see the problem on mdraid6 because the extra parity ends up papering over the problem caused by this misconfiguration, but it's a misconfiguration that's the default unless you're using enterprise/NAS specific drives with short recoveries set on them by default. The linux-raid@ list is full of problems resulting from this issue. I think the obvious mistake here though is assuming reshapes entail no risk. There's a -f required for a reason. You could have ended up in just as bad situation doing a reshape without a backup of an md or lvm based array. Yes it should work, and if it doesn't it's a bug, but how much data do you want to lose today? > What I find interesting is that the patterns of corruption in the BTRFS > RAID6 is quite clustered. I have ~80Gb of MP3s ripped over the years - > of that, the corruption would take out 3-4 songs in a row, then the next > 10 albums or so were intact. What made recovery VERY hard, is that it > got to several situations that just caused a complete system hang. The data stripe size is 64KiB * (num of disks - 2). So in your case I think that's 64 *3 = 192KiB. That's less than the size of one song, so that means roughly 15 bad stripes in a row. That's less than a block group also. The Btrfs conversion should be safer than methods used by mdadm and lvm because the operation is cow. The raid6 block group is supposed to remain intact and "live" if you will, until the single block group is written to stable media. The full crash set of kernel messages might be useful to find out what was happening that instigated all of this corruption. But even still the subsequent mount should at worst rollback to state of block groups of different profiles where the most recent (failed) conversion is still a raid6 block group intact. So, I'd still say btrfs-image it and host it somewhere, file a bug, cross reference this thread in the bug, and the bug URL in this thread. Might take months or even a year before a dev looks at it, but better than nothing. > > I tried it on bare metal - just in case it was a Xen thing, but it hard > hung the entire machine then. In every case, it was a flurry of csum > error messages, then instant death. I would have been much happier if > the file had been skipped or returned as unavailable instead of having > the entire machine crash. Of course. The unanswered question though is why are there so many csum errors? Are these metadata csum errors, or are they EXTENT_CSUM errors, and how are they becoming wrong? Wrongly read, wrongly written, wrongly recomputed from parity? How did the parity go bad if that's the case? So it needs an autopsy or it just doesn't get better. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html