Re: Any hope of pool recovery?

Donald Pearson Wed, 01 Jul 2015 18:39:50 -0700

Thanks Chris.

Everything is/was raid6.  Oddly when I created the filesystem there
was a mix of raid1 and raid6 but a balance dconvert mconvert after
creation set everything to raid6.


I did previously try a btrfs-image as I found that as a "first thing
to do" through some google searching but that command won't run with
essentially the same errors (additional "device is missing errors now"
but this is otherwise identical to what I saw before).

I'm happy to help post a bug report but can I still provide actionable
information without btrfs-image working?

[root@san01 btrfs-progs]# ./btrfs-image -c9 -t4 /dev/sdc /mnt2/backup/sdc.img
warning, device 4 is missing
warning devid 4 not found already
checksum verify failed on 21364736 found EC809498 wanted 0863292E
checksum verify failed on 21364736 found 925303CE wanted 09150E74
checksum verify failed on 21364736 found 925303CE wanted 09150E74
bytenr mismatch, want=21364736, have=1065943040
Couldn't read chunk tree
Open ctree failed
create failed (Bad file descriptor)

So after the chunk-recover failed I postulated that there may be some
correlation with the read of /dev/sdg stopping early.  I say early
because the other 4 drives of the same capacity continued reading for
quite some time.

So I tested a dd of sdg to a file, and after it ran for about 2 hours
it stopped prematurely after 700 some-odd gigs and left some errors in
the logs (I'll just tack them on the end of the email for the
curious).

At this point I decided sdg was done and couldn't be doing any help
while installed so I yanked it out.  Still unable to mount, I
rebooted.  Unfortunately I am still unable to mount after the reboot
(and I tried again just now with all the options you posted, no dice),
so I am running the chunk-recover command again.

That would be neat if I can somehow contribute!

Thanks again,
Donald

Here's the drive vomiting in my logs after it got halfway through the
dd image attempt.

Jul  1 17:05:51 san01 kernel: sd 0:0:6:0: [sdg] FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul  1 17:05:51 san01 kernel: sd 0:0:6:0: [sdg] Sense Key : Medium
Error [current]
Jul  1 17:05:51 san01 kernel: sd 0:0:6:0: [sdg] Add. Sense:
Unrecovered read error
Jul  1 17:05:51 san01 kernel: sd 0:0:6:0: [sdg] CDB: Read(10) 28 00 5a
5b f1 e0 00 01 00 00
Jul  1 17:05:51 san01 kernel: blk_update_request: critical medium
error, dev sdg, sector 1515975136
Jul  1 17:05:57 san01 kernel: sd 0:0:6:0: [sdg] FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul  1 17:05:57 san01 kernel: sd 0:0:6:0: [sdg] Sense Key : Medium
Error [current]
Jul  1 17:05:57 san01 kernel: sd 0:0:6:0: [sdg] Add. Sense:
Unrecovered read error
Jul  1 17:05:57 san01 kernel: sd 0:0:6:0: [sdg] CDB: Read(10) 28 00 5a
5b f2 e0 00 01 00 00

On Wed, Jul 1, 2015 at 6:29 PM, Chris Murphy <li...@colorremedies.com> wrote:
> On Wed, Jul 1, 2015 at 3:35 PM, Donald Pearson
> <donaldwhpear...@gmail.com> wrote:
>
>> *** Error in `./btrfs': free(): invalid next size (fast): 0x0000000001332100 
>> ***
>> Segmentation fault
>
> Blek. Well that's a bug then too. If you have space somewhere to put a
> btrfs-image -c9 -t4, I'd do that now before making anymore changes.
> Write up a bugzilla.kernel.org bug, include the URL for the image file
> (which will be large). Include the URL for the bug in this thread. And
> then it's wait time basically. I'm not a dev but this sounds rather
> serious.
>
> The pisser is that this is exactly the use case for raid6. You have a
> failed drive, want an extra margin to cover possible additional
> errors, you get a "BTRFS: failed to read chunk root on sdc" which
> could be construed as a problem with sdc, so a 2nd failure, and yet no
> reconstruction of the necessary metadata.
>
> Is metadata also raid6? Or just data? I don't see a 'btrfs fi df'
> probably because you can't mount the volume. Do you know if it was
> created with -d raid6 -m raid6 at mkfs time? (Include this info in the
> bug report.)
>
> Failing device handling with Btrfs is still weak. In many cases it
> will keep trying to use a device that produces spurious or even failed
> read and write errors. It's possible this caused some confusion.
>
> I propose trying the following. You could wait to see if someone else
> has better suggestions, but this seems reasonably safe.
>
> - Physically remove sdg from the system, reboot, and see if you can
> mount the volume with the most conservative mount option: -o
> ro,recovery,degraded,skip_balance
>
> If that doesn't work, and you still get the message about chunk root
> on devid 1/sdc (thing is, when you remove sdg it's possible drive
> letters will change, so be sure to correlate any errors to devid by
> using a current 'btrfs fi show' listing), then yuck.
>
> I would try chunk recover again, now that known bad drive sdg is
> physically removed. Do you get a different result, or still a seg
> fault?
>
> If those two things still fail, what's next is a toss up between two options:
>
> - Find or build a "4.2" kernel (there is no rc1 yet); Fedora has
> several "4.2"/linux-next binaries already built in the koji build
> system, so your distro might have extremely new kernels available
> somewhere for bleeding edgers. Try this with the above mount options
> again. In the recent git pull for this kernel there were nearly 2000
> lines added, and nearly that many deleted. A lot of changes. So it's
> worth a shot. It could produce a good result or a worse result, or the
> same result. *shrug* What I probably wouldn't try while running the
> 4.2 kernel is another chunk recover. Seems doubtful it will make much
> difference.
>
> and the other option:
>
> - Physically remove the device that still produces the "BTRFS: failed
> to read chunk root on sdX" error, which in the current state as you
> posted it, was /dev/sdc (devid 1). Physically remove it. Reboot. And
> then retry the same mount options from above and see what that results
> in. If there were no problems with your file system, removing two
> devices and mounting degraded should work without errors (I've done
> it), so it seems like a valid thing to try seeing as two devices are
> giving you a hard time. Will a 3rd? Dunno.
>
> Anyway, not good news. But you're helping make Btrfs better!
>
>
>
> --
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Any hope of pool recovery?

Reply via email to