Re: btrfs rescue chunk-recover segfaults

2017-01-26 Thread Simon Waid



Am 24.01.2017 um 11:05 schrieb Duncan:

Simon Waid posted on Mon, 23 Jan 2017 09:42:28 +0100 as excerpted:



I have a btrfs raid5 array that has become unmountable.

[As a list regular and btrfs user myself, not a dev, but I try to help
with replies where I can in ordered to allow the devs and real experts to
put their time to better use where I can't help.]

As stated on the btrfs wiki and in the mkfs.btrfs manpage, btrfs raid56
hasn't stabilized and is now known to have critical defects as originally
implemented, that make it unfit for the purposes most people normally use
parity-raid for.  It's not recommended except for testing with purely
sacrificial data that might potentially be eaten by the test.

Thus, anyone on btrfs raid56 mode should only been testing with
effectively throw-away data, either because it's backed up and can be
easily retrieved from that backup, or because it really /is/ throw-away
data, making the damage from losing that testing filesystem minimal.

As such, if you're done with testing, the fastest and most efficient way
back to production is to forget about the broken filesystem, and blow it
away with a mkfs to a new filesystem, either some other btrfs mode or
something other than the still maturing btrfs entirely, your choice.
Then you can restore from backups if the data was worth having them.

Tho of course blowing it away does mean it can't be used as a lab
specimen to perhaps help find and fix some of the problems that do affect
raid56 mode at this time.

Qu Wenruo in particular, and others, have been gradually working thru at
least some of the raid56 mode bugs, tho it's still possible the current
code is beyond hope and may need entirely rewritten to properly
stabilize.  If you don't have to get the space the filesystem was taking
directly back in service and can build and work with the newest code
possibly including patches they ask you to apply, you may be able to use
your deployment as a lab specimen to help them test their newest recovery
code and possibly help fix additional bugs in the process.

However, even then, don't expect that you'll necessarily recover most of
what was on the filesystem, as raid56 mode really is seriously bugged
ATM, and it's quite possible that the data has already been wiped out by
those bugs.  Mostly, you're simply continuing to use the filesystem as an
in-the-wild test deployment gone bad, now testing diagnostics and
possible recovery, not necessarily with a good chance of recovering the
data, but that's OK, since btrfs raid56 mode was never out of unstable
testing-only mode in the first place, so any data put on it always was
effectively sacrificial data, known to be potentially eaten by the
testing itself.


Thank you Ducan for the information.

Before wiping the filesystem, is there anything I should do to help 
fixing the segfault in chunk-recover?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID5: btrfs rescue chunk-recover segfaults.

2017-01-23 Thread Simon Waid
Dear all,

I have a btrfs raid5 array that has become unmountable. When trying to
mount dmesg containes the following:

[ 5686.334384] BTRFS info (device sdb): disk space caching is enabled
[ 5688.377244] BTRFS info (device sdb): bdev /dev/sdb errs: wr 2517, rd
77, flush 0, corrupt 0, gen 0
[ 5688.377254] BTRFS info (device sdb): bdev /dev/sdc errs: wr 0, rd 0,
flush 0, corrupt 10, gen 0
[ 5688.377261] BTRFS info (device sdb): bdev /dev/sdd1 errs: wr 0, rd 0,
flush 0, corrupt 5, gen 0
[ 5688.377268] BTRFS info (device sdb): bdev /dev/sde errs: wr 21, rd
8807, flush 0, corrupt 0, gen 0
[ 5688.744249] BTRFS error (device sdb): parent transid verify failed on
16227387371520 wanted 88711 found 88395
[ 5689.533817] BTRFS error (device sdb): parent transid verify failed on
16227388260352 wanted 88711 found 88395
[ 5689.609355] BTRFS error (device sdb): parent transid verify failed on
16227415158784 wanted 88711 found 88397
[ 5689.627715] BTRFS error (device sdb): parent transid verify failed on
16227415158784 wanted 88711 found 88397
[ 5689.627731] BTRFS error (device sdb): failed to read block groups: -5
[ 5689.675017] BTRFS error (device sdb): open_ctree failed

I tried to recover from the problem using:

btrfs rescue chunk-recover -v /dev/sdb

The command runs for a few minutes. Then it segfaults. I used gdb to
debug. This is the backtrace:

Starting program: btrfs-progs/btrfs rescue chunk-recover -v /dev/sdb
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
All Devices:
 Device: id = 4, name = /dev/sde
 Device: id = 1, name = /dev/sdd1
 Device: id = 2, name = /dev/sdc
 Device: id = 3, name = /dev/sdb

[New Thread 0x76f6e700 (LWP 8155)]
[New Thread 0x7676d700 (LWP 8156)]
[New Thread 0x75f6c700 (LWP 8157)]
[New Thread 0x7576b700 (LWP 8158)]
Scanning: 24603734016 in dev0, 32581337088 in dev1, 37911248896 in dev2,
32217350144 in dev3
Thread 2 "btrfs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x76f6e700 (LWP 8155)]
btrfs_new_device_extent_record (leaf=leaf@entry=0x78c0,
key=key@entry=0x76f6dc90, slot=slot@entry=12)
 at cmds-check.c:6656
6656rec->chunk_objecteid =
(gdb) backtrace
#0  btrfs_new_device_extent_record (leaf=leaf@entry=0x78c0,
key=key@entry=0x76f6dc90, slot=slot@entry=12)
 at cmds-check.c:6656
#1  0x004370d2 in process_device_extent_item (slot=12,
key=0x76f6dc90, leaf=0x78c0,
 devext_cache=0x7fffe410) at chunk-recover.c:332
#2  extract_metadata_record (rc=rc@entry=0x7fffe3c0,
leaf=leaf@entry=0x78c0) at chunk-recover.c:727
#3  0x0043759b in scan_one_device (dev_scan_struct=0x6ae420) at
chunk-recover.c:807
#4  0x7733f6ba in start_thread (arg=0x76f6e700) at
pthread_create.c:333
#5  0x7707582d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Information about the system:

uname -a: Linux 4.10.0-041000rc4-generic #201701152031 SMP Mon Jan 16
01:33:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
btrfs-progs --version: v4.9 (from
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git)
sudo btrfs fi show
Label: none  uuid: a27cc0cf-1665-43ba-8c63-bf236d31fcd2
 Total devices 4 FS bytes used 6.51TiB
 devid1 size 2.73TiB used 2.73TiB path /dev/sdd1
 devid2 size 7.28TiB used 2.73TiB path /dev/sdc
 devid3 size 3.64TiB used 3.56TiB path /dev/sdb
 devid4 size 1.82TiB used 1.46TiB path /dev/sde
btrfs fi df wont work as the filesystem is not mountable.

Any help would be appreciated!

Best regards,
Simon


PS: I'd also like to mention how the raid array became unmountable.

The system I was running at that time was:
Kernel: 4.8.0-34 generic #36~16.04.1 Ubuntu SMP
btrfs-progs --version: v4.4

- I issued a replace command on disk 2. During the replace, disc 4 was
disconnected. I noticed it and rebooted the system just a few second
after the event. After the reboot, the replace continued and eventually
finished. However, dmesg would showed errors like: parent transid verify
failed on 16227387371520 wanted 88711 found 88395.

- I issued a resize command on the new drive to free additional space:
btrfs resize 2:max, which completed without errors.

- I issued a balance without any filters in the hope it would correct
the "parent transid verify failed" errors. The balance started normally.
However, after about one hour, I saw that no I/O would happen and lots
of errors appeared in dmesg. I tried to reboot but the command had no
effect, so disconnected the PC from the power supply.

I have attached the dmesg for the resize and balance operations.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html