Hello,

TL;DR ==

btrfs 3x500GB RAID 5 - One device failed. Added a new device (btrfs device add) 
and tried to remove the failed device (btrfs device delete).

I tried to mount the array in degraded mode, but that didn't work either. After 
multiple attempts (including adding back the failed HDD), I finally ran the 
btrfs rescue chunk-recover command on the primary member /dev/sdb.

This ran for about 4 hours, and then failed with "floating point exception (core 
dumped)".
==

I am testing out btrfs to gain familiarity with it. I am quite amazed at it's 
capabilities and performance. However, I am either not able to understand or 
implement RAID5 fault tolerance.

I understand from the wiki that RAID56 is experimental. The data I am working 
with is backed up elsewhere and for all intents and purposes, discard-able.

I have set up a btrfs RAID5 with 3x500GB Seagate HDDs, with a mount point of 
/storage. Booting is off a fourth HDD (ext4, lubuntu 64bit) that is not 
involved in the RAID.

Everything was working amazingly well, until one HDD failed and was quietly 
offlined. For a couple of days, the RAID was running off 2 HDDs and I didn't 
notice.

When I DID realize, I shut down the system, bought a new HDD (2TB), which took 
a couple of days to arrive.

When I powered up the system again, the failed 500GB was back. Everything 
loaded fine, and looked good. To be on the safe side, I ran a badblocks test 
(ro) on the failing HDD.

Halfway through the test, the HDD disappeared again. After a cold reboot, it 
was loaded fine again.

At this point, I decided to replace the failed HDD. I shutdown, plugged in the 
new HDD in place of the boot HDD, booted up with Lubuntu live, mounted 
(/storage) and added the device to the RAID.

After adding the device successfully, I gave a device delete command for the 
failed HDD. Partway through the process, the failing HDD (/dev/sdc) disappeared 
again, and after waiting a couple of hours, I hard-reset the system, and 
removed the failing HDD, assuming that the RAID will re-build on the existing 
devices.

Now, the RAID (/storage) refused to mount. I got a c_tree error (please see 
enclosed logs below).

I tried to mount the array in degraded mode, but that didn't work either. After 
multiple attempts (including adding back the failed HDD), I finally ran the 
btrfs rescue chunk-recover command on the primary member /dev/sdb.

This ran for about 4 hours, and then failed with "floating point exception (core 
dumped)".

Can I recover the array or should I start again? The data is not important, but 
I would like to know the recovery process, or any misconceptions in my thinking 
that RAID5 with 3 devices is enough for SOHO-level fault tolerance?

Any advice, pointers, etc, much appreciated. Tech level: medium-high (RHCE).

Relevant system information:
=== uname -a
Linux lubuntu 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 15:35:06 UTC 2015 
x86_64 x86_64 x86_64 GNU/Linux


== btrfs --version
btrfs-progs v4.0

== btrfs fi show
warning, device 2 is missing
Label: 'storage'  uuid: 5a3d6590-df08-4520-b61b-802d350849c7
        Total devices 4 FS bytes used 176.91GiB
        devid    1 size 465.76GiB used 90.03GiB path /dev/sdb
        devid    3 size 465.76GiB used 90.01GiB path /dev/sdc
        devid    4 size 1.82TiB used 10.00GiB path /dev/sda
        *** Some devices missing

== dmesg info
...
Jan  5 01:45:22 lubuntu kernel: [   10.338295] Btrfs loaded
Jan  5 01:45:22 lubuntu kernel: [   10.338899] BTRFS: device label storage 
devid 4 transid 969 /dev/sda
Jan  5 01:45:22 lubuntu kernel: [   10.340448] BTRFS info (device sda): disk 
space caching is enabled
Jan  5 01:45:22 lubuntu kernel: [   10.340454] BTRFS: has skinny extents
Jan  5 01:45:22 lubuntu kernel: [   10.343395] BTRFS: failed to read the system 
array on sda
Jan  5 01:45:22 lubuntu kernel: [   10.352137] BTRFS: open_ctree failed
Jan  5 01:45:22 lubuntu kernel: [   10.382199] BTRFS: device label storage 
devid 1 transid 969 /dev/sdb
Jan  5 01:45:22 lubuntu kernel: [   10.383740] BTRFS info (device sdb): disk 
space caching is enabled
Jan  5 01:45:22 lubuntu kernel: [   10.383744] BTRFS: has skinny extents
Jan  5 01:45:22 lubuntu kernel: [   10.384469] BTRFS: failed to read the system 
array on sdb
Jan  5 01:45:22 lubuntu kernel: [   10.392116] BTRFS: open_ctree failed
Jan  5 01:45:22 lubuntu kernel: [   10.423075] BTRFS: device label storage 
devid 3 transid

... // after btrfs rescue chunk for about 4 hours
Jan  5 06:01:45 lubuntu kernel: [15404.828156] traps: btrfs[3016] trap divide 
error ip:4211a0 sp:7ffd7dbb03a8 error:0 in btrfs[400000+73000]
...

== some output from btrfs rescu chunk -vv
...
            Stripes list:
            [ 0] Stripe: devid = 3, offset = 21484273664
            [ 1] Stripe: devid = 2, offset = 21484273664
            [ 2] Stripe: devid = 1, offset = 21504196608
        Chunk: start = 45134905344, len = 2147483648, type = 81, num_stripes = 3
            Stripes list:
            [ 0] Stripe: devid = 3, offset = 22558015488
            [ 1] Stripe: devid = 2, offset = 22558015488
            [ 2] Stripe: devid = 1, offset = 22577938432
        Chunk: start = 47282388992, len = 2147483648, type = 81, num_stripes = 3
            Stripes list:
            [ 0] Stripe: devid = 3, offset = 23631757312
            [ 1] Stripe: devid = 2, offset = 23631757312
            [ 2] Stripe: devid = 1, offset = 23651680256
...
        Device extent: devid = 4, start = 5369757696, len = 1073741824, chunk 
offset = 201901211648
        Device extent: devid = 4, start = 6443499520, len = 1073741824, chunk 
offset = 204048695296
        Device extent: devid = 4, start = 7517241344, len = 1073741824, chunk 
offset = 206196178944
        Device extent: devid = 4, start = 8590983168, len = 1073741824, chunk 
offset = 208343662592
// floating point error

Regards,
PRShah

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to