Re: Btrfs scrub failure for raid 6 kernel 4.3

Waxhead Sun, 27 Dec 2015 15:07:33 -0800

Chris Murphy wrote:

On Sun, Dec 27, 2015 at 6:59 AM, Waxhead <waxh...@online.no> wrote:

Hi,


I have a "toy-array" of 6x USB drives hooked up to a hub where I made a
btrfs raid 6 data+metadata filesystem.

I copied some files to the filesystem, ripped out one USB drive and ruined
it dd if=/dev/random to various locations on the drive. Put the USB drive
back and the filesystem mounts ok.

If i start scrub I after seconds get the following

  kernel:[   50.844026] CPU: 1 PID: 91 Comm: kworker/u4:2 Not tainted
4.3.0-1-686-pae #1 Debian 4.3.3-2
  kernel:[   50.844026] Hardware name: Acer AOA150/        , BIOS v0.3310
10/06/2008
  kernel:[   50.844026] Workqueue: btrfs-endio-raid56
btrfs_endio_raid56_helper [btrfs]
  kernel:[   50.844026] task: f642c040 ti: f664c000 task.ti: f664c000
  kernel:[   50.844026] Stack:
  kernel:[   50.844026]  00000005 f0d20800 f664ded0 f86d0262 00000000
f664deac c109a0fc 00000001
  kernel:[   50.844026]  f79eac40 edb4a000 edb7a000 edb8a000 edbba000
eccc1000 ecca1000 00000000
  kernel:[   50.844026]  00000000 f664de68 00000003 f664de74 ecb23000
f664de5c f5cda6a4 f0d20800
  kernel:[   50.844026] Call Trace:
  kernel:[   50.844026]  [<f86d0262>] ? finish_parity_scrub+0x272/0x560
[btrfs]
  kernel:[   50.844026]  [<c109a0fc>] ? set_next_entity+0x8c/0xba0
  kernel:[   50.844026]  [<c127d130>] ? bio_endio+0x40/0x70
  kernel:[   50.844026]  [<f86891fe>] ? btrfs_scrubparity_helper+0xce/0x270
[btrfs]
  kernel:[   50.844026]  [<c107ca7d>] ? process_one_work+0x14d/0x360
  kernel:[   50.844026]  [<c107ccc9>] ? worker_thread+0x39/0x440
  kernel:[   50.844026]  [<c107cc90>] ? process_one_work+0x360/0x360
  kernel:[   50.844026]  [<c10821a6>] ? kthread+0xa6/0xc0
  kernel:[   50.844026]  [<c1536181>] ? ret_from_kernel_thread+0x21/0x30
  kernel:[   50.844026]  [<c1082100>] ? kthread_create_on_node+0x130/0x130
  kernel:[   50.844026] Code: 6e c1 e8 ac dd f2 ff 83 c4 04 5b 5d c3 8d b6 00
00 00 00 31 c9 81 3d 84 f0 6e c1 84 f0 6e c1 0f 95 c1 eb b9 8d b4 200 00 00
00 0f 0b 8d b4 26 00 00 00 00 8d bc 27 00
  kernel:[   50.844026] EIP: [<c1174858>] kunmap_high+0xa8/0xc0 SS:ESP
0068:f664de40

This is only a test setup and I will keep this filesystem for a while if it
can be of any use...

Sounds like a bug, but also might be missing functionality still. If
you can include the reproduce steps, including the exact
locations+lengths of the random writes, that's probably useful.

More than one thing could be going on. First, I don't know that Btrfs
even understands the device went missing because it doesn't yet have a
concept of faulty devices, and then I've seen it get confused when
drives reappear with new drive designations (not uncommon), and from
your call trace we don't know if that happened because there's not
enough information posted. Second, if the damage is too much on a
device, it almost certainly isn't recognized when reattached. But this
depends on what locations were damaged. If Btrfs doesn't recognize the
drive as part of the array, then the scrub request is effectively a
scrub for a volume with a missing drive which you probably wouldn't
ever do, you'd first replace the missing device. Scrubs happen on
normally operating arrays not degraded ones. So it's uncertain either
Btrfs, or the user, had any idea what state the volume was actually in
at the time.

Conversely on mdadm, it knows in such a case to mark a device as
faulty, the array automatically goes degraded, but when the drive is
reattached it is not automatically re-added. When the user re-adds,
typically a complete rebuild happens unless there's a write-intent
bitmap, which isn't a default at create time.

I am afraid I can't exactly include the how to reproduce steps.

I do however have the filesystem in a "bad state" so if there isanything I can do - let me know.


First of all ... a "btrfs filesystem show" does list all drives
Label: none  uuid: 2832346e-0720-499f-8239-355534e5721b
        Total devices 6 FS bytes used 8.53GiB
        devid    1 size 7.68GiB used 3.08GiB path /dev/sdb1
        devid    2 size 7.68GiB used 3.08GiB path /dev/sdc1
        devid    3 size 7.68GiB used 3.08GiB path /dev/sdd1
        devid    4 size 7.68GiB used 3.08GiB path /dev/sde1
        devid    5 size 7.68GiB used 3.08GiB path /dev/sdf1
        devid    6 size 7.68GiB used 3.08GiB path /dev/sdg1

mount /dev/sdb1 /mnt/
btrfs filesystem df /mnt

Data, RAID6: total=12.00GiB, used=8.45GiB
System, RAID6: total=64.00MiB, used=16.00KiB
Metadata, RAID6: total=256.00MiB, used=84.58MiB
GlobalReserve, single: total=32.00MiB, used=0.00B

btrfs scrub status /mnt
scrub status for 2832346e-0720-499f-8239-355534e5721b

scrub started at Sun Mar 29 23:21:04 2015 and finished after00:01:04

        total bytes scrubbed: 1.97GiB with 14549 errors
        error details: super=2 csum=14547

corrected errors: 0, uncorrectable errors: 14547, unverifiederrors: 0

Now here is the first worrying part... it says that scrub started at SunMar 29. That is NOT true, the first scrub I did on this filesystem was afew days ago and it claims it is a lot of uncorrectable errors. Why?This is after all a raid6 filesystem correct?!


btrfs scrub start -B /mnt

Message from syslogd@a150 at Dec 27 23:44:22 ...

kernel:[ 611.478448] CPU: 0 PID: 1200 Comm: kworker/u4:1 Not tainted4.3.0-1-686-pae #1 Debian 4.3.3-2


Message from syslogd@a150 at Dec 27 23:44:22 ...

kernel:[ 611.478448] Hardware name: Acer AOA150/ , BIOSv0.3310 10/06/2008kernel:[ 611.478448] Workqueue: btrfs-endio-raid56btrfs_endio_raid56_helper [btrfs]

 kernel:[  611.478448] task: ec403040 ti: ec4a2000 task.ti: ec4a2000
 kernel:[  611.478448] Stack:

kernel:[ 611.478448] 00000005 ecd78800 ec4a3ed0 f8768262 000000000000008e 5ead4067 0000008ekernel:[ 611.478448] 5ead3301 ec5bd000 ec5ce000 ec5fd000 ec62d000ec5a9000 ec5a8000 f79d27cckernel:[ 611.478448] 00000000 ec4a3e68 00000003 ec4a3e74 ec32d700ec4a3e5c f5ccaba0 ecd78800

 kernel:[  611.478448] Call Trace:

kernel:[ 611.478448] [<f8768262>] ? finish_parity_scrub+0x272/0x560[btrfs]

 kernel:[  611.478448]  [<c127d130>] ? bio_endio+0x40/0x70

kernel:[ 611.478448] [<f87211fe>] ?btrfs_scrubparity_helper+0xce/0x270 [btrfs]

 kernel:[  611.478448]  [<c107ca7d>] ? process_one_work+0x14d/0x360
 kernel:[  611.482350]  [<c107ccc9>] ? worker_thread+0x39/0x440
 kernel:[  611.482350]  [<c107cc90>] ? process_one_work+0x360/0x360
 kernel:[  611.482350]  [<c10821a6>] ? kthread+0xa6/0xc0
 kernel:[  611.482350]  [<c1536181>] ? ret_from_kernel_thread+0x21/0x30
 kernel:[  611.482350]  [<c1082100>] ? kthread_create_on_node+0x130/0x130

kernel:[ 611.482350] Code: c4 04 5b 5d c3 8d b6 00 00 00 00 31 c9 813d 84 f0 6e c1 84 f0 6e c1 0f 95 c1 eb b9 8d b4 26 00 00 00 00 0f 0b 8db6 00 00 00 00 <0f> 0b 8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 55 89e5 56 53kernel:[ 611.482350] EIP: [<c1174860>] kunmap_high+0xb0/0xc0 SS:ESP0068:ec4a3e40

This is what I got from my ssh login , there is a longer stacktrace onthe computer I am testing this on... what I can read on the screen is(hope I got all the numbers right):


? print_oops_end_marker+0x41/0x70
? oops_end+0x92/0xd0
? no_context+0x100/0x2b0
? __bad_area_nosemaphore+0xb5/0x140
? dequeue_task_fair+0x4c/0xbd0
? check_preempt_curr+0x7a/0x90
? __do_page_fault+0x460/0x460
? bad_area_nosemaphore+0x17/0x20
? error_code+0x67/0x6c
? alloc_pid+0x5b/0x420
? kthread_data+0xf/0x20
? wq_worker_sleeping+0x10/0x90
? __schedule+0x4e2/0x8c0
? schedule+0x2b/0x80
? do_exit+0x746/0x9f0
? vprintk_default+0x37/0x40
? printk_0x17/0x19
? oops_end+0x92
? do_error_trap+0x8a/0x120
? kunmap_high+0xb0/0xc0
? __alloc_pages_nodemask+0x13b/0x850
? do_overflow+0x30/0x30
? do_invalid_op+0x24/0x30
? error_code+0x67/0x6c
? compact_unblock_should_abort.isra.31+0x7b/0x90
? kunmap_high+0xb0/0xc0
? finish_parity_scrub+0x272/0x556 [btrfs]
? bio_endio+0x40/0x70
? btrfs_scrubparity_helper+0xce/0x270 [btrfs]
? process_one_work+0x14d/0x360
? worker_thread+0x39/0x440
? process_one_work+0x360/0x360
? kthread+0xa6/0xc0
? ret_from_kernel_thread+0x21/0x30
? kthread_create_on_node+0x130/0x130
---[end trace....]

I hope this is of more help. Again if there is anything I can do I amhappy to help. I don't need this filesystem so no need to recover it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs scrub failure for raid 6 kernel 4.3

Reply via email to