Re: Trying to rescue my data :(

Steven Haigh Fri, 24 Jun 2016 10:06:12 -0700

On 25/06/16 02:59, ronnie sahlberg wrote:
> What I would do in this situation :
> 
> 1, Immediately stop writing to these disks/filesystem. ONLY access it
> in read-only mode until you have salvaged what can be salvaged.


That's ok - I can't even mount it in RW mode :)

> 2, get a new 5T UDB drive (they are cheap) and copy file by file off the 
> array.

I've actually got enough combined space to store stuff places in the
mean time...

> 3, when you hit files that cause panics, make a node of the inode and
> avoid touching that file again.

What I have in mind here is that a file seems to get CREATED when I copy
the file that crashes the system in the target directory. I'm thinking
if I 'cp -an source/ target/' that it will make this somewhat easier (it
won't overwrite the zero byte file).

> Will likely take a lot of work and time since I suspect it is a
> largely manual process. But if the data is important ...

Yeah - there's only about 80Gb on the array that I *really* care about -
the rest is just a bonus if its there - not rage-worthy :P

> Once you have all salvageable data copied to the new drive you can
> decide on how to proceed.
> I.e. if you want to try to repair the filesystem (I have low
> confidence in this for parity raid case) or if you will simply rebuild
> a new fs from scratch.

I honestly think it'll be scorched earth and start again with a new FS.
I'm thinking of going back to mdadm for the RAID (which has worked
perfectly for years) and using maybe a vanilla BTRFS on top of that
block device.

Anything else seems like too much work for too little reward - and lack
of confidence.

> On Fri, Jun 24, 2016 at 9:26 AM, Steven Haigh <net...@crc.id.au> wrote:
>> On 25/06/16 00:52, Steven Haigh wrote:
>>> Ok, so I figured that despite what the BTRFS wiki seems to imply, the
>>> 'multi parity' support just isn't stable enough to be used. So, I'm
>>> trying to revert to what I had before.
>>>
>>> My setup consist of:
>>>       * 2 x 3Tb drives +
>>>       * 3 x 2Tb drives.
>>>
>>> I've got (had?) about 4.9Tb of data.
>>>
>>> My idea was to convert the existing setup using a balance to a 'single'
>>> setup, delete the 3 x 2Tb drives from the BTRFS system, then create a
>>> new mdadm based RAID6 (5 drives degraded to 3), create a new filesystem
>>> on that, then copy the data across.
>>>
>>> So, great - first the balance:
>>> $ btrfs balance start -dconvert=single -mconvert=single -f (yes, I know
>>> it'll reduce the metadata redundancy).
>>>
>>> This promptly was followed by a system crash.
>>>
>>> After a reboot, I can no longer mount the BTRFS in read-write:
>>> [  134.768908] BTRFS info (device xvdd): disk space caching is enabled
>>> [  134.769032] BTRFS: has skinny extents
>>> [  134.769856] BTRFS: failed to read the system array on xvdd
>>> [  134.776055] BTRFS: open_ctree failed
>>> [  143.900055] BTRFS info (device xvdd): allowing degraded mounts
>>> [  143.900152] BTRFS info (device xvdd): not using ssd allocation scheme
>>> [  143.900243] BTRFS info (device xvdd): disk space caching is enabled
>>> [  143.900330] BTRFS: has skinny extents
>>> [  143.901860] BTRFS warning (device xvdd): devid 4 uuid
>>> 61ccce61-9787-453e-b793-1b86f8015ee1 is missing
>>> [  146.539467] BTRFS: missing devices(1) exceeds the limit(0), writeable
>>> mount is not allowed
>>> [  146.552051] BTRFS: open_ctree failed
>>>
>>> I can mount it read only - but then I also get crashes when it seems to
>>> hit a read error:
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064
>>> csum 3245290974 wanted 982056704 mirror 0
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 390821102 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 550556475 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 1279883714 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 2566472073 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 1876236691 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 3350537857 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 3319706190 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 2377458007 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 2066127208 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 657140479 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 1239359620 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 1598877324 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 1082738394 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 371906697 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 2156787247 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 3777709399 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 180814340 wanted 982056704 mirror 1
>>> ------------[ cut here ]------------
>>> kernel BUG at fs/btrfs/extent_io.c:2401!
>>> invalid opcode: 0000 [#1] SMP
>>> Modules linked in: btrfs x86_pkg_temp_thermal coretemp crct10dif_pclmul
>>> xor aesni_intel aes_x86_64 lrw gf128mul glue_helper pcspkr raid6_pq
>>> ablk_helper cryptd nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
>>> xen_netfront crc32c_intel xen_gntalloc xen_evtchn ipv6 autofs4
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 2610978113 wanted 982056704 mirror 1
>>> BTRFS info (device xvdc): csum failed ino 42179 extent 8690008064 csum
>>> 59610051 wanted 982056704 mirror 1
>>> CPU: 1 PID: 1273 Comm: kworker/u4:4 Not tainted 4.4.13-1.el7xen.x86_64 #1
>>> Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
>>> task: ffff880079ce12c0 ti: ffff880078788000 task.ti: ffff880078788000
>>> RIP: e030:[<ffffffffa039e0e0>]  [<ffffffffa039e0e0>]
>>> btrfs_check_repairable+0x100/0x110 [btrfs]
>>> RSP: e02b:ffff88007878bcc8  EFLAGS: 00010297
>>> RAX: 0000000000000001 RBX: ffff880079db2080 RCX: 0000000000000003
>>> RDX: 0000000000000003 RSI: 000004db13730000 RDI: ffff88007889ef38
>>> RBP: ffff88007878bce0 R08: 000004db01c00000 R09: 000004dbc1c00000
>>> R10: ffff88006bb0c1b8 R11: 0000000000000000 R12: 0000000000000000
>>> R13: ffff88007b213ea8 R14: 0000000000001000 R15: 0000000000000000
>>> FS:  00007fbf2fdc0880(0000) GS:ffff88007f500000(0000) knlGS:0000000000000000
>>> CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007fbf2d96702b CR3: 000000007969f000 CR4: 0000000000042660
>>> Stack:
>>>  ffffea00019db180 0000000000010000 ffff88007b213f30 ffff88007878bd88
>>>  ffffffffa03a0808 ffff880002d15500 ffff88007878bd18 ffff880079ce12c0
>>>  ffff88007b213e40 000000000000001f ffff880000000000 ffff88006bb0c048
>>> Call Trace:
>>>  [<ffffffffa03a0808>] end_bio_extent_readpage+0x428/0x560 [btrfs]
>>>  [<ffffffff812f40c0>] bio_endio+0x40/0x60
>>>  [<ffffffffa0375a6c>] end_workqueue_fn+0x3c/0x40 [btrfs]
>>>  [<ffffffffa03af3f1>] normal_work_helper+0xc1/0x300 [btrfs]
>>>  [<ffffffff810a1352>] ? finish_task_switch+0x82/0x280
>>>  [<ffffffffa03af702>] btrfs_endio_helper+0x12/0x20 [btrfs]
>>>  [<ffffffff81093844>] process_one_work+0x154/0x400
>>>  [<ffffffff8109438a>] worker_thread+0x11a/0x460
>>>  [<ffffffff8165a24f>] ? __schedule+0x2bf/0x880
>>>  [<ffffffff81094270>] ? rescuer_thread+0x2f0/0x2f0
>>>  [<ffffffff810993f9>] kthread+0xc9/0xe0
>>>  [<ffffffff81099330>] ? kthread_park+0x60/0x60
>>>  [<ffffffff8165e14f>] ret_from_fork+0x3f/0x70
>>>  [<ffffffff81099330>] ? kthread_park+0x60/0x60
>>> Code: 00 31 c0 eb d5 8d 48 02 eb d9 31 c0 45 89 e0 48 c7 c6 a0 f8 3f a0
>>> 48 c7 c7 00 05 41 a0 e8 c9 f2 fa e0 31 c0 e9 70 ff ff ff 0f 0b <0f> 0b
>>> 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
>>> RIP  [<ffffffffa039e0e0>] btrfs_check_repairable+0x100/0x110 [btrfs]
>>>  RSP <ffff88007878bcc8>
>>> ------------[ cut here ]------------
>>> <more crashes until the system hangs>
>>>
>>> So, where to from here? Sadly, I feel there is data loss in my future,
>>> but not sure how to minimise this :\
>>>
>>
>> The more I look at this, the more I'm wondering if this is a total
>> corruption scenario:
>>
>> $ btrfs restore -D -l /dev/xvdc
>> warning, device 4 is missing
>> checksum verify failed on 11224137433088 found EF5DE164 wanted 62BE2322
>> bytenr mismatch, want=11224137433088, have=11224137564160
>> Couldn't read chunk tree
>> Could not open root, trying backup super
>> warning, device 2 is missing
>> warning, device 4 is missing
>> warning, device 5 is missing
>> warning, device 3 is missing
>> checksum verify failed on 11224137433088 found EF5DE164 wanted 62BE2322
>> checksum verify failed on 11224137433088 found EF5DE164 wanted 62BE2322
>> bytenr mismatch, want=11224137433088, have=59973363410688
>> Couldn't read chunk tree
>> Could not open root, trying backup super
>> warning, device 2 is missing
>> warning, device 4 is missing
>> warning, device 5 is missing
>> warning, device 3 is missing
>> checksum verify failed on 11224137433088 found EF5DE164 wanted 62BE2322
>> checksum verify failed on 11224137433088 found EF5DE164 wanted 62BE2322
>> bytenr mismatch, want=11224137433088, have=59973363410688
>> Couldn't read chunk tree
>> Could not open root, trying backup super
>>
>> $ btrfs restore -D -l /dev/xvdd
>> warning, device 4 is missing
>> checksum verify failed on 11224137433088 found EF5DE164 wanted 62BE2322
>> bytenr mismatch, want=11224137433088, have=11224137564160
>> Couldn't read chunk tree
>> Could not open root, trying backup super
>> warning, device 1 is missing
>> warning, device 4 is missing
>> warning, device 5 is missing
>> warning, device 3 is missing
>> bytenr mismatch, want=11224137170944, have=0
>> ERROR: cannot read chunk root
>> Could not open root, trying backup super
>> warning, device 1 is missing
>> warning, device 4 is missing
>> warning, device 5 is missing
>> warning, device 3 is missing
>> bytenr mismatch, want=11224137170944, have=0
>> ERROR: cannot read chunk root
>> Could not open root, trying backup super
>>
>> $ btrfs restore -D -l /dev/xvde
>> warning, device 4 is missing
>> checksum verify failed on 11224137433088 found EF5DE164 wanted 62BE2322
>> bytenr mismatch, want=11224137433088, have=11224137564160
>> Couldn't read chunk tree
>> Could not open root, trying backup super
>> warning, device 1 is missing
>> warning, device 2 is missing
>> warning, device 4 is missing
>> warning, device 5 is missing
>> checksum verify failed on 11224137170944 found C9115A93 wanted 14526E28
>> checksum verify failed on 11224137170944 found C9115A93 wanted 14526E28
>> bytenr mismatch, want=11224137170944, have=59973365311232
>> ERROR: cannot read chunk root
>> Could not open root, trying backup super
>> warning, device 1 is missing
>> warning, device 2 is missing
>> warning, device 4 is missing
>> warning, device 5 is missing
>> checksum verify failed on 11224137170944 found C9115A93 wanted 14526E28
>> checksum verify failed on 11224137170944 found C9115A93 wanted 14526E28
>> bytenr mismatch, want=11224137170944, have=59973365311232
>> ERROR: cannot read chunk root
>> Could not open root, trying backup super
>>
>> $ btrfs restore -D -l /dev/xvdf
>> warning, device 4 is missing
>> checksum verify failed on 11224137433088 found EF5DE164 wanted 62BE2322
>> bytenr mismatch, want=11224137433088, have=11224137564160
>> Couldn't read chunk tree
>> Could not open root, trying backup super
>> warning, device 1 is missing
>> warning, device 2 is missing
>> warning, device 4 is missing
>> warning, device 5 is missing
>> warning, device 3 is missing
>> bytenr mismatch, want=11224137170944, have=0
>> ERROR: cannot read chunk root
>> Could not open root, trying backup super
>> warning, device 1 is missing
>> warning, device 2 is missing
>> warning, device 4 is missing
>> warning, device 5 is missing
>> warning, device 3 is missing
>> bytenr mismatch, want=11224137170944, have=0
>> ERROR: cannot read chunk root
>> Could not open root, trying backup super
>>
>> $ btrfs restore -D -l /dev/xvdg
>> warning, device 4 is missing
>> checksum verify failed on 11224137433088 found EF5DE164 wanted 62BE2322
>> bytenr mismatch, want=11224137433088, have=11224137564160
>> Couldn't read chunk tree
>> Could not open root, trying backup super
>> warning, device 1 is missing
>> warning, device 2 is missing
>> warning, device 4 is missing
>> warning, device 3 is missing
>> bytenr mismatch, want=11224137170944, have=11224137105408
>> ERROR: cannot read chunk root
>> Could not open root, trying backup super
>> warning, device 1 is missing
>> warning, device 2 is missing
>> warning, device 4 is missing
>> warning, device 3 is missing
>> bytenr mismatch, want=11224137170944, have=11224137105408
>> ERROR: cannot read chunk root
>> Could not open root, trying backup super
>>
>> If I mount it read only:
>> $ mount -o nossd,degraded,ro /dev/xvdc /mnt/fileshare/
>>
>> $ btrfs device usage /mnt/fileshare/
>>
>> /dev/xvdc, ID: 1
>>    Device size:             2.73TiB
>>    Device slack:              0.00B
>>    Data,single:             5.00GiB
>>    Data,RAID6:              1.60TiB
>>    Data,RAID6:              2.75GiB
>>    Data,RAID6:              1.00GiB
>>    Metadata,RAID6:          2.06GiB
>>    System,RAID6:           32.00MiB
>>    Unallocated:             1.12TiB
>>
>> /dev/xvdd, ID: 2
>>    Device size:             2.73TiB
>>    Device slack:              0.00B
>>    Data,single:             1.00GiB
>>    Data,RAID6:              1.60TiB
>>    Data,RAID6:              7.07GiB
>>    Data,RAID6:              1.00GiB
>>    Metadata,RAID6:          2.06GiB
>>    System,RAID6:           32.00MiB
>>    Unallocated:             1.12TiB
>>
>> /dev/xvde, ID: 3
>>    Device size:             1.82TiB
>>    Device slack:              0.00B
>>    Data,RAID6:              1.60TiB
>>    Data,RAID6:              7.07GiB
>>    Metadata,RAID6:          2.06GiB
>>    System,RAID6:           32.00MiB
>>    Unallocated:           213.23GiB
>>
>> /dev/xvdf, ID: 6
>>    Device size:             1.82TiB
>>    Device slack:              0.00B
>>    Data,RAID6:            882.62GiB
>>    Data,RAID6:              1.00GiB
>>    Metadata,RAID6:          2.06GiB
>>    Unallocated:           977.33GiB
>>
>> /dev/xvdg, ID: 5
>>    Device size:             1.82TiB
>>    Device slack:              0.00B
>>    Data,RAID6:              1.60TiB
>>    Data,RAID6:              7.07GiB
>>    Metadata,RAID6:          2.06GiB
>>    System,RAID6:           32.00MiB
>>    Unallocated:           213.23GiB
>>
>> missing, ID: 4
>>    Device size:               0.00B
>>    Device slack:           16.00EiB
>>    Data,RAID6:            758.00GiB
>>    Data,RAID6:              4.31GiB
>>    System,RAID6:           32.00MiB
>>    Unallocated:             1.07TiB
>>
>> Hoping this isn't a total loss ;)
>>
>> --
>> Steven Haigh
>>
>> Email: net...@crc.id.au
>> Web: https://www.crc.id.au
>> Phone: (03) 9001 6090 - 0412 935 897
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

signature.asc
Description: OpenPGP digital signature

Re: Trying to rescue my data :(

Reply via email to