On 2019/1/9 上午3:33, Thiago Ramon wrote: > I have a pretty complicated setup here, so first a general description: > 8 HDs: 4x5TB, 2x4TB, 2x8TB > > Each disk is a LVM PV containing a BCACHE backing device, which then > contains the BTRFS disks. All the drives then were in writeback mode > on a SSD BCACHE cache partition (terrible setup, I know, but without > the caching the system was getting too slow to use). > > I had all my data, metadata and system blocks on RAID1, but as I'm > running out of space, and the new kernels are getting better RAID5/6 > support recently, I've finally decided to migrate to RAID6 and was > starting it off with the metadata. > > > It was running well (I was already expecting it to be slow, so no > problem there), but I had to spend some days away from the machine. > Due to an air conditioning failure, the room temperature went pretty > high and one of the disks decided to die (apparently only > temporarily). BCACHE couldn't write to the backing device anymore, so > it ejected all drives and let them cope with it by themselves. I've > caught the trouble some 12h later, still away, and shut down anything > accessing the disks until I could be physically there to handle the > issue. > > After I got back and got the temperature down to acceptable levels, > I've checked the failed drive, which seems to be working well after > getting re-inserted, but it's of course out of date with the rest of > the drives. But apparently the rest got some corruption as well when > they got ejected from the cache, and I'm getting some errors I haven't > been able to handle. > > I've gone through the steps here that helped me before when having > complicated crashes on this system, but this time it wasn't enough, > and I'll need some advice from people who know the BTRFS internals > better than me to get this back running. I have around 20TB of data in > the drives, so copying the data out is the last resort, as I'd prefer > to let most of it die than to buy a few disks to fit all of that. > > > Now on to the errors: > > I've tried both with the "failed" drive in (which gives me additional > transid errors) and without it. > > Trying to mount with it gives me: > [Jan 7 20:18] BTRFS info (device bcache0): enabling auto defrag > [ +0.000010] BTRFS info (device bcache0): disk space caching is enabled > [ +0.671411] BTRFS error (device bcache0): parent transid verify > failed on 77292724051968 wanted > 1499510 found 1499467 > [ +0.005950] BTRFS critical (device bcache0): corrupt leaf: root=2 > block=77292724051968 slot=2, bad key order, prev (39029522223104 168 > 212992) current (39029521915904 168 16384)
Heavily corrupted extent tree. And there is a very good experimental patch for you: https://patchwork.kernel.org/patch/10738583/ Then go mount with "skip_bg,ro" mount option. Please note this can only help you to salvage data (kernel version of btrfs-store). AFAIK, the corruption may affect fs trees too, so be aware of corrupted data. Thanks, Qu > [ +0.000378] BTRFS error (device bcache0): failed to read block groups: -5 > [ +0.022884] BTRFS error (device bcache0): open_ctree failed > > Trying without the disk (and -o degraded) gives me: > [Jan 8 12:51] BTRFS info (device bcache1): enabling auto defrag > [ +0.000002] BTRFS info (device bcache1): allowing degraded mounts > [ +0.000002] BTRFS warning (device bcache1): 'recovery' is deprecated, > use 'usebackuproot' instead > [ +0.000000] BTRFS info (device bcache1): trying to use backup root at > mount time[ +0.000002] BTRFS info (device bcache1): disabling disk > space caching > [ +0.000001] BTRFS info (device bcache1): force clearing of disk cache > [ +0.001334] BTRFS warning (device bcache1): devid 2 uuid > 27f87964-1b9a-466c-ac18-b47c0d2faa1c is missing > [ +1.049591] BTRFS critical (device bcache1): corrupt leaf: root=2 > block=77291982323712 slot=0, unexpected item end, have 685883288 > expect 3995 > [ +0.000739] BTRFS error (device bcache1): failed to read block groups: -5 > [ +0.017842] BTRFS error (device bcache1): open_ctree failed > > btrfs check output (without drive): > warning, device 2 is missing > checksum verify failed on 77088164081664 found 715B4470 wanted 580444F6 > checksum verify failed on 77088164081664 found 98775719 wanted FA63AD42 > checksum verify failed on 77088164081664 found 98775719 wanted FA63AD42 > bytenr mismatch, want=77088164081664, have=274663271295232 > Couldn't read chunk tree > ERROR: cannot open file system > > I've already tried super-recover, zero-log and chunk-recover without > any results, and check with --repair fails the same way as without. > > So, any ideas? I'll be happy to run experiments and grab more logs if > anyone wants more details. > > > And thanks for any suggestions. >
signature.asc
Description: OpenPGP digital signature