I converted my significantly smaller raid 5 array to raid 1 a little less than a year ago now and I encountered some similar issues.
What i ended up doing was starting balance again and again with slightly different arguments (usually thresholds for what blocks to move) and eventually (a week or two, even with a small array) I managed a full conversion with only some data loss, which I was able to find and correct from backups with scrub. On Mon, Aug 29, 2016 at 5:57 AM, henkjan gersen <h.ger...@gmail.com> wrote: > Following the recent posts on the mailing list I'm trying to convert a > running raid5 system to raid1. This conversion fails to complete with > checksum verify failures. Running a scrub does not fix these checksum > failures and moreover scrub itself aborts after ~9TB (despite repeated > tries). > > All disks in the array complete a long smartctl test without any > errors. Running a scrub after remounting the array with the > recovery-option also makes no difference, it still aborts. For > clarity: I can mount the array without issues and copying all files > and directories to /dev/zero completes without any errors in the logs. > > Any suggestions on how to salvage the array would be highly > appreciated as I'm out of options/ideas for this. I do have a backup > of the important bits, but still restoring it will take time. > > The information of the system: > > -- > > Linux-kernel: 4.4.6 (Slackware) > btrfs-progs v4.5.3 > > [root@quasar:~] # btrfs fi show > Label: 'btr_pool2' uuid: 7c9b2b91-1e89-45fe-8726-91a97663bb5c > Total devices 7 FS bytes used 9.97TiB > devid 3 size 3.64TiB used 3.34TiB path /dev/sdh > devid 4 size 3.64TiB used 3.34TiB path /dev/sdd > devid 5 size 1.82TiB used 1.53TiB path /dev/sdb > devid 6 size 1.82TiB used 1.53TiB path /dev/sdc > devid 7 size 3.64TiB used 3.34TiB path /dev/sdg > devid 10 size 3.64TiB used 3.34TiB path /dev/sde > devid 11 size 3.64TiB used 3.34TiB path /dev/sdf > > [root@quasar:~] # btrfs fi df /storage > Data, RAID1: total=9.50TiB, used=9.48TiB > Data, RAID5: total=1.72GiB, used=1.72GiB > Data, RAID6: total=496.76GiB, used=490.45GiB > System, RAID1: total=32.00MiB, used=1.44MiB > Metadata, RAID1: total=10.00GiB, used=7.68GiB > Metadata, RAID5: total=4.09GiB, used=3.22GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > -- > > The mixture of raid1 and raid5 is the result of the balancing > operation stopping. If I try to restart the balance with the > soft-option it aborts when balancing only meta-data. For the > datablocks it hangs with no IO-activity in iostat for many hours once > hitting the logical address that fails checksum verify > > The output from the scrub operation shows that it almost fully > completes. Note how the errors are on a different devices than flagged > up in dmesg when given per device. > > -- > > [root@quasar:~] # btrfs scrub status /storage/ > scrub status for 7c9b2b91-1e89-45fe-8726-91a97663bb5c > scrub started at Sun Aug 28 14:58:27 2016 and was aborted after 08:15:15 > total bytes scrubbed: 8.91TiB with 33 errors > error details: read=32 csum=1 > corrected errors: 0, uncorrectable errors: 33, unverified errors: 0 > > [root@quasar:~] # btrfs scrub status -d /storage/ > scrub status for 7c9b2b91-1e89-45fe-8726-91a97663bb5c > scrub device /dev/sdh (id 3) history > scrub started at Sun Aug 28 14:58:27 2016 and was aborted after 01:04:54 > total bytes scrubbed: 429.36GiB with 0 errors > scrub device /dev/sdd (id 4) history > scrub started at Sun Aug 28 14:58:27 2016 and was aborted after 01:04:24 > total bytes scrubbed: 425.46GiB with 16 errors > error details: read=16 > corrected errors: 0, uncorrectable errors: 16, unverified errors: 0 > scrub device /dev/sdb (id 5) history > scrub started at Sun Aug 28 14:58:27 2016 and finished after 08:15:15 > total bytes scrubbed: 1.52TiB with 0 errors > scrub device /dev/sdc (id 6) history > scrub started at Sun Aug 28 14:58:27 2016 and finished after 08:02:51 > total bytes scrubbed: 1.52TiB with 1 errors > error details: csum=1 > corrected errors: 0, uncorrectable errors: 1, unverified errors: 0 > scrub device /dev/sdg (id 7) history > scrub started at Sun Aug 28 14:58:27 2016 and was aborted after 03:07:32 > total bytes scrubbed: 1.16TiB with 0 errors > scrub device /dev/sde (id 10) history > scrub started at Sun Aug 28 14:58:27 2016 and was aborted after 06:51:31 > total bytes scrubbed: 1.94TiB with 0 errors > scrub device /dev/sdf (id 11) history > scrub started at Sun Aug 28 14:58:27 2016 and was aborted after 06:03:00 > total bytes scrubbed: 1.94TiB with 16 errors > error details: read=16 > corrected errors: 0, uncorrectable errors: 16, unverified errors: 0 > > -- > > The relevant chunk from dmesg when mounting the array itself. I'm not > sure what the corrupt errs for device sdb and sdc are as there seems > no documentation for it. Both drives pass a smartctl -t long without > errors as said. > > I needed to reboot when the balancing hanged, but errors in dmesg > looked similar to these. > > -- > > [ 1067.179062] BTRFS info (device sde): disk space caching is enabled > [ 1067.414416] BTRFS info (device sde): bdev /dev/sdc errs: wr 0, rd > 0, flush 0, corrupt 47, gen 0 > [ 1067.414423] BTRFS info (device sde): bdev /dev/sdb errs: wr 0, rd > 0, flush 0, corrupt 337, gen 0 > [ 1111.375181] BTRFS: checking UUID tree > [ 1111.375206] BTRFS info (device sde): continuing balance > [ 1116.413445] BTRFS info (device sde): relocating block group > 95050853777408 flags 257 > [ 1134.882061] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1135.032077] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1135.032318] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1135.032455] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1135.032646] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1135.032742] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1135.032907] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1135.033035] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1135.033227] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1135.033330] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1143.682132] BTRFS info (device sde): found 455 extents > [ 1143.823628] csum_tree_block: 8106 callbacks suppressed > [ 1143.823635] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > [ 1143.823754] BTRFS warning (device sde): sde checksum verify failed > on 99586523447296 wanted D883E9B found DF677297 level 0 > > -- > > The output of btrfs check shows checksum failures all relating to the > same logical address: > > -- > [root@quasar:~] # btrfs check -p /dev/sdc > Checking filesystem on /dev/sdc > UUID: 7c9b2b91-1e89-45fe-8726-91a97663bb5c > checksum verify failed on 99586523447296 found DF677297 wanted 0D883E9B > checksum verify failed on 99586523447296 found DF677297 wanted 0D883E9B > checksum verify failed on 99586523447296 found 87B38132 wanted B1BF7088 > checksum verify failed on 99586523447296 found 87B38132 wanted B1BF7088 > bytenr mismatch, want=99586523447296, have=458752 > owner ref check failed [99586523447296 16384] > > cache and super generation don't match, space cache will be invalidated > checksum verify failed on 99586523447296 found DF677297 wanted 0D883E9B > checksum verify failed on 99586523447296 found DF677297 wanted 0D883E9B > checksum verify failed on 99586523447296 found 87B38132 wanted B1BF7088 > checksum verify failed on 99586523447296 found 87B38132 wanted B1BF7088 > bytenr mismatch, want=99586523447296, have=458752 > checking fs roots [O] > checking csums > checksum verify failed on 99586523447296 found DF677297 wanted 0D883E9B > checksum verify failed on 99586523447296 found DF677297 wanted 0D883E9B > checksum verify failed on 99586523447296 found 87B38132 wanted B1BF7088 > checksum verify failed on 99586523447296 found 87B38132 wanted B1BF7088 > bytenr mismatch, want=99586523447296, have=458752 > Error going to next leaf -5 > checking root refs > found 10966788235264 bytes used err is 0 > total csum bytes: 10698166420 > total tree bytes: 11712806912 > total fs tree bytes: 405241856 > total extent tree bytes: 265453568 > btree space waste bytes: 347751364 > file data blocks allocated: 10955252420608 > referenced 10992993153024 > > -- > > Trying to relate that logical address to any real file or directory > fail. I've seen messages on this mailing list that I would need to > give in subvolumes, but that doesn't seem to make any difference. That > gives me the same error > > -- > [root@quasar:~] # btrfs inspect-internal logical-resolve > 99586523447296 /storage/ > ERROR: logical ino ioctl: No such file or directory > -- > > With the above things completed I've tried running btrfs check with > the repair enabled, but that crashes with an assertion failure. So > that doesn't help either. > > -- > > [root@quasar:~] # btrfs check -p --repair /dev/sdc > enabling repair mode > Checking filesystem on /dev/sdc > UUID: 7c9b2b91-1e89-45fe-8726-91a97663bb5c > checksum verify failed on 99586523447296 found DF677297 wanted 0D883E9B > checksum verify failed on 99586523447296 found DF677297 wanted 0D883E9B > checksum verify failed on 99586523447296 found 87B38132 wanted B1BF7088 > checksum verify failed on 99586523447296 found 87B38132 wanted B1BF7088 > bytenr mismatch, want=99586523447296, have=458752 > owner ref check failed [99586523447296 16384] > Unable to find block group for 0 > extent-tree.c:289: find_search_start: Assertion `1` failed. > btrfs(btrfs_reserve_extent+0x993)[0x44ef37] > btrfs(btrfs_alloc_free_block+0x50)[0x44f2c7] > btrfs(__btrfs_cow_block+0x19d)[0x43eca8] > btrfs(btrfs_cow_block+0xec)[0x43f6ff] > btrfs(btrfs_search_slot+0x1b9)[0x442004] > btrfs[0x42080b] > btrfs[0x42a1e9] > btrfs(cmd_check+0x156e)[0x42c461] > btrfs(main+0x155)[0x40a75d] > /lib64/libc.so.6(__libc_start_main+0xf0)[0x7fb45d9b17d0] > btrfs(_start+0x29)[0x40a2e9] > > -- > Any suggestion would be much appreciated. Thanks for getting this far > in reading! > > Best wishes, > Henkjan Gersen > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html