On 2018年07月31日 08:43, Sterling Windmill wrote:
> I am using a two disk raid1 btrfs filesystem spanning two external hard
> drives connected via USB 3.0.

Is there any speed difference between the two device?
And are these 2 devices under the same USB3.0 root hub or different root
hubs?

lsusb output could help to determine the hierarchy.

> 
> While copying ~6TB of data from this filesystem to local disk via rsync
> I am seeing messages like the following in dmesg output:
> 
> [ 2213.406267] BTRFS warning (device sdj1): csum failed root 5 ino 830
> off 2124197888 csum 0xb5da0cd2 expected csum 0x6e478250 mirror 2

Since only one copy shows the problem, the other copy should be good
thus the read should work without problem.

> [ 4890.178727] BTRFS warning (device sdj1): csum failed root 5 ino 1058
> off 26052067328 csum 0x8ccd1067 expected csum 0x4adb8254 mirror 2
> [27463.940218] BTRFS warning (device sdj1): csum failed root 5 ino 5372
> off 7954096128 csum 0x9f9b697e expected csum 0xbd61a0e2 mirror 2
> [29405.832643] BTRFS warning (device sdj1): csum failed root 5 ino 31374
> off 7893983232 csum 0x12fd0ddc expected csum 0xddcd2f8e mirror 2
> [31224.279082] BTRFS warning (device sdj1): csum failed root 5 ino
> 150903 off 183635968 csum 0xea025eb4 expected csum 0x46d64878 mirror 2
> [32282.635615] BTRFS warning (device sdj1): csum failed root 5 ino
> 162774 off 31092424704 csum 0x1ee9b38d expected csum 0x4022e3de mirror 2
> [41052.643493] BTRFS warning (device sdj1): csum failed root 5 ino
> 163742 off 52214816768 csum 0x6723208c expected csum 0x0377e68a mirror 2
> [47723.500430] BTRFS warning (device sdj1): csum failed root 5 ino
> 470775 off 12533760 csum 0x9f50f9a0 expected csum 0x23ddc68e mirror 2
> [60060.843425] BTRFS warning (device sdj1): csum failed root 5 ino
> 786762 off 4178321408 csum 0xcd520ead expected csum 0x46fe6ebc mirror 2
> [60900.058745] BTRFS warning (device sdj1): csum failed root 5 ino
> 786900 off 896303104 csum 0x4c7e26e7 expected csum 0x86554095 mirror 2
> [68149.417236] BTRFS warning (device sdj1): csum failed root 5 ino 1058
> off 3101224960 csum 0x2b8c363c expected csum 0x8df2991a mirror 1
> [69072.272010] BTRFS warning (device sdj1): csum failed root 5 ino 1141
> off 2939588608 csum 0xa2969f63 expected csum 0xddf33efd mirror 1
> [71342.354453] BTRFS warning (device sdj1): csum failed root 5 ino 1328
> off 57047568384 csum 0xd57f5bb7 expected csum 0x421f96e5 mirror 1
> 
> Because the device was consistent, it seemed that one of the disks held
> bad data. I wasn't sure if btrfs was correcting the issue by using the
> other seemingly good copy on the second disk or if I was copying bad
> data to the destination filesystem, so I aborted the copy and ran a
> scrub of the filesystem that includes sdj1 by issuing the following command:
> 
> btrfs scrub start /external
> 
> I let the scrub finish and monitored the result using the following command:
> 
> btrfs scrub status /external
> 
> Which showed the following output:
> 
> scrub status for ece518d2-4af0-4ef7-a31d-8c89b13a5ad9
>         scrub started at Sun Jul 29 11:34:44 2018 and finished after
> 14:34:58
>         total bytes scrubbed: 12.80TiB with 0 errors

Would you provide the dmesg during the scrub?

> 
> Alright, perhaps btrfs had already fixed the issues upon encountering
> them. I ran my copy again only to see very similar messages show up in
> dmesg:
> 
> [154842.551604] BTRFS warning (device sdj1): csum failed root 5 ino 1284
> off 858886144 csum 0x8caf203c expected csum 0x9a3acab6 mirror 2

At least the corrupted ino and offset is different, thus the old
corruption is fixed, but somehow it introduced new corruption.

> [159949.727412] BTRFS warning (device sdj1): csum failed root 5 ino 1636
> off 4463370240 csum 0x8dfaf00c expected csum 0xa7ab457e mirror 2
> [160911.893913] BTRFS warning (device sdj1): csum failed root 5 ino 1729
> off 8181428224 csum 0xd57845b5 expected csum 0x6904c54e mirror 2
> [165210.245890] BTRFS warning (device sdj1): csum failed root 5 ino 2927
> off 1013219328 csum 0xf2d2820d expected csum 0x812222bb mirror 2
> [169279.620570] BTRFS warning (device sdj1): csum failed root 5 ino 3363
> off 900493312 csum 0x6c6a35a2 expected csum 0x2a983a9c mirror 2
> [169990.401373] BTRFS warning (device sdj1): csum failed root 5 ino 4277
> off 186707968 csum 0xbdd075d5 expected csum 0xf302e9df mirror 2
> [171411.085425] BTRFS warning (device sdj1): csum failed root 5 ino 4719
> off 593842176 csum 0xcdabc7e6 expected csum 0xc137d47a mirror 2
> [173370.025471] BTRFS warning (device sdj1): csum failed root 5 ino 5267
> off 2605592576 csum 0xcd2cb8a8 expected csum 0x9de364e9 mirror 2
> [180329.942125] BTRFS warning (device sdj1): csum failed root 5 ino
> 162774 off 22459506688 csum 0xc38e7a53 expected csum 0xad11854c mirror 2

Since all corruption showed above is about mirror 2, would you mind to
try scrub certain device other than the whole fs and attach the dmesg?

# btrfs scrub start <device>



> 
> I would have expected the scrub to find these issues or to show some
> number of corrected errors. Perhaps I misunderstand what scrub does?

Your understanding is completely correct.
In fact reading from corrupted block should trigger re-write on
corrupted data.

I'm wondering if it's related to some scrub race, since for multi-device
btrfs, full fs scrub is addressed by doing multiple scrub
simultaneously, one scrub for each device.
It used to cause problem for raid5/6, but never heard of corruption for
raid1.

Would you provide the kernel version and full dmesg (including reading
error and scrub, and later read)?

> 
> I also tried tracking down individual files via the referenced inode
> numbers with the following command:
> 
> btrfs inspect-internal inode-resolve $INODE /external
> 
> And ran checksums of the source and destination versions of these files
> to find them to be identical. So at least the copy on the source and
> destination appear to match.

Since btrfs will switch to the good copy, the data should be correct.

> 
> Maybe I'm experiencing some sort of intermittent USB device / bus issue?

Full dmesg may help, if there is something related to usb.

Thanks,
Qu

> Can anyone help explain what might be happening here?
> 
> Thanks!
> 
> 
> 
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to