On Sun, Oct 29, 2017 at 3:05 PM, Chris Murphy <li...@colorremedies.com> wrote:
> On Thu, Oct 26, 2017 at 4:34 AM, Zak Kohler <y...@y2kbugger.com> wrote:
>
>> I will gladly repeat this process, but I am very concerned why this
>> corruption happened in the first place.
>
> Right. So everyone who can, needs to run the three scrubs on all
> available Btrfs volumes/devices and see if they get any discrepancies.
> I only ever use the online scrub so I have no idea if --offline or the
> older check --check-data-csum differ from it.
>
> scrub start
> scrub start --offline
> btrfs check --check-data-csum
>
> I think you've hit a software bug if those three methods don't exactly
> agree with each other. And it's a question which one is correct, or if
> they all have different bugs in them?
>
>
>
>>
>> More tests:
>>
>> scrub start --offline
>>     All devices had errors in differing amounts
>>     I will verify that these counts are repeatable.
>>     Csum error: 150
>>     Csum error: 238
>>     Csum error: 175
>>
>> btrfs check
>>     found 2179745955840 bytes used, no error found
>>
>> btrfs check --check-data-csum
>>     mirror 0 bytenr 13348855808 csum 2387937020 expected csum 562782116
>>     mirror 0 bytenr 23398821888 csum 3602081170 expected csum 1963854755
>
>
> Offhand that sounds like three different results, which is sorta fakaked.
Those results for passing each of the three devices:

$ sudo btrfs scrub start --offline --progress /dev/disk/by-id/WD-XX1
Scrub result:
Tree bytes scrubbed: 5234425856
Tree extents scrubbed: 638968
Data bytes scrubbed: 4353723670528
Data extents scrubbed: 374300
Data bytes without csum: 533200896
Read error: 0
Verify error: 0
Csum error: 150

$ sudo btrfs scrub start --offline --progress /dev/disk/by-id/WD-XX2
Scrub result:
Tree bytes scrubbed: 5234425856
Tree extents scrubbed: 638967
Data bytes scrubbed: 4353723314176
Data extents scrubbed: 374300
Data bytes without csum: 533200896
Read error: 0
Verify error: 0
Csum error: 238

$ sudo btrfs scrub start --offline --progress /dev/disk/by-id/WD-XX3
Scrub result:
Tree bytes scrubbed: 5234491392
Tree extents scrubbed: 638975
Data bytes scrubbed: 4353723572224
Data extents scrubbed: 374300
Data bytes without csum: 533200896
Read error: 0
Verify error: 0
Csum error: 175 #first run
Csum error: 112 #second run...
Csum error: 285 #third run...

But I ran the /dev/disk/by-id/WD-XX3 device three times and you can
see the result...


So I ran memtest86+ 5.01 for >4 days:
Pass: 39 Errors: 0


Only other think I can think to try is transferring those drives to
another system.

I'm running '$ sudo btrfs scrub start --offline --progress
/dev/disk/by-id/WD-XX3' one more time to just to make sure that I
wasn't reading something wrong.

I just noticed that all three drives are blinking so I can assume that
$ sudo btrfs scrub start --offline --progress /dev/disk/by-id/WD-XX1
$ sudo btrfs scrub start --offline --progress /dev/disk/by-id/WD-XX2
$ sudo btrfs scrub start --offline --progress /dev/disk/by-id/WD-XX3
are all equivalent and are checking the entire filesystem?

The original problem was noticed during a btrfs send, and it would
stop and send a dmesg after an intermittent amount of data. You could
run it 3 times and it would stop at the same spot and then next time
it would make it further.
Given these dmesg error, I'll tabularize the info to more clearly show
the intermittent nature

[  114.450699] BTRFS warning (device sdn): csum failed ino 6407 \
off 7683907584 csum 1745651892 expected csum 3952841867

   time     ino        off          csum    expected
[  114.43]  6407    7683907584  1745651892  3952841867
[  114.45]  6407    7683907584  1745651892  3952841867
[38494.97]  4708    27529216    876064455   874979996
[38494.98]  4708    27529216    2615801759  874979996
[38541.07]  4708    27529216    876064455   874979996
[38571.24]  4708    27529216    2615801759  874979996
[39434.21]  4708    27529216    2615801759  874979996
[73132.65]  4708    27529216    2615801759  874979996
[73167.89]  4708    27529216    2615801759  874979996

Sometimes it gets stuck at the same ino, sometimes not. Sometime the
actual incorrect csum repeats, sometimes it doesn't.



>
>>     ...
>>
>> The only thing I could think of is that the btrfs version that I used to mkfs
>> was not up to date. Is there a way to determine which version was used to
>> create the filesystem?
>
> That information isn't in the superblock. I think it could be added to
> the device tree as a PERSISTENT_ITEM, although I'm not sure how useful
> it is.
>
> Anyway, I don't think it's related. mkfs.btrfs writes a tiny amount to
> the drive, and almost certainly the problem is happening later.
>
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to