Re: [PATCH] Btrfs: relocate csums properly with prealloc extents

Hans-Kristian Bakke Thu, 24 Oct 2013 09:20:14 -0700

The result of the scrubbing came back today and it was not pretty:
...
scrub done for b64daec7-6c14-4996-94b3-80c6abfa26ce
        scrub started at Wed Oct 23 23:01:22 2013 and finished after
34990 seconds
        total bytes scrubbed: 12.55TB with 3859542 errors
        error details: csum=3859542
        corrected errors: 0, uncorrectable errors: 3859542, unverified errors: 0
---


Still only two folder structures affected, but seemingly unrecoverable.
I noticed the mail to include it in 3.12. Jippi!
Until this is included I will have to pospone rebalancing over the
four new drives.


Mvh

Hans-Kristian Bakke


On 23 October 2013 23:49, Hans-Kristian Bakke <hkba...@gmail.com> wrote:
> OK. btrfs scrub and dmesg is hitting me with lots of unfixable errors.
> All in the same file. Example
>
> [13313.441091] btrfs: unable to fixup (regular) error at logical
> 560107954176 on dev /dev/sdn
> [13321.532223] scrub_handle_errored_block: 1510 callbacks suppressed
> [13321.532309] btrfs_dev_stat_print_on_error: 1510 callbacks suppressed
> [13321.532314] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40016, gen 0
> [13321.532420] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40017, gen 0
> [13321.532545] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40018, gen 0
> [13321.532605] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40019, gen 0
> [13321.533039] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40020, gen 0
> [13321.537519] scrub_handle_errored_block: 1508 callbacks suppressed
> [13321.537525] btrfs: unable to fixup (regular) error at logical
> 560630136832 on dev /dev/sdq
> [13321.537821] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40021, gen 0
> [13321.538081] btrfs: unable to fixup (regular) error at logical
> 560630140928 on dev /dev/sdq
> [13321.538438] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40022, gen 0
> [13321.538715] btrfs: unable to fixup (regular) error at logical
> 560630145024 on dev /dev/sdq
> [13321.539016] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40023, gen 0
> [13321.539234] btrfs: unable to fixup (regular) error at logical
> 560630149120 on dev /dev/sdq
> [13321.539522] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40024, gen 0
> [13321.539739] btrfs: unable to fixup (regular) error at logical
> 560630153216 on dev /dev/sdq
> [13321.540027] btrfs: bdev /dev/sdq errs: wr 0, rd 0, flush 0, corrupt
> 40025, gen 0
> [13321.540242] btrfs: unable to fixup (regular) error at logical
> 560630157312 on dev /dev/sdq
> [13321.540620] btrfs: unable to fixup (regular) error at logical
> 560630161408 on dev /dev/sdq
> [13321.541140] btrfs: unable to fixup (regular) error at logical
> 560630165504 on dev /dev/sdq
> [13321.541571] btrfs: unable to fixup (regular) error at logical
> 560630169600 on dev /dev/sdq
> [13321.541931] btrfs: unable to fixup (regular) error at logical
> 560630173696 on dev /dev/sdq
>
> Luckily all the corruption seems to be in a single very large file,
> but on different part of it on different disks. The file was written
> by rtorrent which have the option "system.file_allocate.set = yes"
> configured.
> I also have samba configured with "strict allocate = yes" because it
> is recommended for best performance on extent based filesystems. Do
> that mean even samba files vulnerable to this corruption too?
> If so this could become very ugly very fast on certain systems.
>
> Mvh
>
> Hans-Kristian Bakke
>
>
> On 23 October 2013 23:24, Hans-Kristian Bakke <hkba...@gmail.com> wrote:
>> I was hit by this when trying to rebalance a 16TB RAID10 to 32TB
>> RAID10 going from 4 to 8 WD SE 4TB drives today. I cannot finish a
>> rebalance because of failed csum.
>>
>> [10228.850910] BTRFS info (device sdq): csum failed ino 487 off 65536
>> csum 2566472073 private 151366068
>> [10228.850967] BTRFS info (device sdq): csum failed ino 487 off 69632
>> csum 2566472073 private 3056924305
>> [10228.850973] BTRFS info (device sdq): csum failed ino 487 off 593920
>> csum 2566472073 private 906093395
>> [10228.851004] BTRFS info (device sdq): csum failed ino 487 off 73728
>> csum 2566472073 private 2680502892
>> [10228.851014] BTRFS info (device sdq): csum failed ino 487 off 598016
>> csum 2566472073 private 1940162924
>> [10228.851029] BTRFS info (device sdq): csum failed ino 487 off 77824
>> csum 2566472073 private 2939385278
>> [10228.851051] BTRFS info (device sdq): csum failed ino 487 off 602112
>> csum 2566472073 private 645310077
>> [10228.851055] BTRFS info (device sdq): csum failed ino 487 off 81920
>> csum 2566472073 private 3600741549
>> [10228.851078] BTRFS info (device sdq): csum failed ino 487 off 86016
>> csum 2566472073 private 200201951
>> [10228.851091] BTRFS info (device sdq): csum failed ino 487 off 606208
>> csum 2566472073 private 1002916440
>>
>> The system is running a scrub now and I will return with some more
>> details later. I do not think systemd is logging to this volume, but
>> the scrub wil probably show which files are affected.
>>
>> As this is a very serious issue for those hit by the corruption (it
>> basically makes it impossible to run rebalance with all its
>> consequences) hopefully this wil go upstream soon.
>> I am on Kernel 3.11.6 by the way.
>> Mvh
>>
>> Hans-Kristian Bakke
>> Mob: 91 76 17 38
>>
>>
>> On 4 October 2013 23:19, Johannes Hirte <johannes.hi...@datenkhaos.de> wrote:
>>> On Fri, 27 Sep 2013 09:37:00 -0400
>>> Josef Bacik <jba...@fusionio.com> wrote:
>>>
>>>> A user reported a problem where they were getting csum errors when
>>>> running a balance and running systemd's journal.  This is because
>>>> systemd is awesome and fallocate()'s its log space and writes into
>>>> it.  Unfortunately we assume that when we read in all the csums for
>>>> an extent that they are sequential starting at the bytenr we care
>>>> about.  This obviously isn't the case for prealloc extents, where we
>>>> could have written to the middle of the prealloc extent only, which
>>>> means the csum would be for the bytenr in the middle of our range and
>>>> not the front of our range.  Fix this by offsetting the new bytenr we
>>>> are logging to based on the original bytenr the csum was for.  With
>>>> this patch I no longer see the csum errors I was seeing.  Thanks,
>>>
>>> Any assessment when this goes upstream? Until it hit Linus tree it
>>> won't won't appear in stable. And this seems rather important.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: relocate csums properly with prealloc extents

Reply via email to