Re: "decompress failed" in 1-2 files always causes kernel oops, check/scrub pass

Qu Wenruo Mon, 14 May 2018 04:06:02 -0700


On 2018年05月14日 18:29, james harvey wrote:
> On Mon, May 14, 2018 at 2:36 AM, Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>> OK, I could reproduce it now.
>>
>> Just mount with -o nodatasum, then create a file.
>> Remount with compress-force=lzo, then write something.
>>
>> So at least btrfs should disallow such thing.
>>
>> Thanks,
>> Qu
> 
> Would the corrupted dump and correct one of the file, and kernel with
> kasan output still help?  Or, with what you reproduced, do you have
> what you need?


The dumps are good enough, kasan will be a little over killed.

For my reproduced case, the data is all good, thus unable to reproduce
the wildly corrupted kernel memory.

I could try corrupt them, but I'm not sure if the same symptom can be
reproduced, thus binary good/bad dump still helps.
The heavy lifting kasan can be done in my environment.

> 
> 
> On Mon, May 14, 2018 at 1:30 AM, Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>> So there is something wrong that btrfs allows compressed data to be
>> generated for such file.
>> (Could not reproduce the same behavior with 4.16 kernel, could such
>> problem happens in older kernels? Or just get fixed recently?)
>>
>> Then some corruption screwed up the compressed data, and when we
>> decompress, the kernel is screwed up.
> 
> In this thread, Chris Murphy noted systemd sets the "C" attribute, and
> discussed what sounds to me like what happened here: "Usually nocow
> also means no compression. But in the archives is a thread where I
> found that compression can be forced on nocow if the file is submitted
> for defragmentation

Oh, defrag is making things more complex here.

But at least the kernel patch should also address that case.

> and either the volume is mounted with compression
> or the file has inherited chattr +c (I don't remember which or
> possibly both). And systemd does submit rotated logs for
> defragmentation."
> 
> 
> 
> (If you don't need the dumps and kernel kasan output, you can skip the
> rest of this reply)
> 
> 
> 
>> Yep, even the last case it still looks like that it's kernel memory get
>> corrupted.
>>
>> From the thread, since you have already located the corrupted mirror,
>> would you please provide the corrupted dump along with correct one?
>>
>> It would help a lot for us to under stand what's going on.
> 
> Absolutely.  I'm not sure how to best get you that, though.

It's great you'd like to help.
Considering you have experienced btrfs-map-block usage, it should be a
piece of cake.
And considering you're an Arch user, it won't really take you much time.
(Yeah, Arch users rock!)

> 
> "filefrag -v" on one of the files can be seen here:
> https://bugzilla.kernel.org/attachment.cgi?id=275953
> 
> It lists 58 fragments.

That filefrag output is less useful here, the main problem is it only
provides the uncompressed extent size.
That's why I asked for debug-tree dump.

And now those debug tree dumps would shine.

I'll take the 72267 dump as an example.

------
        item 182 key (72267 EXTENT_DATA *0* ) itemoff 6375 itemsize 53
                generation 41083 type 1 (regular)
                extent data disk byte *2625134592* nr *4096*
                extent data offset 0 nr 131072 ram 131072
                extent compression 2 (lzo)
------
Important numbers are surrounded by '*'.

The "0" means the offset inside the file.
"2625134592" means the logical address that compressed extent lies.
"4096" means the compressed size of that extent.

So to dump it, passing all needed numbers to btrfs-map-logical, ins this
example you just need:

# btrfs-map-logical -l 2625134592 -b 4096 <device>

And would get the result like:
mirror 1 logical 1104150528 physical 9437184 device
/dev/mapper/test-scratch2
mirror 2 logical 1104150528 physical 1095761920 device
/dev/mapper/test-scratch1

Then grab them using dd.

> 
> filefrag lists the ending offsets and length based on the uncompressed
> sizes.  filefrag doesn't account for the compression.

Now with debug-tree dump, you have the length you need. :)

> 
> So, in this thread, I compared the first 4k of fragments 0-57 on each
> disk and found all the corruption was on disk 1.  (And the entire
> 207*4096 bytes on fragment 58.)  Grabbing more than 4k of each
> fragment brings in data from other files.

And indeed, they are all 4K sized from the dump.

>  So, I might have compared
> all of the data (fragments 0-57 are 128k uncompressed, and at least
> fragment 0 uncompressed does lzop down to about 2k, so perhaps all the
> other 128k fragments compress to within 4k, but maybe not) but this
> might not have grabbed all the data.

Any bad/good pair is enough.

> 
> I could give you (56) 128k, (1) 68k, and (1) 828k fragments, but
> they'd include extra data not involved, so you'd have to find a way to
> use them, and without the metadata saying how many bytes of each
> fragment to use, it might not be easy to put together.  (Maybe
> chopping off all the trailing 0's in each fragment would do the
> trick.)  Maybe the first 9 byte header on each fragment encodes the
> length actually used?

No need to worry now. :)

> 
> If this is useful to you, I'd be happy to provide it, along with the
> correct one.

Any pair which differs would be sufficient.

> 
> If there's a better way than this, I'd be happy to do that instead.  I
> of course can't just copy the file, so have to do something like dd or
> "btrfs-map-logical -o".  "btrfs-map-logical -o" can't automatically
> grab the proper length, because it needs a size argument, and if not
> given, defaults to the 16k nodesize.
> 
>> The dump indicates the same conclusion you reached.
>> The inode has NODATACOW NODATASUM flag, which means it should not has
>> csum nor has data compressed.
>> While in fact we have tons of compressed extents.
>>
>> But the following fiemap result also shows that these extents get
>> shared. This could happen when there is a snapshot.
> 
> I do run snapper.
> 
>> To pindown the lzo decompress corruption, kasan would be a nice try.
>> However this means you need to enable it at compile time, and recompile
>> a kernel.
>> Not to mention kasan has a great impact on performance.
>>
>> But it should provide more info before memory get corrupted.
> 
> Sure, it's compiling.  I'll probably be available to run it and send
> results in 14 hours, if needed.

No need to bother.

But it's still a pretty nice and fun learning progress of how to hack
the complex linux kernel. :)

Thanks,
Qu

>

signature.asc
Description: OpenPGP digital signature

Re: "decompress failed" in 1-2 files always causes kernel oops, check/scrub pass

Reply via email to