On 2018年04月14日 21:45, Timo Nentwig wrote:
> On 04/14/2018 11:42 AM, Qu Wenruo wrote:
>> And the work load when the RO happens is also helpful.
>> (Well, the dmesg of RO happens would be the best though)
> Surprisingly nothing special AFAIR. It's a private, mostly idle machine.
> Probably "just" browsing with chrome.
> I didn't notice the remount right away as there were no obvious
> failures. And even then I kept it running for a couple more hours/a day
> or so.
> 
> I had a glance at dmesg but don't remember anything specific (think the
> usual "---- [cut here] ---" + dump of registers, but I'm not even sure
> about that). Sorry.
> 
> Actually the same thing happened just a few days earlier and after a
> reboot (and maybe fsck) it was back up and good. Was optimistic it would
> go the same way this time as well :) In general I had to hard-reset (+
> fsck) a couple of times in recent times.

So, after each hard-reset, fsck is executed and no problem exposed by
btrfs check (before RW mount)?

That's interesting.

> Except for the SSD it's an
> all-new machine that I'm still OC/stress-testing. But not when that
> particular event happened.

A little off-topic, Linux + OC is not that common in my opinion.
Especially when we don't have AMD Ryzen Master to Intel XTU under linux.

>> Despite above salvage method, please also considering provide the
>> following data, as your case is pretty special and may help us to catch
>> a long hidden bug.
> If only I had know I would have saved dmesg! :)
> Sure, I'd be happy to help. If you need any more information just let me
> know.
>> 1) Extent tree dump
>>     Need above 2 patches applied first.
>>
>>     # btrfs inspect dump-tree -t extent /dev/sda2 &> \
>>       /tmp/extent_tree_dump
>>     If above dump is too large, "grep -C20 166030671872" of the output is
>>     also good enough.
> 
> I'll send you a link to the full dump directly.

It's good enough with the grepped result, feel free to delete the full dump.


>     item 16 key (166030671872 EXTENT_ITEM 4096) itemoff 3096 itemsize 51
>         refs 1 gen 1702074 flags TREE_BLOCK
>         tree block key (162793705472 EXTENT_ITEM 4096) level 0
>         tree block backref root 2

So at least btrfs still consider that tree block should belong to extent
tree.

>     item 17 key (166030671872 BLOCK_GROUP_ITEM 1073741824) itemoff 3072
> itemsize 24
>         block group used 96915456 chunk_objectid 256 flags METADATA

Your metadata is SINGLE profile, default for SSD. Nothing special here.

Currently speaking, the problem looks like that tree log tree block get
allocated into extent tree (the only way btrfs allocate tree blocks
without update extent tree).
And when log tree get replayed, your fs is corrupted.

Did you have several hard-reset before the fs mounted RO itself?

>> 2) super block dump
>>     # btrfs inspect dump-super -f /dev/sda2
> superblock: bytenr=65536, device=/dev/sda2
> ---------------------------------------------------------
> csum_type        0 (crc32c)
> csum_size        4
> csum            0xef0068ba [match]
> bytenr            65536
> flags            0x1
>             ( WRITTEN )
> magic            _BHRfS_M [match]
> fsid            22e778f7-2499-4379-99d2-cdd399d1cc6e
> label            830
> generation        1706541

The offending tree block has generation 1705980, which is 561
generations ago.

Although it's hard to tell the real world time, at least the problem is
not directly caused by your first automatical RO remount.

The problem should exist for a while.

> root            167104118784
> sys_array_size        97
> chunk_root_generation    1702072
> root_level        1
> chunk_root        186120536064
> chunk_root_level    1
> log_root        180056702976> log_root_transid    0

Not sure if this is common, need to double check later.

> log_root_level        0
> total_bytes        63879249920
> bytes_used        36929691648
> sectorsize        4096
> nodesize        4096

Nodesize is not the default 16K, any reason for this?
(Maybe performance?)

>> 3) Extra hardware info about your sda
>>     Things like SMART and hardware model would also help here.
> smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.15.15-1-ARCH] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Samsung based SSDs
> Device Model:     SAMSUNG SSD 830 Series

At least I haven't hear much problem about Samsung SSD, so I don't think
it's the hardware to blamce. (Unlike Intel 600P)


>> 4) The mount option of /dev/sda2
> 
> /dev/sda2    /    btrfs compress=zstd,discard,autodefrag,subvol=/      
> 0       0

Discard used to cause some problem, but it should be fixed in recent
release IIRC.

Despite that, discard option is not recommended IIRC, routine fstrim is
preferred instead.

> 
> And if that matters (AFAIK subvolume mount options have no effect anyway):
> 
> /dev/sda2    /var/lib/postgres       btrfs
> compress=zstd,discard,nodatacow,subvol=var/lib/postgres 0       0

RDB could cause a lot of fsync, at least this explains why the tree log
is so large.

To be safe, it's recommended to use notreelog mount option, which will
degrade fsync() to sync() for btrfs, so no log tree will be used.
Although it will bring performance impact for sync(), it could help us
to determine if it's really tree log to blame.

> /dev/sda2       /var/cache              btrfs
> compress=off,discard,subvol=var/cache                   0       0
> /dev/sda2       /var/tmp                btrfs
> compress=zstd,discard,subvol=var/tmp            0       0
> 
>> Thanks,
>> Qu
> 
> Got a couple of these:
> We seem to be looping a lot on /mnt/sda2/var/lib/postgres/data/.., do
> you want to keep going on ? (y/N/a): y

Not familiar with btrfs-restore, so hard to say.
But it seems to report false alert quite a lot. So keep it running seems
good.

Another way to verify if it's only your extent tree corrupted, btrfs
inspect dump-tree could be used here.

# btrfs inspect dump-tree -t <subvolid> /dev/sda2 > /dev/null

If no stderr is outputted for all your subvolid, then it's should be
pretty safe.

Thanks for your info, your info indeed shows pretty useful clue here.
4K nodesize (so taller tree, smaller lock range) and RDB workload may be
the key to the problem.


Thanks,
Qu

> 
> Is this something I need to be worried about? Postgres did at least
> start up.
> 
> 
> Thanks a lot for your help!
> Timo
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to