On 2021/1/18 上午7:38, chainofflowers wrote:
Hi all,
Hi Qu,

I am also getting this very same error on my system.
Actually, I am experiencing this since the following old bug was introduced AND 
also even after it has been fixed:
https://lore.kernel.org/linux-btrfs/20190521190023.GA68070@glet/T/

That (dm-related) bug was claimed to have been fixed and users confirmed that 
their btrfs partitions were working correctly again, but I am still 
experiencing some issues from time to time - and obviously only on SSD devices.

Just to clarify: I am using btrfs volumes on encrypted LUKS partitions, where 
every partition is encrypted individually.
I am *NOT* using LVM at all: just btrfs directly on top of LUKS (which is 
different from the users' setup in the above-mentioned bug reports).
And I am trimming the partitions only via fstrim, have configured the mount points with the 
"nodiscard" option and the LUKS volumes in /etc/crypttab with "discard", so to 
have the pass-through when I use fstrim.

Opposite to Justin, my partitions are all aligned.

When this happens on my root partition, I cannot launch any command anymore because the 
file system is not responding (e.g.: I get "ls: command not found"). I cannot 
actually do anything in reality, because the system console is flooded with messages like:

  "sd <....> [sdX] tag#29 access beyond end of device"

The best way to debug such problem is to recompile the kernel adding
some debug outputs.
(Maybe it can be done with bpftrace, but not yet familiar with that)

If you're able to recompile the kenerl (using abs + makepkg for Arch
based kernel), please try the following diff.

This will add extra debugging to show where the offending length
happens, either extent discard or unallocated space discard.
And from that output we can continue our investigation.

Thanks,
Qu

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 30b1a630dc2f..7451fa0b14b9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5776,6 +5776,7 @@ static int btrfs_trim_free_extents(struct
btrfs_device *device, u64 *trimmed)

        ret = 0;

+       pr_info("%s: enter devid=%llu\n", __func__, device->devid);
        while (1) {
                struct btrfs_fs_info *fs_info = device->fs_info;
                u64 bytes;
@@ -5820,6 +5821,8 @@ static int btrfs_trim_free_extents(struct
btrfs_device *device, u64 *trimmed)
                        break;
                }

+               pr_info("%s: devid=%llu start=%llu len=%llu\n",
+                       __func__, device->devid, start, len);
                ret = btrfs_issue_discard(device->bdev, start, len,
                                          &bytes);
                if (!ret)
@@ -5842,6 +5845,7 @@ static int btrfs_trim_free_extents(struct
btrfs_device *device, u64 *trimmed)
                cond_resched();
        }

+       pr_info("%s: done devid=%llu ret=%d\n", __func__, device->devid,
ret);
        return ret;
 }

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 379bef967e1d..03046fca53a2 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3772,6 +3772,8 @@ int btrfs_trim_block_group(struct
btrfs_block_group *block_group,
                spin_unlock(&block_group->lock);
                return 0;
        }
+       pr_info("%s: enter bg start=%llu start=%llu end=%llu minlen=%llu\n",
+               __func__, block_group->start, start, end, minlen);
        btrfs_freeze_block_group(block_group);
        spin_unlock(&block_group->lock);

@@ -3786,6 +3788,8 @@ int btrfs_trim_block_group(struct
btrfs_block_group *block_group,
                reset_trimming_bitmap(ctl, offset_to_bitmap(ctl, end));
 out:
        btrfs_unfreeze_block_group(block_group);
+       pr_info("%s: enter bg start=%llu ret=%d\n",
+               __func__, block_group->start, ret);
        return ret;
 }



and that "clogs" the system. Since the root fs is unusable, the system log 
cannot store those messages, so I can't find them at the next reboot.
I can only soft-reset (CTRL-ALT-DEL), it's the "cleanest" (and only) way I can 
get back to a working system.

When the system restarts, it takes 2 seconds longer than usual to mount the 
file systems, and then I can use the PC again.
Immediately after login, if I run btrfs scrub, I get no errors (I scrub ALL of 
my partitions: they're all fine). So, it seems that at least the auto-recovery 
capability of btrfs works fine, thanks to you devs :-)

Then, if I boot from an external device and run btrfs check on the unmounted 
file systems, it also reports NO errors - opposite to what was happening when 
the dm bug was still open: to me, this really means that btrfs today is able to 
heal itself from this issue (it was not always the case in 2019, when the dm 
bug was opened).
I have not tried to boot from external device directly after this issue occurs 
- I mean, performing btrfs check without going first through the btrfs scrub 
step. I will do that next time and see what output I get.

All my partitions are snapshotted, and surely this could help with 
auto-recovery.

What I have noticed is that when this bug happens, it ALWAYS happens after I have purged 
the old snapshots: that is, when the root partition only has one "fresh" 
(read-only) snapshot. This is never happening when I have more than one snapshot - maybe 
it means nothing, but it seems to me to be systematic.

I have attached a file with my setup.
Could you maybe spot anything weird there? It looks fine to me. The USER and 
SCRATCH volumes are in RAID-0.

I am unable to provide any dmesg output or system log because, as said, when it 
happens it does not write anything to the SYS partition (where /var/log is). I 
will move at least /var/log/journal to another device, so hopefully next time I 
will be able to provide some useful info.

Another info: of course, I have tried (twice!) to reconstruct the system SSD 
from scratch, because I wanted to be sure that it was not depending on some 
exotic issue. And each time I used a brand new device.
So, this issue has been happening with a SanDisk Ultra II and with two 
different Samsung EVO 860.

Is it possible that what we are experiencing is still an effect of that dm bug, 
that it was not completely fixed?


Thanks for your help, and for reading till here :)

(c)

Reply via email to