Bug#982459: mdadm examine corrupts host ext4
On Sun, 31 Jul 2022, Chris Hofstaedtler wrote: I can't see a difference that should matter from userspace. I have stared a bit at the kernel code... there have been quite some changes and fixes in this area. Which kernel version were you running when testing this? Could you retry on something >= 5.9? I.e. some version with patch 08fc1ab6d748ab1a690fd483f41e2938984ce353. Dear Chris, I believe that I was running 5.10 (bullseye). It looks like 5.18 (from backports) does not show the issue! (i.e. works) Some more details: I have now tried again: host: linux-image-5.10.0-16-amd64 5.10.127-2 mdadm 4.2-1~bpo11+1 chroot: mdadm 4.1-11 Some more details: This time I did get some dmesg BUG output as well (attached). It does not seem to be the same backtrace on two occurances. I also noticed that the BUG: report in dmesg does not happen directly when doing 'mdadm --examine --scan --config=partitions'. It rather occurs when some activity happens on the host filesystem, e.g. a 'touch /root/a' command. host: linux-image-5.18.0-0.bpo.1-amd64 5.18.2-1~bpo11+1 (did not re-install anything else, except upgraded zfs, also from backports (since pure bullseye would not compile with 5.18)) Does not exhibit the problem. I have tried with both kernels several times, and it was repeatable that 5.10 got stuck while 5.18 does not show issues. Reminder: to get the issue, /dev/ should not be mounted in the chroot. With /dev/ mounted, 5.10 also works. Best regards, Håkan[mÃ¥n aug 1 15:53:08 2022] BUG: kernel NULL pointer dereference, address: 0010 [mÃ¥n aug 1 15:53:08 2022] #PF: supervisor read access in kernel mode [mÃ¥n aug 1 15:53:08 2022] #PF: error_code(0x) - not-present page [mÃ¥n aug 1 15:53:08 2022] PGD 0 P4D 0 [mÃ¥n aug 1 15:53:08 2022] Oops: [#1] SMP PTI [mÃ¥n aug 1 15:53:08 2022] CPU: 2 PID: 284256 Comm: cron Tainted: P OE 5.10.0-16-amd64 #1 Debian 5.10.127-2 [mÃ¥n aug 1 15:53:08 2022] Hardware name: Dell Computer Corporation PowerEdge 2850/0T7971, BIOS A04 09/22/2005 [mÃ¥n aug 1 15:53:08 2022] RIP: 0010:__ext4_journal_get_write_access+0x29/0x120 [ext4] [mÃ¥n aug 1 15:53:08 2022] Code: 00 0f 1f 44 00 00 41 57 41 56 41 89 f6 41 55 41 54 49 89 d4 55 48 89 cd 53 48 83 ec 10 48 89 3c 24 e8 ab d7 bb e1 48 8b 45 30 <4c> 8b 78 10 4d 85 ff 74 2f 49 8b 87 e0 00 00 00 49 8b 9f 88 03 00 [mÃ¥n aug 1 15:53:08 2022] RSP: 0018:ae27c059fd60 EFLAGS: 00010246 [mÃ¥n aug 1 15:53:08 2022] RAX: RBX: 9d1b94505480 RCX: 9d1bc52e5e38 [mÃ¥n aug 1 15:53:08 2022] RDX: 9d1bc13782d8 RSI: 0c14 RDI: c096feb0 [mÃ¥n aug 1 15:53:08 2022] RBP: 9d1bc52e5e38 R08: 9d1be04d5230 R09: 0001 [mÃ¥n aug 1 15:53:08 2022] R10: 9d1bc985f000 R11: 001d R12: 9d1bc13782d8 [mÃ¥n aug 1 15:53:08 2022] R13: 9d1be04d5000 R14: 0c14 R15: 9d1bc13782d8 [mÃ¥n aug 1 15:53:08 2022] FS: 7fed5ecb1840() GS:9d1cd7c8() knlGS: [mÃ¥n aug 1 15:53:08 2022] CS: 0010 DS: ES: CR0: 80050033 [mÃ¥n aug 1 15:53:08 2022] CR2: 0010 CR3: 0001a46d8000 CR4: 06e0 [mÃ¥n aug 1 15:53:08 2022] Call Trace: [mÃ¥n aug 1 15:53:08 2022] ext4_orphan_del+0x23f/0x290 [ext4] [mÃ¥n aug 1 15:53:08 2022] ext4_evict_inode+0x31f/0x630 [ext4] [mÃ¥n aug 1 15:53:08 2022] evict+0xd1/0x1a0 [mÃ¥n aug 1 15:53:08 2022] __dentry_kill+0xe4/0x180 [mÃ¥n aug 1 15:53:08 2022] dput+0x149/0x2f0 [mÃ¥n aug 1 15:53:08 2022] __fput+0xe4/0x240 [mÃ¥n aug 1 15:53:08 2022] task_work_run+0x65/0xa0 [mÃ¥n aug 1 15:53:08 2022] exit_to_user_mode_prepare+0x111/0x120 [mÃ¥n aug 1 15:53:08 2022] syscall_exit_to_user_mode+0x28/0x140 [mÃ¥n aug 1 15:53:08 2022] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [mÃ¥n aug 1 15:53:08 2022] RIP: 0033:0x7fed5eea2d77 [mÃ¥n aug 1 15:53:08 2022] Code: 44 00 00 48 8b 15 19 a1 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 e9 a0 0c 00 f7 d8 64 89 02 b8 [mÃ¥n aug 1 15:53:08 2022] RSP: 002b:7ffd50452818 EFLAGS: 0202 ORIG_RAX: 0003 [mÃ¥n aug 1 15:53:08 2022] RAX: RBX: 55dab4578910 RCX: 7fed5eea2d77 [mÃ¥n aug 1 15:53:08 2022] RDX: 7fed5ef6e8a0 RSI: RDI: 0006 [mÃ¥n aug 1 15:53:08 2022] RBP: R08: R09: 7fed5ef6dbe0 [mÃ¥n aug 1 15:53:08 2022] R10: 006f R11: 0202 R12: 7fed5ef6f4a0 [mÃ¥n aug 1 15:53:08 2022] R13: R14: R15: 0001 [mÃ¥n aug 1 15:53:08 2022] Modules linked in: msr autofs4 nfsd auth_rpcgss nfsv3 nfs_acl nfs lockd grace sunrpc nfs_ssc fscache xt_mac xt_length xt_recent xt_multiport xt_tcpudp xt_state xt_conntrack
Bug#982459:
Hi, I believe that I have been hit by this bug too. What has happened for me is that the machine in question 'almost' locks up, with a read-only /, and such that most commands to debug further never complete due to waiting for filesystem action. It then requires a reboot. 'dmesg' has worked, and then shows ext4-related issues. However, they were not recorded to /var/log. I generally do not find any corruption on the filesystem itself when running fsck afterwards. On the machine I have a number of chroot debian installations of different releases. By pure chance I found that 'update-initramfs' was the trigger for the system hangs. I could then repeatably trigger the issue again. (Before this, it would happen as part of system maintenance (unattended upgrades in the chroots), so just spuriously hang the machine.) In my case, the chroot installations live on a ZFS filesystem. But the host system itself is on (multiple; /, /usr/, /var/ ) MD raid1. I have had /proc mounted in the chroots. But had forgotten /dev . After mounting /dev (and /dev/pts) in the chroots, the issue has not happened again. The issue was when the host system ran Buster, I then upgraded to Bullseye ~2 weeks ago, hoping it would be resolved, but the issue was still present after the upgrade. Only after that upgrade I found the update-initramfs trigger. I am running with sysvinit, both on host and chroots. Currently, I do not have hands-on access to the system, so cannot inspect or reboot it reliably. Should be able to do some further tests in a few weeks. Best regards, Håkan