Bug#982459: mdadm examine corrupts host ext4

2022-08-01 Thread Håkan T Johansson


On Sun, 31 Jul 2022, Chris Hofstaedtler wrote:


I can't see a difference that should matter from userspace.

I have stared a bit at the kernel code... there have been quite some
changes and fixes in this area. Which kernel version were you
running when testing this?

Could you retry on something >= 5.9? I.e. some version with patch
   08fc1ab6d748ab1a690fd483f41e2938984ce353.


Dear Chris,

I believe that I was running 5.10 (bullseye).

It looks like 5.18 (from backports) does not show the issue!  (i.e. works)

Some more details:

I have now tried again:

host:
  linux-image-5.10.0-16-amd64   5.10.127-2
  mdadm 4.2-1~bpo11+1
chroot:
  mdadm 4.1-11

  Some more details:

  This time I did get some dmesg BUG output as well (attached).
  It does not seem to be the same backtrace on two occurances.

  I also noticed that the BUG: report in dmesg does not happen directly
  when doing 'mdadm --examine --scan --config=partitions'.  It rather
  occurs when some activity happens on the host filesystem, e.g.
  a 'touch /root/a' command.

host:
  linux-image-5.18.0-0.bpo.1-amd64  5.18.2-1~bpo11+1

  (did not re-install anything else, except upgraded zfs, also from
  backports (since pure bullseye would not compile with 5.18))

  Does not exhibit the problem.

I have tried with both kernels several times, and it was repeatable that 
5.10 got stuck while 5.18 does not show issues.


Reminder: to get the issue, /dev/ should not be mounted in the chroot.
With /dev/ mounted, 5.10 also works.

Best regards,
Håkan[mÃ¥n aug  1 15:53:08 2022] BUG: kernel NULL pointer dereference, address: 
0010
[mån aug  1 15:53:08 2022] #PF: supervisor read access in kernel mode
[mån aug  1 15:53:08 2022] #PF: error_code(0x) - not-present page
[mån aug  1 15:53:08 2022] PGD 0 P4D 0 
[mån aug  1 15:53:08 2022] Oops:  [#1] SMP PTI
[mån aug  1 15:53:08 2022] CPU: 2 PID: 284256 Comm: cron Tainted: P   
OE 5.10.0-16-amd64 #1 Debian 5.10.127-2
[mån aug  1 15:53:08 2022] Hardware name: Dell Computer Corporation PowerEdge 
2850/0T7971, BIOS A04 09/22/2005
[mån aug  1 15:53:08 2022] RIP: 
0010:__ext4_journal_get_write_access+0x29/0x120 [ext4]
[mån aug  1 15:53:08 2022] Code: 00 0f 1f 44 00 00 41 57 41 56 41 89 f6 41 55 
41 54 49 89 d4 55 48 89 cd 53 48 83 ec 10 48 89 3c 24 e8 ab d7 bb e1 48 8b 45 
30 <4c> 8b 78 10 4d 85 ff 74 2f 49 8b 87 e0 00 00 00 49 8b 9f 88 03 00
[mån aug  1 15:53:08 2022] RSP: 0018:ae27c059fd60 EFLAGS: 00010246
[mån aug  1 15:53:08 2022] RAX:  RBX: 9d1b94505480 RCX: 
9d1bc52e5e38
[mån aug  1 15:53:08 2022] RDX: 9d1bc13782d8 RSI: 0c14 RDI: 
c096feb0
[mån aug  1 15:53:08 2022] RBP: 9d1bc52e5e38 R08: 9d1be04d5230 R09: 
0001
[mån aug  1 15:53:08 2022] R10: 9d1bc985f000 R11: 001d R12: 
9d1bc13782d8
[mån aug  1 15:53:08 2022] R13: 9d1be04d5000 R14: 0c14 R15: 
9d1bc13782d8
[mån aug  1 15:53:08 2022] FS:  7fed5ecb1840() 
GS:9d1cd7c8() knlGS:
[mån aug  1 15:53:08 2022] CS:  0010 DS:  ES:  CR0: 80050033
[mån aug  1 15:53:08 2022] CR2: 0010 CR3: 0001a46d8000 CR4: 
06e0
[mån aug  1 15:53:08 2022] Call Trace:
[mån aug  1 15:53:08 2022]  ext4_orphan_del+0x23f/0x290 [ext4]
[mån aug  1 15:53:08 2022]  ext4_evict_inode+0x31f/0x630 [ext4]
[mån aug  1 15:53:08 2022]  evict+0xd1/0x1a0
[mån aug  1 15:53:08 2022]  __dentry_kill+0xe4/0x180
[mån aug  1 15:53:08 2022]  dput+0x149/0x2f0
[mån aug  1 15:53:08 2022]  __fput+0xe4/0x240
[mån aug  1 15:53:08 2022]  task_work_run+0x65/0xa0
[mån aug  1 15:53:08 2022]  exit_to_user_mode_prepare+0x111/0x120
[mån aug  1 15:53:08 2022]  syscall_exit_to_user_mode+0x28/0x140
[mån aug  1 15:53:08 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[mån aug  1 15:53:08 2022] RIP: 0033:0x7fed5eea2d77
[mån aug  1 15:53:08 2022] Code: 44 00 00 48 8b 15 19 a1 0c 00 f7 d8 64 89 02 
b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 03 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 e9 a0 0c 00 f7 d8 64 89 02 b8
[mån aug  1 15:53:08 2022] RSP: 002b:7ffd50452818 EFLAGS: 0202 
ORIG_RAX: 0003
[mån aug  1 15:53:08 2022] RAX:  RBX: 55dab4578910 RCX: 
7fed5eea2d77
[mån aug  1 15:53:08 2022] RDX: 7fed5ef6e8a0 RSI:  RDI: 
0006
[mån aug  1 15:53:08 2022] RBP:  R08:  R09: 
7fed5ef6dbe0
[mån aug  1 15:53:08 2022] R10: 006f R11: 0202 R12: 
7fed5ef6f4a0
[mån aug  1 15:53:08 2022] R13:  R14:  R15: 
0001
[mån aug  1 15:53:08 2022] Modules linked in: msr autofs4 nfsd auth_rpcgss 
nfsv3 nfs_acl nfs lockd grace sunrpc nfs_ssc fscache xt_mac xt_length xt_recent 
xt_multiport xt_tcpudp xt_state xt_conntrack 

Bug#982459:

2021-08-15 Thread Håkan T Johansson


Hi,

I believe that I have been hit by this bug too.

What has happened for me is that the machine in question 'almost' locks 
up, with a read-only /, and such that most commands to debug further never 
complete due to waiting for filesystem action.  It then requires a reboot.


'dmesg' has worked, and then shows ext4-related issues.  However, they 
were not recorded to /var/log.  I generally do not find any corruption on 
the filesystem itself when running fsck afterwards.


On the machine I have a number of chroot debian installations of different 
releases. By pure chance I found that 'update-initramfs' was the trigger 
for the system hangs. I could then repeatably trigger the issue again.
(Before this, it would happen as part of system maintenance (unattended 
upgrades in the chroots), so just spuriously hang the machine.)


In my case, the chroot installations live on a ZFS filesystem.  But the 
host system itself is on (multiple; /, /usr/, /var/ ) MD raid1.


I have had /proc mounted in the chroots.  But had forgotten /dev .  After 
mounting /dev (and /dev/pts) in the chroots, the issue has not happened 
again.


The issue was when the host system ran Buster, I then upgraded to Bullseye 
~2 weeks ago, hoping it would be resolved, but the issue was still present 
after the upgrade.  Only after that upgrade I found the update-initramfs 
trigger.


I am running with sysvinit, both on host and chroots.

Currently, I do not have hands-on access to the system, so cannot inspect 
or reboot it reliably.  Should be able to do some further tests in a few 
weeks.


Best regards,
Håkan