Bug#1071501: linux-image-6.1.0-21-arm64: Linux NFS client hangs in nfs4_lookup_revalidate
Dear Salvatore, I've already started bisecting. It will take some time. Usually the bug appears after a few hours, unfortunately I am not able to trigger it faster. So, if the bug appears, I can step forward easily, but if not, its hard to decide if it is still present and simply just have not occured, or if the current version is a good one. I'll try to do my best. I will also contact linux-nfs mailing list. As I remember, it started nearly a year ago, when I switched to Debian's kernel. I dont know exactly what version was at that time. Howewer, I've checked Debian's patches, and I did not find anything related to NFS. Regards, Richard 2024-05-20 21:07 időpontban Salvatore Bonaccorso ezt írta: Hi Richard, On Mon, May 20, 2024 at 09:27:24AM +, Richard Kojedzinszky wrote: Package: src:linux Version: 6.1.90-1 Severity: normal X-Debbugs-Cc: richard+debian+bugrep...@kojedz.in Dear Maintainer, I am running kubernetes on debian, and pods are mounting multiple nfs shares. I am running dovecot processes in PODs, which receive mails from the internet, and also serves as imap server for clients. I am monitoring my mail system by sending mails periodically (15 seconds) and also downloading them via imap. I found a few times that some dovecot process stuck in D state, a reboot was always needed to recover from that state. Unfortunately, I was not able to trigger the bug really fast, I dont really know what operations does dovecot issue and in what order to trigger this behavior. So until I get closer, I've set up a similar, but smaller environment with just a single dovecot process, and it also does the same work, delivering only test mails locally, and serving them via imap to the monitoring client, storing everything on NFS. Fortunately, this also triggers the bug, after a few hours one of the dovecot processes is stuck in D state. Kernel also shows blocked state: As you seem in the lucky position to be able to trigger the issue in a more localized setup, might you: - try as well more recent kernels from upper suites (6.8.9-1 in unstable would be ideal to check if the issue is there as well). - I did read you cannot trigger with 5.15. If you build 6.1.90 from upstream without Debian patches I assume you can trigger the issue likewise? If so could you bisect the changes introducing the issue? This is a cumbersome process in particular if you need few hours to trigger it So maybe the following point could be done first: - Can you report the issue to the linux-nfs list, keeping us in the loop? Regards, Salvatore
Bug#1071501: linux-image-6.1.0-21-arm64: Linux NFS client hangs in nfs4_lookup_revalidate
On Monday, 20 May 2024 21:07:49 CEST Salvatore Bonaccorso wrote: > - I did read you cannot trigger with 5.15. If you build 6.1.90 from > upstream without Debian patches I assume you can trigger the issue > likewise? If so could you bisect the changes introducing the issue? If the test with the upstream 6.1.90 version also has this problem, there's another (series of) test(s) worth doing, which could shorten the bisect operation significantly. I got the impression that you have only tried it with version 6.1.90. Have you tried it with earlier versions in the 6.1 series to see if the issue is present there? Via https://snapshot.debian.org/package/linux-signed-arm64/ you can find earlier versions from the 6.1 series already compiled and packaged. To take version 6.1.52 as (random) example: - click on the ``6.1.52+1`` link - In the ``Binary packages`` list, click on the linux-image-6.1.0-X-arm64 link, where 'X' is 12 in this case - Click the ``linux-image-6.1.0-12-arm64_6.1.52-1_arm64.deb`` link to download the deb file which you can then install as root or with sudo by executing ``apt install ./linux-image-_arm64.deb`` If the problem does NOT occur with 6.1.52-1, then you try a higher version. Continue that process until you've found the latest version that works and the earliest version where it stopped working. If the problem also occurs with 6.1.52-1, then you try an (even) older version. This is to test whether it was a regression *within* the 6.1 series and if so, to get the narrowest range without having to compile yourself. HTH signature.asc Description: This is a digitally signed message part.
Bug#1071501: linux-image-6.1.0-21-arm64: Linux NFS client hangs in nfs4_lookup_revalidate
Hi Richard, On Mon, May 20, 2024 at 09:27:24AM +, Richard Kojedzinszky wrote: > Package: src:linux > Version: 6.1.90-1 > Severity: normal > X-Debbugs-Cc: richard+debian+bugrep...@kojedz.in > > Dear Maintainer, > > I am running kubernetes on debian, and pods are mounting multiple nfs > shares. I am running dovecot processes in PODs, which receive mails from > the internet, and also serves as imap server for clients. I am > monitoring my mail system by sending mails periodically (15 seconds) and > also downloading them via imap. I found a few times that some dovecot process > stuck in D state, a reboot was always needed to recover from that state. > > Unfortunately, I was not able to trigger the bug really fast, I dont > really know what operations does dovecot issue and in what order to trigger > this behavior. So until I get closer, I've set up a similar, but smaller > environment with just a single dovecot process, and it also does the > same work, delivering only test mails locally, and serving them via imap > to the monitoring client, storing everything on NFS. Fortunately, this also > triggers the bug, after a few hours one of the dovecot processes is stuck > in D state. Kernel also shows blocked state: As you seem in the lucky position to be able to trigger the issue in a more localized setup, might you: - try as well more recent kernels from upper suites (6.8.9-1 in unstable would be ideal to check if the issue is there as well). - I did read you cannot trigger with 5.15. If you build 6.1.90 from upstream without Debian patches I assume you can trigger the issue likewise? If so could you bisect the changes introducing the issue? This is a cumbersome process in particular if you need few hours to trigger it So maybe the following point could be done first: - Can you report the issue to the linux-nfs list, keeping us in the loop? Regards, Salvatore
Bug#1071501: linux-image-6.1.0-21-arm64: Linux NFS client hangs in nfs4_lookup_revalidate
Package: src:linux Version: 6.1.90-1 Severity: normal X-Debbugs-Cc: richard+debian+bugrep...@kojedz.in Dear Maintainer, I am running kubernetes on debian, and pods are mounting multiple nfs shares. I am running dovecot processes in PODs, which receive mails from the internet, and also serves as imap server for clients. I am monitoring my mail system by sending mails periodically (15 seconds) and also downloading them via imap. I found a few times that some dovecot process stuck in D state, a reboot was always needed to recover from that state. Unfortunately, I was not able to trigger the bug really fast, I dont really know what operations does dovecot issue and in what order to trigger this behavior. So until I get closer, I've set up a similar, but smaller environment with just a single dovecot process, and it also does the same work, delivering only test mails locally, and serving them via imap to the monitoring client, storing everything on NFS. Fortunately, this also triggers the bug, after a few hours one of the dovecot processes is stuck in D state. Kernel also shows blocked state: May 19 12:16:49 k8s-node07 kernel: INFO: task lmtp:665683 blocked for more than 120 seconds. May 19 12:16:49 k8s-node07 kernel: Not tainted 6.1.0-21-arm64 #1 Debian 6.1.90-1 May 19 12:16:49 k8s-node07 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 19 12:16:49 k8s-node07 kernel: task:lmtpstate:D stack:0 pid:665683 ppid:2881 flags:0x May 19 12:16:49 k8s-node07 kernel: Call trace: May 19 12:16:49 k8s-node07 kernel: __switch_to+0xf0/0x170 May 19 12:16:49 k8s-node07 kernel: __schedule+0x340/0x940 May 19 12:16:49 k8s-node07 kernel: schedule+0x58/0xf0 May 19 12:16:49 k8s-node07 kernel: __nfs_lookup_revalidate+0x118/0x160 [nfs] May 19 12:16:49 k8s-node07 kernel: nfs4_lookup_revalidate+0x20/0x30 [nfs] May 19 12:16:49 k8s-node07 kernel: lookup_fast+0x138/0x150 May 19 12:16:49 k8s-node07 kernel: walk_component+0x30/0x1a0 May 19 12:16:49 k8s-node07 kernel: path_lookupat+0x80/0x1a4 May 19 12:16:49 k8s-node07 kernel: filename_lookup+0xb4/0x1b0 May 19 12:16:49 k8s-node07 kernel: vfs_statx+0x94/0x19c May 19 12:16:49 k8s-node07 kernel: vfs_fstatat+0x68/0x90 May 19 12:16:49 k8s-node07 kernel: __do_sys_newfstatat+0x58/0xa0 May 19 12:16:49 k8s-node07 kernel: __arm64_sys_newfstatat+0x28/0x34 May 19 12:16:49 k8s-node07 kernel: invoke_syscall+0x78/0x100 May 19 12:16:49 k8s-node07 kernel: el0_svc_common.constprop.0+0x4c/0xf4 May 19 12:16:49 k8s-node07 kernel: do_el0_svc+0x34/0xd0 May 19 12:16:49 k8s-node07 kernel: el0_svc+0x34/0xd4 May 19 12:16:49 k8s-node07 kernel: el0t_64_sync_handler+0xf4/0x120 May 19 12:16:49 k8s-node07 kernel: el0t_64_sync+0x18c/0x190 Or, for another process: May 20 04:50:01 k8s-node07 kernel: INFO: task imap:8337 blocked for more than 120 seconds. May 20 04:50:01 k8s-node07 kernel: Not tainted 6.1.0-21-arm64 #1 Debian 6.1.90-1 May 20 04:50:01 k8s-node07 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 20 04:50:01 k8s-node07 kernel: task:imapstate:D stack:0 pid:8337 ppid:3164 flags:0x May 20 04:50:01 k8s-node07 kernel: Call trace: May 20 04:50:01 k8s-node07 kernel: __switch_to+0xf0/0x170 May 20 04:50:01 k8s-node07 kernel: __schedule+0x340/0x940 May 20 04:50:01 k8s-node07 kernel: schedule+0x58/0xf0 May 20 04:50:01 k8s-node07 kernel: __nfs_lookup_revalidate+0x118/0x160 [nfs] May 20 04:50:01 k8s-node07 kernel: nfs4_lookup_revalidate+0x20/0x30 [nfs] May 20 04:50:01 k8s-node07 kernel: lookup_fast+0x138/0x150 May 20 04:50:01 k8s-node07 kernel: walk_component+0x30/0x1a0 May 20 04:50:01 k8s-node07 kernel: path_lookupat+0x80/0x1a4 May 20 04:50:01 k8s-node07 kernel: filename_lookup+0xb4/0x1b0 May 20 04:50:01 k8s-node07 kernel: vfs_statx+0x94/0x19c May 20 04:50:01 k8s-node07 kernel: vfs_fstatat+0x68/0x90 May 20 04:50:01 k8s-node07 kernel: __do_sys_newfstatat+0x58/0xa0 May 20 04:50:01 k8s-node07 kernel: __arm64_sys_newfstatat+0x28/0x34 May 20 04:50:01 k8s-node07 kernel: invoke_syscall+0x78/0x100 May 20 04:50:01 k8s-node07 kernel: el0_svc_common.constprop.0+0x4c/0xf4 May 20 04:50:01 k8s-node07 kernel: do_el0_svc+0x34/0xd0 May 20 04:50:01 k8s-node07 kernel: el0_svc+0x34/0xd4 May 20 04:50:01 k8s-node07 kernel: el0t_64_sync_handler+0xf4/0x120 May 20 04:50:01 k8s-node07 kernel: el0t_64_sync+0x18c/0x190 Of course the NFS server is running, and other NFS mounts are still working from the node. Also, this started to happen with Debian's kernel. Before that, I was compiling my own upstream kernel, version 5.15. With that, I've never experienced such a lockup. Unfortunately, I dont know, how to go further, how shall I collect more relevant debugging information. I expect thet dovecot is just an application, which should not cause any kernel-side lockups. In my test lab, this specific NFS mount is just mounted on one machine, so it really suggests me a