Bug#1071501: linux-image-6.1.0-21-arm64: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-21 Thread Richard Kojedzinszky

Dear Salvatore,

I've already started bisecting. It will take some time. Usually the bug 
appears after a few hours, unfortunately I am not able to trigger it 
faster. So, if the bug appears, I can step forward easily, but if not, 
its hard to decide if it is still present and simply just have not 
occured, or if the current version is a good one. I'll try to do my 
best.


I will also contact linux-nfs mailing list.

As I remember, it started nearly a year ago, when I switched to Debian's 
kernel. I dont know exactly what version was at that time. Howewer, I've 
checked Debian's patches, and I did not find anything related to NFS.


Regards,
Richard


2024-05-20 21:07 időpontban Salvatore Bonaccorso ezt írta:

Hi Richard,

On Mon, May 20, 2024 at 09:27:24AM +, Richard Kojedzinszky wrote:

Package: src:linux
Version: 6.1.90-1
Severity: normal
X-Debbugs-Cc: richard+debian+bugrep...@kojedz.in

Dear Maintainer,

I am running kubernetes on debian, and pods are mounting multiple nfs
shares. I am running dovecot processes in PODs, which receive mails 
from

the internet, and also serves as imap server for clients. I am
monitoring my mail system by sending mails periodically (15 seconds) 
and
also downloading them via imap. I found a few times that some dovecot 
process
stuck in D state, a reboot was always needed to recover from that 
state.


Unfortunately, I was not able to trigger the bug really fast, I dont
really know what operations does dovecot issue and in what order to 
trigger
this behavior. So until I get closer, I've set up a similar, but 
smaller

environment with just a single dovecot process, and it also does the
same work, delivering only test mails locally, and serving them via 
imap
to the monitoring client, storing everything on NFS. Fortunately, this 
also
triggers the bug, after a few hours one of the dovecot processes is 
stuck

in D state. Kernel also shows blocked state:


As you seem in the lucky position to be able to trigger the issue in a
more localized setup, might you:

- try as well more recent kernels from upper suites (6.8.9-1 in
  unstable would be ideal to check if the issue is there as well).
- I did read you cannot trigger with 5.15. If you build 6.1.90 from
  upstream without Debian patches I assume you can trigger the issue
  likewise? If so could you bisect the changes introducing the issue?
  This is a cumbersome process in particular if you need few hours to
  trigger it  So maybe the following point could be done first:
- Can you report the issue to the linux-nfs list, keeping us in the
  loop?

Regards,
Salvatore




Bug#1071501: linux-image-6.1.0-21-arm64: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-20 Thread Diederik de Haas
On Monday, 20 May 2024 21:07:49 CEST Salvatore Bonaccorso wrote:
> - I did read you cannot trigger with 5.15. If you build 6.1.90 from
>   upstream without Debian patches I assume you can trigger the issue
>   likewise? If so could you bisect the changes introducing the issue?

If the test with the upstream 6.1.90 version also has this problem, there's 
another (series of) test(s) worth doing, which could shorten the bisect 
operation significantly.

I got the impression that you have only tried it with version 6.1.90.
Have you tried it with earlier versions in the 6.1 series to see if the issue 
is present there? 

Via https://snapshot.debian.org/package/linux-signed-arm64/ you can find 
earlier versions from the 6.1 series already compiled and packaged.
To take version 6.1.52 as (random) example:
- click on the ``6.1.52+1`` link
- In the ``Binary packages`` list, click on the linux-image-6.1.0-X-arm64 
link, where 'X' is 12 in this case
- Click the ``linux-image-6.1.0-12-arm64_6.1.52-1_arm64.deb`` link to download 
the deb file which you can then install as root or with sudo by executing
``apt install ./linux-image-_arm64.deb``

If the problem does NOT occur with 6.1.52-1, then you try a higher version. 
Continue that process until you've found the latest version that works and the 
earliest version where it stopped working.

If the problem also occurs with 6.1.52-1, then you try an (even) older 
version.

This is to test whether it was a regression *within* the 6.1 series and if so, 
to get the narrowest range without having to compile yourself.

HTH

signature.asc
Description: This is a digitally signed message part.


Bug#1071501: linux-image-6.1.0-21-arm64: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-20 Thread Salvatore Bonaccorso
Hi Richard,

On Mon, May 20, 2024 at 09:27:24AM +, Richard Kojedzinszky wrote:
> Package: src:linux
> Version: 6.1.90-1
> Severity: normal
> X-Debbugs-Cc: richard+debian+bugrep...@kojedz.in
> 
> Dear Maintainer,
> 
> I am running kubernetes on debian, and pods are mounting multiple nfs
> shares. I am running dovecot processes in PODs, which receive mails from
> the internet, and also serves as imap server for clients. I am
> monitoring my mail system by sending mails periodically (15 seconds) and
> also downloading them via imap. I found a few times that some dovecot process
> stuck in D state, a reboot was always needed to recover from that state.
> 
> Unfortunately, I was not able to trigger the bug really fast, I dont
> really know what operations does dovecot issue and in what order to trigger
> this behavior. So until I get closer, I've set up a similar, but smaller
> environment with just a single dovecot process, and it also does the
> same work, delivering only test mails locally, and serving them via imap
> to the monitoring client, storing everything on NFS. Fortunately, this also
> triggers the bug, after a few hours one of the dovecot processes is stuck
> in D state. Kernel also shows blocked state:

As you seem in the lucky position to be able to trigger the issue in a
more localized setup, might you:

- try as well more recent kernels from upper suites (6.8.9-1 in
  unstable would be ideal to check if the issue is there as well).
- I did read you cannot trigger with 5.15. If you build 6.1.90 from
  upstream without Debian patches I assume you can trigger the issue
  likewise? If so could you bisect the changes introducing the issue?
  This is a cumbersome process in particular if you need few hours to
  trigger it  So maybe the following point could be done first:
- Can you report the issue to the linux-nfs list, keeping us in the
  loop?

Regards,
Salvatore



Bug#1071501: linux-image-6.1.0-21-arm64: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-20 Thread Richard Kojedzinszky
Package: src:linux
Version: 6.1.90-1
Severity: normal
X-Debbugs-Cc: richard+debian+bugrep...@kojedz.in

Dear Maintainer,

I am running kubernetes on debian, and pods are mounting multiple nfs
shares. I am running dovecot processes in PODs, which receive mails from
the internet, and also serves as imap server for clients. I am
monitoring my mail system by sending mails periodically (15 seconds) and
also downloading them via imap. I found a few times that some dovecot process
stuck in D state, a reboot was always needed to recover from that state.

Unfortunately, I was not able to trigger the bug really fast, I dont
really know what operations does dovecot issue and in what order to trigger
this behavior. So until I get closer, I've set up a similar, but smaller
environment with just a single dovecot process, and it also does the
same work, delivering only test mails locally, and serving them via imap
to the monitoring client, storing everything on NFS. Fortunately, this also
triggers the bug, after a few hours one of the dovecot processes is stuck
in D state. Kernel also shows blocked state:

May 19 12:16:49 k8s-node07 kernel: INFO: task lmtp:665683 blocked for more than 
120 seconds.
May 19 12:16:49 k8s-node07 kernel:   Not tainted 6.1.0-21-arm64 #1 Debian 
6.1.90-1
May 19 12:16:49 k8s-node07 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 19 12:16:49 k8s-node07 kernel: task:lmtpstate:D stack:0 
pid:665683 ppid:2881   flags:0x
May 19 12:16:49 k8s-node07 kernel: Call trace:
May 19 12:16:49 k8s-node07 kernel:  __switch_to+0xf0/0x170
May 19 12:16:49 k8s-node07 kernel:  __schedule+0x340/0x940
May 19 12:16:49 k8s-node07 kernel:  schedule+0x58/0xf0
May 19 12:16:49 k8s-node07 kernel:  __nfs_lookup_revalidate+0x118/0x160 [nfs]
May 19 12:16:49 k8s-node07 kernel:  nfs4_lookup_revalidate+0x20/0x30 [nfs]
May 19 12:16:49 k8s-node07 kernel:  lookup_fast+0x138/0x150
May 19 12:16:49 k8s-node07 kernel:  walk_component+0x30/0x1a0
May 19 12:16:49 k8s-node07 kernel:  path_lookupat+0x80/0x1a4
May 19 12:16:49 k8s-node07 kernel:  filename_lookup+0xb4/0x1b0
May 19 12:16:49 k8s-node07 kernel:  vfs_statx+0x94/0x19c
May 19 12:16:49 k8s-node07 kernel:  vfs_fstatat+0x68/0x90
May 19 12:16:49 k8s-node07 kernel:  __do_sys_newfstatat+0x58/0xa0
May 19 12:16:49 k8s-node07 kernel:  __arm64_sys_newfstatat+0x28/0x34
May 19 12:16:49 k8s-node07 kernel:  invoke_syscall+0x78/0x100
May 19 12:16:49 k8s-node07 kernel:  el0_svc_common.constprop.0+0x4c/0xf4
May 19 12:16:49 k8s-node07 kernel:  do_el0_svc+0x34/0xd0
May 19 12:16:49 k8s-node07 kernel:  el0_svc+0x34/0xd4
May 19 12:16:49 k8s-node07 kernel:  el0t_64_sync_handler+0xf4/0x120
May 19 12:16:49 k8s-node07 kernel:  el0t_64_sync+0x18c/0x190

Or, for another process:

May 20 04:50:01 k8s-node07 kernel: INFO: task imap:8337 blocked for more than 
120 seconds.
May 20 04:50:01 k8s-node07 kernel:   Not tainted 6.1.0-21-arm64 #1 Debian 
6.1.90-1
May 20 04:50:01 k8s-node07 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 20 04:50:01 k8s-node07 kernel: task:imapstate:D stack:0 
pid:8337  ppid:3164   flags:0x
May 20 04:50:01 k8s-node07 kernel: Call trace:
May 20 04:50:01 k8s-node07 kernel:  __switch_to+0xf0/0x170
May 20 04:50:01 k8s-node07 kernel:  __schedule+0x340/0x940
May 20 04:50:01 k8s-node07 kernel:  schedule+0x58/0xf0
May 20 04:50:01 k8s-node07 kernel:  __nfs_lookup_revalidate+0x118/0x160 [nfs]
May 20 04:50:01 k8s-node07 kernel:  nfs4_lookup_revalidate+0x20/0x30 [nfs]
May 20 04:50:01 k8s-node07 kernel:  lookup_fast+0x138/0x150
May 20 04:50:01 k8s-node07 kernel:  walk_component+0x30/0x1a0
May 20 04:50:01 k8s-node07 kernel:  path_lookupat+0x80/0x1a4
May 20 04:50:01 k8s-node07 kernel:  filename_lookup+0xb4/0x1b0
May 20 04:50:01 k8s-node07 kernel:  vfs_statx+0x94/0x19c
May 20 04:50:01 k8s-node07 kernel:  vfs_fstatat+0x68/0x90
May 20 04:50:01 k8s-node07 kernel:  __do_sys_newfstatat+0x58/0xa0
May 20 04:50:01 k8s-node07 kernel:  __arm64_sys_newfstatat+0x28/0x34
May 20 04:50:01 k8s-node07 kernel:  invoke_syscall+0x78/0x100
May 20 04:50:01 k8s-node07 kernel:  el0_svc_common.constprop.0+0x4c/0xf4
May 20 04:50:01 k8s-node07 kernel:  do_el0_svc+0x34/0xd0
May 20 04:50:01 k8s-node07 kernel:  el0_svc+0x34/0xd4
May 20 04:50:01 k8s-node07 kernel:  el0t_64_sync_handler+0xf4/0x120
May 20 04:50:01 k8s-node07 kernel:  el0t_64_sync+0x18c/0x190


Of course the NFS server is running, and other NFS mounts are still
working from the node. Also, this started to happen with Debian's
kernel. Before that, I was compiling my own upstream kernel, version
5.15. With that, I've never experienced such a lockup.

Unfortunately, I dont know, how to go further, how shall I collect more
relevant debugging information.

I expect thet dovecot is just an application, which should not cause any
kernel-side lockups. In my test lab, this specific NFS mount is just
mounted on one machine, so it really suggests me a