I encountered the same problem. After several days of testing, the
problem can be reproduced 100%. Ubuntu24.04, a 10Gb/s optical fiber
connection is used between the login node and the computing node. The
computing node uses nfs to mount the /home of the login node. The entire
system is managed using slurm.

 The login node submits files that require a large number of reading and
writing files on /home. During the program, my program reads local txt
of about 10GB in size by python-numpy, and then separates it into
multiple small files of 100MB and saves them as npy files.

I submit 252 similar programs at one time and run them at the same time,
within one hour. The nfs service of the login node is stuck. At this
time, the nfs-server service of the login node cannot be restarted. The
login node cannot ssh to the computing node, and the problem of
restarting the computing node still exists. However, the problem of just
restarting the login node disappears, ssh is restored, and the computing
node Automatically connect to nfs successfully.

```bash
root        1548  0.0  0.0   5632  1792 ?        Ss   18:19   0:00 
/usr/sbin/nfsdcld
root        2347  4.6  0.0      0     0 ?        D    18:19   8:04 [nfsd]
root       53326  0.0  0.0      0     0 ?        D    20:00   0:00 
[kworker/u112:2+nfsd4_callbacks]
root       68918  0.0  0.0   2704  1792 ?        Is   20:47   0:00 
/usr/sbin/rpc.nfsd 0
root       74448  0.0  0.0   9436  2240 pts/6    S+   21:11   0:00 grep 
--color=auto --ex
```

```log
6月 23 20:48:52 icpcs systemd[1]: nfs-server.service: Stopping timed out. 
Terminating.
6月 23 20:49:10 icpcs sudo[69464]:     root : TTY=pts/6 ; PWD=/root ; USER=root 
; COMMAND=/usr/bin/systemctl status nfs-server.service
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: State 'stop-sigterm' timed 
out. Killing.
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: Killing process 68918 
(rpc.nfsd) with signal SIGKILL.
6月 23 20:50:27 icpcs kernel: INFO: task nfsd:2347 blocked for more than 1105 
seconds.
6月 23 20:50:27 icpcs kernel: task:nfsd            state:D stack:0     pid:2347  
tgid:2347  ppid:2      flags:0x00004000
6月 23 20:50:27 icpcs kernel:  nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
6月 23 20:50:27 icpcs kernel:  nfsd4_destroy_session+0x186/0x260 [nfsd]
6月 23 20:50:27 icpcs kernel:  nfsd4_proc_compound+0x3af/0x770 [nfsd]
6月 23 20:50:27 icpcs kernel:  nfsd_dispatch+0xd4/0x220 [nfsd]
6月 23 20:50:27 icpcs kernel:  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel:  ? __pfx_nfsd+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel:  nfsd+0x8b/0xe0 [nfsd]
```

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062568

Title:
  nfsd gets unresponsive after some hours of operation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to