I encountered the same problem. After several days of testing, the problem can be reproduced 100%. Ubuntu24.04, a 10Gb/s optical fiber connection is used between the login node and the computing node. The computing node uses nfs to mount the /home of the login node. The entire system is managed using slurm.
The login node submits files that require a large number of reading and writing files on /home. During the program, my program reads local txt of about 10GB in size by python-numpy, and then separates it into multiple small files of 100MB and saves them as npy files. I submit 252 similar programs at one time and run them at the same time, within one hour. The nfs service of the login node is stuck. At this time, the nfs-server service of the login node cannot be restarted. The login node cannot ssh to the computing node, and the problem of restarting the computing node still exists. However, the problem of just restarting the login node disappears, ssh is restored, and the computing node Automatically connect to nfs successfully. ```bash root 1548 0.0 0.0 5632 1792 ? Ss 18:19 0:00 /usr/sbin/nfsdcld root 2347 4.6 0.0 0 0 ? D 18:19 8:04 [nfsd] root 53326 0.0 0.0 0 0 ? D 20:00 0:00 [kworker/u112:2+nfsd4_callbacks] root 68918 0.0 0.0 2704 1792 ? Is 20:47 0:00 /usr/sbin/rpc.nfsd 0 root 74448 0.0 0.0 9436 2240 pts/6 S+ 21:11 0:00 grep --color=auto --ex ``` ```log 6月 23 20:48:52 icpcs systemd[1]: nfs-server.service: Stopping timed out. Terminating. 6月 23 20:49:10 icpcs sudo[69464]: root : TTY=pts/6 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/systemctl status nfs-server.service 6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: State 'stop-sigterm' timed out. Killing. 6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: Killing process 68918 (rpc.nfsd) with signal SIGKILL. 6月 23 20:50:27 icpcs kernel: INFO: task nfsd:2347 blocked for more than 1105 seconds. 6月 23 20:50:27 icpcs kernel: task:nfsd state:D stack:0 pid:2347 tgid:2347 ppid:2 flags:0x00004000 6月 23 20:50:27 icpcs kernel: nfsd4_probe_callback_sync+0x1a/0x30 [nfsd] 6月 23 20:50:27 icpcs kernel: nfsd4_destroy_session+0x186/0x260 [nfsd] 6月 23 20:50:27 icpcs kernel: nfsd4_proc_compound+0x3af/0x770 [nfsd] 6月 23 20:50:27 icpcs kernel: nfsd_dispatch+0xd4/0x220 [nfsd] 6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] 6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd+0x10/0x10 [nfsd] 6月 23 20:50:27 icpcs kernel: nfsd+0x8b/0xe0 [nfsd] ``` -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs