[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: nfs-utils (Ubuntu Noble) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
** Also affects: nfs-utils (Ubuntu Noble) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Noble) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Noble) Status: New => In Progress ** Changed in: linux (Ubuntu Noble) Assignee: (unassigned) => Mehmet Basaran (mehmetbasaran) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
We installed the unofficial kernel 6.8.0-46-generic-nfs on several NFS client servers on Saturday and have been testing it with high IO loads since then. Unfortunately the server crashed again after about 40 hours with "rcu: INFO: rcu_sched self-detected stall on CPU". The kernel 6.8.0-46-generic-nfs prevents the error message "RPC: Could not send backchannel reply error: -110", but not the crashs which we have been struggling with since August 19th switching the kernel from 6.5.0-44-generic to 6.8.0-40-generic. Our experiences with NFS server crashes are: - We were able to reproduce the crashes in our production and test environments. Rarely after minutes, sometimes after hours or days, but sometimes not at all, as we often stopped the experiments after 12 to 24 hours. - We have not yet been able to reproduce a crash between a bare metal NFS server and a bare metal NFS client, but between a bare metal NFS server and a virtualized client. - we could not reproduce a crash with NFS vers=4.0 - the crashs happens with or without GSSPROXY Our setup: - virtualized NFS 4.2 server with Ubuntu 22.04.5 LTS and kernel 5.15.0-122-generic - virtualized NFS client with Ubuntu 22.04.5 LTS and kernel 6.8.0-40-generic or kernel 6.8.0-45-generic - /etc/exports : /mnt/home nfsclient(sec=krb5,rw,root_squash,sync,no_subtree_check) - /etc/fstab : nfsserver:/mnt/home /home nfs vers=4.2,rw,soft,sec=krb5,proto=tcp 0 0 - apt info nfs-common : Version: 1:2.6.1-1ubuntu1.2 syslog of NFS server after crash: Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: INFO: rcu_sched self-detected stall on CPU Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: 54-: (14998 ticks this GP) idle=2db/1/0x4000 softirq=32173387/32173387 fqs=7449 Sep 30 01:15:51 nfs-server.domain.de kernel: (t=15000 jiffies g=144775177 q=49782) Sep 30 01:15:51 nfs-server.domain.de kernel: NMI backtrace for cpu 54 Sep 30 01:15:51 nfs-server.domain.de kernel: CPU: 54 PID: 153154 Comm: kworker/u480:36 Not tainted 5.15.0-122-generic #132-Ubuntu Sep 30 01:15:51 nfs-server.domain.de kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 Sep 30 01:15:51 nfs-server.domain.de kernel: Workqueue: rpciod rpc_async_schedule [sunrpc] Sep 30 01:15:51 nfs-server.domain.de kernel: Call Trace: Sep 30 01:15:51 nfs-server.domain.de kernel: Sep 30 01:15:51 nfs-server.domain.de kernel: show_stack+0x52/0x5c Sep 30 01:15:51 nfs-server.domain.de kernel: dump_stack_lvl+0x4a/0x63 Sep 30 01:15:51 nfs-server.domain.de kernel: dump_stack+0x10/0x16 Sep 30 01:15:51 nfs-server.domain.de kernel: nmi_cpu_backtrace.cold+0x4d/0x93 Sep 30 01:15:51 nfs-server.domain.de kernel: ? lapic_can_unplug_cpu+0x90/0x90 Sep 30 01:15:51 nfs-server.domain.de kernel: nmi_trigger_cpumask_backtrace+0xec/0x100 Sep 30 01:15:51 nfs-server.domain.de kernel: arch_trigger_cpumask_backtrace+0x19/0x20 Sep 30 01:15:51 nfs-server.domain.de kernel: trigger_single_cpu_backtrace+0x44/0x4f Sep 30 01:15:51 nfs-server.domain.de kernel: rcu_dump_cpu_stacks+0x102/0x149 Sep 30 01:15:51 nfs-server.domain.de kernel: print_cpu_stall.cold+0x2f/0xe2 Sep 30 01:15:51 nfs-server.domain.de kernel: check_cpu_stall+0x1d8/0x270 Sep 30 01:15:51 nfs-server.domain.de kernel: rcu_sched_clock_irq+0x9a/0x250 Sep 30 01:15:51 nfs-server.domain.de kernel: update_process_times+0x94/0xd0 Sep 30 01:15:51 nfs-server.domain.de kernel: tick_sched_handle+0x29/0x70 Sep 30 01:15:51 nfs-server.domain.de kernel: tick_sched_timer+0x6f/0x90 Sep 30 01:15:51 nfs-server.domain.de kernel: ? tick_sched_do_timer+0xa0/0xa0 Sep 30 01:15:51 nfs-server.domain.de kernel: __hrtimer_run_queues+0x104/0x230 Sep 30 01:15:51 nfs-server.domain.de kernel: ? read_hv_clock_tsc_cs+0x9/0x30 Sep 30 01:15:51 nfs-server.domain.de kernel: hrtimer_interrupt+0x101/0x220 Sep 30 01:15:51 nfs-server.domain.de kernel: hv_stimer0_isr+0x1d/0x30 Sep 30 01:15:51 nfs-server.domain.de kernel: __sysvec_hyperv_stimer0+0x2f/0x70 Sep 30 01:15:51 nfs-server.domain.de kernel: sysvec_hyperv_stimer0+0x7b/0x90 Sep 30 01:15:51 nfs-server.domain.de kernel: Sep 30 01:15:51 nfs-server.domain.de kernel: Sep 30 01:15:51 nfs-server.domain.de kernel: asm_sysvec_hyperv_stimer0+0x1b/0x20 Sep 30 01:15:51 nfs-server.domain.de kernel: RIP: 0010:read_hv_clock_tsc+0x1b/0x60 Sep 30 01:15:51 nfs-server.domain.de kernel: Code: eb bc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 35 2a 89 97 02 85 f6 74 38 4c 8b 05 27 89 97 02 48 8b 3d 28 89 97 02 0f 01 f9 <66> 90 8b 0d 0d 89 97 02 39 ce 75 d9 48 c1 e2 20 48 09 d0 49 f7 e0 Sep 30 01:15:51 nfs-server.domain.de kernel: RSP: 0018:ada44ab33dc8 EFLAGS: 0202 Sep 30 01:15:51 nfs-server.domain.de kernel: RAX: 5d52dc50 RBX: 0002197146e8f7ec RCX: 0036 Sep 30 01:15:51 nfs-server.domain.de kernel: RDX: 000571f0 RSI: 0002 RDI: 000a Sep 30 01:15:51 nfs-server.domain.de kernel: RBP:
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
If this patch proves a fix, we plan to release it in the next update. In this case, update will install the official kernel (which will also include this patch) and change grub settings to boot this kernel automatically. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
** Changed in: linux (Ubuntu) Importance: Undecided => Medium ** Changed in: linux (Ubuntu) Status: Confirmed => In Progress -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Performed two upgrades from 22.04 yesterday, both have locked up overnight with the below. It feels like this is the same issue - can anyone confirm for me? kernel: INFO: task nfsd:2029 blocked for more than 122 seconds. kernel: Tainted: G OE 6.8.0-45-generic #45-Ubuntu kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: task:nfsdstate:D stack:0 pid:2029 tgid:2029 ppid:2 flags:0x4000 kernel: Call Trace: kernel: kernel: __schedule+0x27c/0x6b0 kernel: ? __smp_call_single_queue+0xe0/0x180 kernel: schedule+0x33/0x110 kernel: schedule_timeout+0x157/0x170 kernel: wait_for_completion+0x88/0x150 kernel: __flush_workqueue+0x140/0x3e0 kernel: ? nfsd4_run_cb+0x30/0x70 [nfsd] kernel: nfsd4_probe_callback_sync+0x1a/0x30 [nfsd] kernel: nfsd4_destroy_session+0x186/0x260 [nfsd] kernel: nfsd4_proc_compound+0x3b7/0x780 [nfsd] kernel: nfsd_dispatch+0xd7/0x220 [nfsd] kernel: svc_process_common+0x450/0x710 [sunrpc] kernel: ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] kernel: svc_process+0x132/0x1b0 [sunrpc] kernel: svc_handle_xprt+0x4d3/0x5d0 [sunrpc] kernel: svc_recv+0x18b/0x2e0 [sunrpc] kernel: ? __pfx_nfsd+0x10/0x10 [nfsd] kernel: nfsd+0x8b/0xe0 [nfsd] kernel: kthread+0xf2/0x120 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork+0x47/0x70 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork_asm+0x1b/0x30 kernel: Most annoying element is that nothing seems to allow recovery without a reload. Unless someone knows some trick to getting it back up? These host VMs and the NFS share is purely for some bulk data backups. I will shift them to the kernel mentioned above later today. If this proves a fix, how soon may it roll out? I've got a large number of hosts to move to 24.04 and will be holding off until this is fixed as it's quite a showstopper. Cheers! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Hi all, I am from the Canonical's kernel team and currently investigating this issue. In this case, jammy-hwe, mantic-hwe, and noble by default uses 6.8 kernel (when a generic jammy and mantic is installed it uses hwe version by default). So, the issue is with 6.8 kernel rather than series. I was not able to reproduce the error with generic 6.8.0-45.45 kernel after 1 hour of stressing. I am still working on this. I really appreciate all the feedback you provided. Meanwhile, for those who are having the problem, I have created an unofficial version of 6.8.0-45.45 kernel which includes the upstream fix from "6ddc9deacc1312762c2edd9de00ce76b00f69f7c", - for jammy: https://launchpad.net/~mehmetbasaran/+archive/ubuntu/linux-hwe-6.8-6.8.0-45.45-nfs-patch - for noble: https://launchpad.net/~mehmetbasaran/+archive/ubuntu/linux-6.8.0-45.45-nfs-patch Installation instructions: Note that, if you are using secure boot, you will not be able to boot into these kernels. You will need to disable it first. # Add the unofficial ppa. Pick the correct one depending on your series # For jammy: sudo add-apt-repository ppa:mehmetbasaran/linux-hwe-6.8-6.8.0-45.45-nfs-patch # For noble: sudo add-apt-repository ppa:mehmetbasaran/linux-6.8.0-45.45-nfs-patch $ sudo add-apt-repository ppa:mehmetbasaran/linux-6.8.0-45.45-nfs-patch $ sudo apt update $ sudo apt install linux-buildinfo-6.8.0-46-generic-nfs \ linux-cloud-tools-6.8.0-46-generic-nfs \ linux-cloud-tools-common \ linux-headers-6.8.0-46-generic-nfs \ linux-image-unsigned-6.8.0-46-generic-nfs \ linux-modules-6.8.0-46-generic-nfs \ linux-modules-extra-6.8.0-46-generic-nfs \ linux-modules-ipu6-6.8.0-46-generic-nfs \ linux-modules-iwlwifi-6.8.0-46-generic-nfs \ linux-modules-usbio-6.8.0-46-generic-nfs \ linux-nfs-6.8-cloud-tools-6.8.0-46 \ linux-nfs-6.8-headers-6.8.0-46 \ linux-nfs-6.8-tools-6.8.0-46 \ linux-tools-6.8.0-46-generic-nfs Next time you boot, you will be using the patched 6.8.0-45.45 $ uname -r # 6.8.0-46-generic-nfs To return back to the previous kernel (official 6.8.0-45.45) you just need to update grub: $ grep 'menuentry \|submenu ' /boot/grub/grub.cfg | cut -f2 -d "'" # Prints available kernels on your machine, in my case: Ubuntu Advanced options for Ubuntu Ubuntu, with Linux 6.8.0-46-generic-nfs Ubuntu, with Linux 6.8.0-46-generic-nfs (recovery mode) Ubuntu, with Linux 6.8.0-45-generic Ubuntu, with Linux 6.8.0-45-generic (recovery mode) Ubuntu, with Linux 6.5.0-18-generic Ubuntu, with Linux 6.5.0-18-generic (recovery mode) # Change GRUB_DEFAULT in /etc/default/grub # from GRUB_DEFAULT=0 # to GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-45-generic" $ sudo update-grub $ reboot $ uname -r # 6.8.0-45-generic After changing your kernel to previous version you can completely remove the unofficial kernel: # Now these packages will be safe to be removed $ sudo apt remove linux-buildinfo-6.8.0-46-generic-nfs \ linux-cloud-tools-6.8.0-46-generic-nfs \ linux-headers-6.8.0-46-generic-nfs \ linux-image-unsigned-6.8.0-46-generic-nfs \ linux-modules-6.8.0-46-generic-nfs \ linux-modules-extra-6.8.0-46-generic-nfs \ linux-modules-ipu6-6.8.0-46-generic-nfs \ linux-modules-iwlwifi-6.8.0-46-generic-nfs \ linux-modules-usbio-6.8.0-46-generic-nfs \ linux-nfs-6.8-cloud-tools-6.8.0-46 \ linux-nfs-6.8-headers-6.8.0-46 \ linux-nfs-6.8-tools-6.8.0-46 \ linux-tools-6.8.0-46-generic-nfs # Remove unofficial ppa from update list $ sudo add-apt-repository --remove ppa:mehmetbasaran/linux-6.8.0-45.45-nfs-patch # Restore grub settings # Change GRUB_DEFAULT in /etc/default/grub to GRUB_DEFAULT=0 $ sudo update-grub $ reboot -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Additionally, for those who prefer to migrate to newer kernel you can try our mainline builds here: https://kernel.ubuntu.com/mainline/ -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
** Changed in: linux (Ubuntu) Assignee: (unassigned) => Mehmet Basaran (mehmetbasaran) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
For those who can't switch to NFSv3 and don't want to run the .0 release 6.11.0 kernel from Oriole, there are 6.10 kernels from xanmod for Ubuntu https://xanmod.org/#apt_repository -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
I also have this issue on Ubuntu 22.04.5 LTS with linux kernel tagged as 6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 NFSv3 workaround has serious consequences on all the clients that refused to downgrade the protocol. My only option, till a patch, is to downgrade to vmlinuz-6.5.0-44-generic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Is anybody in a position to try out the kernel from the ubuntu 24.10 upcoming release? There will be a beta out this week of Ubuntu Oracular 24.10. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Thanks all for your input. I'll add a kernel task to this bug, but keep the userspace one open for now. ** Also affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
I also have this issue and can't go more than about 8 hours without this breaking. This was not an issue in 20.04. Currently attempting the NFSv3 workaround There are mentions that this bug is fixed in kernel 6.9.8 https://lists.proxmox.com/pipermail/pve-devel/2024-July/064614.html -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Confirmed on 24.04.1 and previously on 23.10 (both server and client), also using large files (1-100GB) and 10G networking to large/fast disk arrays, which others have suggested to be a key factor. All mountpoints are running BTRFS (in some cases a brand new filesystem) without any LUKS. My observations with throughput also match, e.g. host B as client connects to host A's nfs server and is high traffic, this fails after ~12 hours, requiring server A to reboot to recover host A as client connects to host B's nfs server but is low traffic, and mount has not failed, even if neither servers rebooted for days I have amended both my nfs.conf and fstab on all devices to force nfsvers3 only as a workaround until there's a more permanent fix, or we migrate to Debian __schedule+0x27c/0x6b0 ? __smp_call_single_queue+0xfd/0x180 schedule+0x33/0x110 schedule_timeout+0x157/0x170 wait_for_completion+0x88/0x150 __flush_workqueue+0x140/0x3e0 ? nfsd4_run_cb+0x30/0x70 [nfsd] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd] nfsd4_destroy_session+0x186/0x260 [nfsd] nfsd4_proc_compound+0x3b7/0x780 [nfsd] nfsd_dispatch+0xd7/0x220 [nfsd] svc_process_common+0x450/0x710 [sunrpc] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] svc_process+0x132/0x1b0 [sunrpc] svc_handle_xprt+0x4d3/0x5d0 [sunrpc] svc_recv+0x18b/0x2e0 [sunrpc] ? __pfx_nfsd+0x10/0x10 [nfsd] nfsd+0x8b/0xe0 [nfsd] kthread+0xf2/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x47/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Get the same in 22.04 now after 6.18 was rolled out as hwe kernel. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
This is a me too. We encountered the same problem, with the exact same message of a tainted and hung nfsd task. Like yuhldr (yuh) we investigated and problem happens regularly. In our case, we have a smallish cluster (100 machines) with a gigabit ethernet switch network. The nfsd machine serves the cluster as storage pool. We just installed netdata (a nagios like server monitor tool) the day before and suspect that was our root cause as we had not experienced this hang for a few weeks before that after bringing up a new cloudstack configuraiton on this cluster of machines. We disabled netdata and have not seen a reoccurance for 24 hours, though we are still monitoring. We did not try suggestion to set nfsversion to 3, but we may do that if we decide we would like to get netdata back on this machine. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
For the benefit of others... Like #10, we also have a cluster with this issue. As a workaround, we are using version 3 of the NFS protocol (`nfsvers=3` in `/etc/fstab`), which so far seems to have eliminated the problem for us. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
I encountered the same problem. After several days of testing, the problem can be reproduced 100%. Ubuntu24.04, a 10Gb/s optical fiber connection is used between the login node and the computing node. The computing node uses nfs to mount the /home of the login node. The entire system is managed using slurm. The login node submits files that require a large number of reading and writing files on /home. During the program, my program reads local txt of about 10GB in size by python-numpy, and then separates it into multiple small files of 100MB and saves them as npy files. I submit 252 similar programs at one time and run them at the same time, within one hour. The nfs service of the login node is stuck. At this time, the nfs-server service of the login node cannot be restarted. The login node cannot ssh to the computing node, and the problem of restarting the computing node still exists. However, the problem of just restarting the login node disappears, ssh is restored, and the computing node Automatically connect to nfs successfully. ```bash root1548 0.0 0.0 5632 1792 ?Ss 18:19 0:00 /usr/sbin/nfsdcld root2347 4.6 0.0 0 0 ?D18:19 8:04 [nfsd] root 53326 0.0 0.0 0 0 ?D20:00 0:00 [kworker/u112:2+nfsd4_callbacks] root 68918 0.0 0.0 2704 1792 ?Is 20:47 0:00 /usr/sbin/rpc.nfsd 0 root 74448 0.0 0.0 9436 2240 pts/6S+ 21:11 0:00 grep --color=auto --ex ``` ```log 6月 23 20:48:52 icpcs systemd[1]: nfs-server.service: Stopping timed out. Terminating. 6月 23 20:49:10 icpcs sudo[69464]: root : TTY=pts/6 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/systemctl status nfs-server.service 6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: State 'stop-sigterm' timed out. Killing. 6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: Killing process 68918 (rpc.nfsd) with signal SIGKILL. 6月 23 20:50:27 icpcs kernel: INFO: task nfsd:2347 blocked for more than 1105 seconds. 6月 23 20:50:27 icpcs kernel: task:nfsdstate:D stack:0 pid:2347 tgid:2347 ppid:2 flags:0x4000 6月 23 20:50:27 icpcs kernel: nfsd4_probe_callback_sync+0x1a/0x30 [nfsd] 6月 23 20:50:27 icpcs kernel: nfsd4_destroy_session+0x186/0x260 [nfsd] 6月 23 20:50:27 icpcs kernel: nfsd4_proc_compound+0x3af/0x770 [nfsd] 6月 23 20:50:27 icpcs kernel: nfsd_dispatch+0xd4/0x220 [nfsd] 6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] 6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd+0x10/0x10 [nfsd] 6月 23 20:50:27 icpcs kernel: nfsd+0x8b/0xe0 [nfsd] ``` -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
The tricky part for me is that client was regularly changing, so I can't confidently say when did errors start appearing, it's just very suspicious that it needs high load (as new host and higher network bandwidth made the issue more frequent), and uploading to the server as just pure downloading doesn't seem to be a problem even if cached data is getting sent at full bandwidth for minutes. Moved the server to 24.04, but I've also moved some I/O heavy tasks to it so there would be less need of uploading. Client was on 23.10 and I'm still holding back on upgrading for some more weeks. Can't say a whole lot about the current situation as I'm not uploading much anymore to avoid the issue, but I actually ran into a hanging issue a few days ago, I just didn't have time to debug it, but the server didn't want to gracefully restart, so ended up hard rebooting. I believe it was the first time since moving I/O heavy tasks, wanted to upload a few hundred GiB of data back to the server which was downloaded from there a while ago without problems. Otherwise light I/O doesn't seem to run into this problem, like the occasional backup to the server is fine, but that rarely saturates the network, and likely completely fits into the page cache almost every time. A few hopefully helpful points for reproducing the problem: - As mentioned multiple times, download alone seems to be unaffected, uploading is what should be stressed, and I suspect that either there's no need to download at the same time, or just casual filesystem browsing is a good enough load. - A fast client with high bandwidth is key. Ran into this issue a couple times with an older host on 1 Gb/s, but a new fast host with 2.5 Gb/s made the issue appear significantly more frequently. - Likely doesn't matter how the link gets saturated, but I either processed files cached on the server (mixed R/W), or uploaded cached files (fast SSD should be fine too), meaning that the bottleneck was always the network at least while the caches were large enough. - Files were large, so there wasn't any stopping for fiddling with metadata as it would happen with small files, and the page cache was often exhausted. The target was a single HDD the majority of the time which often meant that writes started blocking (100-ish MiB/s HDD catching up with close to 250 MiB/s data), occasionally making the hosts freeze as the kernel's background I/O handling is still bad, we just pretend the issue is gone with SSDs being fast enough not to run into this. The page cache draining freezes may be good at exposing race conditions. It may be more efficient to start looking for what's causing the "RPC: Could not send backchannel reply error: -110" log spam which might be related. The lockup may take significant time to catch while that kernel message showed up quite frequently. Even now I have plenty of those lines without experiencing issues and not even uploading much, mostly just downloading large files. Some extra info which may or may not matter: - The server hardware is quite weak with an old 4 core Broadwell CPU, possibly helping to expose race condition problems - All file systems are Btrfs with noatime,discard=async,compress-force=zstd , the later part surely adding more load - LUKS is used everywhere, also adding some extra load - There's a Btrfs (on LUKS) image mounted over NFS (with not a whole lot of usage though) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
IIUC, from nfs server side, both 23.10 (6.5 series) and 24.04 (6.8 series) have the similar issue, but 22.04 (probably 5.15.x kernel) was ok. And what is the kernel version from nfs client? Is it changed or stay on one certain version? Anyway, I guess the efficient way is to bisect which commit caused the issue, thanks! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
** Attachment added: "rpc_tasks.txt" https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+attachment/5770302/+files/rpc_tasks.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Oh fun, trying to add multiple attachments, I just ended up finding a bug report from 2007 complaining about apport being able to do this, but the web interface is limited, so guess I'll get a bit spammy. Ran into the same as usual again, even with avoiding heavy utilization. Not seeing anything too interesting in the gathered info and apparently the kernel log buffer is not large enough to be able to completely handle a task list dump, but seems like most of the information is at least there. Will try to use RPCNFSDCOUNT=1 for some time in case this is a silly deadlock, at least it's definitely easier to give it a try than to downgrade to Ubuntu 22.04 which worked well. ** Attachment added: "dmesg.txt" https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+attachment/5770300/+files/dmesg.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
** Attachment added: "nfs_threads.txt" https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+attachment/5770301/+files/nfs_threads.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Ran into this again just hours after commenting by attempting to unpack a large archive file. Apparently the new setup of higher performance host with more network bandwidth is just too overwhelming to be usable with NFS with this issue. Reading alone still seems to be fine though. Torturing 50% of the memory with memtester didn't reveal any problems, and getting the server to do the unpacking of files does the same mixed I/O with no issues so far, progressing way beyond where I could get over NFS, so it doesn't look like an HDD issue either. The info dumping described here seems to be potentially useful for the next catch, so linking it here: https://www.spinics.net/lists/linux-nfs/msg97486.html At some point I may try to reproduce the issue with just writing to see if the mixed workload is really required for the freeze, but not sure when will I get to it, can't just restart NFSd, and the mandatory reboot to get it going again is only feasible when all other tasks running on the host can be interrupted. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: nfs-utils (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
This kind of issue appeared with Ubuntu 23.10 for me on the server mostly using an HDD for bulk storage with a not exactly powerful CPU also being occupied with using WireGuard to secure the NFS connection. Mentioning the performance details because I have a feeling they matter. An also not exactly high performance client connecting over 1 Gb/s only very occasionally caused this problem, however given a 10 Gb/s connection, the issue appeared significantly more commonly. A higher performance setup utilizing a 2.5 Gb/s connection triggered this bug in a couple of days after setup. The lockup always seem to occur with heavy NFS usage, suspiciously mostly when there's both reading and writing going on, at least I don't recall it happening with reading only, but I'm not confident in stating it didn't happen with a writing only load. Found this bug report by the client error message, server side differs due to the different version: ``` [300146.04] INFO: task nfsd:1426 blocked for more than 241 seconds. [300146.046732] Not tainted 6.5.0-27-generic #28~22.04.1-Ubuntu [300146.046770] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [300146.046813] task:nfsdstate:D stack:0 pid:1426 ppid:2 flags:0x4000 [300146.046827] Call Trace: [300146.046832] [300146.046839] __schedule+0x2cb/0x750 [300146.046860] schedule+0x63/0x110 [300146.046870] schedule_timeout+0x157/0x170 [300146.046881] wait_for_completion+0x88/0x150 [300146.046894] __flush_workqueue+0x140/0x3e0 [300146.046908] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd] [300146.047074] nfsd4_destroy_session+0x193/0x260 [nfsd] [300146.047219] nfsd4_proc_compound+0x3b7/0x770 [nfsd] [300146.047365] nfsd_dispatch+0xbf/0x1d0 [nfsd] [300146.047497] svc_process_common+0x420/0x6e0 [sunrpc] [300146.047695] ? __pfx_read_tsc+0x10/0x10 [300146.047706] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] [300146.047848] ? __pfx_nfsd+0x10/0x10 [nfsd] [300146.047977] svc_process+0x132/0x1b0 [sunrpc] [300146.048157] nfsd+0xdc/0x1c0 [nfsd] [300146.048287] kthread+0xf2/0x120 [300146.048299] ? __pfx_kthread+0x10/0x10 [300146.048310] ret_from_fork+0x47/0x70 [300146.048321] ? __pfx_kthread+0x10/0x10 [300146.048331] ret_from_fork_asm+0x1b/0x30 [300146.048341] ``` This seems to be matching, but the previous lockups experienced may have been somewhat different. I mostly remember the client whining about the server not responding instead of the message presented here, and the server call trace used to have btrfs in it which made me suspect it may be exclusive to that, although the issue was always with NFS, nothing else locked up despite having some other sources of heavy I/O. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
** Description changed: - I installed the 22.04 Beta on two test machines that were running 22.04 + I installed the 24.04 Beta on two test machines that were running 22.04 without issues before. One of them exports two volumes that are mounted by the other machine, which primarily uses them as a secondary storage for ccache. After being up for a couple of hours (happened twice since yesterday evening) it seems that nfsd on the machine exporting the volumes hangs on something. From dmesg on the server (repeated a few times): [11183.290548] INFO: task nfsd:1419 blocked for more than 1228 seconds. [11183.290558] Not tainted 6.8.0-22-generic #22-Ubuntu [11183.290563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11183.290582] task:nfsdstate:D stack:0 pid:1419 tgid:1419 ppid:2 flags:0x4000 [11183.290587] Call Trace: [11183.290602] [11183.290606] __schedule+0x27c/0x6b0 [11183.290612] schedule+0x33/0x110 [11183.290615] schedule_timeout+0x157/0x170 [11183.290619] wait_for_completion+0x88/0x150 [11183.290623] __flush_workqueue+0x140/0x3e0 [11183.290629] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd] [11183.290689] nfsd4_destroy_session+0x186/0x260 [nfsd] [11183.290744] nfsd4_proc_compound+0x3af/0x770 [nfsd] [11183.290798] nfsd_dispatch+0xd4/0x220 [nfsd] [11183.290851] svc_process_common+0x44d/0x710 [sunrpc] [11183.290924] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] [11183.290976] svc_process+0x132/0x1b0 [sunrpc] [11183.291041] svc_handle_xprt+0x4d3/0x5d0 [sunrpc] [11183.291105] svc_recv+0x18b/0x2e0 [sunrpc] [11183.291168] ? __pfx_nfsd+0x10/0x10 [nfsd] [11183.291220] nfsd+0x8b/0xe0 [nfsd] [11183.291270] kthread+0xef/0x120 [11183.291274] ? __pfx_kthread+0x10/0x10 [11183.291276] ret_from_fork+0x44/0x70 [11183.291279] ? __pfx_kthread+0x10/0x10 [11183.291281] ret_from_fork_asm+0x1b/0x30 [11183.291286] From dmesg on the client (repeated a number of times): [ 6596.911785] RPC: Could not send backchannel reply error: -110 [ 6596.972490] RPC: Could not send backchannel reply error: -110 [ 6837.281307] RPC: Could not send backchannel reply error: -110 ProblemType: Bug DistroRelease: Ubuntu 24.04 Package: nfs-kernel-server 1:2.6.4-3ubuntu5 ProcVersionSignature: Ubuntu 6.8.0-22.22-generic 6.8.1 Uname: Linux 6.8.0-22-generic x86_64 .etc.request-key.d.id_resolver.conf: create id_resolver * * /usr/sbin/nfsidmap -t 600 %k %d ApportVersion: 2.28.1-0ubuntu1 Architecture: amd64 CasperMD5CheckResult: pass Date: Fri Apr 19 14:10:25 2024 InstallationDate: Installed on 2024-04-16 (3 days ago) InstallationMedia: Ubuntu-Server 24.04 LTS "Noble Numbat" - Beta amd64 (20240410.1) NFSMounts: - + NFSv4Mounts: - + ProcEnviron: - LANG=en_US.UTF-8 - PATH=(custom, no user) - SHELL=/bin/bash - TERM=xterm-256color - XDG_RUNTIME_DIR= + LANG=en_US.UTF-8 + PATH=(custom, no user) + SHELL=/bin/bash + TERM=xterm-256color + XDG_RUNTIME_DIR= SourcePackage: nfs-utils UpgradeStatus: No upgrade log present (probably fresh install) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs