[Kernel-packages] [Bug 2015827] Re: NFS performance issue while clearing the file access cache upon login
Hi @kleber-souza, @chengendu Thanks for your attention. Allow me give my perception of the impact of fixing "bug" LP: #2003053. The original patchset introduced *two* regressions. One, (NFS deathlock) that hit everybody - fixed by #2009325, but the remaining one, are now hitting those of use spawning new user processes frequently, causing new "login times" to be created and access cache zapped. As a result we are looking at 300-400% increase in *overall* NFS operations, making the current kernels unusable for production. We do not have that kind of head-room on our NFS servers. The result is, we are simply stuck with kernels prior to #2003053 fixes. With recent CVE fixes in current kernel, we have now also resorted to the option of building our own kernels. This is very counter-productive. I understand the use case for the changes that went into "bug" #20003053. The reason why I call this a "bug" (in quotes) is due to the fact, that the behaviour has been around for more than 15 years. While age alone is not a qualifier, I am just saying that this has been an accepted behaviour for that long. Furthermore #2003053 will only apply in environments where the NFS-server has a knowledge of users and their secondary groups and validates them for ACCESS calls. (ours don't) >From the original upstream commit message 0eb43812c0270ee3d005ff32f91f7d0a6c4943af : "While it is reasonable to expect that such group membership changes are rare, and that we do not want to optimise the cache to accommodate them, it is also not unreasonable for the user to expect that if they log out and log back in again, that the staleness would clear up". It is clear that a trade-off was considered, however the use case being a "user" (a physical interactive person), and not any service of any kind. I am quite certain that with a use case with a regression of 3-4x increase in NFS ops, this would not have gone in the way it was. I understand why sometimes there is are strong reasons to cherry-pick changes from upstream - or making your own changes. IMHO, I do not think the use case for #20003053 was strong enough to justify that. The main regression assessment for #20003053 was considered low, as it was upstream changes. We now know, this was not the case. And with that knowledge, and comparing it to the weak use case the changes was trying to address, it should have been the right decision to revert the changes. The suggested upstream changes to introduce a mount option to address this, should should be turned around. The option should be added for those wanting to zap/re-validate their access caches on re-login, but leave the default behaviour as is. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2015827 Title: NFS performance issue while clearing the file access cache upon login Status in linux package in Ubuntu: In Progress Bug description: The performance issue that has been observed may be attributed to an increase in NFS ACCESS operations, possibly due to a new mechanism introduced in the Linux 6.2-rc3 NFS client side. This mechanism clears the access cache as soon as the cache timestamp becomes older than the user's login time, with the primary objective of preventing the NFS client's access cache from becoming stale due to any changes made to the user's group membership on the server after the user has already logged in on the client. It's worth noting that POSIX only refreshes the user's supplementary group information upon login. Upstream has taken into consideration that users may reasonably expect the access cache to be cleared when they log out and log back in again, with all behavior returning to normal after the replacement. The performance overhead can be particularly noticeable when applications or users switch to other privileged users via commands such as "su" to operate on NFS-mounted folders. In such cases, the privileged user's login time will be renewed, and NFS ACCESS operations will need to be re-sent, potentially leading to performance degradation. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015827/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2015827] Re: NFS performance issue while clearing the file access cache upon login
Du ChengEn, I would second Jan's opinion. This whole chain of fixes that has gone in to fix LP: #2003053, should be rolled back. There where no heavy arguments to cherry-pick those changes in the first place. (It is not in upstream LTS either). Once it was discovered what kind of impact it had, it should have been rolled back. As it is also now clear that LP: #2003053 cannot be solved without impacting other use-cases, and by only introducing an extra mount- option, this is just another argument for reverting this set of patches. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2015827 Title: NFS performance issue while clearing the file access cache upon login Status in linux package in Ubuntu: In Progress Bug description: The performance issue that has been observed may be attributed to an increase in NFS ACCESS operations, possibly due to a new mechanism introduced in the Linux 6.2-rc3 NFS client side. This mechanism clears the access cache as soon as the cache timestamp becomes older than the user's login time, with the primary objective of preventing the NFS client's access cache from becoming stale due to any changes made to the user's group membership on the server after the user has already logged in on the client. It's worth noting that POSIX only refreshes the user's supplementary group information upon login. Upstream has taken into consideration that users may reasonably expect the access cache to be cleared when they log out and log back in again, with all behavior returning to normal after the replacement. The performance overhead can be particularly noticeable when applications or users switch to other privileged users via commands such as "su" to operate on NFS-mounted folders. In such cases, the privileged user's login time will be renewed, and NFS ACCESS operations will need to be re-sent, potentially leading to performance degradation. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015827/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2009325] Re: NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74
Hi Du ChengEn, Thanks for you feedback, and understanding of our issue. I will be watching the nfs mailing list as well, but kindly post references to the bug here, once the bug is opened. I support your idea, of a separate mount-option, if it is not possible to address both issues. Looking at the overall impact of this in our environment, we seem to be trading some GETATTRs for ACCESS calls. While the isolated increase in ACCESS calls is very high, the combined GETATTR+ACCESS is not as high. However, we are still talking about 4-5x here - 3-400% increase. As our workloads are very meta-data intensive it will push our NFS servers into unacceptable load levels (~50k -> ~200k NFS ops). Good news with regards to suggested increases in LOOKUPs, though. I cannot confirm any issues here. I have been running 5.15.69, 5.15.60 and your 1st test kernel (5.15.0-67-generic #74+test20230307b2h2cbb6062f8eb) for days. Once caches were settled, they are all hovering at the same levels of LOOKUP calls. So I see no regression on this matter. Thanks in advance. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2009325 Title: NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74 Status in linux package in Ubuntu: In Progress Status in linux-aws package in Ubuntu: Invalid Status in linux source package in Bionic: Fix Released Status in linux-aws source package in Bionic: Fix Released Status in linux source package in Focal: Fix Released Status in linux-aws source package in Focal: Fix Released Status in linux source package in Jammy: Fix Released Status in linux-aws source package in Jammy: Fix Released Status in linux source package in Kinetic: Fix Released Status in linux-aws source package in Kinetic: Fix Released Status in linux source package in Lunar: In Progress Status in linux-aws source package in Lunar: Invalid Bug description: After updating on the kernel 5.4.0-144.161 at Ubuntu 18 and 5.15.0-67.74 at Ubuntu 20, we have a 100% CPU outlation and 20 to 30 Mbit traffic to the clients for our NFS servers. All clients are extremely slow when it comes to access to the NFS resources. Restart and use older kernel, fixed the problem. Ubuntu 18 5.4.0-139-generic Ubuntu 20 5.15.0-60-Generic I don't have a NFS problem with this kernel. Problem came with the last releas on March 3rd, 2023 --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mär 4 15:00 seq crw-rw 1 root audio 116, 33 Mär 4 15:00 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu27.25 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CasperMD5CheckResult: skip DistroRelease: Ubuntu 20.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Error: command ['lsusb'] failed with exit code 1: Lsusb-t: Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1: MachineType: VMware, Inc. VMware Virtual Platform Package: linux (not installed) PciMultimedia: ProcFB: 0 svgadrmfb ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.15.0-67-generic root=/dev/mapper/vg1-root ro net.ifnames=0 biosdevname=0 kvm.nx_huge_pages=auto elevator=noop ProcVersionSignature: Ubuntu 5.15.0-67.74~20.04.1-generic 5.15.85 RelatedPackageVersions: linux-restricted-modules-5.15.0-67-generic N/A linux-backports-modules-5.15.0-67-generic N/A linux-firmware 1.187.36 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: focal Uname: Linux 5.15.0-67-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 11/12/2020 dmi.bios.release: 4.6 dmi.bios.vendor: Phoenix Technologies LTD dmi.bios.version: 6.00 dmi.board.name: 440BX Desktop Reference Platform dmi.board.vendor: Intel Corporation dmi.board.version: None dmi.chassis.asset.tag: No Asset Tag dmi.chassis.type: 1 dmi.chassis.vendor: No Enclosure dmi.chassis.version: N/A dmi.ec.firmware.release: 0.0 dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd11/12/2020:br4.6:efr0.0:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:sku: dmi.product.name: VMware Virtual Platform dmi.product.version: None dmi.sys.vendor: VMware, Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mär 4 15:03 seq crw-rw 1 root audio 116, 33 Mär 4 15:03 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.28 Architecture: amd64 ArecordDe
[Kernel-packages] [Bug 2009325] Re: NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74
Hi Du ChengEn, Thanks for the clarification on test kernels. I am sad, I did not get to test your 2nd test kernel, as that would have revealed this issue immediately. While the changes to fix LP: #2003053, went in there to fix a "bug" - it also created this major regression. But the bug fix just tries to get around the metadata inconsistencies, that has always been an issue with NFS (by design). Not a hard case IHMO, and also "rare" as pointed out in the original commit message. I follow Jan here, that these changes should be reverted, as they create more problems than they tried to fix. That does not stop anybody from working on an upstream solution, that would fit both purposes. For whatever it is worth, this reproducer points out the pain point: $ cd /nfsdir $ touch myfiles.{1..1000} $ md5sum myfiles.{1..1000} > /dev/null $ sudo -u md5sum myfiles.{1..1000} > /dev/null The latter 'md5sum' command produces 1000 GETATTR and 1000 ACCESS calls. The GEATTRs are there to ensure close-to-open consistency - fine. However, ACCESS calls should not be produced in this case, which shows the issue. This is even more outspoken, if this is mounted using 'nocto' (as we do). The GETATTRs are gone (as they should). The ACCESS calls, however remain. This is where we get that huge increase in numbers. With regards to potential increase in LOOKUPs - this is still to early. Caches are still not settled (lightly loaded system) So, where do we go from here - create a new bug ? Give feedback on the linux-nfs mail list ? I can see you already made a post to the mail list - I will be happy to be of assistance, if you see fit. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2009325 Title: NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74 Status in linux package in Ubuntu: In Progress Status in linux-aws package in Ubuntu: Invalid Status in linux source package in Bionic: Fix Released Status in linux-aws source package in Bionic: Fix Released Status in linux source package in Focal: Fix Released Status in linux-aws source package in Focal: Fix Released Status in linux source package in Jammy: Fix Released Status in linux-aws source package in Jammy: Fix Released Status in linux source package in Kinetic: Fix Released Status in linux-aws source package in Kinetic: Fix Released Status in linux source package in Lunar: In Progress Status in linux-aws source package in Lunar: Invalid Bug description: After updating on the kernel 5.4.0-144.161 at Ubuntu 18 and 5.15.0-67.74 at Ubuntu 20, we have a 100% CPU outlation and 20 to 30 Mbit traffic to the clients for our NFS servers. All clients are extremely slow when it comes to access to the NFS resources. Restart and use older kernel, fixed the problem. Ubuntu 18 5.4.0-139-generic Ubuntu 20 5.15.0-60-Generic I don't have a NFS problem with this kernel. Problem came with the last releas on March 3rd, 2023 --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mär 4 15:00 seq crw-rw 1 root audio 116, 33 Mär 4 15:00 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu27.25 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CasperMD5CheckResult: skip DistroRelease: Ubuntu 20.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Error: command ['lsusb'] failed with exit code 1: Lsusb-t: Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1: MachineType: VMware, Inc. VMware Virtual Platform Package: linux (not installed) PciMultimedia: ProcFB: 0 svgadrmfb ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.15.0-67-generic root=/dev/mapper/vg1-root ro net.ifnames=0 biosdevname=0 kvm.nx_huge_pages=auto elevator=noop ProcVersionSignature: Ubuntu 5.15.0-67.74~20.04.1-generic 5.15.85 RelatedPackageVersions: linux-restricted-modules-5.15.0-67-generic N/A linux-backports-modules-5.15.0-67-generic N/A linux-firmware 1.187.36 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: focal Uname: Linux 5.15.0-67-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 11/12/2020 dmi.bios.release: 4.6 dmi.bios.vendor: Phoenix Technologies LTD dmi.bios.version: 6.00 dmi.board.name: 440BX Desktop Reference Platform dmi.board.vendor: Intel Corporation dmi.board.version: None dmi.chassis.asset.tag: No Asset Tag dmi.chassis.type: 1 dmi.chassis.vendor: No Enclosure dmi.chassis.version: N/A dmi.ec.firmware.release: 0.0 dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6
[Kernel-packages] [Bug 2009325] Re: NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74
Hi Du ChengEn, Thanks for your efforts in making this change for this SRU cycle. However, I can confirm that we too are facing challenges with orders of magnitude (>10x) increases in ACCESS calls, with the newly released kernels. Still better/differently than 5.15.0-67, but not suitable for production. Kernel 5.15.0-69 behaves significantly different than test kernel 5.15.0-67.74+test20230307b2h2cbb6062f8eb. I do not think 5.15.0-67.74+test20230308b0h1a13a615ee3 was ever available from your PPA ?. We ran 5.15.0-67.74+test20230307b2h2cbb6062f8eb for 2 weeks, without issues. What looks like a similar (apache2) environment to Jan and Ben, we have short(er) lived procs with suexec-like setup relying heavily on caching using nocto and increased AC-timeouts options. While I understand the reasoning for avoiding stale access caches, and clearing it in relation to group-member changes, I do not think it was ever the intention to break default long-time behaviour. As of now I cannot confirm issues with increased LOOKUPs in our end as our caches are still warming. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2009325 Title: NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74 Status in linux package in Ubuntu: In Progress Status in linux-aws package in Ubuntu: Invalid Status in linux source package in Bionic: Fix Released Status in linux-aws source package in Bionic: Fix Released Status in linux source package in Focal: Fix Released Status in linux-aws source package in Focal: Fix Released Status in linux source package in Jammy: Fix Released Status in linux-aws source package in Jammy: Fix Released Status in linux source package in Kinetic: Fix Released Status in linux-aws source package in Kinetic: Fix Released Status in linux source package in Lunar: In Progress Status in linux-aws source package in Lunar: Invalid Bug description: After updating on the kernel 5.4.0-144.161 at Ubuntu 18 and 5.15.0-67.74 at Ubuntu 20, we have a 100% CPU outlation and 20 to 30 Mbit traffic to the clients for our NFS servers. All clients are extremely slow when it comes to access to the NFS resources. Restart and use older kernel, fixed the problem. Ubuntu 18 5.4.0-139-generic Ubuntu 20 5.15.0-60-Generic I don't have a NFS problem with this kernel. Problem came with the last releas on March 3rd, 2023 --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mär 4 15:00 seq crw-rw 1 root audio 116, 33 Mär 4 15:00 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu27.25 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CasperMD5CheckResult: skip DistroRelease: Ubuntu 20.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Error: command ['lsusb'] failed with exit code 1: Lsusb-t: Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1: MachineType: VMware, Inc. VMware Virtual Platform Package: linux (not installed) PciMultimedia: ProcFB: 0 svgadrmfb ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.15.0-67-generic root=/dev/mapper/vg1-root ro net.ifnames=0 biosdevname=0 kvm.nx_huge_pages=auto elevator=noop ProcVersionSignature: Ubuntu 5.15.0-67.74~20.04.1-generic 5.15.85 RelatedPackageVersions: linux-restricted-modules-5.15.0-67-generic N/A linux-backports-modules-5.15.0-67-generic N/A linux-firmware 1.187.36 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: focal Uname: Linux 5.15.0-67-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 11/12/2020 dmi.bios.release: 4.6 dmi.bios.vendor: Phoenix Technologies LTD dmi.bios.version: 6.00 dmi.board.name: 440BX Desktop Reference Platform dmi.board.vendor: Intel Corporation dmi.board.version: None dmi.chassis.asset.tag: No Asset Tag dmi.chassis.type: 1 dmi.chassis.vendor: No Enclosure dmi.chassis.version: N/A dmi.ec.firmware.release: 0.0 dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd11/12/2020:br4.6:efr0.0:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:sku: dmi.product.name: VMware Virtual Platform dmi.product.version: None dmi.sys.vendor: VMware, Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mär 4 15:03 seq crw-rw 1 root audio 116, 33 Mär 4 15:03 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.28 Architecture: amd64