[Kernel-packages] [Bug 2015827] Re: NFS performance issue while clearing the file access cache upon login

2023-05-03 Thread Allan G Soeby
Hi @kleber-souza, @chengendu

Thanks for your attention.

Allow me give my perception of the impact of fixing "bug" LP: #2003053.

The original patchset introduced *two* regressions. One, (NFS deathlock)
that hit everybody - fixed by #2009325, but the remaining one, are now
hitting those of use spawning new user processes frequently, causing new
"login times" to be created and access cache zapped. As a result we are
looking at 300-400% increase in *overall* NFS operations, making the
current kernels unusable for production. We do not have that kind of
head-room on our NFS servers.

The result is, we are simply stuck with kernels prior to #2003053 fixes.
With recent CVE fixes in current kernel, we have now also resorted to
the option of building our own kernels. This is very counter-productive.

I understand the use case for the changes that went into "bug"
#20003053. The reason why I call this a "bug" (in quotes) is due to the
fact, that the behaviour has been around for more than 15 years. While
age alone is not a qualifier, I am just saying that this has been an
accepted behaviour for that long. Furthermore #2003053 will only apply
in environments where the NFS-server has a knowledge of users and their
secondary groups and validates them for ACCESS calls. (ours don't)

>From the original upstream commit message
0eb43812c0270ee3d005ff32f91f7d0a6c4943af : "While it is reasonable to
expect that such group membership changes are rare, and that we do not
want to optimise the cache to accommodate them, it is also not
unreasonable for the user to expect that if they log out and log back in
again, that the staleness would clear up".

It is clear that a trade-off was considered, however the use case being
a "user" (a physical interactive person), and not any service of any
kind. I am quite certain that with a use case with a regression of 3-4x
increase in NFS ops, this would not have gone in the way it was.

I understand why sometimes there is are strong reasons to cherry-pick
changes from upstream - or making your own changes. IMHO, I do not think
the use case for #20003053 was strong enough to justify that.

The main regression assessment for #20003053 was considered low, as it
was upstream changes. We now know, this was not the case.

And with that knowledge, and comparing it to the weak use case the
changes was trying to address, it should have been the right decision to
revert the changes.

The suggested upstream changes to introduce a mount option to address
this, should should be turned around. The option should be added for
those wanting to zap/re-validate their access caches on re-login, but
leave the default behaviour as is.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2015827

Title:
  NFS performance issue while clearing the file access cache upon login

Status in linux package in Ubuntu:
  In Progress

Bug description:
  The performance issue that has been observed may be attributed to an increase 
in NFS ACCESS operations, possibly due to a new mechanism introduced in the 
Linux 6.2-rc3 NFS client side.
  This mechanism clears the access cache as soon as the cache timestamp becomes 
older than the user's login time,
  with the primary objective of preventing the NFS client's access cache from 
becoming stale due to any changes made to the user's group membership on the 
server after the user has already logged in on the client.

  It's worth noting that POSIX only refreshes the user's supplementary group 
information upon login.
  Upstream has taken into consideration that users may reasonably expect the 
access cache to be cleared when they log out and log back in again, with all 
behavior returning to normal after the replacement.

  The performance overhead can be particularly noticeable when applications or 
users switch to other privileged users via commands such as "su" to operate on 
NFS-mounted folders.
  In such cases, the privileged user's login time will be renewed, and NFS 
ACCESS operations will need to be re-sent, potentially leading to performance 
degradation.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015827/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2015827] Re: NFS performance issue while clearing the file access cache upon login

2023-04-12 Thread Allan G Soeby
Du ChengEn, I would second Jan's opinion.

This whole chain of fixes that has gone in to fix LP: #2003053, should
be rolled back. There where no heavy arguments to cherry-pick those
changes in the first place. (It is not in upstream LTS either).

Once it was discovered what kind of impact it had, it should have been
rolled back.

As it is also now clear that LP: #2003053 cannot be solved without
impacting other use-cases, and by only introducing an extra mount-
option, this is just another argument for reverting this set of patches.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2015827

Title:
  NFS performance issue while clearing the file access cache upon login

Status in linux package in Ubuntu:
  In Progress

Bug description:
  The performance issue that has been observed may be attributed to an increase 
in NFS ACCESS operations, possibly due to a new mechanism introduced in the 
Linux 6.2-rc3 NFS client side.
  This mechanism clears the access cache as soon as the cache timestamp becomes 
older than the user's login time,
  with the primary objective of preventing the NFS client's access cache from 
becoming stale due to any changes made to the user's group membership on the 
server after the user has already logged in on the client.

  It's worth noting that POSIX only refreshes the user's supplementary group 
information upon login.
  Upstream has taken into consideration that users may reasonably expect the 
access cache to be cleared when they log out and log back in again, with all 
behavior returning to normal after the replacement.

  The performance overhead can be particularly noticeable when applications or 
users switch to other privileged users via commands such as "su" to operate on 
NFS-mounted folders.
  In such cases, the privileged user's login time will be renewed, and NFS 
ACCESS operations will need to be re-sent, potentially leading to performance 
degradation.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015827/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2009325] Re: NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74

2023-03-31 Thread Allan G Soeby
Hi Du ChengEn,

Thanks for you feedback, and understanding of our issue.

I will be watching the nfs mailing list as well, but kindly post
references to the bug here, once the bug is opened.

I support your idea, of a separate mount-option, if it is not possible
to address both issues.

Looking at the overall impact of this in our environment, we seem to be
trading some GETATTRs for ACCESS calls. While the isolated increase in
ACCESS calls is very high, the combined GETATTR+ACCESS is not as high.
However, we are still talking about 4-5x here - 3-400% increase.

As our workloads are very meta-data intensive it will push our NFS
servers into unacceptable load levels (~50k -> ~200k NFS ops).

Good news with regards to suggested increases in LOOKUPs, though. I
cannot confirm any issues here. I have been running 5.15.69, 5.15.60 and
your 1st test kernel (5.15.0-67-generic #74+test20230307b2h2cbb6062f8eb)
for days. Once caches were settled, they are all hovering at the same
levels of LOOKUP calls. So I see no regression on this matter.

Thanks in advance.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/2009325

Title:
  NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74

Status in linux package in Ubuntu:
  In Progress
Status in linux-aws package in Ubuntu:
  Invalid
Status in linux source package in Bionic:
  Fix Released
Status in linux-aws source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released
Status in linux-aws source package in Focal:
  Fix Released
Status in linux source package in Jammy:
  Fix Released
Status in linux-aws source package in Jammy:
  Fix Released
Status in linux source package in Kinetic:
  Fix Released
Status in linux-aws source package in Kinetic:
  Fix Released
Status in linux source package in Lunar:
  In Progress
Status in linux-aws source package in Lunar:
  Invalid

Bug description:
  After updating on the kernel 
  5.4.0-144.161 at Ubuntu 18 and 
  5.15.0-67.74 at Ubuntu 20, 
  we have a 100% CPU outlation and 20 to 30 Mbit traffic to the clients for our 
NFS servers.

  All clients are extremely slow when it comes to access to the NFS
  resources.

  Restart and use older kernel, fixed the problem.
  Ubuntu 18 5.4.0-139-generic
  Ubuntu 20 5.15.0-60-Generic
  I don't have a NFS problem with this kernel.

  Problem came with the last releas on March 3rd, 2023
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Mär  4 15:00 seq
   crw-rw 1 root audio 116, 33 Mär  4 15:00 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu27.25
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CasperMD5CheckResult: skip
  DistroRelease: Ubuntu 20.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  Lsusb-t:
   
  Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
  MachineType: VMware, Inc. VMware Virtual Platform
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 svgadrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.15.0-67-generic 
root=/dev/mapper/vg1-root ro net.ifnames=0 biosdevname=0 kvm.nx_huge_pages=auto 
elevator=noop
  ProcVersionSignature: Ubuntu 5.15.0-67.74~20.04.1-generic 5.15.85
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-67-generic N/A
   linux-backports-modules-5.15.0-67-generic  N/A
   linux-firmware 1.187.36
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  focal
  Uname: Linux 5.15.0-67-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 11/12/2020
  dmi.bios.release: 4.6
  dmi.bios.vendor: Phoenix Technologies LTD
  dmi.bios.version: 6.00
  dmi.board.name: 440BX Desktop Reference Platform
  dmi.board.vendor: Intel Corporation
  dmi.board.version: None
  dmi.chassis.asset.tag: No Asset Tag
  dmi.chassis.type: 1
  dmi.chassis.vendor: No Enclosure
  dmi.chassis.version: N/A
  dmi.ec.firmware.release: 0.0
  dmi.modalias: 
dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd11/12/2020:br4.6:efr0.0:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:sku:
  dmi.product.name: VMware Virtual Platform
  dmi.product.version: None
  dmi.sys.vendor: VMware, Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Mär  4 15:03 seq
   crw-rw 1 root audio 116, 33 Mär  4 15:03 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.28
  Architecture: amd64
  

[Kernel-packages] [Bug 2009325] Re: NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74

2023-03-29 Thread Allan G Soeby
Hi Du ChengEn,

Thanks for the clarification on test kernels. I am sad, I did not get to
test your 2nd test kernel, as that would have revealed this issue
immediately.

While the changes to fix LP: #2003053, went in there to fix a "bug" - it
also created this major regression. But the bug fix just tries to get
around the metadata inconsistencies, that has always been an issue with
NFS (by design). Not a hard case IHMO, and also "rare" as pointed out in
the original commit message.

I follow Jan here, that these changes should be reverted, as they create
more problems than they tried to fix.

That does not stop anybody from working on an upstream solution, that
would fit both purposes.

For whatever it is worth, this reproducer points out the pain point:

$ cd /nfsdir
$ touch myfiles.{1..1000}
$ md5sum myfiles.{1..1000} > /dev/null
$ sudo -u  md5sum myfiles.{1..1000} > /dev/null

The latter 'md5sum' command produces 1000 GETATTR and 1000 ACCESS calls.
The GEATTRs are there to ensure close-to-open consistency - fine.
However, ACCESS calls should not be produced in this case, which shows
the issue.

This is even more outspoken, if this is mounted using 'nocto' (as we
do). The GETATTRs are gone (as they should). The ACCESS calls, however
remain. This is where we get that huge increase in numbers.

With regards to potential increase in LOOKUPs - this is still to early.
Caches are still not settled (lightly loaded system)

So, where do we go from here - create a new bug ? Give feedback on the
linux-nfs mail list ?

I can see you already made a post to the mail list - I will be happy to
be of assistance, if you see fit.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/2009325

Title:
  NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74

Status in linux package in Ubuntu:
  In Progress
Status in linux-aws package in Ubuntu:
  Invalid
Status in linux source package in Bionic:
  Fix Released
Status in linux-aws source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released
Status in linux-aws source package in Focal:
  Fix Released
Status in linux source package in Jammy:
  Fix Released
Status in linux-aws source package in Jammy:
  Fix Released
Status in linux source package in Kinetic:
  Fix Released
Status in linux-aws source package in Kinetic:
  Fix Released
Status in linux source package in Lunar:
  In Progress
Status in linux-aws source package in Lunar:
  Invalid

Bug description:
  After updating on the kernel 
  5.4.0-144.161 at Ubuntu 18 and 
  5.15.0-67.74 at Ubuntu 20, 
  we have a 100% CPU outlation and 20 to 30 Mbit traffic to the clients for our 
NFS servers.

  All clients are extremely slow when it comes to access to the NFS
  resources.

  Restart and use older kernel, fixed the problem.
  Ubuntu 18 5.4.0-139-generic
  Ubuntu 20 5.15.0-60-Generic
  I don't have a NFS problem with this kernel.

  Problem came with the last releas on March 3rd, 2023
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Mär  4 15:00 seq
   crw-rw 1 root audio 116, 33 Mär  4 15:00 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu27.25
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CasperMD5CheckResult: skip
  DistroRelease: Ubuntu 20.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  Lsusb-t:
   
  Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
  MachineType: VMware, Inc. VMware Virtual Platform
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 svgadrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.15.0-67-generic 
root=/dev/mapper/vg1-root ro net.ifnames=0 biosdevname=0 kvm.nx_huge_pages=auto 
elevator=noop
  ProcVersionSignature: Ubuntu 5.15.0-67.74~20.04.1-generic 5.15.85
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-67-generic N/A
   linux-backports-modules-5.15.0-67-generic  N/A
   linux-firmware 1.187.36
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  focal
  Uname: Linux 5.15.0-67-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 11/12/2020
  dmi.bios.release: 4.6
  dmi.bios.vendor: Phoenix Technologies LTD
  dmi.bios.version: 6.00
  dmi.board.name: 440BX Desktop Reference Platform
  dmi.board.vendor: Intel Corporation
  dmi.board.version: None
  dmi.chassis.asset.tag: No Asset Tag
  dmi.chassis.type: 1
  dmi.chassis.vendor: No Enclosure
  dmi.chassis.version: N/A
  dmi.ec.firmware.release: 0.0
  dmi.modalias: 

[Kernel-packages] [Bug 2009325] Re: NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74

2023-03-28 Thread Allan G Soeby
Hi Du ChengEn,

Thanks for your efforts in making this change for this SRU cycle.

However, I can confirm that we too are facing challenges with orders of
magnitude (>10x) increases in ACCESS calls, with the newly released
kernels. Still better/differently than 5.15.0-67, but not suitable for
production.

Kernel 5.15.0-69 behaves significantly different than test kernel
5.15.0-67.74+test20230307b2h2cbb6062f8eb. I do not think
5.15.0-67.74+test20230308b0h1a13a615ee3 was ever available from your PPA
?. We ran 5.15.0-67.74+test20230307b2h2cbb6062f8eb for 2 weeks, without
issues.

What looks like a similar (apache2) environment to Jan and Ben, we have
short(er) lived procs with suexec-like setup relying heavily on caching
using nocto and increased AC-timeouts options.

While I understand the reasoning for avoiding stale access caches, and
clearing it in relation to group-member changes, I do not think it was
ever the intention to break default long-time behaviour.

As of now I cannot confirm issues with increased LOOKUPs in our end as
our caches are still warming.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/2009325

Title:
  NFS deathlock with last Kernel 5.4.0-144.161 and 5.15.0-67.74

Status in linux package in Ubuntu:
  In Progress
Status in linux-aws package in Ubuntu:
  Invalid
Status in linux source package in Bionic:
  Fix Released
Status in linux-aws source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released
Status in linux-aws source package in Focal:
  Fix Released
Status in linux source package in Jammy:
  Fix Released
Status in linux-aws source package in Jammy:
  Fix Released
Status in linux source package in Kinetic:
  Fix Released
Status in linux-aws source package in Kinetic:
  Fix Released
Status in linux source package in Lunar:
  In Progress
Status in linux-aws source package in Lunar:
  Invalid

Bug description:
  After updating on the kernel 
  5.4.0-144.161 at Ubuntu 18 and 
  5.15.0-67.74 at Ubuntu 20, 
  we have a 100% CPU outlation and 20 to 30 Mbit traffic to the clients for our 
NFS servers.

  All clients are extremely slow when it comes to access to the NFS
  resources.

  Restart and use older kernel, fixed the problem.
  Ubuntu 18 5.4.0-139-generic
  Ubuntu 20 5.15.0-60-Generic
  I don't have a NFS problem with this kernel.

  Problem came with the last releas on March 3rd, 2023
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Mär  4 15:00 seq
   crw-rw 1 root audio 116, 33 Mär  4 15:00 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu27.25
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CasperMD5CheckResult: skip
  DistroRelease: Ubuntu 20.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  Lsusb-t:
   
  Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
  MachineType: VMware, Inc. VMware Virtual Platform
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 svgadrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.15.0-67-generic 
root=/dev/mapper/vg1-root ro net.ifnames=0 biosdevname=0 kvm.nx_huge_pages=auto 
elevator=noop
  ProcVersionSignature: Ubuntu 5.15.0-67.74~20.04.1-generic 5.15.85
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-67-generic N/A
   linux-backports-modules-5.15.0-67-generic  N/A
   linux-firmware 1.187.36
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  focal
  Uname: Linux 5.15.0-67-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 11/12/2020
  dmi.bios.release: 4.6
  dmi.bios.vendor: Phoenix Technologies LTD
  dmi.bios.version: 6.00
  dmi.board.name: 440BX Desktop Reference Platform
  dmi.board.vendor: Intel Corporation
  dmi.board.version: None
  dmi.chassis.asset.tag: No Asset Tag
  dmi.chassis.type: 1
  dmi.chassis.vendor: No Enclosure
  dmi.chassis.version: N/A
  dmi.ec.firmware.release: 0.0
  dmi.modalias: 
dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd11/12/2020:br4.6:efr0.0:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:sku:
  dmi.product.name: VMware Virtual Platform
  dmi.product.version: None
  dmi.sys.vendor: VMware, Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Mär  4 15:03 seq
   crw-rw 1 root audio 116, 33 Mär  4 15:03 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.28
  Architecture: amd64