[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
** Package changed: linux-signed (Ubuntu) => slurm-wlm (Ubuntu) ** Changed in: slurm-wlm (Ubuntu) Status: Confirmed => Invalid ** Summary changed: - cgroup2 broken since 5.15.0-90-generic? + load_ebpf_prog() fails for long bpf() logs -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: load_ebpf_prog() fails for long bpf() logs Status in slurm-wlm package in Ubuntu: Invalid Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/slurm-wlm/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
So turns out this is not a kernel bug after all. As @hedrick mentioned it is indeed related to the bpf logs. I suppose kernel 5.15 just produces longer logs here than the newer kernels. Here is the original bug report for Slurm https://bugs.schedmd.com/show_bug.cgi?id=17210 that includes a patch. Please close this bug report here. ** Bug watch added: bugs.schedmd.com/ #17210 https://bugs.schedmd.com/show_bug.cgi?id=17210 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: cgroup2 broken since 5.15.0-90-generic? Status in linux-signed package in Ubuntu: Confirmed Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
@hedrick: Regarding that workaround you mentioned above, I would guess it only suppresses the error message but doesn't fix the problem with broken cgroup confinement. Is that correct? I've done more testing and have identified the following commit: https://github.com/torvalds/linux/commit/2516deeb872ab9dda3bf4c66cf24b1ee900f25bf Reverting that one (and the follow-up 683d2969a0820072ddc05dbcc667664f7a34ac90) makes things work again as far as I can tell. I haven't tested that on the Ubuntu source yet, so far only on upstream 5.15.127. The only thing I can imagine regarding why this works with newer kernels is that this is an incomplete backport. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: cgroup2 broken since 5.15.0-90-generic? Status in linux-signed package in Ubuntu: Confirmed Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
Furthermore, slurm doesn't actually use the log data that the kernel would pass back. So my code is better even with kernels that work. Why pass a log buffer that you aren't going to use? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: cgroup2 broken since 5.15.0-90-generic? Status in linux-signed package in Ubuntu: Confirmed Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
However the same patch is in 6.5. I give up. At least I have a workaround. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: cgroup2 broken since 5.15.0-90-generic? Status in linux-signed package in Ubuntu: Confirmed Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
>From timing I suspect kernel.org 2dcb31e65d26a29a6842500e904907180e80a091, but I don't understand the code so I can't tell whether there's actually a problem there. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: cgroup2 broken since 5.15.0-90-generic? Status in linux-signed package in Ubuntu: Confirmed Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
In src/plugins/cgroup/v2/ebpf.c, comment out logging. I.e. change attr.log_level = 1; attr.log_buf = (size_t) log; attr.log_size = sizeof(log); to attr.log_level = 0; attr.log_buf = NULL; attr.log_size = 0; I think you'll find that this fixes it. I have no idea why this is a problem in this specific kernel release -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: cgroup2 broken since 5.15.0-90-generic? Status in linux-signed package in Ubuntu: Confirmed Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
I'm concerned with the security implications of being frozen on a kernel we can't update. I suspect we should start testing an HWE kernel. That raises other issues, since there are lots of other features we also need to work, so it's going to require a fair amount of testing. I'd accelerate our move to 24.04, except there is too much question about update of user packages, which often lag new system versions. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: cgroup2 broken since 5.15.0-90-generic? Status in linux-signed package in Ubuntu: Confirmed Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
We are also seeing this. We're now stuck on old kernels until this is fixed. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: cgroup2 broken since 5.15.0-90-generic? Status in linux-signed package in Ubuntu: Confirmed Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2050098] Re: cgroup2 broken since 5.15.0-90-generic?
** Summary changed: - cgroup2 appears to be broken + cgroup2 broken since 5.15.0-90-generic? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed in Ubuntu. https://bugs.launchpad.net/bugs/2050098 Title: cgroup2 broken since 5.15.0-90-generic? Status in linux-signed package in Ubuntu: Confirmed Bug description: We're using Slurm workload manager in a cluster with Ubuntu 22.04 and the linux-generic kernel (amd64). We use cgroups (cgroup2) for resource allocation with Slurm. With kernel version linux-image-5.15.0-91-generic 5.15.0-91.101 amd64 I'm seeing a new issue. This must have been introduced recently, I can confirm that with kernel 5.15.0-88-generic the issue does not exist. When I request a single GPU on a node with kernel 5.15.0-88-generic all is well: $ srun -G 1 -w gpu59 nvidia-smi -L GPU 0: NVIDIA [...] Instead with kernel 5.15.0-91-generic: $ srun -G 1 -w gpu59 nvidia-smi -L slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). GPU 0: NVIDIA [...] GPU 1: NVIDIA [...] GPU 2: NVIDIA [...] GPU 3: NVIDIA [...] GPU 4: NVIDIA [...] GPU 5: NVIDIA [...] GPU 6: NVIDIA [...] GPU 7: NVIDIA [...] So I get this error regarding MEMLOCK limit and see all GPUs in the system instead of only the one requested. Hence I assume the problem is related to cgroups. $ cat /proc/version_signature Ubuntu 5.15.0-91.101-generic 5.15.131 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp