[slurm-users] Re: Restricting local disk storage of jobs
Hey Jeffrey, thanks for this suggestion! This is probably the way to go if one can find a way to access GRES in the prolog. I read somewhere that people were calling scontrol to get this information, but this seems a bit unclean. Anyway, if I find some time I will try it out. Best, Tim On 2/6/24 16:30, Jeffrey T Frey wrote: Most of my ideas have revolved around creating file systems on-the-fly as part of the job prolog and destroying them in the epilog. The issue with that mechanism is that formatting a file system (e.g. mkfs.) can be time-consuming. E.g. formatting your local scratch SSD as an LVM PV+VG and allocating per-job volumes, you'd still need to run a e.g. mkfs.xfs and mount the new file system. ZFS file system creation is much quicker (basically combines the LVM + mkfs steps above) but I don't know of any clusters using ZFS to manage local file systems on the compute nodes :-) One /could/ leverage XFS project quotas. E.g. for Slurm job 2147483647: *[root@r00n00 /]# mkdir /tmp-alloc/slurm-2147483647* *[root@r00n00 /]# xfs_quota -x -c 'project -s -p /tmp-alloc/slurm-2147483647 2147483647' /tmp-alloc* Setting up project 2147483647 (path /tmp-alloc/slurm-2147483647)... Processed 1 (/etc/projects and cmdline) paths for project 2147483647 with recursion depth infinite (-1). *[root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=1g 2147483647' /tmp-alloc* *[root@r00n00 /]# cd /tmp-alloc/slurm-2147483647* *[root@r00n00 slurm-2147483647]# dd if=/dev/zero of=zeroes bs=5M count=1000* dd: error writing ‘zeroes’: No space left on device 205+0 records in 204+0 records out 1073741824 bytes (1.1 GB) copied, 2.92232 s, 367 MB/s : [root@r00n00 /]# rm -rf /tmp-alloc/*slurm-2147483647* [root@r00n00 /]# *xfs_quota -x -c 'limit -p bhard=0 2147483647' /tmp-alloc* Since Slurm jobids max out at 0x03FF (and 2147483647 = 0x7FFF) we have an easy on-demand project id to use on the file system. Slurm tmpfs plugins have to do a mkdir to create the per-job directory, adding two xfs_quota commands (which run in more or less O(1) time) won't extend the prolog by much. Likewise, Slurm tmpfs plugins have to scrub the directory at job cleanup, so adding another xfs_quota command will not do much to change their epilog execution times. The main question is "where does the tmpfs plugin find the quota limit for the job?" On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users wrote: Hi, In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tmp and it gets cleaned up after them. Currently, we are mapping /tmp into the nodes RAM, which means that the cgroups make sure that users can only use a certain amount of storage inside /tmp. Now we would like to use of the node's local SSD instead of its RAM to hold the files in /tmp. I have seen people define local storage as GRES, but I am wondering how to make sure that users do not exceed the storage space they requested in a job. Does anyone have an idea how to configure local storage as a proper tracked resource? Thanks a lot in advance! Best, Tim -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [ext] Restricting local disk storage of jobs
Hi Magnus, I understand. Thanks a lot for your suggestion. Best, Tim On 06.02.24 15:34, Hagdorn, Magnus Karl Moritz wrote: Hi Tim, in the end the InitScript didn't contain anything useful because slurmd: error: _parse_next_key: Parsing error at unrecognized key: InitScript At this stage I gave up. This was with SLURM 23.02. My plan was to setup the local scratch directory with XFS and then get the script to apply a project quota, ie quota attached to the directory. I would start by checking if slurm recognises the InitScript option. Regards magnus On Tue, 2024-02-06 at 15:24 +0100, Tim Schneider wrote: Hi Magnus, thanks for your reply! If you can, would you mind sharing the InitScript of your attempt at getting it to work? Best, Tim On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote: Hi Tim, we are using the container/tmpfs plugin to map /tmp to a local NVMe drive which works great. I did consider setting up directory quotas. I thought the InitScript [1] option should do the trick. Alas, I didn't get it to work. If I remember correctly, slurm complained about the option being present. In the end we recommend our users to make exclusive use a node if they are going to use a lot of local scratch space. I don't think this happens very often if at all. Regards magnus [1] https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users wrote: Hi, In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tmp and it gets cleaned up after them. Currently, we are mapping /tmp into the nodes RAM, which means that the cgroups make sure that users can only use a certain amount of storage inside /tmp. Now we would like to use of the node's local SSD instead of its RAM to hold the files in /tmp. I have seen people define local storage as GRES, but I am wondering how to make sure that users do not exceed the storage space they requested in a job. Does anyone have an idea how to configure local storage as a proper tracked resource? Thanks a lot in advance! Best, Tim -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [ext] Restricting local disk storage of jobs
Hi Magnus, thanks for your reply! If you can, would you mind sharing the InitScript of your attempt at getting it to work? Best, Tim On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote: Hi Tim, we are using the container/tmpfs plugin to map /tmp to a local NVMe drive which works great. I did consider setting up directory quotas. I thought the InitScript [1] option should do the trick. Alas, I didn't get it to work. If I remember correctly, slurm complained about the option being present. In the end we recommend our users to make exclusive use a node if they are going to use a lot of local scratch space. I don't think this happens very often if at all. Regards magnus [1] https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users wrote: Hi, In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tmp and it gets cleaned up after them. Currently, we are mapping /tmp into the nodes RAM, which means that the cgroups make sure that users can only use a certain amount of storage inside /tmp. Now we would like to use of the node's local SSD instead of its RAM to hold the files in /tmp. I have seen people define local storage as GRES, but I am wondering how to make sure that users do not exceed the storage space they requested in a job. Does anyone have an idea how to configure local storage as a proper tracked resource? Thanks a lot in advance! Best, Tim -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Restricting local disk storage of jobs
Hi, In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tmp and it gets cleaned up after them. Currently, we are mapping /tmp into the nodes RAM, which means that the cgroups make sure that users can only use a certain amount of storage inside /tmp. Now we would like to use of the node's local SSD instead of its RAM to hold the files in /tmp. I have seen people define local storage as GRES, but I am wondering how to make sure that users do not exceed the storage space they requested in a job. Does anyone have an idea how to configure local storage as a proper tracked resource? Thanks a lot in advance! Best, Tim -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).
Hi, I just tested with 23.02.7-1 and the issue is gone. So it seems like the patch got released. Best, Tim On 1/24/24 16:55, Stefan Fleischmann wrote: On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro wrote: Many thanks One question? Do we have to apply this patch (and recompile slurm i guess) only on the compute-node with problems? Also, I noticed the patch now appears as "obsolete", is that ok? We have Slurm installed on a NFS share, so what I did was to recompile it and then I only replaced the library lib/slurm/cgroup_v2.so. Good enough for now, I've been planning to update to 23.11 anyway soon. I suppose it's marked as obsolete because the patch went into a release. According to the info in the bug report it should have been included in 23.02.4. Cheers, Stefan On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann wrote: Turns out I was wrong, this is not a problem in the kernel at all. It's a known bug that is triggered by long bpf logs, see here https://bugs.schedmd.com/show_bug.cgi?id=17210 There is a patch included there. Cheers, Stefan On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann wrote: I don't think there is much for SchedMD to do. As I said since it is working fine with newer kernels there doesn't seem to be any breaking change in cgroup2 in general, but only a regression introduced in one of the latest updates in 5.15. If Slurm was doing something wrong with cgroup2, and it accidentally worked until this recent change, then other kernel versions should show the same behavior. But as far as I can tell it still works just fine with newer kernels. Cheers, Stefan On Tue, 23 Jan 2024 15:20:56 +0100 Tim Schneider wrote: Hi, I have filed a bug report with SchedMD (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told me they cannot invest time in this issue since I don't have a support contract. Maybe they will look into it once it affects more people or someone important enough. So far, I have resorted to using 5.15.0-89-generic, but I am also a bit concerned about the security aspect of this choice. Best, Tim On 23.01.24 14:59, Stefan Fleischmann wrote: Hi! I'm seeing the same in our environment. My conclusion is that it is a regression in the Ubuntu 5.15 kernel, introduced with 5.15.0-90-generic. Last working kernel version is 5.15.0-89-generic. I have filed a bug report here: https://bugs.launchpad.net/bugs/2050098 Please add yourself to the affected users in the bug report so it hopefully gets more attention. I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem does not exist there. 6.5 is the latest hwe kernel for 22.04 and would be an option for now. Reverting back to 5.15.0-89 would work as well, but I haven't looked into the security aspects of that. Cheers, Stefan On Mon, 22 Jan 2024 13:31:15 -0300 cristobal.navarro.g at gmail.com wrote: Hi Tim and community, We are currently having the same issue (cgroups not working it seems, showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days ago after a full update (apt upgrade). Now whenever we launch a job for that partition, we get the error message mentioned by Tim. As a note, we have another custom GPU-compute node with L40s, on a different partition, and that one works fine. Before this error, we always had small differences in kernel version between nodes, so I am not sure if this can be the problem. Nevertheless, here is the info of our nodes as well. *[Problem node]* The DGX A100 node has this kernel cnavarro at nodeGPU01:~$ uname -a Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux *[Functioning node]* The Custom GPU node (L40s) has this kernel cnavarro at nodeGPU02:~$ uname -a Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux *And the login node *(slurmctld) ? ~ uname -a Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Any ideas what we should check? On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider wrote: Hi, I am using SLURM 22.05.9 on a small compute cluster. Since I reinstalled two of our nodes, I get the following error when launching a job: slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). Also the cgroups do not seem to work properly anymore, as I am able to see all GPUs even if I do not request them, which is not the case on the other nodes. One difference I found between the updated nodes and the original nodes (both are Ubuntu 22.04) is the kernel version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could not figure out how to install the exact first kernel version on the updated nodes, but I noticed that when I reinstall 5.15.0 with this tool: ht
Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).
Hi, I have filed a bug report with SchedMD (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told me they cannot invest time in this issue since I don't have a support contract. Maybe they will look into it once it affects more people or someone important enough. So far, I have resorted to using 5.15.0-89-generic, but I am also a bit concerned about the security aspect of this choice. Best, Tim On 23.01.24 14:59, Stefan Fleischmann wrote: Hi! I'm seeing the same in our environment. My conclusion is that it is a regression in the Ubuntu 5.15 kernel, introduced with 5.15.0-90-generic. Last working kernel version is 5.15.0-89-generic. I have filed a bug report here: https://bugs.launchpad.net/bugs/2050098 Please add yourself to the affected users in the bug report so it hopefully gets more attention. I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem does not exist there. 6.5 is the latest hwe kernel for 22.04 and would be an option for now. Reverting back to 5.15.0-89 would work as well, but I haven't looked into the security aspects of that. Cheers, Stefan On Mon, 22 Jan 2024 13:31:15 -0300 cristobal.navarro.g at gmail.com wrote: Hi Tim and community, We are currently having the same issue (cgroups not working it seems, showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days ago after a full update (apt upgrade). Now whenever we launch a job for that partition, we get the error message mentioned by Tim. As a note, we have another custom GPU-compute node with L40s, on a different partition, and that one works fine. Before this error, we always had small differences in kernel version between nodes, so I am not sure if this can be the problem. Nevertheless, here is the info of our nodes as well. *[Problem node]* The DGX A100 node has this kernel cnavarro at nodeGPU01:~$ uname -a Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux *[Functioning node]* The Custom GPU node (L40s) has this kernel cnavarro at nodeGPU02:~$ uname -a Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux *And the login node *(slurmctld) ? ~ uname -a Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Any ideas what we should check? On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider wrote: Hi, I am using SLURM 22.05.9 on a small compute cluster. Since I reinstalled two of our nodes, I get the following error when launching a job: slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). Also the cgroups do not seem to work properly anymore, as I am able to see all GPUs even if I do not request them, which is not the case on the other nodes. One difference I found between the updated nodes and the original nodes (both are Ubuntu 22.04) is the kernel version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could not figure out how to install the exact first kernel version on the updated nodes, but I noticed that when I reinstall 5.15.0 with this tool: https://github.com/pimlie/ubuntu-mainline-kernel.sh, the error message disappears. However, once I do that, the network driver does not function properly anymore, so this does not seem to be a good solution. Has anyone seen this issue before or is there maybe something else I should take a look at? I am also happy to just find a workaround such that I can take these nodes back online. I appreciate any help! Thanks a lot in advance and best wishes, Tim
[slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).
Hi, I am using SLURM 22.05.9 on a small compute cluster. Since I reinstalled two of our nodes, I get the following error when launching a job: slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). Also the cgroups do not seem to work properly anymore, as I am able to see all GPUs even if I do not request them, which is not the case on the other nodes. One difference I found between the updated nodes and the original nodes (both are Ubuntu 22.04) is the kernel version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could not figure out how to install the exact first kernel version on the updated nodes, but I noticed that when I reinstall 5.15.0 with this tool: https://github.com/pimlie/ubuntu-mainline-kernel.sh, the error message disappears. However, once I do that, the network driver does not function properly anymore, so this does not seem to be a good solution. Has anyone seen this issue before or is there maybe something else I should take a look at? I am also happy to just find a workaround such that I can take these nodes back online. I appreciate any help! Thanks a lot in advance and best wishes, Tim
Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set
Hi Ole, thanks for your reply. The curious thing is that when I run "scontrol reboot nextstate=RESUME ", the drain flag of that node is not set (sinfo shows mix@ and "scontrol show node " shows no DRAIN in State, just MIXED+REBOOT_REQUESTED), yet no jobs are scheduled on that node until reboot. If I specifically request that node for a job with "-w ", I get "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions". Not using nextstate=RESUME is inconvenient for me as sometimes we have parts of our cluster drained and I would like to run a single command that reboots all non-drained nodes once they become idle and all drained nodes immediately, resuming them once they are done reinstalling. Best, Tim On 25.10.23 14:59, Ole Holm Nielsen wrote: Hi Tim, I think the scontrol manual page explains the "scontrol reboot" function fairly well: reboot [ASAP] [nextstate={RESUME|DOWN}] [reason=] {ALL|} Reboot the nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file. Each node will have the "REBOOT" flag added to its node state. After a node reboots and the slurmd daemon starts up again, the HealthCheckProgram will run once. Then, the slurmd daemon will register itself with the slurmctld daemon and the "REBOOT" flag will be cleared. The node's "DRAIN" state flag will be cleared if the reboot was "ASAP", nextstate=resume or down. The "ASAP" option adds the "DRAIN" flag to each node's state, preventing additional jobs from running on the node so it can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP). It seems to be implicitly understood that if nextstate is specified, this implies setting the "DRAIN" state flag: The node's "DRAIN" state flag will be cleared if the reboot was "ASAP", nextstate=resume or down. You can verify the node's DRAIN flag with "scontrol show node ". IMHO, if you want nodes to continue accepting new jobs, then nextstate is irrelevant. We always use "reboot ASAP" because our cluster is usually so busy that nodes never become idle if left to themselves :-) FYI: We regularly make package updates and firmware updates using the "scontrol reboot asap" method which is explained in this script: https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh Best regards, Ole, Ole On 10/25/23 13:39, Tim Schneider wrote: Hi Chris, thanks a lot for your response. I just realized that I made a mistake in my post. In the section you cite, the command is supposed to be "scontrol reboot nextstate=RESUME" (without ASAP). So to clarify: my problem is that if I type "scontrol reboot nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On the other hand, if I type "scontrol reboot", jobs continue to get scheduled, which is what I want. I just don't understand, why setting nextstate results in the nodes not accepting jobs anymore. My usecase is similar to the one you describe. We use the ASAP option when we install a new image to ensure that from the point of the reinstallation onwards, all jobs end up on nodes with the new configuration only. However, in some cases when we do only minor changes to the image configuration, we prefer to cause as little disruption as possible and just reinstall the nodes whenever they are idle. Here, being able to set nextstate=RESUME is useful, since we usually want the nodes to resume after reinstallation, no matter what their previous state was. Hope that clears it up and sorry for the confusion! Best, tim On 25.10.23 02:10, Christopher Samuel wrote: On 10/24/23 12:39, Tim Schneider wrote: Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME ", the node goes in "mix@" state (not drain), but no new jobs get scheduled until the node reboots. Essentially I get draining behavior, even though the node's state is not "drain". Note that this behavior is caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled as expected. Does anyone have an idea why that could be? The intent of the "ASAP` flag for "scontrol reboot" is to not let any more jobs onto a node until it has rebooted. IIRC that was from work we sponsored, the idea being that (for how our nodes are managed) we would build new images with the latest software stack, test them on a separate test system and then once happy bring them over to the production system and do an "scontrol reboot ASAP nextstate=resume reason=... $NODES&
Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set
Hi Chris, thanks a lot for your response. I just realized that I made a mistake in my post. In the section you cite, the command is supposed to be "scontrol reboot nextstate=RESUME" (without ASAP). So to clarify: my problem is that if I type "scontrol reboot nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On the other hand, if I type "scontrol reboot", jobs continue to get scheduled, which is what I want. I just don't understand, why setting nextstate results in the nodes not accepting jobs anymore. My usecase is similar to the one you describe. We use the ASAP option when we install a new image to ensure that from the point of the reinstallation onwards, all jobs end up on nodes with the new configuration only. However, in some cases when we do only minor changes to the image configuration, we prefer to cause as little disruption as possible and just reinstall the nodes whenever they are idle. Here, being able to set nextstate=RESUME is useful, since we usually want the nodes to resume after reinstallation, no matter what their previous state was. Hope that clears it up and sorry for the confusion! Best, tim On 25.10.23 02:10, Christopher Samuel wrote: On 10/24/23 12:39, Tim Schneider wrote: Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME ", the node goes in "mix@" state (not drain), but no new jobs get scheduled until the node reboots. Essentially I get draining behavior, even though the node's state is not "drain". Note that this behavior is caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled as expected. Does anyone have an idea why that could be? The intent of the "ASAP` flag for "scontrol reboot" is to not let any more jobs onto a node until it has rebooted. IIRC that was from work we sponsored, the idea being that (for how our nodes are managed) we would build new images with the latest software stack, test them on a separate test system and then once happy bring them over to the production system and do an "scontrol reboot ASAP nextstate=resume reason=... $NODES" to ensure that from that point onwards no new jobs would start in the old software configuration, only the new one. Also slurmctld would know that these nodes are due to come back in "ResumeTimeout" seconds after the reboot is issued and so could plan for them as part of scheduling large jobs, rather than thinking there was no way it could do so and letting lots of smaller jobs get in the way. Hope that helps! All the best, Chris
[slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set
Hi, from my understanding, if I run "scontrol reboot ", the node should continue to operate as usual and reboots once it is idle. When adding the ASAP flag (scontrol reboot ASAP ), the node should go into drain state and not accept any more jobs. Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME ", the node goes in "mix@" state (not drain), but no new jobs get scheduled until the node reboots. Essentially I get draining behavior, even though the node's state is not "drain". Note that this behavior is caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled as expected. Does anyone have an idea why that could be? I am running slurm 22.05.9. Steps to reproduce: # To prevent node from rebooting immediately sbatch -t 1:00:00 -c 1 --mem-per-cpu 1G -w ./long_running_script.sh # Request reboot scontrol reboot nextstate=RESUME # Run interactive command, which does not start until "scontrol cancel_reboot " is executed in another shell srun -t 1:00:00 -c 1 --mem-per-cpu 1G -w --pty bash Thanks a lot in advance! Best, Tim
Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04
Hi, I just want to wrap this up in case someone has the same issue in the future. As Reed pointed out, Ubuntu 22 does not support cgroups v1 anymore. At the same time, the slurm-wlm package in the Ubuntu repositories uses cgroups v1, which makes its task/cgroup plugin incompatible with Ubuntu 22. My solution was to build Slurm 22.05 manually, while ensuring that /libdbus-1-dev/ is installed (as otherwise cgroups v2 support does not get built). This takes a bit more time but seems to work so far. Thanks a lot Reed & Abel for your advice! Best, Tim On 6/16/23 10:42, Tim Schneider wrote: Hi again, I just realized that https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 wrote at some point that he build Slurm 22 instead of using the Ubuntu repo version. So I guess I will have to look into that. Best, Tim On 6/16/23 10:36, Tim Schneider wrote: Hi Abel and Reed, thanks a lot for your quick replies! I did indeed just install slurm-wlm from the Ubuntu repos. Following the advice of https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1, I tried disabling cgroups v1 on Ubuntu, but that just leads to an error during startup of slurmd: /slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so// //slurmd: error: unable to mount freezer cgroup namespace: Invalid argument// //slurmd: error: unable to create freezer cgroup namespace// //slurmd: error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed// //slurmd: error: cannot create proctrack context for proctrack/cgroup// //slurmd: error: slurmd initialization failed/ So it seems that slurmd is using cgroups v1. This is also reflected in the mounts (for the output below, cgroups v1 is enabled again): /$ mount | grep cgroup// //cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)// //cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)/ What is still confusing to me is that the slurmd logs indicate no error when I try running with cgroups v1 enabled and the error only appears on the slurmctld side. Do you know how I can enable cgroups v2 in Slurm? To me it seems that this is what https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 did. Best, Tim On 6/16/23 03:28, abel pinto wrote: Indeed, the issue seems to be that Ubuntu 22.04 does not support cgroups v1 anymore. Does SLURM support cgroupsv2? It seems so: https://slurm.schedmd.com/cgroup_v2.html /Abel On Jun 15, 2023, at 20:20, Reed Dier wrote: I don’t have any direct advice off-hand, but I figure I will try to help steer the conversation in the right direction for figuring it out. I’m going to assume that since you mention 21.08.5, that this means you are using the slurm-wlm packages from the ubuntu repos, and not building yourself? And have all the components (slurmctld(s), slurmdbd, slurmd(s)) been upgraded as well? The only thing that immediately comes to mind is that I remember reading a good bit about Ubuntu 22.04’s use of cgroups v2, which as I understand it are very different from cgroups v1, and plenty of people have had issues with v1/v2 mismatches with slurm and other applications. https://www.reddit.com/r/SLURM/comments/vjquih/error_cannot_find_cgroup_plugin_for_cgroupv2/ https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 https://discuss.linuxcontainers.org/t/after-updated-to-more-recent-ubuntu-version-with-cgroups-v2-ubuntu-16-04-container-is-not-working-properly/14022 Hope that at least steers the conversation in a good direction. Reed On Jun 15, 2023, at 5:04 PM, Tim Schneider wrote: Hi, I am maintaining the SLURM cluster of my research group. Recently I updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am unable to launch jobs. When launching a job, I receive the following error: /$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G --time=01:00:00 --pty -p amd -w cn02 --pty bash -i// //srun: error: task 0 launch failed: Plugin initialization failed/ Strangely, I cannot find any indication of this problem in the logs (find the logs attached). The problem must be related to the task/cgroup plugin, as it does not occur when I disable it. After reading in the documentation, I tried adding the /cgroup_enable=memory swapaccount=1/ kernel parameters, but the problem persisted. I would be very grateful for any advice where to look since I have no idea how to investigate this issue further. Thanks a lot in advance. Best, Tim
Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04
Hi, I just want to wrap this up in case someone has the same issue in the future. As Reed pointed out, Ubuntu 22 does not support cgroups v1 anymore. At the same time, the slurm-wlm package in the Ubuntu repositories uses cgroups v1, which makes its task/cgroup plugin incompatible with Ubuntu 22. My solution was to build Slurm 22.05 manually, while ensuring that /libdbus-1-dev/ is installed (as otherwise cgroups v2 support does not get built). This takes a bit more time but seems to work so far. Thanks a lot Reed & Abel for your advice! Best, Tim On 6/16/23 10:42, Tim Schneider wrote: Hi again, I just realized that https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 wrote at some point that he build Slurm 22 instead of using the Ubuntu repo version. So I guess I will have to look into that. Best, Tim On 6/16/23 10:36, Tim Schneider wrote: Hi Abel and Reed, thanks a lot for your quick replies! I did indeed just install slurm-wlm from the Ubuntu repos. Following the advice of https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1, I tried disabling cgroups v1 on Ubuntu, but that just leads to an error during startup of slurmd: /slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so// //slurmd: error: unable to mount freezer cgroup namespace: Invalid argument// //slurmd: error: unable to create freezer cgroup namespace// //slurmd: error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed// //slurmd: error: cannot create proctrack context for proctrack/cgroup// //slurmd: error: slurmd initialization failed/ So it seems that slurmd is using cgroups v1. This is also reflected in the mounts (for the output below, cgroups v1 is enabled again): /$ mount | grep cgroup// //cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)// //cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)/ What is still confusing to me is that the slurmd logs indicate no error when I try running with cgroups v1 enabled and the error only appears on the slurmctld side. Do you know how I can enable cgroups v2 in Slurm? To me it seems that this is what https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 did. Best, Tim On 6/16/23 03:28, abel pinto wrote: Indeed, the issue seems to be that Ubuntu 22.04 does not support cgroups v1 anymore. Does SLURM support cgroupsv2? It seems so: https://slurm.schedmd.com/cgroup_v2.html /Abel On Jun 15, 2023, at 20:20, Reed Dier wrote: I don’t have any direct advice off-hand, but I figure I will try to help steer the conversation in the right direction for figuring it out. I’m going to assume that since you mention 21.08.5, that this means you are using the slurm-wlm packages from the ubuntu repos, and not building yourself? And have all the components (slurmctld(s), slurmdbd, slurmd(s)) been upgraded as well? The only thing that immediately comes to mind is that I remember reading a good bit about Ubuntu 22.04’s use of cgroups v2, which as I understand it are very different from cgroups v1, and plenty of people have had issues with v1/v2 mismatches with slurm and other applications. https://www.reddit.com/r/SLURM/comments/vjquih/error_cannot_find_cgroup_plugin_for_cgroupv2/ https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 https://discuss.linuxcontainers.org/t/after-updated-to-more-recent-ubuntu-version-with-cgroups-v2-ubuntu-16-04-container-is-not-working-properly/14022 Hope that at least steers the conversation in a good direction. Reed On Jun 15, 2023, at 5:04 PM, Tim Schneider wrote: Hi, I am maintaining the SLURM cluster of my research group. Recently I updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am unable to launch jobs. When launching a job, I receive the following error: /$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G --time=01:00:00 --pty -p amd -w cn02 --pty bash -i// //srun: error: task 0 launch failed: Plugin initialization failed/ Strangely, I cannot find any indication of this problem in the logs (find the logs attached). The problem must be related to the task/cgroup plugin, as it does not occur when I disable it. After reading in the documentation, I tried adding the /cgroup_enable=memory swapaccount=1/ kernel parameters, but the problem persisted. I would be very grateful for any advice where to look since I have no idea how to investigate this issue further. Thanks a lot in advance. Best, Tim
[slurm-users] Fwd: task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04
Hi, I am maintaining the SLURM cluster of my research group. Recently I updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am unable to launch jobs. When launching a job, I receive the following error: /$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G --time=01:00:00 --pty -p amd -w cn02 --pty bash -i// //srun: error: task 0 launch failed: Plugin initialization failed/ Strangely, I cannot find any indication of this problem in the logs (find the logs attached). The problem must be related to the task/cgroup plugin, as it does not occur when I disable it. After reading in the documentation, I tried adding the /cgroup_enable=memory swapaccount=1/ kernel parameters, but the problem persisted. I would be very grateful for any advice where to look since I have no idea how to investigate this issue further. Thanks a lot in advance. Best, Tim ### # Slurm cgroup support configuration file ### CgroupAutomount=yes CgroupMountpoint=/sys/fs/cgroup ConstrainKmemSpace=no ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes # This will be necessary for controlling GPU access ConstrainDevices=yes # # slurmd -D -vv --conf-server nas:6817 slurmd: debug: Log file re-opened slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug2: hwloc_topology_export_xml slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:16 ThreadsPerCore:1 slurmd: debug4: CPU map[0]=>0 S:C:T 0:0:0 slurmd: debug4: CPU map[1]=>1 S:C:T 0:1:0 slurmd: debug4: CPU map[2]=>2 S:C:T 0:2:0 slurmd: debug4: CPU map[3]=>3 S:C:T 0:3:0 slurmd: debug4: CPU map[4]=>4 S:C:T 0:4:0 slurmd: debug4: CPU map[5]=>5 S:C:T 0:5:0 slurmd: debug4: CPU map[6]=>6 S:C:T 0:6:0 slurmd: debug4: CPU map[7]=>7 S:C:T 0:7:0 slurmd: debug4: CPU map[8]=>8 S:C:T 0:8:0 slurmd: debug4: CPU map[9]=>9 S:C:T 0:9:0 slurmd: debug4: CPU map[10]=>10 S:C:T 0:10:0 slurmd: debug4: CPU map[11]=>11 S:C:T 0:11:0 slurmd: debug4: CPU map[12]=>12 S:C:T 0:12:0 slurmd: debug4: CPU map[13]=>13 S:C:T 0:13:0 slurmd: debug4: CPU map[14]=>14 S:C:T 0:14:0 slurmd: debug4: CPU map[15]=>15 S:C:T 0:15:0 slurmd: debug3: _set_slurmd_spooldir: initializing slurmd spool directory `/var/spool/slurmd` slurmd: debug2: hwloc_topology_init slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:16 ThreadsPerCore:1 slurmd: debug4: CPU map[0]=>0 S:C:T 0:0:0 slurmd: debug4: CPU map[1]=>1 S:C:T 0:1:0 slurmd: debug4: CPU map[2]=>2 S:C:T 0:2:0 slurmd: debug4: CPU map[3]=>3 S:C:T 0:3:0 slurmd: debug4: CPU map[4]=>4 S:C:T 0:4:0 slurmd: debug4: CPU map[5]=>5 S:C:T 0:5:0 slurmd: debug4: CPU map[6]=>6 S:C:T 0:6:0 slurmd: debug4: CPU map[7]=>7 S:C:T 0:7:0 slurmd: debug4: CPU map[8]=>8 S:C:T 0:8:0 slurmd: debug4: CPU map[9]=>9 S:C:T 0:9:0 slurmd: debug4: CPU map[10]=>10 S:C:T 0:10:0 slurmd: debug4: CPU map[11]=>11 S:C:T 0:11:0 slurmd: debug4: CPU map[12]=>12 S:C:T 0:12:0 slurmd: debug4: CPU map[13]=>13 S:C:T 0:13:0 slurmd: debug4: CPU map[14]=>14 S:C:T 0:14:0 slurmd: debug4: CPU map[15]=>15 S:C:T 0:15:0 slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so slurmd: debug: gres/gpu: init: loaded slurmd: debug3: Success. slurmd: debug3: _merge_gres2: From gres.conf, using gpu:rtx2080:1:/dev/nvidia0 slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/gpu_generic.so slurmd: debug: gpu/generic: init: init: GPU Generic plugin loaded slurmd: debug3: Success. slurmd: debug3: gres_device_major : /dev/nvidia0 major 195, minor 0 slurmd: Gres Name=gpu Type=rtx2080 Count=1 slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/topology_none.so slurmd: topology/none: init: topology NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/route_default.so slurmd: route/default: init: route default plugin loaded slurmd: debug3: Success. slurmd: debug2: Gathering cpu frequency information for 16 cpus slurmd: debug: Resource spec: No specialized cores configured by default on this node slurmd: debug: Resource spec: Reserved system memory limit not configured for this node slurmd: debug3: NodeName= cn02 slurmd: debug3: TopoAddr= cn02 slurmd: debug3: TopoPattern = node slurmd: debug3: ClusterName = iascluster slurmd: debug3: Confile = `/var/spool/slurmd/conf-cache/slurm.conf' slurmd: debug3: Debug = 5 slurmd: debug3: CPUs= 16 (CF: 16, HW: 16) slurmd: debug3: Boards = 1 (CF: 1, HW: 1) slurmd: debug3: Sockets = 1 (CF: 1, HW: 1) slurmd: debug3: Cores = 16 (CF: 16, HW: 16) slurmd: debug3: Threads = 1 (CF: 1, HW: 1) slurmd: debug3: UpTime = 2377 = 00:39:37 slurmd: debug3: Block Map = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 slurmd: debug3: Inverse Map = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 slurmd: debug3: RealMemory = 64216 slurmd: debug3: TmpDisk = 32108 slurmd: debug3: Epilog