[slurm-users] Re: Restricting local disk storage of jobs

2024-02-07 Thread Tim Schneider via slurm-users

Hey Jeffrey,

thanks for this suggestion! This is probably the way to go if one can 
find a way to access GRES in the prolog. I read somewhere that people 
were calling scontrol to get this information, but this seems a bit 
unclean. Anyway, if I find some time I will try it out.


Best,

Tim

On 2/6/24 16:30, Jeffrey T Frey wrote:
Most of my ideas have revolved around creating file systems on-the-fly 
as part of the job prolog and destroying them in the epilog.  The 
issue with that mechanism is that formatting a file system (e.g. 
mkfs.) can be time-consuming.  E.g. formatting your local 
scratch SSD as an LVM PV+VG and allocating per-job volumes, you'd 
still need to run a e.g. mkfs.xfs and mount the new file system.



ZFS file system creation is much quicker (basically combines the LVM + 
mkfs steps above) but I don't know of any clusters using ZFS to manage 
local file systems on the compute nodes :-)



One /could/ leverage XFS project quotas.  E.g. for Slurm job 2147483647:


*[root@r00n00 /]# mkdir /tmp-alloc/slurm-2147483647*
*[root@r00n00 /]# xfs_quota -x -c 'project -s -p
/tmp-alloc/slurm-2147483647 2147483647' /tmp-alloc*
Setting up project 2147483647 (path /tmp-alloc/slurm-2147483647)...
Processed 1 (/etc/projects and cmdline) paths for project
2147483647 with recursion depth infinite (-1).
*[root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=1g 2147483647'
/tmp-alloc*
*[root@r00n00 /]# cd /tmp-alloc/slurm-2147483647*
*[root@r00n00 slurm-2147483647]# dd if=/dev/zero of=zeroes bs=5M
count=1000*
dd: error writing ‘zeroes’: No space left on device
205+0 records in
204+0 records out
1073741824 bytes (1.1 GB) copied, 2.92232 s, 367 MB/s

   :

[root@r00n00 /]# rm -rf /tmp-alloc/*slurm-2147483647*
[root@r00n00 /]# *xfs_quota -x -c 'limit -p bhard=0 2147483647'
/tmp-alloc*


Since Slurm jobids max out at 0x03FF (and 2147483647 = 0x7FFF) 
we have an easy on-demand project id to use on the file system.  Slurm 
tmpfs plugins have to do a mkdir to create the per-job directory, 
adding two xfs_quota commands (which run in more or less O(1) time) 
won't extend the prolog by much. Likewise, Slurm tmpfs plugins have to 
scrub the directory at job cleanup, so adding another xfs_quota 
command will not do much to change their epilog execution times.  The 
main question is "where does the tmpfs plugin find the quota limit for 
the job?"






On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users 
 wrote:


Hi,

In our SLURM cluster, we are using the job_container/tmpfs plugin to 
ensure that each user can use /tmp and it gets cleaned up after them. 
Currently, we are mapping /tmp into the nodes RAM, which means that 
the cgroups make sure that users can only use a certain amount of 
storage inside /tmp.


Now we would like to use of the node's local SSD instead of its RAM 
to hold the files in /tmp. I have seen people define local storage as 
GRES, but I am wondering how to make sure that users do not exceed 
the storage space they requested in a job. Does anyone have an idea 
how to configure local storage as a proper tracked resource?


Thanks a lot in advance!

Best,

Tim


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users

Hi Magnus,

I understand. Thanks a lot for your suggestion.

Best,

Tim

On 06.02.24 15:34, Hagdorn, Magnus Karl Moritz wrote:

Hi Tim,
in the end the InitScript didn't contain anything useful because

slurmd: error: _parse_next_key: Parsing error at unrecognized key:
InitScript

At this stage I gave up. This was with SLURM 23.02. My plan was to
setup the local scratch directory with XFS and then get the script to
apply a project quota, ie quota attached to the directory.

I would start by checking if slurm recognises the InitScript option.

Regards
magnus

On Tue, 2024-02-06 at 15:24 +0100, Tim Schneider wrote:

Hi Magnus,

thanks for your reply! If you can, would you mind sharing the
InitScript
of your attempt at getting it to work?

Best,

Tim

On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote:

Hi Tim,
we are using the container/tmpfs plugin to map /tmp to a local NVMe
drive which works great. I did consider setting up directory
quotas. I
thought the InitScript [1] option should do the trick. Alas, I
didn't
get it to work. If I remember correctly, slurm complained about the
option being present. In the end we recommend our users to make
exclusive use a node if they are going to use a lot of local
scratch
space. I don't think this happens very often if at all.
Regards
magnus

[1]
https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript


On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users
wrote:

Hi,

In our SLURM cluster, we are using the job_container/tmpfs plugin
to
ensure that each user can use /tmp and it gets cleaned up after
them.
Currently, we are mapping /tmp into the nodes RAM, which means
that
the
cgroups make sure that users can only use a certain amount of
storage
inside /tmp.

Now we would like to use of the node's local SSD instead of its
RAM
to
hold the files in /tmp. I have seen people define local storage
as
GRES,
but I am wondering how to make sure that users do not exceed the
storage
space they requested in a job. Does anyone have an idea how to
configure
local storage as a proper tracked resource?

Thanks a lot in advance!

Best,

Tim




--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users

Hi Magnus,

thanks for your reply! If you can, would you mind sharing the InitScript 
of your attempt at getting it to work?


Best,

Tim

On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote:

Hi Tim,
we are using the container/tmpfs plugin to map /tmp to a local NVMe
drive which works great. I did consider setting up directory quotas. I
thought the InitScript [1] option should do the trick. Alas, I didn't
get it to work. If I remember correctly, slurm complained about the
option being present. In the end we recommend our users to make
exclusive use a node if they are going to use a lot of local scratch
space. I don't think this happens very often if at all.
Regards
magnus

[1]
https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript


On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users wrote:

Hi,

In our SLURM cluster, we are using the job_container/tmpfs plugin to
ensure that each user can use /tmp and it gets cleaned up after them.
Currently, we are mapping /tmp into the nodes RAM, which means that
the
cgroups make sure that users can only use a certain amount of storage
inside /tmp.

Now we would like to use of the node's local SSD instead of its RAM
to
hold the files in /tmp. I have seen people define local storage as
GRES,
but I am wondering how to make sure that users do not exceed the
storage
space they requested in a job. Does anyone have an idea how to
configure
local storage as a proper tracked resource?

Thanks a lot in advance!

Best,

Tim




--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users

Hi,

In our SLURM cluster, we are using the job_container/tmpfs plugin to 
ensure that each user can use /tmp and it gets cleaned up after them. 
Currently, we are mapping /tmp into the nodes RAM, which means that the 
cgroups make sure that users can only use a certain amount of storage 
inside /tmp.


Now we would like to use of the node's local SSD instead of its RAM to 
hold the files in /tmp. I have seen people define local storage as GRES, 
but I am wondering how to make sure that users do not exceed the storage 
space they requested in a job. Does anyone have an idea how to configure 
local storage as a proper tracked resource?


Thanks a lot in advance!

Best,

Tim


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-24 Thread Tim Schneider

Hi,

I just tested with 23.02.7-1 and the issue is gone. So it seems like the 
patch got released.


Best,

Tim

On 1/24/24 16:55, Stefan Fleischmann wrote:

On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro
 wrote:

Many thanks
One question? Do we have to apply this patch (and recompile slurm i
guess) only on the compute-node with problems?
Also, I noticed the patch now appears as "obsolete", is that ok?

We have Slurm installed on a NFS share, so what I did was to recompile
it and then I only replaced the library lib/slurm/cgroup_v2.so. Good
enough for now, I've been planning to update to 23.11 anyway soon.

I suppose it's marked as obsolete because the patch went into a
release. According to the info in the bug report it should have been
included in 23.02.4.

Cheers,
Stefan


On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann 
wrote:


Turns out I was wrong, this is not a problem in the kernel at all.
It's a known bug that is triggered by long bpf logs, see here
  https://bugs.schedmd.com/show_bug.cgi?id=17210

There is a patch included there.

Cheers,
Stefan

On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann 
wrote:

I don't think there is much for SchedMD to do. As I said since it
is working fine with newer kernels there doesn't seem to be any
breaking change in cgroup2 in general, but only a regression
introduced in one of the latest updates in 5.15.

If Slurm was doing something wrong with cgroup2, and it
accidentally worked until this recent change, then other kernel
versions should show the same behavior. But as far as I can tell
it still works just fine with newer kernels.

Cheers,
Stefan

On Tue, 23 Jan 2024 15:20:56 +0100
Tim Schneider  wrote:
  

Hi,

I have filed a bug report with SchedMD
(https://bugs.schedmd.com/show_bug.cgi?id=18623), but the
support told me they cannot invest time in this issue since I
don't have a support contract. Maybe they will look into it
once it affects more people or someone important enough.

So far, I have resorted to using 5.15.0-89-generic, but I am
also a bit concerned about the security aspect of this choice.

Best,

Tim

On 23.01.24 14:59, Stefan Fleischmann wrote:

Hi!

I'm seeing the same in our environment. My conclusion is that
it is a regression in the Ubuntu 5.15 kernel, introduced with
5.15.0-90-generic. Last working kernel version is
5.15.0-89-generic. I have filed a bug report here:
https://bugs.launchpad.net/bugs/2050098

Please add yourself to the affected users in the bug report
so it hopefully gets more attention.

I've tested with newer kernels (6.5, 6.6 and 6.7) and the
problem does not exist there. 6.5 is the latest hwe kernel
for 22.04 and would be an option for now. Reverting back to
5.15.0-89 would work as well, but I haven't looked into the
security aspects of that.

Cheers,
Stefan

On Mon, 22 Jan 2024 13:31:15 -0300
cristobal.navarro.g at gmail.com wrote:
  

Hi Tim and community,
We are currently having the same issue (cgroups not working
it seems, showing all GPUs on jobs) on a GPU-compute node
(DGX A100) a couple of days ago after a full update (apt
upgrade). Now whenever we launch a job for that partition,
we get the error message mentioned by Tim. As a note, we
have another custom GPU-compute node with L40s, on a
different partition, and that one works fine. Before this
error, we always had small differences in kernel version
between nodes, so I am not sure if this can be the problem.
Nevertheless, here is the info of our nodes as well.

*[Problem node]* The DGX A100 node has this kernel
cnavarro at nodeGPU01:~$ uname -a
Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15
20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

*[Functioning node]* The Custom GPU node (L40s) has this
kernel cnavarro at nodeGPU02:~$ uname -a
Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

*And the login node *(slurmctld)
?  ~ uname -a
Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue
Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Any ideas what we should check?

On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider  wrote:
  

Hi,

I am using SLURM 22.05.9 on a small compute cluster. Since I
reinstalled two of our nodes, I get the following error when
launching a job:

slurmstepd: error: load_ebpf_prog: BPF load error (No space
left on device). Please check your system limits (MEMLOCK).

Also the cgroups do not seem to work properly anymore, as I
am able to see all GPUs even if I do not request them,
which is not the case on the other nodes.

One difference I found between the updated nodes and the
original nodes (both are Ubuntu 22.04) is the kernel
version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the
functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP"
on the updated nodes. I could not figure out how to install
the exact first kernel version on the updated nodes, but I
noticed that when I reinstall 5.15.0 with this tool:
ht

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-23 Thread Tim Schneider

Hi,

I have filed a bug report with SchedMD 
(https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told 
me they cannot invest time in this issue since I don't have a support 
contract. Maybe they will look into it once it affects more people or 
someone important enough.


So far, I have resorted to using 5.15.0-89-generic, but I am also a bit 
concerned about the security aspect of this choice.


Best,

Tim

On 23.01.24 14:59, Stefan Fleischmann wrote:

Hi!

I'm seeing the same in our environment. My conclusion is that it is a
regression in the Ubuntu 5.15 kernel, introduced with 5.15.0-90-generic.
Last working kernel version is 5.15.0-89-generic. I have filed a bug
report here: https://bugs.launchpad.net/bugs/2050098

Please add yourself to the affected users in the bug report so it
hopefully gets more attention.

I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem does
not exist there. 6.5 is the latest hwe kernel for 22.04 and would be an
option for now. Reverting back to 5.15.0-89 would work as well, but I
haven't looked into the security aspects of that.

Cheers,
Stefan

On Mon, 22 Jan 2024 13:31:15 -0300
cristobal.navarro.g at gmail.com wrote:


Hi Tim and community,
We are currently having the same issue (cgroups not working it seems,
showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple
of days ago after a full update (apt upgrade). Now whenever we launch
a job for that partition, we get the error message mentioned by Tim.
As a note, we have another custom GPU-compute node with L40s, on a
different partition, and that one works fine.
Before this error, we always had small differences in kernel version
between nodes, so I am not sure if this can be the problem.
Nevertheless, here is the info of our nodes as well.

*[Problem node]* The DGX A100 node has this kernel
cnavarro at nodeGPU01:~$ uname -a
Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30
UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

*[Functioning node]* The Custom GPU node (L40s) has this kernel
cnavarro at nodeGPU02:~$ uname -a
Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08
UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

*And the login node *(slurmctld)
?  ~ uname -a
Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Any ideas what we should check?

On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider  wrote:


Hi,

I am using SLURM 22.05.9 on a small compute cluster. Since I
reinstalled two of our nodes, I get the following error when
launching a job:

slurmstepd: error: load_ebpf_prog: BPF load error (No space left on
device). Please check your system limits (MEMLOCK).

Also the cgroups do not seem to work properly anymore, as I am able
to see all GPUs even if I do not request them, which is not the
case on the other nodes.

One difference I found between the updated nodes and the original
nodes (both are Ubuntu 22.04) is the kernel version, which is
"5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and
"5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could
not figure out how to install the exact first kernel version on the
updated nodes, but I noticed that when I reinstall 5.15.0 with this
tool: https://github.com/pimlie/ubuntu-mainline-kernel.sh, the
error message disappears. However, once I do that, the network
driver does not function properly anymore, so this does not seem to
be a good solution.

Has anyone seen this issue before or is there maybe something else I
should take a look at? I am also happy to just find a workaround
such that I can take these nodes back online.

I appreciate any help!

Thanks a lot in advance and best wishes,

Tim


  




[slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-04 Thread Tim Schneider

Hi,

I am using SLURM 22.05.9 on a small compute cluster. Since I reinstalled 
two of our nodes, I get the following error when launching a job:


slurmstepd: error: load_ebpf_prog: BPF load error (No space left on 
device). Please check your system limits (MEMLOCK).


Also the cgroups do not seem to work properly anymore, as I am able to 
see all GPUs even if I do not request them, which is not the case on the 
other nodes.


One difference I found between the updated nodes and the original nodes 
(both are Ubuntu 22.04) is the kernel version, which is 
"5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and 
"5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could not 
figure out how to install the exact first kernel version on the updated 
nodes, but I noticed that when I reinstall 5.15.0 with this tool: 
https://github.com/pimlie/ubuntu-mainline-kernel.sh, the error message 
disappears. However, once I do that, the network driver does not 
function properly anymore, so this does not seem to be a good solution.


Has anyone seen this issue before or is there maybe something else I 
should take a look at? I am also happy to just find a workaround such 
that I can take these nodes back online.


I appreciate any help!

Thanks a lot in advance and best wishes,

Tim




Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Tim Schneider

Hi Ole,

thanks for your reply.

The curious thing is that when I run "scontrol reboot nextstate=RESUME 
", the drain flag of that node is not set (sinfo shows mix@ and 
"scontrol show node " shows no DRAIN in State, just 
MIXED+REBOOT_REQUESTED), yet no jobs are scheduled on that node until 
reboot. If I specifically request that node for a job with "-w ", 
I get "Nodes required for job are DOWN, DRAINED or reserved for jobs in 
higher priority partitions".


Not using nextstate=RESUME is inconvenient for me as sometimes we have 
parts of our cluster drained and I would like to run a single command 
that reboots all non-drained nodes once they become idle and all drained 
nodes immediately, resuming them once they are done reinstalling.


Best,

Tim

On 25.10.23 14:59, Ole Holm Nielsen wrote:

Hi Tim,

I think the scontrol manual page explains the "scontrol reboot" function
fairly well:


reboot  [ASAP]  [nextstate={RESUME|DOWN}] [reason=]
{ALL|}
   Reboot the nodes in the system when they become idle  using  the
   RebootProgram  as  configured  in Slurm's slurm.conf file.  Each
   node will have the "REBOOT" flag added to its node state.  After
   a  node  reboots  and  the  slurmd  daemon  starts up again, the
   HealthCheckProgram will run once. Then, the slurmd  daemon  will
   register  itself with the slurmctld daemon and the "REBOOT" flag
   will be cleared.  The node's "DRAIN" state flag will be  cleared
   if  the reboot was "ASAP", nextstate=resume or down.  The "ASAP"
   option adds the "DRAIN" flag to each  node's  state,  preventing
   additional  jobs  from running on the node so it can be rebooted
   and returned to service  "As  Soon  As  Possible"  (i.e.  ASAP).

It seems to be implicitly understood that if nextstate is specified, this
implies setting the "DRAIN" state flag:


The node's "DRAIN" state flag will be  cleared if the reboot was "ASAP", 
nextstate=resume or down.

You can verify the node's DRAIN flag with "scontrol show node ".

IMHO, if you want nodes to continue accepting new jobs, then nextstate is
irrelevant.

We always use "reboot ASAP" because our cluster is usually so busy that
nodes never become idle if left to themselves :-)

FYI: We regularly make package updates and firmware updates using the
"scontrol reboot asap" method which is explained in this script:
https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh

Best regards,
Ole,
Ole


On 10/25/23 13:39, Tim Schneider wrote:

Hi Chris,

thanks a lot for your response.

I just realized that I made a mistake in my post. In the section you cite,
the command is supposed to be "scontrol reboot nextstate=RESUME" (without
ASAP).

So to clarify: my problem is that if I type "scontrol reboot
nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On
the other hand, if I type "scontrol reboot", jobs continue to get
scheduled, which is what I want. I just don't understand, why setting
nextstate results in the nodes not accepting jobs anymore.

My usecase is similar to the one you describe. We use the ASAP option when
we install a new image to ensure that from the point of the reinstallation
onwards, all jobs end up on nodes with the new configuration only.
However, in some cases when we do only minor changes to the image
configuration, we prefer to cause as little disruption as possible and
just reinstall the nodes whenever they are idle. Here, being able to set
nextstate=RESUME is useful, since we usually want the nodes to resume
after reinstallation, no matter what their previous state was.

Hope that clears it up and sorry for the confusion!

Best,

tim

On 25.10.23 02:10, Christopher Samuel wrote:

On 10/24/23 12:39, Tim Schneider wrote:


Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME
", the node goes in "mix@" state (not drain), but no new jobs get
scheduled until the node reboots. Essentially I get draining behavior,
even though the node's state is not "drain". Note that this behavior is
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled
as expected. Does anyone have an idea why that could be?

The intent of the "ASAP` flag for "scontrol reboot" is to not let any
more jobs onto a node until it has rebooted.

IIRC that was from work we sponsored, the idea being that (for how our
nodes are managed) we would build new images with the latest software
stack, test them on a separate test system and then once happy bring
them over to the production system and do an "scontrol reboot ASAP
nextstate=resume reason=... $NODES&

Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Tim Schneider

Hi Chris,

thanks a lot for your response.

I just realized that I made a mistake in my post. In the section you 
cite, the command is supposed to be "scontrol reboot nextstate=RESUME" 
(without ASAP).


So to clarify: my problem is that if I type "scontrol reboot 
nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On 
the other hand, if I type "scontrol reboot", jobs continue to get 
scheduled, which is what I want. I just don't understand, why setting 
nextstate results in the nodes not accepting jobs anymore.


My usecase is similar to the one you describe. We use the ASAP option 
when we install a new image to ensure that from the point of the 
reinstallation onwards, all jobs end up on nodes with the new 
configuration only. However, in some cases when we do only minor changes 
to the image configuration, we prefer to cause as little disruption as 
possible and just reinstall the nodes whenever they are idle. Here, 
being able to set nextstate=RESUME is useful, since we usually want the 
nodes to resume after reinstallation, no matter what their previous 
state was.


Hope that clears it up and sorry for the confusion!

Best,

tim

On 25.10.23 02:10, Christopher Samuel wrote:

On 10/24/23 12:39, Tim Schneider wrote:


Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME
", the node goes in "mix@" state (not drain), but no new jobs get
scheduled until the node reboots. Essentially I get draining behavior,
even though the node's state is not "drain". Note that this behavior is
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled
as expected. Does anyone have an idea why that could be?

The intent of the "ASAP` flag for "scontrol reboot" is to not let any
more jobs onto a node until it has rebooted.

IIRC that was from work we sponsored, the idea being that (for how our
nodes are managed) we would build new images with the latest software
stack, test them on a separate test system and then once happy bring
them over to the production system and do an "scontrol reboot ASAP
nextstate=resume reason=... $NODES" to ensure that from that point
onwards no new jobs would start in the old software configuration, only
the new one.

Also slurmctld would know that these nodes are due to come back in
"ResumeTimeout" seconds after the reboot is issued and so could plan for
them as part of scheduling large jobs, rather than thinking there was no
way it could do so and letting lots of smaller jobs get in the way.

Hope that helps!

All the best,
Chris




[slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-24 Thread Tim Schneider

Hi,

from my understanding, if I run "scontrol reboot ", the node 
should continue to operate as usual and reboots once it is idle. When 
adding the ASAP flag (scontrol reboot ASAP ), the node should go 
into drain state and not accept any more jobs.


Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME 
", the node goes in "mix@" state (not drain), but no new jobs get 
scheduled until the node reboots. Essentially I get draining behavior, 
even though the node's state is not "drain". Note that this behavior is 
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled 
as expected. Does anyone have an idea why that could be?


I am running slurm 22.05.9.

Steps to reproduce:

# To prevent node from rebooting immediately
sbatch -t 1:00:00 -c 1 --mem-per-cpu 1G -w  ./long_running_script.sh

# Request reboot
scontrol reboot nextstate=RESUME 

# Run interactive command, which does not start until "scontrol 
cancel_reboot " is executed in another shell

srun -t 1:00:00 -c 1 --mem-per-cpu 1G -w  --pty bash


Thanks a lot in advance!

Best,

Tim




Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-17 Thread Tim Schneider

Hi,

I just want to wrap this up in case someone has the same issue in the 
future.


As Reed pointed out, Ubuntu 22 does not support cgroups v1 anymore. At 
the same time, the slurm-wlm package in the Ubuntu repositories uses 
cgroups v1, which makes its task/cgroup plugin incompatible with Ubuntu 22.


My solution was to build Slurm 22.05 manually, while ensuring that 
/libdbus-1-dev/ is installed (as otherwise cgroups v2 support does not 
get built). This takes a bit more time but seems to work so far.


Thanks a lot Reed & Abel for your advice!

Best,

Tim

On 6/16/23 10:42, Tim Schneider wrote:


Hi again,

I just realized that 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 wrote at 
some point that he build Slurm 22 instead of using the Ubuntu repo 
version. So I guess I will have to look into that.


Best,

Tim

On 6/16/23 10:36, Tim Schneider wrote:


Hi Abel and Reed,

thanks a lot for your quick replies!

I did indeed just install slurm-wlm from the Ubuntu repos.

Following the advice of 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1, I tried 
disabling cgroups v1 on Ubuntu, but that just leads to an error 
during startup of slurmd:


/slurmd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so//
//slurmd: error: unable to mount freezer cgroup namespace: Invalid 
argument//

//slurmd: error: unable to create freezer cgroup namespace//
//slurmd: error: Couldn't load specified plugin name for 
proctrack/cgroup: Plugin init() callback failed//

//slurmd: error: cannot create proctrack context for proctrack/cgroup//
//slurmd: error: slurmd initialization failed/

So it seems that slurmd is using cgroups v1. This is also reflected 
in the mounts (for the output below, cgroups v1 is enabled again):


/$ mount | grep cgroup//
//cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)//
//cgroup on /sys/fs/cgroup/freezer type cgroup 
(rw,nosuid,nodev,noexec,relatime,freezer)/


What is still confusing to me is that the slurmd logs indicate no 
error when I try running with cgroups v1 enabled and the error only 
appears on the slurmctld side.


Do you know how I can enable cgroups v2 in Slurm? To me it seems that 
this is what 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 did.


Best,

Tim

On 6/16/23 03:28, abel pinto wrote:
Indeed, the issue seems to be that Ubuntu 22.04 does not support 
cgroups v1 anymore. Does SLURM support cgroupsv2? It seems so: 
https://slurm.schedmd.com/cgroup_v2.html


/Abel


On Jun 15, 2023, at 20:20, Reed Dier  wrote:

I don’t have any direct advice off-hand, but I figure I will try 
to help steer the conversation in the right direction for figuring 
it out.


I’m going to assume that since you mention 21.08.5, that this means 
you are using the slurm-wlm packages from the ubuntu repos, and not 
building yourself?


And have all the components (slurmctld(s), slurmdbd, slurmd(s)) 
been upgraded as well?


The only thing that immediately comes to mind is that I remember 
reading a good bit about Ubuntu 22.04’s use of cgroups v2, which as 
I understand it are very different from cgroups v1, and plenty of 
people have had issues with v1/v2 mismatches with slurm and other 
applications.


https://www.reddit.com/r/SLURM/comments/vjquih/error_cannot_find_cgroup_plugin_for_cgroupv2/
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1
https://discuss.linuxcontainers.org/t/after-updated-to-more-recent-ubuntu-version-with-cgroups-v2-ubuntu-16-04-container-is-not-working-properly/14022

Hope that at least steers the conversation in a good direction.

Reed

On Jun 15, 2023, at 5:04 PM, Tim Schneider 
 wrote:


Hi,

I am maintaining the SLURM cluster of my research group. Recently 
I updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am 
unable to launch jobs. When launching a job, I receive the 
following error:


/$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G 
--time=01:00:00 --pty -p amd -w cn02 --pty bash -i//

//srun: error: task 0 launch failed: Plugin initialization failed/

Strangely, I cannot find any indication of this problem in the 
logs (find the logs attached). The problem must be related to the 
task/cgroup plugin, as it does not occur when I disable it.


After reading in the documentation, I tried adding the 
/cgroup_enable=memory swapaccount=1/ kernel parameters, but the 
problem persisted.


I would be very grateful for any advice where to look since I have 
no idea how to investigate this issue further.


Thanks a lot in advance.

Best,

Tim





Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-17 Thread Tim Schneider

Hi,

I just want to wrap this up in case someone has the same issue in the 
future.


As Reed pointed out, Ubuntu 22 does not support cgroups v1 anymore. At 
the same time, the slurm-wlm package in the Ubuntu repositories uses 
cgroups v1, which makes its task/cgroup plugin incompatible with Ubuntu 22.


My solution was to build Slurm 22.05 manually, while ensuring that 
/libdbus-1-dev/ is installed (as otherwise cgroups v2 support does not 
get built). This takes a bit more time but seems to work so far.


Thanks a lot Reed & Abel for your advice!

Best,

Tim

On 6/16/23 10:42, Tim Schneider wrote:


Hi again,

I just realized that 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 wrote at 
some point that he build Slurm 22 instead of using the Ubuntu repo 
version. So I guess I will have to look into that.


Best,

Tim

On 6/16/23 10:36, Tim Schneider wrote:


Hi Abel and Reed,

thanks a lot for your quick replies!

I did indeed just install slurm-wlm from the Ubuntu repos.

Following the advice of 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1, I tried 
disabling cgroups v1 on Ubuntu, but that just leads to an error 
during startup of slurmd:


/slurmd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so//
//slurmd: error: unable to mount freezer cgroup namespace: Invalid 
argument//

//slurmd: error: unable to create freezer cgroup namespace//
//slurmd: error: Couldn't load specified plugin name for 
proctrack/cgroup: Plugin init() callback failed//

//slurmd: error: cannot create proctrack context for proctrack/cgroup//
//slurmd: error: slurmd initialization failed/

So it seems that slurmd is using cgroups v1. This is also reflected 
in the mounts (for the output below, cgroups v1 is enabled again):


/$ mount | grep cgroup//
//cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)//
//cgroup on /sys/fs/cgroup/freezer type cgroup 
(rw,nosuid,nodev,noexec,relatime,freezer)/


What is still confusing to me is that the slurmd logs indicate no 
error when I try running with cgroups v1 enabled and the error only 
appears on the slurmctld side.


Do you know how I can enable cgroups v2 in Slurm? To me it seems that 
this is what 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 did.


Best,

Tim

On 6/16/23 03:28, abel pinto wrote:
Indeed, the issue seems to be that Ubuntu 22.04 does not support 
cgroups v1 anymore. Does SLURM support cgroupsv2? It seems so: 
https://slurm.schedmd.com/cgroup_v2.html


/Abel


On Jun 15, 2023, at 20:20, Reed Dier  wrote:

I don’t have any direct advice off-hand, but I figure I will try 
to help steer the conversation in the right direction for figuring 
it out.


I’m going to assume that since you mention 21.08.5, that this means 
you are using the slurm-wlm packages from the ubuntu repos, and not 
building yourself?


And have all the components (slurmctld(s), slurmdbd, slurmd(s)) 
been upgraded as well?


The only thing that immediately comes to mind is that I remember 
reading a good bit about Ubuntu 22.04’s use of cgroups v2, which as 
I understand it are very different from cgroups v1, and plenty of 
people have had issues with v1/v2 mismatches with slurm and other 
applications.


https://www.reddit.com/r/SLURM/comments/vjquih/error_cannot_find_cgroup_plugin_for_cgroupv2/
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1
https://discuss.linuxcontainers.org/t/after-updated-to-more-recent-ubuntu-version-with-cgroups-v2-ubuntu-16-04-container-is-not-working-properly/14022

Hope that at least steers the conversation in a good direction.

Reed

On Jun 15, 2023, at 5:04 PM, Tim Schneider 
 wrote:


Hi,

I am maintaining the SLURM cluster of my research group. Recently 
I updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am 
unable to launch jobs. When launching a job, I receive the 
following error:


/$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G 
--time=01:00:00 --pty -p amd -w cn02 --pty bash -i//

//srun: error: task 0 launch failed: Plugin initialization failed/

Strangely, I cannot find any indication of this problem in the 
logs (find the logs attached). The problem must be related to the 
task/cgroup plugin, as it does not occur when I disable it.


After reading in the documentation, I tried adding the 
/cgroup_enable=memory swapaccount=1/ kernel parameters, but the 
problem persisted.


I would be very grateful for any advice where to look since I have 
no idea how to investigate this issue further.


Thanks a lot in advance.

Best,

Tim





[slurm-users] Fwd: task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-15 Thread Tim Schneider

Hi,

I am maintaining the SLURM cluster of my research group. Recently I 
updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am unable to 
launch jobs. When launching a job, I receive the following error:


/$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G 
--time=01:00:00 --pty -p amd -w cn02 --pty bash -i//

//srun: error: task 0 launch failed: Plugin initialization failed/

Strangely, I cannot find any indication of this problem in the logs 
(find the logs attached). The problem must be related to the task/cgroup 
plugin, as it does not occur when I disable it.


After reading in the documentation, I tried adding the 
/cgroup_enable=memory swapaccount=1/ kernel parameters, but the problem 
persisted.


I would be very grateful for any advice where to look since I have no 
idea how to investigate this issue further.


Thanks a lot in advance.

Best,

Tim

###
# Slurm cgroup support configuration file
###
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainKmemSpace=no
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
# This will be necessary for controlling GPU access
ConstrainDevices=yes
#
# slurmd -D -vv --conf-server nas:6817
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:16 ThreadsPerCore:1
slurmd: debug4: CPU map[0]=>0 S:C:T 0:0:0
slurmd: debug4: CPU map[1]=>1 S:C:T 0:1:0
slurmd: debug4: CPU map[2]=>2 S:C:T 0:2:0
slurmd: debug4: CPU map[3]=>3 S:C:T 0:3:0
slurmd: debug4: CPU map[4]=>4 S:C:T 0:4:0
slurmd: debug4: CPU map[5]=>5 S:C:T 0:5:0
slurmd: debug4: CPU map[6]=>6 S:C:T 0:6:0
slurmd: debug4: CPU map[7]=>7 S:C:T 0:7:0
slurmd: debug4: CPU map[8]=>8 S:C:T 0:8:0
slurmd: debug4: CPU map[9]=>9 S:C:T 0:9:0
slurmd: debug4: CPU map[10]=>10 S:C:T 0:10:0
slurmd: debug4: CPU map[11]=>11 S:C:T 0:11:0
slurmd: debug4: CPU map[12]=>12 S:C:T 0:12:0
slurmd: debug4: CPU map[13]=>13 S:C:T 0:13:0
slurmd: debug4: CPU map[14]=>14 S:C:T 0:14:0
slurmd: debug4: CPU map[15]=>15 S:C:T 0:15:0
slurmd: debug3: _set_slurmd_spooldir: initializing slurmd spool directory `/var/spool/slurmd`
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:16 ThreadsPerCore:1
slurmd: debug4: CPU map[0]=>0 S:C:T 0:0:0
slurmd: debug4: CPU map[1]=>1 S:C:T 0:1:0
slurmd: debug4: CPU map[2]=>2 S:C:T 0:2:0
slurmd: debug4: CPU map[3]=>3 S:C:T 0:3:0
slurmd: debug4: CPU map[4]=>4 S:C:T 0:4:0
slurmd: debug4: CPU map[5]=>5 S:C:T 0:5:0
slurmd: debug4: CPU map[6]=>6 S:C:T 0:6:0
slurmd: debug4: CPU map[7]=>7 S:C:T 0:7:0
slurmd: debug4: CPU map[8]=>8 S:C:T 0:8:0
slurmd: debug4: CPU map[9]=>9 S:C:T 0:9:0
slurmd: debug4: CPU map[10]=>10 S:C:T 0:10:0
slurmd: debug4: CPU map[11]=>11 S:C:T 0:11:0
slurmd: debug4: CPU map[12]=>12 S:C:T 0:12:0
slurmd: debug4: CPU map[13]=>13 S:C:T 0:13:0
slurmd: debug4: CPU map[14]=>14 S:C:T 0:14:0
slurmd: debug4: CPU map[15]=>15 S:C:T 0:15:0
slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so
slurmd: debug:  gres/gpu: init: loaded
slurmd: debug3: Success.
slurmd: debug3: _merge_gres2: From gres.conf, using gpu:rtx2080:1:/dev/nvidia0
slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/gpu_generic.so
slurmd: debug:  gpu/generic: init: init: GPU Generic plugin loaded
slurmd: debug3: Success.
slurmd: debug3: gres_device_major : /dev/nvidia0 major 195, minor 0
slurmd: Gres Name=gpu Type=rtx2080 Count=1
slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/topology_none.so
slurmd: topology/none: init: topology NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/route_default.so
slurmd: route/default: init: route default plugin loaded
slurmd: debug3: Success.
slurmd: debug2: Gathering cpu frequency information for 16 cpus
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug3: NodeName= cn02
slurmd: debug3: TopoAddr= cn02
slurmd: debug3: TopoPattern = node
slurmd: debug3: ClusterName = iascluster
slurmd: debug3: Confile = `/var/spool/slurmd/conf-cache/slurm.conf'
slurmd: debug3: Debug   = 5
slurmd: debug3: CPUs= 16 (CF: 16, HW: 16)
slurmd: debug3: Boards  = 1  (CF:  1, HW:  1)
slurmd: debug3: Sockets = 1  (CF:  1, HW:  1)
slurmd: debug3: Cores   = 16 (CF: 16, HW: 16)
slurmd: debug3: Threads = 1  (CF:  1, HW:  1)
slurmd: debug3: UpTime  = 2377 = 00:39:37
slurmd: debug3: Block Map   = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
slurmd: debug3: Inverse Map = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
slurmd: debug3: RealMemory  = 64216
slurmd: debug3: TmpDisk = 32108
slurmd: debug3: Epilog