[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread wdennis--- via slurm-users
Thanks for the logical explanation, Paul. So when I rewrite my user 
documentation, I'll mention using `salloc` instead of `srun`.

Yes, we do have `LaunchParameters=use_interactive_step` set on our cluster, so 
salloc gives a shell on the allocated host.

Best,
Will

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users

Most of my stuff is in the cloud, so I use their load balancing services.

HAProxy does have sticky sessions, which you can enable based on IP so 
it works with other protocols: 2 Ways to Enable Sticky Sessions in 
HAProxy (Guide) 



Brian Andrus

On 2/28/2024 12:54 PM, Dan Healy wrote:

Are most of us using HAProxy or something else?

On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users 
 wrote:


Magnus,

That is a feature of the load balancer. Most of them have that
these days.

Brian Andrus

On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users
wrote:
> On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users
wrote:
>> for us, we put a load balancer in front of the login nodes with
>> session
>> affinity enabled. This makes them land on the same backend node
each
>> time.
> Hi Brian,
> that sounds interesting - how did you implement session affinity?
> cheers
> magnus
>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com

To unsubscribe send an email to slurm-users-le...@lists.schedmd.com



--
Thanks,

Daniel Healy
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Cutts, Tim via slurm-users
HAProxy, for on-prem things.  In the cloud I just use their load balancers 
rather than implement my own.

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: Dan Healy via slurm-users 
Date: Wednesday, 28 February 2024 at 20:56
To: Brian Andrus 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: [ext] Re: canonical way to run longer shell/bash 
interactive job (instead of srun inside of screen/tmux at front-end)?
Are most of us using HAProxy or something else?

On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Magnus,

That is a feature of the load balancer. Most of them have that these days.

Brian Andrus

On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote:
> On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote:
>> for us, we put a load balancer in front of the login nodes with
>> session
>> affinity enabled. This makes them land on the same backend node each
>> time.
> Hi Brian,
> that sounds interesting - how did you implement session affinity?
> cheers
> magnus
>
>

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com


--
Thanks,

Daniel Healy


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Dan Healy via slurm-users
Are most of us using HAProxy or something else?

On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Magnus,
>
> That is a feature of the load balancer. Most of them have that these days.
>
> Brian Andrus
>
> On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote:
> > On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote:
> >> for us, we put a load balancer in front of the login nodes with
> >> session
> >> affinity enabled. This makes them land on the same backend node each
> >> time.
> > Hi Brian,
> > that sounds interesting - how did you implement session affinity?
> > cheers
> > magnus
> >
> >
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users

Magnus,

That is a feature of the load balancer. Most of them have that these days.

Brian Andrus

On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote:

On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote:

for us, we put a load balancer in front of the login nodes with
session
affinity enabled. This makes them land on the same backend node each
time.

Hi Brian,
that sounds interesting - how did you implement session affinity?
cheers
magnus




--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Enforcing relative resource restrictions in submission script

2024-02-28 Thread Jason Simms via slurm-users
Hello Matthew,

You may be aware of this already, but most sites would make these kinds of
checks/validations using job_submit.lua. I'm not an expert in that - though
plenty of others on this list are - but I'm positive you could implement
this type of validation logic. I'd like to say that I've come across a good
tutorial for job_submit.lua, but I haven't really found one. This is kind
of a good intro:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins

You can also find some sample scripts, such as:

https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua

Warmest regards,
Jason

On Tue, Feb 27, 2024 at 5:02 PM Matthew R. Baney via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hello Slurm users,
>
> I'm trying to write a check in our job_submit.lua script that enforces
> relative resource requirements such as disallowing more than 4 CPUs or 48GB
> of memory per GPU. The QOS itself has a MaxTRESPerJob of
> cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to
> prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with
> only 1 GPU.
>
> I might be missing something obvious, but the rabbit hole I'm going down
> at the moment is trying to check all of the different ways job arguments
> could be set in the job descriptor.
>
> i.e., the following should all be disallowed:
>
> srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the
> descriptor)
>
> srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres)
>
> srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks,
> ntasks_per_tres)
>
> srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks,
> mem_per_cpu)
>
> ...
>
> Essentially what I'm looking for is a way to access the ReqTRES string
> from the job record before it exists, and then run some logic against that
> i.e., if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G,
> error out.
>
> Is something like this possible?
>
> Thanks,
> Matthew
>
> --
> Matthew Baney
> Assistant Director of Computational Systems
> mba...@umd.edu | (301) 405-6756
> University of Maryland Institute for Advanced Computer Studies
> 3154 Brendan Iribe Center
> 8125 Paint Branch Dr.
> College Park, MD 20742
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: pty jobs are killed when another job on the same node terminates

2024-02-28 Thread Jason Simms via slurm-users
Hello Thomas,

I know I'm a few days late to this, so I'm wondering whether you've made
any progress. We experience this, too, but in a different way.

First, though, you may be aware, but you should use salloc rather than srun
--pty for an interactive session. That's been the preferred method for a
while, and one reason is that you can't run an srun from within an srun. So
I wonder whether that has something to do with it.

We run an old version of Open OnDemand, and what we see is when a user
starts a virtual desktop session on a node and then submits a job with
sbatch, once it terminates/dies, the virtual desktop session terminates
too. I think this happens only when the job ends up on the same node on
which the virtual desktop session is running. I haven't delved too deeply
into that, but I suspect the virtual desktop session might be launched with
an srun in some way, and somehow this is affected by something that happens
when submitting an sbatch. I know that's super vague, but I haven't really
gone too far with it, though the errors are similar (and in fact might be
identical, it's been a few weeks!).

Warmest regards,
Jason

On Thu, Feb 22, 2024 at 5:41 AM Thomas Hartmann via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi,
>
> when I start an interactive job like this:
>
> srun --pty --mem=3G -c2 bash
>
> And then I schedule and run other jobs (can be interactive or non
> interactive) and one of these jobs that runs on the same node terminates,
> the interactive job gets killed with this message:
>
> srun: error: node01.abc.at: task 0: Killed
>
> I attached our slurm config. Does anybody have an idea what is going on
> here or where I could look to debug? I'm quite new to slurm, so I don't
> know all the places to look...
>
> Thanks a lot in advance!
>
> Thomas
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
He's talking about recent versions of Slurm which now have this option: 
https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step


-Paul Edmon-

On 2/28/2024 10:46 AM, Paul Raines wrote:


What do you mean "operate via the normal command line"?  When
you salloc, you are still on the login node.

$ salloc -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G 
--time=1-10:00:00 --gpus=2 --cpus-per-task=2 /bin/bash

salloc: Pending job allocation 3798364
salloc: job 3798364 queued and waiting for resources
salloc: job 3798364 has been allocated resources
salloc: Granted job allocation 3798364
salloc: Waiting for resource configuration
salloc: Nodes rtx-02 are ready for job
mesg: cannot open /dev/pts/91: Permission denied
mlsc-login[0]:~$ hostname
mlsc-login.nmr.mgh.harvard.edu
mlsc-login[0]:~$ printenv | grep SLURM_JOB_NODELIST
SLURM_JOB_NODELIST=rtx-02

Seems you MUST use srun


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote:

  External Email - Use Caution salloc is the currently 
recommended way for interactive sessions. srun is now intended for 
launching steps or MPI applications. So properly you would salloc and 
then srun inside the salloc.


As you've noticed with srun you tend lose control of your shell as it 
takes over so you have background the process unless it is the main 
process. We've hit this before when people use srun to subschedule in 
a salloc.


You can also just launch the salloc and then operate via the normal 
command line reserving srun for things like launching MPI.


The reason they changed from srun to salloc is that you can't srun 
inside a srun. So if you were a user who started a srun interactive 
session and then you tried to invoke MPI it would get weird as you 
would be invoking another srun. By using salloc you avoid this issue.


We used to use srun for interactive sessions as well but swapped to 
salloc a few years back and haven't had any issues.


-Paul Edmon-

On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote:

 Hi list,

 In our institution, our instructions to users who want to spawn an
 interactive job (for us, a bash shell) have always been to do "srun 
..."
 from the login node, which has always been working well for us. But 
when

 we had a recent Slurm training, the SchedMD folks advised us to use
 "salloc" and then "srun" to do interactive jobs. I tried this today,
 "salloc" gave me a shell on a server, the same as srun does, but 
then when
 I tried to "srun [programname]" it hung there with no output. Of 
course

 when I tried "srun [programname] &" it spawned the background job, and
 gave me back a prompt. Either time I had to Ctrl-C the running srun 
job,

 and got no output other than the srun/slurmstepd termination output.

 I think I read somewhere that directly invoking srun creates an
 allocation; why then would I want to do an initial salloc, and then 
srun?

 (i the case that I want a foreground program, such as a bash shell)

 I have surveyed some other institution's Slurm interactive jobs
 documentation for users, I see both examples of advice to run srun
 directly, or salloc and then srun.

 Please help me to understand how this is intended to work, and if 
we are

 "doing it wrong" :)

 Thanks,
 Will



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com




The information in this e-mail is intended only for the person to whom 
it is addressed.  If you believe this e-mail was sent to you in error 
and the e-mail contains patient information, please contact the Mass 
General Brigham Compliance HelpLine at 
https://www.massgeneralbrigham.org/complianceline 
 .
Please note that this e-mail is not secure (encrypted).  If you do not 
wish to continue communication over unencrypted e-mail, please notify 
the sender of this message immediately.  Continuing to send or respond 
to e-mail after receiving this message means you understand and accept 
this risk and wish to continue to communicate over unencrypted e-mail.


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Raines via slurm-users



What do you mean "operate via the normal command line"?  When
you salloc, you are still on the login node.

$ salloc -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G 
--time=1-10:00:00 --gpus=2 --cpus-per-task=2 /bin/bash

salloc: Pending job allocation 3798364
salloc: job 3798364 queued and waiting for resources
salloc: job 3798364 has been allocated resources
salloc: Granted job allocation 3798364
salloc: Waiting for resource configuration
salloc: Nodes rtx-02 are ready for job
mesg: cannot open /dev/pts/91: Permission denied
mlsc-login[0]:~$ hostname
mlsc-login.nmr.mgh.harvard.edu
mlsc-login[0]:~$ printenv | grep SLURM_JOB_NODELIST
SLURM_JOB_NODELIST=rtx-02

Seems you MUST use srun


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote:

  External Email - Use Caution 
salloc is the currently recommended way for interactive sessions. srun is now 
intended for launching steps or MPI applications. So properly you would 
salloc and then srun inside the salloc.


As you've noticed with srun you tend lose control of your shell as it takes 
over so you have background the process unless it is the main process. We've 
hit this before when people use srun to subschedule in a salloc.


You can also just launch the salloc and then operate via the normal command 
line reserving srun for things like launching MPI.


The reason they changed from srun to salloc is that you can't srun inside a 
srun. So if you were a user who started a srun interactive session and then 
you tried to invoke MPI it would get weird as you would be invoking another 
srun. By using salloc you avoid this issue.


We used to use srun for interactive sessions as well but swapped to salloc a 
few years back and haven't had any issues.


-Paul Edmon-

On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote:

 Hi list,

 In our institution, our instructions to users who want to spawn an
 interactive job (for us, a bash shell) have always been to do "srun ..."
 from the login node, which has always been working well for us. But when
 we had a recent Slurm training, the SchedMD folks advised us to use
 "salloc" and then "srun" to do interactive jobs. I tried this today,
 "salloc" gave me a shell on a server, the same as srun does, but then when
 I tried to "srun [programname]" it hung there with no output. Of course
 when I tried "srun [programname] &" it spawned the background job, and
 gave me back a prompt. Either time I had to Ctrl-C the running srun job,
 and got no output other than the srun/slurmstepd termination output.

 I think I read somewhere that directly invoking srun creates an
 allocation; why then would I want to do an initial salloc, and then srun?
 (i the case that I want a foreground program, such as a bash shell)

 I have surveyed some other institution's Slurm interactive jobs
 documentation for users, I see both examples of advice to run srun
 directly, or salloc and then srun.

 Please help me to understand how this is intended to work, and if we are
 "doing it wrong" :)

 Thanks,
 Will



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com





The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
 .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. 



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Josef Dvoracek via slurm-users

> I'm running slurm 22.05.11 which is available with OpenHCP 3.x
> Do you think an upgrade is needed?

I feel that lot of slurm operators tend to not use 3rd party sources of 
slurm binaries, as you do not have the build environment fully in your 
hands.


But before making such a complex decision, perhaps look for build logs 
of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was 
built with libraries needed to have cgroupsv2 working..


Not having cgroupsv2 dependencies during build-time is only one of all 
possible causes..


josef






smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
salloc is the currently recommended way for interactive sessions. srun 
is now intended for launching steps or MPI applications. So properly you 
would salloc and then srun inside the salloc.


As you've noticed with srun you tend lose control of your shell as it 
takes over so you have background the process unless it is the main 
process. We've hit this before when people use srun to subschedule in a 
salloc.


You can also just launch the salloc and then operate via the normal 
command line reserving srun for things like launching MPI.


The reason they changed from srun to salloc is that you can't srun 
inside a srun. So if you were a user who started a srun interactive 
session and then you tried to invoke MPI it would get weird as you would 
be invoking another srun. By using salloc you avoid this issue.


We used to use srun for interactive sessions as well but swapped to 
salloc a few years back and haven't had any issues.


-Paul Edmon-

On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote:

Hi list,

In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun 
..." from the login node, which has always been working well for us. But when we had a recent Slurm training, the SchedMD folks advised us 
to use "salloc" and then "srun" to do interactive jobs. I tried this today, "salloc" gave me a shell on a server, 
the same as srun does, but then when I tried to "srun [programname]" it hung there with no output. Of course when I tried "srun 
[programname] &" it spawned the background job, and gave me back a prompt. Either time I had to Ctrl-C the running srun job, and got 
no output other than the srun/slurmstepd termination output.

I think I read somewhere that directly invoking srun creates an allocation; why 
then would I want to do an initial salloc, and then srun? (i the case that I 
want a foreground program, such as a bash shell)

I have surveyed some other institution's Slurm interactive jobs documentation 
for users, I see both examples of advice to run srun directly, or salloc and 
then srun.

Please help me to understand how this is intended to work, and if we are "doing it 
wrong" :)

Thanks,
Will



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] salloc+srun vs just srun

2024-02-28 Thread wdennis--- via slurm-users
Hi list,

In our institution, our instructions to users who want to spawn an interactive 
job (for us, a bash shell) have always been to do "srun ..." from the login 
node, which has always been working well for us. But when we had a recent Slurm 
training, the SchedMD folks advised us to use "salloc" and then "srun" to do 
interactive jobs. I tried this today, "salloc" gave me a shell on a server, the 
same as srun does, but then when I tried to "srun [programname]" it hung there 
with no output. Of course when I tried "srun [programname] &" it spawned the 
background job, and gave me back a prompt. Either time I had to Ctrl-C the 
running srun job, and got no output other than the srun/slurmstepd termination 
output.

I think I read somewhere that directly invoking srun creates an allocation; why 
then would I want to do an initial salloc, and then srun? (i the case that I 
want a foreground program, such as a bash shell)

I have surveyed some other institution's Slurm interactive jobs documentation 
for users, I see both examples of advice to run srun directly, or salloc and 
then srun.

Please help me to understand how this is intended to work, and if we are "doing 
it wrong" :)

Thanks,
Will

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Josef Dvoracek via slurm-users

I think installing/upgrading "slurm" rpm will replace this shared lib.

Indeed, as always, test it first at not-so-critical system, use vm 
snapshots to be able to travel back in time ... as once you'll upgrade 
DB schema (if part of upgrade) you AFAIK can not go back.


josef

On 28. 02. 24 15:51, Miriam Olmi via slurm-users wrote:

I installed the new version of slurm 23.11.0-1 by rpm.
How can I fix this?


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] How to get usage data for a QOS

2024-02-28 Thread thomas.hartmann--- via slurm-users
Hi,
so, I figured out that I can give some users priority access for a specific 
amount of TRES by creating a qos with the GrpTRESMins property and the 
DenyOnLimit,NoDecay flags. This works nicely.

However, I would like to know, how much of this has already been consumed and I 
have not yet found a way to do this. Like: How can I get the amount of 
TRES/TRES minutes consumed for a certain QOS?

Thanks a lot!
Thomas

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Miriam Olmi via slurm-users
Hi Josef,

thanks a lot for your reply!

I just checked and you are right!!!

My library comes from the old version of slurm:

$ rpm -q --whatprovides /usr/lib64/slurm/libslurmfull.so
slurm-23.02.3-1.el8.x86_64

I installed the new version of slurm 23.11.0-1 by rpm.
How can I fix this?

Many thanks in advance again,
Miriam


> I see this question unanswered so far.. so I'll give you my 2 cents:
>
> Quick check reveals that mentioned symbol is in libslurmfull.so :
>
> [root@slurmserver2 ~]# nm -gD /usr/lib64/slurm/libslurmfull.so | grep
> "slurm_conf$"
> 000d2c06 T free_slurm_conf
> 000d3345 T init_slurm_conf
> 0041d000 B slurm_conf
> [root@slurmserver2 ~]#
>
> can not be that this dynamic lib is still the old one?
>
> Depending if you install slurm by rpms, manual in-place build, or
> something else, the reasons why there is old lib in place may vary..
>
> cheers
>
> josef
>
>
> On 28. 02. 24 11:16, Miriam Olmi via slurm-users wrote:
>> `slurm_conf' has different size in shared object, consider re-linking
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
***
Miriam Olmi
Computing & Network Service

Laboratori Nazionali del Gran Sasso - INFN
Via G. Acitelli, 22
67100 Assergi (AQ) Italy
https://www.lngs.infn.it

email: miriam.o...@lngs.infn.it
   office: +39 0862 437222
***


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Dietmar Rieder via slurm-users

Hi,

I'm running slurm 22.05.11 which is available with OpenHCP 3.x
Do you think an upgrade is needed?

Best
  Dietmar

On 2/28/24 14:55, Josef Dvoracek via slurm-users wrote:

Hi Dietmar;

I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently..

I must say that on my setup it looks it works as expected, see the 
grepped stdout from your reproducer below.


I use recent slurm 23.11.4 .

Wild guess.. Has your build machine bpt and dbus devel packages installed?
(both packages are fine to be absent when doing build for cgroupsv1 - 
slurm..)


cheers

josef

[jose@koios1 test_cgroups]$ cat slurm-7177217.out | grep eli
ValueError: CPU number 7 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 4 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 5 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 11 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 9 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 10 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 14 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 8 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 12 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 6 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 13 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 15 is not eligible; choose between [0, 1, 2, 3]
[jose@koios1 test_cgroups]$

On 28. 02. 24 14:28, Dietmar Rieder via slurm-users wrote:
...







OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Dietmar Rieder via slurm-users

Hi Hermann,

I get:

Cpus_allowed:   ,,
Cpus_allowed_list:  0-95

Best
   Dietmar

p.s.: lg aus dem CCB

On 2/28/24 15:01, Hermann Schwärzler via slurm-users wrote:

Hi Dietmar,

what do you find in the output-file of this job

sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'

On our 64 cores machines with enabled hyperthreading I see e.g.

Cpus_allowed:   0400,,0400,
Cpus_allowed_list:  58,122

Greetings
Hermann


On 2/28/24 14:28, Dietmar Rieder via slurm-users wrote:

Hi,

I'm new to slrum, but maybe someone can help me:

I'm trying to restrict the CPU usage to the actually 
requested/allocated resources using cgroup v2.


For this I made the following settings in slurmd.conf:


ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

And in cgroup.conf

CgroupPlugin=cgroup/v2
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
AllowedRAMSpace=98


cgroup v2 seems to be active on the compute node:

# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)


# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
# cat /sys/fs/cgroup/system.slice/cgroup.subtree_control
cpuset cpu io memory pids


Now, when I use sbatch to submit the following test script, the python 
script which is started from the batch script is utilizing all CPUs 
(96) at 100% on the allocated node, although I only ask for 4 cpus 
(--cpus-per-task=4). I'd expect that the task can not use more that 
these 4.


#!/bin/bash
#SBATCH --output=/local/users/appadmin/test-%j.log
#SBATCH --job-name=test
#SBATCH --chdir=/local/users/appadmin
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem=64gb
#SBATCH --time=4:00:00
#SBATCH --partition=standard
#SBATCH --gpus=0
#SBATCH --export
#SBATCH --get-user-env=L

export 
PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin


source .bashrc
conda activate test
python test.py


The python code in test.py is the following using the 
cpu_load_generator package from [1]:


#!/usr/bin/env python

import sys
from cpu_load_generator import load_single_core, load_all_cores, 
from_profile


load_all_cores(duration_s=120, target_load=1)  # generates load on all 
cores



Interestingly, when I use srun to launch an interactive job, and run 
the python script manually, I see with top that only 4 cpus are 
running at 100%. And I also python errors thrown when the script tries 
to start the 5th process (which makes sense):


   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

 self.run()
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 108, in run

 self._target(*self._args, **self._kwargs)
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", line 24, in load_single_core

 process.cpu_affinity([core_num])
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", line 867, in cpu_affinity

 self._proc.cpu_affinity_set(list(set(cpus)))
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 1714, in wrapper

 return fun(self, *args, **kwargs)
    ^^
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 2213, in cpu_affinity_set

 cext.proc_cpu_affinity_set(self.pid, cpus)
OSError: [Errno 22] Invalid argument


What am I missing, why are the CPU resources not restricted when I use 
sbatch?



Thanks for any input or hint
    Dietmar

[1]: https://pypi.org/project/cpu-load-generator/






--
_
D i e t m a r  R i e d e r
Innsbruck Medical University
Biocenter - Institute of Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402 | Mobile: +43 676 8716 72402
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Josef Dvoracek via slurm-users

I see this question unanswered so far.. so I'll give you my 2 cents:

Quick check reveals that mentioned symbol is in libslurmfull.so :

[root@slurmserver2 ~]# nm -gD /usr/lib64/slurm/libslurmfull.so | grep 
"slurm_conf$"

000d2c06 T free_slurm_conf
000d3345 T init_slurm_conf
0041d000 B slurm_conf
[root@slurmserver2 ~]#

can not be that this dynamic lib is still the old one?

Depending if you install slurm by rpms, manual in-place build, or 
something else, the reasons why there is old lib in place may vary..


cheers

josef


On 28. 02. 24 11:16, Miriam Olmi via slurm-users wrote:

`slurm_conf' has different size in shared object, consider re-linking

smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: sbatch and cgroup v2

2024-02-28 Thread Hermann Schwärzler via slurm-users

Hi Dietmar,

what do you find in the output-file of this job

sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'

On our 64 cores machines with enabled hyperthreading I see e.g.

Cpus_allowed:   0400,,0400,
Cpus_allowed_list:  58,122

Greetings
Hermann


On 2/28/24 14:28, Dietmar Rieder via slurm-users wrote:

Hi,

I'm new to slrum, but maybe someone can help me:

I'm trying to restrict the CPU usage to the actually requested/allocated 
resources using cgroup v2.


For this I made the following settings in slurmd.conf:


ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

And in cgroup.conf

CgroupPlugin=cgroup/v2
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
AllowedRAMSpace=98


cgroup v2 seems to be active on the compute node:

# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)


# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
# cat /sys/fs/cgroup/system.slice/cgroup.subtree_control
cpuset cpu io memory pids


Now, when I use sbatch to submit the following test script, the python 
script which is started from the batch script is utilizing all CPUs (96) 
at 100% on the allocated node, although I only ask for 4 cpus 
(--cpus-per-task=4). I'd expect that the task can not use more that 
these 4.


#!/bin/bash
#SBATCH --output=/local/users/appadmin/test-%j.log
#SBATCH --job-name=test
#SBATCH --chdir=/local/users/appadmin
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem=64gb
#SBATCH --time=4:00:00
#SBATCH --partition=standard
#SBATCH --gpus=0
#SBATCH --export
#SBATCH --get-user-env=L

export 
PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin


source .bashrc
conda activate test
python test.py


The python code in test.py is the following using the cpu_load_generator 
package from [1]:


#!/usr/bin/env python

import sys
from cpu_load_generator import load_single_core, load_all_cores, 
from_profile


load_all_cores(duration_s=120, target_load=1)  # generates load on all 
cores



Interestingly, when I use srun to launch an interactive job, and run the 
python script manually, I see with top that only 4 cpus are running at 
100%. And I also python errors thrown when the script tries to start the 
5th process (which makes sense):


   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

     self.run()
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 108, in run

     self._target(*self._args, **self._kwargs)
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", line 24, in load_single_core

     process.cpu_affinity([core_num])
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", line 867, in cpu_affinity

     self._proc.cpu_affinity_set(list(set(cpus)))
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 1714, in wrapper

     return fun(self, *args, **kwargs)
    ^^
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 2213, in cpu_affinity_set

     cext.proc_cpu_affinity_set(self.pid, cpus)
OSError: [Errno 22] Invalid argument


What am I missing, why are the CPU resources not restricted when I use 
sbatch?



Thanks for any input or hint
    Dietmar

[1]: https://pypi.org/project/cpu-load-generator/




--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: sbatch and cgroup v2

2024-02-28 Thread Josef Dvoracek via slurm-users

Hi Dietmar;

I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently..

I must say that on my setup it looks it works as expected, see the 
grepped stdout from your reproducer below.


I use recent slurm 23.11.4 .

Wild guess.. Has your build machine bpt and dbus devel packages installed?
(both packages are fine to be absent when doing build for cgroupsv1 - 
slurm..)


cheers

josef

[jose@koios1 test_cgroups]$ cat slurm-7177217.out | grep eli
ValueError: CPU number 7 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 4 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 5 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 11 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 9 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 10 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 14 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 8 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 12 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 6 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 13 is not eligible; choose between [0, 1, 2, 3]
ValueError: CPU number 15 is not eligible; choose between [0, 1, 2, 3]
[jose@koios1 test_cgroups]$

On 28. 02. 24 14:28, Dietmar Rieder via slurm-users wrote:
...


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Josef Dvoracek via slurm-users

From unclear reason "--wrap" was not part of my /repertoire/ so far.

thanks

On 26. 02. 24 9:47, Ward Poelmans via slurm-users wrote:

sbatch --wrap 'screen -D -m'
srun --jobid  --pty screen -rd

smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: GPU shards not exclusive

2024-02-28 Thread wdennis--- via slurm-users
Hi Reed,

Unfortunately, we had the same issue with 22.05.9; SchedMD advice was to 
upgrade to 23.11.x, and this appears to have resolved this issue for us. 
SchedMD support said to us, "We did a lot of work regarding shards in the 23.11 
release."

HTH,
Will

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] sbatch and cgroup v2

2024-02-28 Thread Dietmar Rieder via slurm-users

Hi,

I'm new to slrum, but maybe someone can help me:


I'm trying to restrict the CPU usage to the actually requested/allocated 
resources using cgroup v2.


For this I made the following settings in slurmd.conf:


ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

And in cgroup.conf

CgroupPlugin=cgroup/v2
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
AllowedRAMSpace=98


cgroup v2 seems to be active on the compute node:

# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)


# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
# cat /sys/fs/cgroup/system.slice/cgroup.subtree_control
cpuset cpu io memory pids


Now, when I use sbatch to submit the following test script, the python 
script which is started from the batch script is utilizing all CPUs (96) 
at 100% on the allocated node, although I only ask for 4 cpus 
(--cpus-per-task=4). I'd expect that the task can not use more that these 4.


#!/bin/bash
#SBATCH --output=/local/users/appadmin/test-%j.log
#SBATCH --job-name=test
#SBATCH --chdir=/local/users/appadmin
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem=64gb
#SBATCH --time=4:00:00
#SBATCH --partition=standard
#SBATCH --gpus=0
#SBATCH --export
#SBATCH --get-user-env=L

export 
PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin


source .bashrc
conda activate test
python test.py


The python code in test.py is the following using the cpu_load_generator 
package from [1]:


#!/usr/bin/env python

import sys
from cpu_load_generator import load_single_core, load_all_cores, 
from_profile


load_all_cores(duration_s=120, target_load=1)  # generates load on all cores


Interestingly, when I use srun to launch an interactive job, and run the 
python script manually, I see with top that only 4 cpus are running at 
100%. And I also python errors thrown when the script tries to start the 
5th process (which makes sense):


  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", 
line 314, in _bootstrap

self.run()
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", 
line 108, in run

self._target(*self._args, **self._kwargs)
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", 
line 24, in load_single_core

process.cpu_affinity([core_num])
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", 
line 867, in cpu_affinity

self._proc.cpu_affinity_set(list(set(cpus)))
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", 
line 1714, in wrapper

return fun(self, *args, **kwargs)
   ^^
  File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", 
line 2213, in cpu_affinity_set

cext.proc_cpu_affinity_set(self.pid, cpus)
OSError: [Errno 22] Invalid argument


What am I missing, why are the CPU resources not restricted when I use 
sbatch?



Thanks for any input or hint
   Dietmar

[1]: https://pypi.org/project/cpu-load-generator/


OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] User-facing documentation on shard use

2024-02-28 Thread wdennis--- via slurm-users
Hello list,

We have just enabled "gres/shard" in order to enable sharing of GPUs on our 
cluster. I am now looking for examples of user-facing documentation on this 
feature. If anyone has something, and can send a URL or other example, I'd 
appreciate it.

Thanks,
Will

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Miriam Olmi via slurm-users
Hi all,


I am having some issue with the new version of slurm 23.11.0-1.


I had already installed and configured slurm 23.02.3-1 on my cluster and
all the services were active and running properly.


Following the instructions of the official SLURM webpage, for the moment I
upgrated only the slurmdbd service.
In principle the cluster should be able to work properly if the slurmdbd
has a higher version with respect to slurmctld and slurmd.


Unfortunately the slurmdbd service fails to start with the following status:


slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled;
vendor preset: disabled)
   Active: failed (Result: core-dump) since Wed 2024-02-28 10:05:53 CET;
9min ago
  Process: 534938 ExecStart=/usr/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
(code=dumped, signal=SEGV)
 Main PID: 534938 (code=dumped, signal=SEGV)

Feb 28 10:05:53 slurm-db systemd[1]: Started Slurm DBD accounting daemon.
Feb 28 10:05:53 slurm-db slurmdbd[534938]: /usr/sbin/slurmdbd: Symbol
`slurm_conf' has different size in shared object, consider re-linking
Feb 28 10:05:53 slurm-db systemd[1]: slurmdbd.service: Main process
exited, code=dumped, status=11/SEGV
Feb 28 10:05:53 slurm-db systemd[1]: slurmdbd.service: Failed with result
'core-dump'.



Can anyone help me?

Thanks in advance,
Miriam




-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Partition, Qos Limits & Scheduling of large jobs

2024-02-28 Thread Muck, Katrin via slurm-users
Hi everyone!



I read the slurm documentation about qos, resource limits, scheduling and 
priority now multiple times and even looked into the slurm source but I'm still 
not sure if I got everything correctly, so this is why I decided to ask here ...



The problem: we see the effect that sometimes larger jobs with e.g. 16 gpus in 
our (small) gpu queue get delayed and shifted back without any reason that is 
apparent to us. small jobs that only use e.g. 1 or 2 gpus get scheduled much 
quicker even though they have a runtime of 3 days ...



What we want to do:


- We have a number of nodes with 2x gpus that are usable by the users of our 
cluster

- Some of these nodes belong to so called 'private projects'. Private projects 
have higher priority than other projects. Attached to that is a contingent of 
nodes & "guaranteed" nodes e.g. they could have a contingent of 4 nodes (8 
gpus) and e.g. 2 "guaranteed" nodes (4 gpus)

- Guaranteed nodes are nodes that should always be kept idle for the private 
project, so users of the private project can immediately schedule work on those 
nodes

- The other nodes are shared with other projects in general if they are not "in 
use"


How we are currently doing this (it has history):

Lets assume we have 50 nodes and 100 gpus.

- We have a single partition for all gpu nodes (e.g. 50 nodes)
- Private projects have private queues with a very high priority and a gres 
limit of the number of gpus they reserved (e.g. 10 nodes -> 20 gpus)
- Normal projects only have access to the public queue and schedule work there.
- This public queue has an upper gres limit of "total number of gpus" - 
"guaranteed gpus of all private projects" (e.g. 50 - 10 nodes -> 40 nodes -> 80 
gpus).


Regarding the scheduler, we currently use the following settings:

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CORE_MEMORY,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
SchedulerParameters=defer,max_sched_time=4,default_queue_depth=1000,partition_job_depth=500,enable_user_top,bf_max_job_user=20,bf_interval=120,bf_window=4320,bf_resolution=1800,bf_continue

Partition/queue depth is deliberately set high at the moment to avoid problems 
with jobs not even being examined.


The problem in more detail:

One of the last jobs (16 gpus needed) we diagnosed had a approximate start time 
that was beyond all end times of running/scheduled jobs - like jobs on feb 22 
would release more than enough gpus so the job could have been immediately 
scheduled afterwards, but the start time was still feb 23. Priority wise the 
job had the highest priority of all pending jobs for the partition.

When we turned on scheduler debugging and increased log levels, we observed the 
following messages for this job:

JobId= being held, if allowed the job request will exceed QOS x group 
max tres(gres/gpu) limit yy with already used yy + requested 16

followed by

sched: JobId=2796696 delayed for accounting policy

So to us this meant that the scheduler was always hitting the qos limits, which 
makes sense because the usage is always very high in the gpu queue and thus the 
job wasn't scheduled ...

At first we were worried, that this meant that "held"/"delayed" jobs like this 
would never actually get scheduled when contention is high enough e.g. small 
jobs getting backfilled in and thus qos limits stay at max for a long time.

But for some reason we could not determine the job eventually got scheduled at 
one point and then ran at the scheduled start time.


Open Questions:
- why it couldn't be scheduled in the first place. initially we thought (from 
the source code i looked into) the "delayed for accounting policy" prevents 
further scheduling in general, but since it was scheduled this assumption must 
be wrong?
- why it was scheduled at some point. when it was scheduled, contention was 
still high and the qos limits definitely still applied
- how we could modify the current setup so that the scheduling of larger jobs 
becomes "better" and more reproducible/explainable


Apart from all of this I'm also asking myself if there is maybe a better way to 
setup a system that works the way we want?


This got a bit long but I hope its clear enough :)


Kind regards,
Katrin





-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Hagdorn, Magnus Karl Moritz via slurm-users
On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote:
> for us, we put a load balancer in front of the login nodes with
> session 
> affinity enabled. This makes them land on the same backend node each
> time.

Hi Brian,
that sounds interesting - how did you implement session affinity?
cheers
magnus

-- 
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing
 
Campus Charité Mitte
BALTIC - Invalidenstraße 120/121
10115 Berlin
 
magnus.hagd...@charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-helpd...@charite.de


smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com