[slurm-users] Re: Limit GPU depending on type

2024-06-13 Thread Gerhard Strangar via slurm-users
Gestió Servidors via slurm-users wrote:

> What I want is users could user all of them but simultaniously, a user only 
> could use one of the RTX3080.

How about two partitions: One contains only the RTX3080, using the QoS
MaxTRESPerUser=gres/gpu=1 and another one with all the other GPUs not
having this QoS. Users then submit to both of these partitions.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Avoiding fragmentation

2024-04-08 Thread Gerhard Strangar via slurm-users
Hi,

I'm trying to figure out how to deal with a mix of few- and many-cpu
jobs. By that I mean most jobs use 128 cpus, but sometimes there are
jobs with only 16. As soon as that job with only 16 is running, the
scheduler splits the next 128 cpu jobs into 96+16 each, instead of
assigning a full 128 cpu node to them. Is there a way for the
administrator to achieve preferring full nodes?
The existence of pack_serial_at_end makes me believe there is not,
because that basically is what I needed, apart from my serial jobs using
16 cpus instead of 1.

Gerhard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread Gerhard Strangar via slurm-users
thomas.hartmann--- via slurm-users wrote:

> My idea was to basically have three partitions:
> 
> 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99]  
> PriorityTier=100
> 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] 
> PriorityTier=100
> 3. PartitionName=long_preempt MaxTime=14-00:00:00 State=UP Nodes=nodes[01-99] 
> PriorityTier=40 PreemptMode=requeue

I don't know why you consider preemption if you have short jobs, just
wait for jobs to finish.

My first approach would be to have two partitions, both of them
containing all nodes, but diffent QoSes assigned to them, so you can
limit the short jobs to a certain amount of cpus and also limit long
jobs to a certain amount of cpus - maybe 80% for each of them.

Gerhard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: "Optimal" slurm configuration

2024-02-26 Thread Gerhard Strangar via slurm-users
Max Grönke via slurm-users wrote:

> (b) introduce a "small" partition for the <4h jobs with higher priority but
> we're unsure if this will block all the larger jobs to run...

Just limit the number of cpus in that partition.


Gerhard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Memory used per node

2024-02-09 Thread Gerhard Strangar via slurm-users
Hello,

I'm wondering if there's a way to tell how much memory my job is using
per node. I'm doing

#SBATCH -n 256
srun solver inputfile

When I run sacct -o maxvmsize, the result apparently is the maxmimum VSZ
of the largest solver process, not the maximum of the sum of them all
(unlike when calling mpirun instead). When I sstat -o TresUsageInMax, I
get the memory summed up over all nodes being used. Can I get the
maximum VSZ per node?


Gerhard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] How to run one maintenance job on each node in the cluster

2023-12-23 Thread Gerhard Strangar
Jeffrey Tunison wrote:
> Is there a straightforward way to create a batch job that runs once on every 
> node in the cluster?

A wrapper around reboot configured as RebootProgram in slurm.conf?



Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Gerhard Strangar
Laurence Marks wrote:

> After some (irreproducible) time, often one of the three slow tasks hangs.
> A symptom is that if I try and ssh into the main node of the subtask (which
> is running 128 mpi on the 4 nodes) I get "Authentication failed".

How about asking an admin to check why it hangs?



Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-13 Thread Gerhard Strangar
Tim Wickberg wrote:

> A number of race conditions have been identified within the 
> slurmd/slurmstepd processes that can lead to the user taking ownership 
> of an arbitrary file on the system.

Is it any different than the CVE-2023-41915 in PMIx or does it just have
an additional number but it's the same issue? Or did anyone mis-type the
number? I couldn't find any information on CVE-2023-41914.

Gerhard



Re: [slurm-users] Aborting a job from inside the prolog

2023-06-20 Thread Gerhard Strangar
Alexander Grund wrote:
> Although it may be better to not drain it, I'm a bit nervous with "exit 
> 0" as it is very important that the job does not start/continue, i.e. 
> the user code (sbatch script/srun) is never executed in that case.
> So I want to be sure that an `scancel` on the job in its prolog is 
> actually always preventing the job from running.

Just return the exit code of scancel, then. If it failed, the prolog
failed and the job gets re-queued. If it didn't, the job was cancelled.



Re: [slurm-users] Aborting a job from inside the prolog

2023-06-19 Thread Gerhard Strangar
Alexander Grund wrote:

> Our first approach with `scancel $SLURM_JOB_ID; exit 1` doesn't seem to 
> work as the (sbatch) job still gets re-queued.

Try to exit with 0, because it's not your prolog that failed.



Re: [slurm-users] Does Slurm have any equivalent to LSF elim for generating dynamic node resources

2023-03-03 Thread Gerhard Strangar
Amir Ben Avi wrote:

> I have looked on the Slurm documentation, but didn't found any way to crate 
> resource dynamically ( in a script ) on the node level

Well, basically you could do something like
scontrol update nodename=$HOSTNAME Gres=myres:367. What you don't have
is decaying resource reservations to compensate for the delay between
your elim replacement and job starts.

Gerhard





Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Gerhard Strangar
Phil Chiu wrote:

>- Individual slurm jobs which reboot nodes - With a for loop, I could
>submit a reboot job for each node. But I'm not sure how to limit this so at
>most N jobs are running simultaneously.

With a fake license called reboot?



Re: [slurm-users] Jobs fail on specific nodes.

2022-05-25 Thread Gerhard Strangar
Roger Mason wrote:

> I would appreciate any suggestions on what might be causing this problem
> or what I can do to diagnose it.

Run getent hosts node012 on all hosts to see which one can't resolve it.



Re: [slurm-users] Sharing a GPU

2022-04-03 Thread Gerhard Strangar
Eric F. Alemany wrote:
> Another solution would be the vNVIDIA GPU
> (Virtual GPU manager software).
> You can share GPU among VM’s

You can really *share* one, not just delegate one GPU to one VM?



Re: [slurm-users] Limit partition to 1 job at a time

2022-03-23 Thread Gerhard Strangar
Russell Jones wrote:

> I suppose I am confused about how GrpJobs works. The manual shows:
> 
> The total number of jobs able to run at any given time from an association
> and its children QOS
> 
> 
> It is my understanding an association is cluster + account + user. Would
> this not just limit it to 1 job per user in the partition, not 1 job at a
> time total in the partition?

I'm using GrpTRES to limit the number of cores per partition - and thats
not per user. So I'm assuming that MaxJobs is per user, GrpJobs is not.



Re: [slurm-users] Limit partition to 1 job at a time

2022-03-22 Thread Gerhard Strangar
Russell Jones wrote:

> I am struggling to figure out how to do this. Any tips?

Create a QoS with GrpJobs=1 and assign it to the partition?



Re: [slurm-users] How to checkout a slurm node?

2021-11-12 Thread Gerhard Strangar
Joe Teumer wrote:

> However, if the user needs to reboot the node, set BIOS settings, etc then
> `salloc` automatically terminates the allocation when the new shell  is

What kind of BIOS settings would a user need to change?



Re: [slurm-users] how to check what slurm is doing when job pending with reason=none?

2021-06-16 Thread Gerhard Strangar
taleinterve...@sjtu.edu.cn wrote:

> But after submit, this job still stay at PENDING state for about 30-60s and
> during the pending time sacct shows the REASON is "None".

It's the default sched_interval=60 in your slurm.conf.

Gerhard



[slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Gerhard Strangar
Hello,

how do you implement something like "drain host after 10 consecutive
failed jobs"? Unlike a host check script, that checks for known errors,
I'd like to stop killing jobs just because one node is faulty.

Gerhard



[slurm-users] scancel the solver, not MPI

2020-10-02 Thread Gerhard Strangar
Hi,

I'm wondering if it's possible to gracefully terminate a solver that is
using MPI. If srun starts the MPI for me, can it tell the solver to
terminate and then wait n seconds before it tells MPI to terminate?
Or is the only way of handling this using scancel -b and trapping the
signal?



Re: [slurm-users] Limit nodes of a partition without managing users

2020-08-18 Thread Gerhard Strangar
Brian Andrus wrote:
> Most likely, but the specific approach depends on how you define what 
> you want.

My idea was "high prio job is next unless are are too many of them".

> For example, what if there are no jobs in high pri queue but many in 
> low? Should all the low ones run?

Yes.

> What should happen if they get started 
> and use all the nodes and a high-pri request comes in (preemption policy)?

No preemption.

> What about the inverse of that?

The inverse of what? All nodes being used by high prio jobs? That's
exactly what I want to avoid.

> What if you get a steady stream of 
> high-pri jobs? How long should low-pri wait before being allowed to run?

As long as it takes. Since I'm trying to avoid high prio jobs consuming
all nodes, it won't take forever. :-)

> Does it matter if it is all the same user?1

No.

> You can handle much of that type of interaction with job priorities and 
> a single queue. As you can see, the devil is in the details on how to 
> define/get what you want.

How do you make sure the single partition doesn't run high prio jobs
only if there's a sufficient amout of those?

Gerhard



Re: [slurm-users] [External] Limit nodes of a partition without managing users

2020-08-18 Thread Gerhard Strangar
Prentice Bisbal wrote:
>> I'm wondering if it's possible to have slurm 19 run two partitions (low
>> and high prio) that share all the nodes and limit the high prio
>> partition in number of nodes used simultaneously without requiring to
>> manage the users in the database.
> Yes, you can do this using Slurm's QOS facility

I don't think so. The documentation says setting
"AccountingStorageEnforce=qos" is equal to
"AccountingStorageEnforce=qos,associations" and "associations - This
will prevent users from running jobs if their association is not in the
database.".

Gerhard



[slurm-users] Limit nodes of a partition without managing users

2020-08-17 Thread Gerhard Strangar
Hello,

I'm wondering if it's possible to have slurm 19 run two partitions (low
and high prio) that share all the nodes and limit the high prio
partition in number of nodes used simultaneously without requiring to
manage the users in the database.
Any ideas?

Regards,
Gerhard



Re: [slurm-users] Debugging communication problems

2020-08-06 Thread Gerhard Strangar
Gerhard Strangar wrote:

> I'm experiencing a connectivity problem and I'm out of ideas, why this
> is happening. I'm running a slurmctld on a multihomed host.
> 
> (10.9.8.0/8) - master - (10.11.12.0/8)
> There is no routing between these two subnets.

My topology.conf contained a loop, which resulted in incorrect message
forwarding.

Gerhard



[slurm-users] Debugging communication problems

2020-08-04 Thread Gerhard Strangar
Hi,

I'm experiencing a connectivity problem and I'm out of ideas, why this
is happening. I'm running a slurmctld on a multihomed host.

(10.9.8.0/8) - master - (10.11.12.0/8)
There is no routing between these two subnets.

So far, all slurmds resided in the first subnet and worked fine. I added
some in the second subnet and they keep changing into the DOWN state. I
checked the "last slurmd control message" and sometimes it's overdue for
20 minutes and more with a configured slurmd timeout of 5 minutes. I did
a tcpdump and it showed that the slurmctld isn't even trying to connect
to the slurmds at that time. I haven't found any packet loss yet, the
redundant DNS servers are both resolving the host names properly at that
time and slurmctld just states a communications error for the ping
request while slurmds are running and all hosts are idle.
What reasons can there be for not contacting the slurmds? Or is it more
likely that the reply gets lost on its way?

Gerhard