Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Jan 18, 2024, at 7:31 AM, Matthias Loose  wrote:

Hi Hafedh,

Im no expert in the GPU side of SLURM, but looking at you current configuration 
to me its working as intended at the moment. You have defined 4 GPUs and start 
multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource 
the be free again.

I think what you need to look into is the MPS plugin, which seems to do what 
you are trying to achieve:
https://slurm.schedmd.com/gres.html#MPS_Management

I agree with the first paragraph.  How many GPUs are you expecting each job to 
use? I'd have assumed, based on the original text, that each job is supposed to 
use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one 
node you have (with 4 GPUs).  If so, you need to tell each job to request only 
1 GPU, and currently each one is requesting 4.

If your jobs are actually supposed to be using 4 GPUs each, I still don't see 
any advantage to MPS (at least in what is my usual GPU usage pattern): all the 
jobs will take longer to finish, because they are sharing the fixed resource. 
If they take turns, at least the first ones finish as fast as they can, and the 
last one will finish no later than it would have if they were all time-sharing 
the GPUs.  I guess NVIDIA had something in mind when they developed MPS, so I 
guess our pattern may not be typical (or at least not universal), and in that 
case the MPS plugin may well be what you need.


Re: [slurm-users] [External] Re: Troubleshooting job stuck in Pending state

2023-12-12 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Presumably what's in the squeue Reason column isn't rnough? It's not 
particularly informative, although it does distinguish "Resources" from 
"Priority", for example, and it'll also list various partition limits, e.g.




Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Sep 29, 2023, at 2:51 PM, Davide DelVento 
mailto:davide.quan...@gmail.com>> wrote:

I don't really have an answer for you other than a "hallway comment", that it 
sounds like a good thing which I would test with a simulator, if I had one. 
I've been intrigued by (but really not looked much into) 
https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf

On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob 
mailto:rug...@psu.edu>> wrote:

I could obviously let the test run for an hour to verify the lower priority job 
was never preempted...but that's not really feasible.

Why not? Isn't it going to take longer than an hour to wait for responses to 
this post? Also, you could set up the minimum time to a much smaller value, so 
it won't take as long to test.


Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Sep 21, 2023, at 11:37 AM, Feng Zhang 
mailto:prod.f...@gmail.com>> wrote:

Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not sure.

Hmm, interesting, but it looks like this is just a check at submission time. 
The slurm.conf web page doesn't indicate that it affects the actual queuing 
decision, just whether or not a job that will never run (at all, or just on 
some of the listed partitions) can be submitted.  If it does help then I think 
that the slurm.conf description is misleading.

Noam


Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Sep 21, 2023, at 9:46 AM, David mailto:dr...@umich.edu>> 
wrote:

Slurm is working as it should. From your own examples you proved that; by not 
submitting to b4 the job works. However, looking at man sbatch:

   -p, --partition=
  Request  a  specific partition for the resource allocation.  If 
not specified, the default behavior is to allow the slurm controller to select
  the default partition as designated by the system administrator. 
If the job can use more than one partition, specify their names  in  a  comma
  separate  list and the one offering earliest initiation will be 
used with no regard given to the partition name ordering (although higher pri‐
  ority partitions will be considered first).  When the job is 
initiated, the name of the partition used will be placed first in the job  
record
  partition string.

In your example, the job can NOT use more than one partition (given the 
restrictions defined on the partition itself precluding certain accounts from 
using it). This, to me, seems either like a user education issue (i.e. don't 
have them submit to every partition), or you can try the job submit lua route - 
or perhaps the hidden partition route (which I've not tested).

That's not at all how I interpreted this man page description.  By "If the job 
can use more than..." I thought it was completely obvious (although perhaps 
wrong, if your interpretation is correct, but it never crossed my mind) that it 
referred to whether the _submitting user_ is OK with it using more than one 
partition. The partition where the user is forbidden (because of the 
partition's allowed account) should just be _not_ the earliest initiation 
(because it'll never initiate there), and therefore not run there, but still be 
able to run on the other partitions listed in the batch script.

I think it's completely counter-intuitive that submitting saying it's OK to run 
on one of a few partitions, and one partition happening to be forbidden to the 
submitting user, means that it won't run at all.  What if you list multiple 
partitions, and increase the number of nodes so that there aren't enough in one 
of the partitions, but not realize this problem?  Would you expect that to 
prevent the job from ever running on any partition?

Noam


Re: [slurm-users] Guarantee minimum amount of GPU resources to a Slurm account

2023-09-12 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Is this what you want?
Magnetic Reservations

The default behavior for reservations is that jobs must request a reservation 
in order to run in it. The MAGNETIC flag allows you to create a reservation 
that will allow jobs to run in it without requiring that they specify the name 
of the reservation. The reservation will only "attract" jobs that meet the 
access control requirements.

(from https://slurm.schedmd.com/reservations.html)

On Sep 12, 2023, at 10:14 AM, Stephan Roth 
mailto:stephan.r...@ee.ethz.ch>> wrote:

Dear Slurm users,

I'm looking to fulfill the requirement of guaranteeing availability of GPU 
resources to a Slurm account, while allowing this account to use other 
available GPU resources as well.











U.S. NAVAL



RESEARCH


LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil



Re: [slurm-users] What is the minimal configuration for a compute node

2023-08-24 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
 it can be quite tedious for more complex configurations with multi-level 
includes

It's either identical or configless, as far as I know also. What about changing 
your subdirectories to filenames (e.g. slurm/paritions/bob.conf -> 
slurm.partitions.bob.conf), and then doing configless, or just "cp -r" or "scp 
-r" (depending on how you can get into the compute node directory structure, 
e.g. vnfs or something) the entire config directory tree?



Re: [slurm-users] Is there any public scientific-workflow example that can be run through Slurm?

2023-08-18 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
I'm the lead developer of another workflow system, wfl 
(github.com/libAtoms/workflow), which 
works with slurm using an abstraction layer we also developed, ExPyRe 
(github.com/libAtoms/Expyre), but in writing 
a recent paper about it we looked at other systems, and ones that we know of 
that use queuing systems include:
ASR atomic simulation recipes (which uses MyQueue)
Atomate/Fireworks
PyIron/Pysqua
AiiDA
icolos (https://github.com/MolecularAI/Icolos)
qmpy (part of OQMD)

Note that I'm not promising that they currently support slurm, but it's a list 
to start your research from.

Noam



[slurm-users] job not running if partition MaxCPUsPerNode < actual max

2023-08-15 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
We have a heterogeneous mix of nodes, most 32 core, but one group of 36 core, 
grouped into homogeneous partitions.  We like to be able to specify multiple 
partitions so that a job can run on any homogeneous group.  It would be nice if 
we could run on all such nodes using 32 cores per node.  To try to do this, I 
created a partition for the 36-core nodes (call them n2019) which specifies a 
max cpu # of 64
PartitionName=n2019DefMemPerCPU=2631 Nodes=compute-4-[0-47]
PartitionName=n2019_32 DefMemPerCPU=2631 Nodes=compute-4-[0-47] 
MaxCPUsPerNode=64
PartitionName=n2021DefMemPerCPU=2960 Nodes=compute-7-[0-18]

However, if I try to run a 128 task, 1 task per core job on n2019_32, the 
sbatch fails with
> sbatch  --ntasks=128 --exclusive --partition=n2019_32  --ntasks-per-core=1 
> job.pbs
sbatch: error: Batch job submission failed: Requested node configuration is not 
available
(please ignore the ".pbs" - it's a relic, and the job script works with slurm). 
The identical command but with "n2019" or "n2021" for the partition works (but 
the former uses 36 cores per node). If I specify multiple partitions it will 
only actually run when the non-n2019 (same node set as n2019_32) nodes are 
available.

The job header includes only walltime, job name and stdout/stderr files, shell, 
and a job array range.

I tried to add "-v" to the sbatch to see if that gives more useful info, but I 
couldn't get any more insight.  Does anyone have any idea why it's rejecting my 
job?

thanks,
Noam


Re: [slurm-users] Decreasing time limit of running jobs (notification)

2023-07-06 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Given that the usual way to kill a job that's running is to use scancel, I 
would tend to agree that killing by shortening the walltime to below the 
already used time is likely to be an error, and deserves a warning.


Re: [slurm-users] Decreasing time limit of running jobs (notification)

2023-07-06 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Is the issue that the error in the time made it shorter than the time the job 
had already run, so it killed it immediately?

On Jul 6, 2023, at 12:04 PM, Jason Simms 
mailto:jsim...@swarthmore.edu>> wrote:

No, not a bug, I would say. When the time limit is reached, that's it, job 
dies. I wouldn't be aware of a way to manage that. Once the time limit is 
reached, it wouldn't be a hard limit if you then had to notify the user and 
then... what? How long would you give them to extend the time? Wouldn't be much 
of a limit if a job can be extended, plus that would throw off the 
scheduler/estimator. I'd chalk it up to an unfortunate typo.

Jason

On Thu, Jul 6, 2023 at 11:54 AM Amjad Syed 
mailto:amjad...@gmail.com>> wrote:
Hello

We were trying to increase the time limit of a slurm running job

scontrol update job= TimeLimit=16-00:00:00

But we accidentally got it to 16 hours

scontrol update job= TimeLimit=16:00:00

This actually timeout and killed the running job and did not give any 
notification

Is this a bug, should not the user be warned that this job will be killled ?

Amjad



--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms










U.S. NAVAL



RESEARCH


LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil



Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Feb 23, 2023, at 7:40 AM, Loris Bennett 
mailto:loris.benn...@fu-berlin.de>> wrote:

Hi David,

David Laehnemann mailto:david.laehnem...@hhu.de>> 
writes:

 by a
workflow management system?

I am probably being a bit naive, but I would have thought that the batch
system should just be able start your jobs when resources become
available.  Why do you need to check the status of jobs?  I would tend
to think that it is not something users should be doing.

"workflow management system" generally means some other piece of software that 
submits jobs as needed to complete some task.  It might need to know how 
current jobs are doing (running yet, completed, etc) to decide what to submit 
next. I assume that's the use case here.


Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Sep 29, 2022, at 10:34 AM, Steffen Grunewald 
mailto:steffen.grunew...@aei.mpg.de>> wrote:

Hi Noam,

I'm wondering why one would want to know that - given that there are
approaches to multi-node operation beyond MPI (Charm++ comes to mind)?

The thread title requested a way of detecting non-MPI jobs running on multiple 
nodes.  I assumed that the requester knows, maybe based on their users' 
software, that there are no legitimate ways for them to run on multiple nodes 
without MPI. Actually, we have users that run embarrassingly parallel jobs 
which just ssh to the other nodes and gather files, so clearly it can be done 
in a useful way with very low-tech approaches, but that's a n oddball (and just 
plain old) software package.


Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Can you check slurm for a job that requests multiple nodes but doesn't have 
mpirun (or srun, or mpiexec) running on its head node?


[slurm-users] admin users without a database

2022-09-19 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Is it possible to make a user an admin without slurmdbd? The docs I've found 
indicates that I need to set the user's admin level with sacctmgr, but that 
command always says
You are not running a supported accounting_storage plugin
Only 'accounting_storage/slurmdbd' is supported.

I don't especially want any accounting, just making one user an admin.

Noam


Re: [slurm-users] Strange behaviour with dynamically linked binary in batch job

2022-03-30 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
One possibility is that something about the environment in the running batch 
job is making the "module load" commands fail, which they can do without any 
error (for old fashioned tcl-based env modules).  Do "module list" after, and 
echo $LD_LIBRARY_PATH, to confirm that it really is being set correctly in the 
batch job.


Re: [slurm-users] Prevent users from updating their jobs

2021-12-16 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Is there a meaningful difference between using "scontrol update" and just 
killing the job and resubmitting with those resources already requested?


Re: [slurm-users] A Slurm topological scheduling question

2021-12-07 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
You can schedule jobs across the two racks, with any given job only using one 
rack, by specifying
#SBATCH --partition rack1,rack2
It'll only use 1 partition, in order of priority (not liti
I never found a way for topology to do that - all I could get it to do is to 
prefer to keep things within a single switch, but not to require it.

Noam


[slurm-users] job not running because of "Resources", but resources are available

2021-03-19 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Can anyone explain why job 1908239 is not running, or what else I can check?  
squeue says "Resources", and start time is always right now, no matter when I 
run "squeue --start", but the resources are available according to "sinfo ... 
state=idle".  It's only a 1 minute job, so it's not because the nodes won't be 
available for long enough to be backfilled.

slurm version is admittedly a bit old, 19.05.7


> squeue -p n2019 --state=PD -l
Fri Mar 19 20:09:17 2021
 JOBID PARTITION NAME USERSTATE   TIME TIME_LIMI  
NODES NODELIST(REASON)
   1908239 n2019 LiCu_SPA bernstei  PENDING   0:00  1:00
  1 (Resources)
   1908236 n2019 cspbbr3-  jllyons  PENDING   0:00 2-16:00:00   
   2 (Priority)
   1908227 n2019 Cy3_duplyckim  PENDING   0:00 33-08:00:00  
4 (Priority)
   1908231 n2019,n20 sGC_Fe_N bernstei  PENDING   0:00 7-00:00:00   
   4 (JobHeldUser)
   1908238 n2019 LiCu_SPA bernstei  PENDING   0:00   1:00:00
  1 (JobHeldUser)

> squeue -j 1908239 --start
 JOBID PARTITION NAME USER ST  START_TIME  NODES 
SCHEDNODES   NODELIST(REASON)
   1908239 n2019 LiCu_SPA bernstei PD 2021-03-19T20:09:17  1 
compute-4-[18-19](Resources)

> sinfo -p n2019 state=idle
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
n2019up   infinite 43  alloc 
compute-4-[0-11,13-17,20-26,28-39,41-47]
n2019up   infinite  5   idle compute-4-[12,18-19,27,40]



Re: [slurm-users] Job Step Output Delay

2021-02-11 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Could be this quote from the srun man page:
-u, --unbuffered
By default the connection between slurmstepd and the user launched application 
is over a pipe. The stdio output written by the application is buffered by the 
glibc until it is flushed or the output is set as unbuffered. See setbuf(3). If 
this option is specified the tasks are executed with a pseudo terminal so that 
the application output is unbuffered. This option applies to step allocations.

> On Feb 10, 2021, at 7:14 PM, Maria Semple  > wrote:
> 
> The larger cluster is using NFS. I can see how that could be related to the 
> difference of behaviours between the clusters.
> 
>  The buffering behaviour is the same if I tail the file from the node running 
> the job. The only thing that seems to change the behaviour is whether I use 
> srun to create a job step or not.
> 
> On Wed, Feb 10, 2021 at 4:09 PM Aaron Jackson  > wrote:
> Is it being written to NFS? You say on your local dev cluster it's a
> single node. Is it also the login node as well as compute? In that case
> I guess there is no NFS. Larger cluster will be using some sort of
> shared storage, so whichever shared file system you are using likely has
> caching.
> 
> If you are able to connect directly to the node which is running the
> job, you can try tailing from there. It'll likely update immediately if
> what I said above is the case.
> 
> Cheers,
> Aaron
> 
> 
> On  9 February 2021 at 23:47 GMT, Maria Semple wrote:
> 
> > Hello all,
> >
> > I've noticed an odd behaviour with job steps in some Slurm environments.
> > When a script is launched directly as a job, the output is written to file
> > immediately. When the script is launched as a step in a job, output is
> > written in ~30 second chunks. This doesn't happen in all Slurm
> > environments, but if it happens in one, it seems to always happen. For
> > example, on my local development cluster, which is a single node on Ubuntu
> > 18, I don't experience this. On a large Centos 7 based cluster, I do.
> >
> > Below is a simple reproducible example:
> >
> > loop.sh:
> > #!/bin/bash
> > for i in {1..100}
> > do
> >echo $i
> >sleep 1
> > done
> >
> > withsteps.sh:
> > #!/bin/bash
> > srun ./loop.sh
> >
> > Then from the command line running sbatch loop.sh followed by tail -f
> > slurm-.out prints the job output in smaller chunks, which appears to
> > be related to file system buffering or the time it takes for the tail
> > process to notice that the file has updated. Running cat on the file every
> > second shows that the output is in the file immediately after it is emitted
> > by the script.
> >
> > If you run sbatch withsteps.sh instead, tail-ing or repeatedly cat-ing the
> > output file will show that the job output is written in a chunk of 30 - 35
> > lines.
> >
> > I'm hoping this is something that is possible to work around, potentially
> > related to an OS setting, the way Slurm was compiled, or a Slurm setting.
> 
> 
> -- 
> Research Fellow
> School of Computer Science
> University of Nottingham
> 
> 
> 
> This message and any attachment are intended solely for the addressee
> and may contain confidential information. If you have received this
> message in error, please contact the sender and delete the email and
> attachment. 
> 
> Any views or opinions expressed by the author of this email do not
> necessarily reflect the views of the University of Nottingham. Email
> communications with the University of Nottingham may be monitored 
> where permitted by law.
> 
> 
> 
> 
> 
> 
> 
> -- 
> Thanks,
> Maria



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY

Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Job Step Output Delay

2021-02-11 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Could be this quote from the srun man page:
-u, --unbuffered
By default the connection between slurmstepd and the user launched application 
is over a pipe. The stdio output written by the application is buffered by the 
glibc until it is flushed or the output is set as unbuffered. See setbuf(3). If 
this option is specified the tasks are executed with a pseudo terminal so that 
the application output is unbuffered. This option applies to step allocations.

On Feb 10, 2021, at 7:14 PM, Maria Semple 
mailto:ma...@rstudio.com>> wrote:

The larger cluster is using NFS. I can see how that could be related to the 
difference of behaviours between the clusters.

 The buffering behaviour is the same if I tail the file from the node running 
the job. The only thing that seems to change the behaviour is whether I use 
srun to create a job step or not.

On Wed, Feb 10, 2021 at 4:09 PM Aaron Jackson 
mailto:aaron.jack...@nottingham.ac.uk>> wrote:
Is it being written to NFS? You say on your local dev cluster it's a
single node. Is it also the login node as well as compute? In that case
I guess there is no NFS. Larger cluster will be using some sort of
shared storage, so whichever shared file system you are using likely has
caching.

If you are able to connect directly to the node which is running the
job, you can try tailing from there. It'll likely update immediately if
what I said above is the case.

Cheers,
Aaron


On  9 February 2021 at 23:47 GMT, Maria Semple wrote:

> Hello all,
>
> I've noticed an odd behaviour with job steps in some Slurm environments.
> When a script is launched directly as a job, the output is written to file
> immediately. When the script is launched as a step in a job, output is
> written in ~30 second chunks. This doesn't happen in all Slurm
> environments, but if it happens in one, it seems to always happen. For
> example, on my local development cluster, which is a single node on Ubuntu
> 18, I don't experience this. On a large Centos 7 based cluster, I do.
>
> Below is a simple reproducible example:
>
> loop.sh:
> #!/bin/bash
> for i in {1..100}
> do
>echo $i
>sleep 1
> done
>
> withsteps.sh:
> #!/bin/bash
> srun ./loop.sh
>
> Then from the command line running sbatch loop.sh followed by tail -f
> slurm-.out prints the job output in smaller chunks, which appears to
> be related to file system buffering or the time it takes for the tail
> process to notice that the file has updated. Running cat on the file every
> second shows that the output is in the file immediately after it is emitted
> by the script.
>
> If you run sbatch withsteps.sh instead, tail-ing or repeatedly cat-ing the
> output file will show that the job output is written in a chunk of 30 - 35
> lines.
>
> I'm hoping this is something that is possible to work around, potentially
> related to an OS setting, the way Slurm was compiled, or a Slurm setting.


--
Research Fellow
School of Computer Science
University of Nottingham



This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please contact the sender and delete the email and
attachment.

Any views or opinions expressed by the author of this email do not
necessarily reflect the views of the University of Nottingham. Email
communications with the University of Nottingham may be monitored
where permitted by law.







--
Thanks,
Maria



|
|
|
U.S. NAVAL
|
|
_RESEARCH_
|
LABORATORY

Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil