You can probably have a job submit lua script that looks at the --gpus flag
(and maybe the --gres=gpu:* flag as well) and force a GPU type. A bit
complicated, and not sure if it will catch srun submissions. I don't think
this is flexible enough to ensure they get the least powerful GPU among all
Running scontrol/sinfo from within a job_submit.lua script seems to be
opening a big can of worms --- it might be doable, but it would scare me.
Since it sounds like you are only doing such for a fairly limited amount of
information which presumably does not change frequently, perhaps it would
be b
sacctmgr add/delete user basically adds/deletes a Slurm association for
that user/cluster/account.
You need to add (an association for) the user for account B before you can
change their default account to B.
You do *not* need to delete (the association for) the user with account A
if not desired
Although all three cases ( "-N 1 --cpus-per-task 64 -n 1", "-N 1
--cpus-per-task 1 -n 64", and "-N 1 --cpus-per-task 32 -n 2") will cause
Slurm to allocate 64 cores to the job, there can (and will) be differences
in the other respects.
The variable SLURM_NTASKS will be set to the argument of the -
There are a lot of parameters controlling the resources which an account
and/or user can use. I suspect your default setup only uses a few of
them. To temporarily disable an account, you can apply a setting which is
disjoint from your normal methods.
E.g., if you normally set GrpTresMins to limi
I believe this behavior is intended as various properties of the jobs that
were requeued no longer match the properties of the rest of the job array.
It might not show on the minimal output you are displaying, but I suspect
the jobs were requeued at different times and so the priorities of the jobs
On our clusters, we typically find that an explicit source of the
initialization dot files is need IF the default shell of
the user submitting the job does _not_ match the shell being used to run
the script. I.e., for sundry historical and other reasons,
the "default" login shell for users on our
On Fri, Oct 30, 2020 at 5:37 AM Loris Bennett
wrote:
> Hi Zacarias,
>
> Zacarias Benta writes:
>
> > Good morning everyone.
> >
> > I'm having a "issue", I don't know if it is a "bug or a feature".
> > I've created a QOS: "sacctmgr add qos myqos set GrpTRESMins=cpu=10
> > flags=NoDecay". I know
We use a scavenger partition, and although we do not have the policy you
describe, it could be used in your case.
Assume you have 6 nodes (node-[0-5]) and two groups A and B.
Create partitions
partA = node-[0-2]
partB = node-[3-5]
all = node-[0-6]
Create QoSes normal and scavenger.
Allow normal Q
I am not familiar with using Slurm with VMs, but do note that Slurm can
behave a bit "unexpectedly" with memory constraints due to the memory
consumed by OS, etc.
E.g., if I had a 16 core machine with 64 GB of RAM and requested 16 cores
with 4 GB/core, it would not fit on this machine because some
I have not had a chance to look at you rcode, but find it intriguing,
although I am not sure about use cases. Do you do anything to lock out
other jobs from the affected node?
E.g., you submit a job with unsatisfiable constraint foo.
The tool scanning the cluster detects a job queued with foo cons
We usually we set up a reservation for maintenance. This prevents jobs
from starting if they are not expected to end before the reservation
(maintenance) starts.
As Paul indicated, this causes nodes to become idle (and pending job queue
to grow) as maintenance time approaches, but avoids requiring
I was never able to figure out how to use the Perl API shipped with Slurm,
but instead have written some wrappers around some of the Slurm commands
for Perl. My wrappers for the sacctmgr and share commands are available at
CPAN:
https://metacpan.org/release/Slurm-Sacctmgr
https://metacpan.org/rele
While I agree containers can be quite useful in HPC environments for
dealing with applications requiring
different library versions, there are limitations. In particular, the
kernel inside the container is the same
as running outside the container. Where this seems to be most
problematic is when
You can use squeue to see the priority of jobs. I believe it normally
shows jobs in order of priority, even though does not display priority. If
you want to see actual priority, you need to request it in the format
field. I typically use
squeue -o "%.18i %.12a %.6P %.8u %.2t %.8m %.4D %.4C %12l
The dual QoSes (or dual partition solution suggested by someone else)
should both work in allow select users to submit jobs with longer run
times. We use something like that on our cluster (though I confess it was
our first Slurm cluster and we might have overdid it with QoSes causing
scheduler to
As partition CLUSTER is not in your /etc/slurm/parts file, it likely was
added via scontrol command.
Presumably you or a colleague created a CLUSTER partition, whether
intentionally or not.
Use
scontrol show partition CLUSTER
to view it.
On Wed, Mar 27, 2019 at 1:44 PM Mahmood Naderan
wrote:
>
>From sacctmgr man page:
"To clear a previously set value use the modify command with a new value of
-1 for each TRES id."
So something like
# sacctmgr modify user ghatee set GrpTRES=mem=-1
Similar for other TRES settings
On Wed, Mar 27, 2019 at 1:44 PM Mahmood Naderan
wrote:
> Hi,
> I want to
application's environment, rather than an issue
> with the Python-Slurm interaction. The main piece of evidence that this
> might be a bug in Slurm is that this issue started after the upgrade from
> 18.08.5-2 to 18.08.6-2, but correlation doesn't necessarily mean causation.
>
> Pre
Assuming the GUI produced script is as you indicated (I am not sure where
you got the script you showed, but if it is not the actual script used by a
job it might be worthwhile to examine the Command= file from scontrol show
job to verify), then the only thing that should be different from a GUI
su
Are you uising the prioirty/multifactor plugin? What are the values of the
various Priority* weight factors?
On Tue, Mar 12, 2019 at 12:42 PM Sean Brisbane
wrote:
> Hi,
>
> Thanks for your help.
>
> Either setting qos or setting priority doesn't work for me. However I
> have found the cause if
My understanding is that with PreemptMode=requeue, the running scavenger
job processes on the node will be killed, but the job will be placed back
int he queue (assuming the job's specific parameters allow this. A job can
have a --no-requeue flag set, in which case I assume it behaves the same as
I do know it isn't
> attempting to start nodes to satisfy the larger job.
>
> > JobId=2210784 delayed for accounting policy is likely the key as it
> indicates the job is currently unable to run, so the lower priority smaller
> job bumps ahead of it.
>
> Yeah, that's e
The "JobId=2210784 delayed for accounting policy is likely the key as it
indicates the job is currently unable to run, so the lower priority smaller
job bumps ahead of it.
You have not provided enough information (cluster configuration, job
information, etc) to diagnose what accounting policy is be
Generally, the add, modify, etc sacctmgr
commands want an "user" or "account" entity, but can modify associations
though this.
E.g., if user baduser should have GrpTRESmin of cpu=1000 set on partition
special, use something like
sacctmgr add user name=baduser partition=special account=testacct
grp
Slurm accounting is based on the notion of "associations". An association
is a set of cluster, partition, allocation account, and user. I think most
sites do the accounting so that it is a single limit applied to all
partitions, etc. but you can use sacctmgr to apply limits at any
association lev
We make use of an large home grown library of Perl scripts this for
creating allocations, creating users, adding users to allocations, etc.
We have a number of "flavors" of allocations, but most allocation
creation/disabling activity occurs with respect to applications for
allocations which are re
A couple comments/possible suggestions.
First, it looks to me that all the jobs are run from the same directory
with same input/output files. Or am I missing something?
Also, what MPI library is being used?
I would suggest verifying if any of the jobs in question are terminating
normally. I.e.
Assuming you plan for users to use R in jobs, it will need to be accessible
to the execute/compute nodes.
I would usually suggest on a shared drive. Although it should be OK if
locally installed on each compute
node (probably want at same exact path and with same R packages
installed). Presumabl
I don't believe that is possible.
The #SBATCH lines are comments to the shell, so it does not do any variable
expansion there.
To my knowledge, Slurm does not do any variable expansion in the parameters
either.
If you really needed that sort of functionality, you would probably need to
have someth
I have not had a chance to play with the newest Slurm, but I would suggest
looking at GrpTRESRaw, which is supposed to gather the usage by TRES (in
TRES-minutes).
So if there is a billing TRES in GrpTRESRaw, that might be what you want.
On Fri, Apr 27, 2018 at 11:21 AM, Roberts, John E.
wrote:
>
31 matches
Mail list logo