Re: [slurm-users] Mixing GPU Types on Same Node

2023-03-29 Thread Thomas M. Payerle
You can probably have a job submit lua script that looks at the --gpus flag (and maybe the --gres=gpu:* flag as well) and force a GPU type. A bit complicated, and not sure if it will catch srun submissions. I don't think this is flexible enough to ensure they get the least powerful GPU among all

Re: [slurm-users] Can sinfo/scontrol be called from job_submit.lua?

2022-10-11 Thread Thomas M. Payerle
Running scontrol/sinfo from within a job_submit.lua script seems to be opening a big can of worms --- it might be doable, but it would scare me. Since it sounds like you are only doing such for a fairly limited amount of information which presumably does not change frequently, perhaps it would be b

Re: [slurm-users] Changing a user's default account

2022-08-05 Thread Thomas M. Payerle
sacctmgr add/delete user basically adds/deletes a Slurm association for that user/cluster/account. You need to add (an association for) the user for account B before you can change their default account to B. You do *not* need to delete (the association for) the user with account A if not desired

Re: [slurm-users] Question about sbatch options: -n, and --cpus-per-task

2022-03-24 Thread Thomas M. Payerle
Although all three cases ( "-N 1 --cpus-per-task 64 -n 1", "-N 1 --cpus-per-task 1 -n 64", and "-N 1 --cpus-per-task 32 -n 2") will cause Slurm to allocate 64 cores to the job, there can (and will) be differences in the other respects. The variable SLURM_NTASKS will be set to the argument of the -

Re: [slurm-users] is there a way to temporarily freeze an account?

2021-10-06 Thread Thomas M. Payerle
There are a lot of parameters controlling the resources which an account and/or user can use. I suspect your default setup only uses a few of them. To temporarily disable an account, you can apply a setting which is disjoint from your normal methods. E.g., if you normally set GrpTresMins to limi

Re: [slurm-users] squeue: compact pending job-array in one partition, but not in other

2021-02-24 Thread Thomas M. Payerle
I believe this behavior is intended as various properties of the jobs that were requeued no longer match the properties of the rest of the job array. It might not show on the minimal output you are displaying, but I suspect the jobs were requeued at different times and so the priorities of the jobs

Re: [slurm-users] Using "Environment Modules" in a SLURM script

2021-01-22 Thread Thomas M. Payerle
On our clusters, we typically find that an explicit source of the initialization dot files is need IF the default shell of the user submitting the job does _not_ match the shell being used to run the script. I.e., for sundry historical and other reasons, the "default" login shell for users on our

Re: [slurm-users] Job canceled after reaching QOS limits for CPU time.

2020-10-30 Thread Thomas M. Payerle
On Fri, Oct 30, 2020 at 5:37 AM Loris Bennett wrote: > Hi Zacarias, > > Zacarias Benta writes: > > > Good morning everyone. > > > > I'm having a "issue", I don't know if it is a "bug or a feature". > > I've created a QOS: "sacctmgr add qos myqos set GrpTRESMins=cpu=10 > > flags=NoDecay". I know

Re: [slurm-users] Controlling access to idle nodes

2020-10-06 Thread Thomas M. Payerle
We use a scavenger partition, and although we do not have the policy you describe, it could be used in your case. Assume you have 6 nodes (node-[0-5]) and two groups A and B. Create partitions partA = node-[0-2] partB = node-[3-5] all = node-[0-6] Create QoSes normal and scavenger. Allow normal Q

Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-29 Thread Thomas M. Payerle
I am not familiar with using Slurm with VMs, but do note that Slurm can behave a bit "unexpectedly" with memory constraints due to the memory consumed by OS, etc. E.g., if I had a 16 core machine with 64 GB of RAM and requested 16 cores with 4 GB/core, it would not fit on this machine because some

Re: [slurm-users] How to queue jobs based on non-existent features

2020-08-13 Thread Thomas M. Payerle
I have not had a chance to look at you rcode, but find it intriguing, although I am not sure about use cases. Do you do anything to lock out other jobs from the affected node? E.g., you submit a job with unsatisfiable constraint foo. The tool scanning the cluster detects a job queued with foo cons

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Thomas M. Payerle
We usually we set up a reservation for maintenance. This prevents jobs from starting if they are not expected to end before the reservation (maintenance) starts. As Paul indicated, this causes nodes to become idle (and pending job queue to grow) as maintenance time approaches, but avoids requiring

Re: [slurm-users] Slurm Perl API use and examples

2020-03-23 Thread Thomas M. Payerle
I was never able to figure out how to use the Perl API shipped with Slurm, but instead have written some wrappers around some of the Slurm commands for Perl. My wrappers for the sacctmgr and share commands are available at CPAN: https://metacpan.org/release/Slurm-Sacctmgr https://metacpan.org/rele

Re: [slurm-users] Heterogeneous HPC

2019-09-19 Thread Thomas M. Payerle
While I agree containers can be quite useful in HPC environments for dealing with applications requiring different library versions, there are limitations. In particular, the kernel inside the container is the same as running outside the container. Where this seems to be most problematic is when

Re: [slurm-users] Jobs waiting while plenty of cpu and memory available

2019-07-09 Thread Thomas M. Payerle
You can use squeue to see the priority of jobs. I believe it normally shows jobs in order of priority, even though does not display priority. If you want to see actual priority, you need to request it in the format field. I typically use squeue -o "%.18i %.12a %.6P %.8u %.2t %.8m %.4D %.4C %12l

Re: [slurm-users] Requirement to run longer jobs

2019-07-03 Thread Thomas M. Payerle
The dual QoSes (or dual partition solution suggested by someone else) should both work in allow select users to submit jobs with longer run times. We use something like that on our cluster (though I confess it was our first Slurm cluster and we might have overdid it with QoSes causing scheduler to

Re: [slurm-users] Multinode MPI job

2019-03-27 Thread Thomas M. Payerle
As partition CLUSTER is not in your /etc/slurm/parts file, it likely was added via scontrol command. Presumably you or a colleague created a CLUSTER partition, whether intentionally or not. Use scontrol show partition CLUSTER to view it. On Wed, Mar 27, 2019 at 1:44 PM Mahmood Naderan wrote: >

Re: [slurm-users] Remove memory limit from GrpTRES

2019-03-27 Thread Thomas M. Payerle
>From sacctmgr man page: "To clear a previously set value use the modify command with a new value of -1 for each TRES id." So something like # sacctmgr modify user ghatee set GrpTRES=mem=-1 Similar for other TRES settings On Wed, Mar 27, 2019 at 1:44 PM Mahmood Naderan wrote: > Hi, > I want to

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-22 Thread Thomas M. Payerle
application's environment, rather than an issue > with the Python-Slurm interaction. The main piece of evidence that this > might be a bug in Slurm is that this issue started after the upgrade from > 18.08.5-2 to 18.08.6-2, but correlation doesn't necessarily mean causation. > > Pre

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-22 Thread Thomas M. Payerle
Assuming the GUI produced script is as you indicated (I am not sure where you got the script you showed, but if it is not the actual script used by a job it might be worthwhile to examine the Command= file from scontrol show job to verify), then the only thing that should be different from a GUI su

Re: [slurm-users] How to force jobs to run next in queue

2019-03-12 Thread Thomas M. Payerle
Are you uising the prioirty/multifactor plugin? What are the values of the various Priority* weight factors? On Tue, Mar 12, 2019 at 12:42 PM Sean Brisbane wrote: > Hi, > > Thanks for your help. > > Either setting qos or setting priority doesn't work for me. However I > have found the cause if

Re: [slurm-users] Priority access for a group of users

2019-03-01 Thread Thomas M. Payerle
My understanding is that with PreemptMode=requeue, the running scavenger job processes on the node will be killed, but the job will be placed back int he queue (assuming the job's specific parameters allow this. A job can have a --no-requeue flag set, in which case I assume it behaves the same as

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Thomas M. Payerle
I do know it isn't > attempting to start nodes to satisfy the larger job. > > > JobId=2210784 delayed for accounting policy is likely the key as it > indicates the job is currently unable to run, so the lower priority smaller > job bumps ahead of it. > > Yeah, that's e

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Thomas M. Payerle
The "JobId=2210784 delayed for accounting policy is likely the key as it indicates the job is currently unable to run, so the lower priority smaller job bumps ahead of it. You have not provided enough information (cluster configuration, job information, etc) to diagnose what accounting policy is be

Re: [slurm-users] Accounting configuration

2019-01-15 Thread Thomas M. Payerle
Generally, the add, modify, etc sacctmgr commands want an "user" or "account" entity, but can modify associations though this. E.g., if user baduser should have GrpTRESmin of cpu=1000 set on partition special, use something like sacctmgr add user name=baduser partition=special account=testacct grp

Re: [slurm-users] Noob slurm question

2018-12-12 Thread Thomas M. Payerle
Slurm accounting is based on the notion of "associations". An association is a set of cluster, partition, allocation account, and user. I think most sites do the accounting so that it is a single limit applied to all partitions, etc. but you can use sacctmgr to apply limits at any association lev

Re: [slurm-users] External provisioning for accounts and other things (?)

2018-09-18 Thread Thomas M. Payerle
We make use of an large home grown library of Perl scripts this for creating allocations, creating users, adding users to allocations, etc. We have a number of "flavors" of allocations, but most allocation creation/disabling activity occurs with respect to applications for allocations which are re

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread Thomas M. Payerle
A couple comments/possible suggestions. First, it looks to me that all the jobs are run from the same directory with same input/output files. Or am I missing something? Also, what MPI library is being used? I would suggest verifying if any of the jobs in question are terminating normally. I.e.

Re: [slurm-users] Python and R installation in a SLURM cluster

2018-05-10 Thread Thomas M. Payerle
Assuming you plan for users to use R in jobs, it will need to be accessible to the execute/compute nodes. I would usually suggest on a shared drive. Although it should be OK if locally installed on each compute node (probably want at same exact path and with same R packages installed). Presumabl

Re: [slurm-users] How to access environment variables in submit script?

2018-05-10 Thread Thomas M. Payerle
I don't believe that is possible. The #SBATCH lines are comments to the shell, so it does not do any variable expansion there. To my knowledge, Slurm does not do any variable expansion in the parameters either. If you really needed that sort of functionality, you would probably need to have someth

Re: [slurm-users] New Billing TRES Issue

2018-04-27 Thread Thomas M. Payerle
I have not had a chance to play with the newest Slurm, but I would suggest looking at GrpTRESRaw, which is supposed to gather the usage by TRES (in TRES-minutes). So if there is a billing TRES in GrpTRESRaw, that might be what you want. On Fri, Apr 27, 2018 at 11:21 AM, Roberts, John E. wrote: >