Re: [slurm-users] Preempt jobs to stay within account TRES limits?

2022-10-23 Thread Steven Dick
QOS Group TRES limits apply to associations. If I recall correctly, an association is a (user,account,partition,cluster) On Fri, Oct 21, 2022 at 9:46 AM Matthew R. Baney wrote: > > Hello, > > I have noticed that jobs submitted to non-preemptable partitions (PreemptType > =

Re: [slurm-users] MinTRESPerJob on partitions?

2020-10-14 Thread Steven Dick
You can set MinTRESPerJob in a QOS and then only allow that QOS ni that partition. Or have a set of QOS for that partition that have that set... I'm not sure if a partition QOS would help here, but it could, basically forcing that QOS on all jobs in the partition. I've found that debugging lua

Re: [slurm-users] Mocking SLURM to debug job_submit.lua

2020-09-27 Thread Steven Dick
On Wed, Sep 23, 2020 at 12:37 PM Renfro, Michael wrote: > Not having a separate test environment, I put logic into my job_submit.lua to > use either the production settings or the ones under development or testing, > based off the UID of the user submitting the job: I've also done it that way,

Re: [slurm-users] [External] [slurm 20.02.3] don't suspend nodes in down state

2020-09-03 Thread Steven Dick
I think there are at least two possible ways to do what you want. You can make a reservation on the node and mark it as a maintenance reservation. I don't know if slurm will shut down the node if it is idle while it has a maintenance reservation, but it certainly won't if you also run a job as

Re: [slurm-users] Cancel "reboot ASAP" for a node

2020-08-10 Thread Steven Dick
also state=resume should work On Fri, Aug 7, 2020 at 12:25 PM Hanby, Mike wrote: > > This is what's in /var/log/slurmctld > Invalid node state transition requested for node c01 from=DRAINING > to=CANCEL_REBOOT > > > > So it looks like, for version 18.08 at least, you have to first undrain,

Re: [slurm-users] Restart Job after sudden reboot of the node

2020-07-24 Thread Steven Dick
Both See man sbatch, --requeue The default is to not requeue (unless it was changed in slurm.conf) and your job anc check $SLURM_RESTART_COUNT to see if it has been restarted. This is handy if your job can checkpoint / restart. On Fri, Jul 24, 2020 at 3:33 PM Saikat Roy wrote: > Hello, > > I

Re: [slurm-users] Automatically cancel jobs not utilizing their GPUs

2020-07-03 Thread Steven Dick
I have collectd running on my gpu nodes with the collectd_nvidianvml plugin from pip. I have a collectd frontend that displays that data along with slurm data for the whole cluster for users to see. Some of my users watch that carefully and tune their jobs to maximize utilization. When I spot

Re: [slurm-users] Slurm and shared file systems

2020-06-19 Thread Steven Dick
Condor's original premise was to have long running compute jobs on distributed nodes with no shared filesystem. Of course, they played all kinds of dirty tricks to make this work including intercepted libc and system calls. I see no reason cleverly wrapped slurm jobs coudln't do the same, either

Re: [slurm-users] Node suspend / Power saving - for *idle* nodes only?

2020-05-15 Thread Steven Dick
I've had slurm power off a few nodes I was working on... My normal solution is to just power them back on without slurm's help. Then it brings the node up in state "down / unexpectedly booted" and it doesn't seem to mess with them until I use scontrol to change the state again. (I like scontrol

Re: [slurm-users] additional jobs killed by scancel.

2020-05-13 Thread Steven Dick
Hmm, works for me. Maybe they added that in more recent versions of slurm. I'm using version 18+ On Wed, May 13, 2020 at 5:12 PM Alastair Neil wrote: > > invalid field requested: "reason" > > On Tue, 12 May 2020 at 16:47, Steven Dick wrote: >> >> What do

Re: [slurm-users] additional jobs killed by scancel.

2020-05-12 Thread Steven Dick
8-soft Failed, Run time 04:32:51, FAILED >> [2020-05-10T00:26:05.215] _job_complete: JobId=533900 done > > > it is curious, that all the jobs were running on the same processor, perhaps > this is a cgroup related failure? > > On Tue, 12 May 2020 at 10:10, Steven Dick wrote: >>

Re: [slurm-users] additional jobs killed by scancel.

2020-05-12 Thread Steven Dick
I see one job cancelled and two jobs failed. Your slurmd log is incomplete -- it doesn't show the two failed jobs exiting/failing, so the real error is not here. It might also be helpful to look through slurmctld's log starting from when the first job was canceled, looking at any messages

Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-11 Thread Steven Dick
Previous versions of mysql are suppose to have nasty security issues. I'm not sure why I had mysql instead of mariadb anyway. On Mon, May 11, 2020 at 9:29 AM Relu Patrascu wrote: > > We've experienced the same problem on several versions of slurmdbd > (18, 19) so we downgraded mysql and put a

Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-10 Thread Steven Dick
Latest releases of slurm (17-20) don't work with mysql 5.7.30, Latest version of mariadb works fine. On Tue, May 5, 2020 at 3:41 PM Dustin Lang wrote: > > I tried upgrading Slurm to 18.08.9 and I am still getting this Segmentation > Fault! > > > > On Tue, May 5, 2020 at 2:39 PM Dustin Lang

Re: [slurm-users] How to get the Average number of CPU cores used by jobs per day?

2020-04-02 Thread Steven Dick
Have you looked at sreport? On Fri, Apr 3, 2020 at 1:09 AM Sudeep Narayan Banerjee wrote: > > How to get the Average number of CPU cores used by jobs per day by a > particular group? > > By group means: say faculty group1, group2 etc. all those groups are having a > certain number of students

Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-03-26 Thread Steven Dick
When I changed this on a running system, no jobs were killed, but slurm lost track of jobs on nodes and was unable to kill them or tell when they were finished until slurmd on each node was restarted. I let running jobs complete and monitored them manually, and restarted slurmd on each node as

Re: [slurm-users] Environment modules

2019-11-24 Thread Steven Dick
lmod can mark modules as deprecated, so users are warned. I think you might also be able to get it to collect statistics on module usage or something. lmod also has the advantage of being much more complicated and much less efficient if set up incorrectly. On Sun, Nov 24, 2019 at 9:20 PM Brian

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-31 Thread Steven Dick
or your help. > > Looks like QOS is the way to go if I want both job arrays + user limits on > jobs/resources (in the context of a regression-test). > > Regards, > Guillaume. > > On Fri, Aug 30, 2019 at 6:11 PM Steven Dick wrote: >> >> On Fri, Aug 30, 2019 at

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Steven Dick
On Fri, Aug 30, 2019 at 2:58 PM Guillaume Perrault Archambault wrote: > My problem with that though, is what if each script (the 9 scripts in my > earlier example) each require different requirements? For example, run on a > different partition, or set a different time limit? My understanding

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Steven Dick
ould > show me how to set a job limit that takes effect over multiple job arrays. > > I may have very glaring oversights as I don't necessarily have a big picture > view of things (I've never been an admin, most notably), so feel free to poke > holes at the way I've constructed th

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Steven Dick
This makes no sense and seems backwards to me. When you submit an array job, you can specify how many jobs from the array you want to run at once. So, an administrator can create a QOS that explicitly limits the user. However, you keep saying that they probably won't modify the system for just

Re: [slurm-users] Using cgroups to hide GPUs on a shared controller/node

2019-05-26 Thread Steven Dick
What operating system are you running? Modern versions of systemd automatically put login sessions into their own cgroup which are themselves in a "user" group. When slurm is running parallel to this, it makes its own slurm cgroup. It should be possible to have something at boot modify the

[slurm-users] how to find out why a job won't run?

2018-11-23 Thread Steven Dick
I'm looking for a tool that will tell me why a specific job in the queue is still waiting to run. squeue doesn't give enough detail. If the job is held up on QOS, it's pretty obvious. But if it's resources, it's difficult to tell. If a job is not running because of resources, how can I

Re: [slurm-users] slurmdbd not showing job accounting

2018-10-14 Thread Steven Dick
It is documented that you need to create the cluster in the database. It is not documented that the accounting system won't work until you restart slurmdbd multiple times before it starts collecting accounting records. Also, none of the necessary restarts are needed on an upgrade -- only when

Re: [slurm-users] slurmdbd not showing job accounting

2018-10-13 Thread Steven Dick
I've found that when creating a new cluster, slurmdbd does not function correctly right away. It may be necessary to restart slurmdbd at several points during the slurm installation process to get everything working correctly. Also, slurmctld will buffer the accounting data until slurmdbd starts