from:"Ryan Novosielski"

[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11

2024-06-17 Thread Ryan Novosielski via slurm-users

The benefits are pretty limited if you don’t have the server upgraded anyway, 
unless you’re just saying it’s easier to install a current client.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Jun 17, 2024, at 05:40, ivgeokig via slurm-users 
 wrote:

Hello!


I have a question. I have the server 19.05.3. No chance to upgrade it.   
Have I any chance to connect new client 22.05.11?

   Here is log from server:

[2024-06-14T15:20:40.345] debug:  Unsupported RPC version 9728 msg type 0(0)
[2024-06-14T15:20:40.345] error: g_slurm_auth_unpack: remote plugin_id 12340565 
not found
[2024-06-14T15:20:40.345] error: slurm_unpack_received_msg: Invalid Protocol 
Version 9728 from uid=-1 at 10.82.0.6:52170
[2024-06-14T15:20:40.345] error: slurm_unpack_received_msg: Incompatible 
versions of client and server code
[2024-06-14T15:20:40.355] error: slurm_receive_msg [10.82.0.6:52170]: 
Unspecified error

Regards,
Oleg


--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com<mailto:slurm-users-le...@lists.schedmd.com>


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Ryan Novosielski via slurm-users

We do have bf_continue set. And also bf_max_job_user=50, because we discovered 
that one user can submit so many jobs that it will hit the limit of the number 
it’s going to consider and not run some jobs that it could otherwise run.

On Jun 4, 2024, at 16:20, Robert Kudyba  wrote:

Thanks for the quick response Ryan!

Are there any recommendations for bf_ options from 
https://slurm.schedmd.com/sched_config.html that could help with this? 
bf_continue? Decreasing bf_interval= to a value lower than 30?

On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski 
mailto:novos...@rutgers.edu>> wrote:
This is relatively true of my system as well, and I believe it’s that the 
backfill schedule is slower than the main scheduler.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  |     Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:

At the moment we have 2 nodes that are having long wait times. Generally this 
is when the nodes are fully allocated. What would be the other reasons if there 
is still enough available memory and CPU available, that a job would take so 
long? Slurm version is  23.02.4 via Bright Computing. Note the compute nodes 
have hyperthreading enabled but that should be irrelevant. Is there a way to 
determine what else could be holding jobs up?

srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p short 
/bin/bash
srun: job 672204 queued and waiting for resources

 scontrol show node node001
NodeName=m001 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:A6000:8
   NodeAddr=node001 NodeHostName=node001 Version=23.02.4
   OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38 EDT 
2022
   RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ours,short
   BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
   LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
   CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
   AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

grep 672204 /var/log/slurmctld
[2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources JobId=672204 
NodeList=(null) usec=852

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com<mailto:slurm-users-le...@lists.schedmd.com>



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Ryan Novosielski via slurm-users

This is relatively true of my system as well, and I believe it’s that the 
backfill schedule is slower than the main scheduler.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users 
 wrote:

At the moment we have 2 nodes that are having long wait times. Generally this 
is when the nodes are fully allocated. What would be the other reasons if there 
is still enough available memory and CPU available, that a job would take so 
long? Slurm version is  23.02.4 via Bright Computing. Note the compute nodes 
have hyperthreading enabled but that should be irrelevant. Is there a way to 
determine what else could be holding jobs up?

srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p short 
/bin/bash
srun: job 672204 queued and waiting for resources

 scontrol show node node001
NodeName=m001 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:A6000:8
   NodeAddr=node001 NodeHostName=node001 Version=23.02.4
   OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38 EDT 
2022
   RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ours,short
   BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
   LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
   CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
   AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

grep 672204 /var/log/slurmctld
[2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources JobId=672204 
NodeList=(null) usec=852

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: slurmdbd not connecting to mysql (mariadb)

2024-05-30 Thread Ryan Novosielski via slurm-users

Are you looking at the log/what appears on the screen, and do you know for a 
fact that it is all the way up (should say "version  started” at the 
end)?

If that’s not it, you could have a permissions thing or something.

I do not expect you’d need to extend the timeout for a normal run. I suspect it 
is doing something.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On May 30, 2024, at 23:57, Radhouane Aniba  wrote:

manually running it through sudo slurmdbd -D /path/to/conf is very quick on my 
fresh install

trying to start the slurmdbd through systemctl take 3 minutes and then crashes 
and fail

Is there an alternative to systemctl to start the slurmdbd in the background ?

But most importantly I wanted to know why it takes so long through systemctl. 
Maybe I can increase the timeout limit ?

On Thu, May 30, 2024 at 11:54 PM Ryan Novosielski 
mailto:novos...@rutgers.edu>> wrote:
It may take longer to start than systemd allows for. How long does it take to 
start from the command line? It’s common to need to run it manually for 
upgrades to complete.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  |     Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On May 30, 2024, at 20:24, Radhouane Aniba via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:

Ok I made some progress here.

I removed and purged slurmdbd mysql mariadb etc .. and started from scratch.
I added the recommended mysqld requirements

Started slurmdbd manually : sudo slurmdbd -D /path/to/conf and everything 
worked well

When I tried to start the service sudo systemctl start slurmdbd.service  it 
didnt work

sudo systemctl status  slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
 Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor 
preset: enabled)
 Active: failed (Result: timeout) since Fri 2024-05-31 00:21:30 UTC; 2min 
5s ago
Process: 6258 ExecStart=/usr/sbin/slurmdbd -D /etc/slurm-llnl/slurmdbd.conf 
(code=exited, status=0/SUCCESS)

May 31 00:20:00 hannibal-hn systemd[1]: Starting Slurm DBD accounting daemon...
May 31 00:21:30 hannibal-hn systemd[1]: slurmdbd.service: start operation timed 
out. Terminating.
May 31 00:21:30 hannibal-hn systemd[1]: slurmdbd.service: Failed with result 
'timeout'.
May 31 00:21:30 hannibal-hn systemd[1]: Failed to start Slurm DBD accounting 
daemon.

Even though it is the same command ?!

Any idea ?

--
Rad Aniba, PhD



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Jobs showing running but not running

2024-05-29 Thread Ryan Novosielski via slurm-users

One of the other states — down or fail, from memory — should cause it to 
completely drop the job.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On May 29, 2024, at 13:15, Sushil Mishra via slurm-users 
 wrote:

Hi All,

I'm managing a cluster with Slurm, consisting of 4 nodes. One of the compute 
nodes appears to be experiencing issues. While the front node's 'squeue' 
command indicates that jobs are running, upon connecting to the problematic 
node, I observe no active processes and GPUs are not being utilized.

[sushil@ccbrc ~]$ sinfo -Nel
Wed May 29 12:00:08 2024
NODELIST   NODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
gag1 defq*   mixed 48 2:24:1 370  1   
(null) none
gag1   glycore   mixed 48 2:24:1 370  1   
(null) none
glyco1 1 defq* completing* 1282:64:1 500  1   
(null) none
glyco1 1   glycore completing* 1282:64:1 500  1   
(null) none
glyco2 1 defq*   mixed 1282:64:1 500  1   
(null) none
glyco2 1   glycore   mixed 1282:64:1 500  1   
(null) none
mannose1 defq*   mixed 24 2:12:1 180  1   
(null) none
mannose1   glycore   mixed 24 2:12:1 180  1   
(null) none


On glyco1 (affected node!):
squeue # gets stuck
sudo systemctl restart slurmd  # gets stuck

I tried the following to clear the jobs stuck in CG state, but any new job 
appears to be stuck in a 'running' state without actually running.
scontrol update nodename=glyco1 state=down reason=cg
scontrol update nodename=glyco1 state=resume reason=cg

There is no I/O issue in that node, and all file systems are under 30% in use.  
Any advice on how to resolve this without rebooting the machine?

Best,
Sushil


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Removing safely a node

2024-05-16 Thread Ryan Novosielski via slurm-users

If I’m not mistaken, the manual for slurm.conf or one of the others lists 
either what action is needed to change every option, or has a combined list of 
what requires what (I can never remember and would have to look it up anyway).

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On May 16, 2024, at 23:16, Ratnasamy, Fritz via slurm-users 
 wrote:

Hi,

 What is the "official" process to remove nodes safely? I have drained the 
nodes so jobs are completed and put them in down state after they are 
completely drained.
I edited the slurm.conf file to remove the nodes. After some time, I can see 
that the nodes were removed from the partition with the command sinfo

However, I was told I might need to restart the service slurmctld, do you know 
if it is necessary? Should I also run scontrol reconfig?
Best,
Fritz Ratnasamy
Data Scientist
Information Technology


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Ryan Novosielski via slurm-users

Are you absolutely certain you’ve done it before for completed jobs? I would 
not expect that to work for completed jobs, with the possible exception of very 
recently completed jobs (or am I thinking of Torque?).

Other replies mention the relatively new feature (21.08?) to store the job 
script in the database. Be mindful of the database implications here (I believe 
I have had conversations about this recently with some experienced sites on 
this mailing list).

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Feb 16, 2024, at 14:41, Jason Simms via slurm-users 
 wrote:

Hello all,

I've used the "scontrol write batch_script" command to output the job 
submission script from completed jobs in the past, but for some reason, no 
matter which job I specify, it tells me it is invalid. Any way to troubleshoot 
this? Alternatively, is there another way - even if a manual database query - 
to recover the job script, assuming it exists in the database?

sacct --jobs=38960
JobID   JobName  PartitionAccount  AllocCPUS  State ExitCode
 -- -- -- -- -- 
38960amr_run_v+ tsmith2lab tsmith2lab 72  COMPLETED  0:0
38960.batch   batchtsmith2lab 40  COMPLETED  0:0
38960.extern externtsmith2lab 72  COMPLETED  0:0
38960.0  hydra_pmi+tsmith2lab 72  COMPLETED  0:0

scontrol write batch_script 38960
job script retrieval failed: Invalid job id specified

Warmest regards,
Jason

--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Ryan Novosielski

Ah, I see — no, it’s 24.08. That’s why I didn’t find any reference to it.

Carry on! :-D

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Jan 23, 2024, at 19:13, Jesse Aiton  wrote:

Yeah, 24.0.8 is the bleeding edge version.  I wanted to try the latest in case 
it was a bug in 20.x.x.  I’m happy to go back to any older Slurm version but I 
don’t think that will matter much if the issue occurs on both Slurm 20 and 
Slurm 24.


git clone https://github.com/SchedMD/slurm.git

Thanks,

Jesse

On Jan 23, 2024, at 4:07 PM, Ryan Novosielski  wrote:

On Jan 23, 2024, at 18:14, Jesse Aiton  wrote:

This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8

Thank you,

Jesse

I’m not sure what version you’re actually running, but I don’t believe there is 
a 24.0.8. The latest version I’m aware of is 23.11.2.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Ryan Novosielski

On Jan 23, 2024, at 18:14, Jesse Aiton  wrote:

This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8

Thank you,

Jesse

I’m not sure what version you’re actually running, but I don’t believe there is 
a 24.0.8. The latest version I’m aware of is 23.11.2.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

Re: [slurm-users] sacct --name --status filtering

2024-01-10 Thread Ryan Novosielski

All I can say is that this has to do with --starttime and that you have to read 
the manual really carefully about how they interact, including when you have 
--endtime set. It’s a bit fiddly and annoying, IMO, and I can never quite 
remember how it works.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Jan 10, 2024, at 19:39, Drucker, Daniel  wrote:

What am I misunderstanding about how sacct filtering works here? I would have 
expected the second command to show the exact same results as the first.



[root@mickey ddrucker]# sacct --starttime $(date -d "7 days ago" +"%Y-%m-%d") 
-X --format JobID,JobName,State,Elapsed   --name zsh
JobID   JobName  StateElapsed
 -- -- --
257713  zsh  COMPLETED   00:01:02
257714  zsh  COMPLETED   00:04:01
257715  zsh  COMPLETED   00:03:01
257716  zsh  COMPLETED   00:03:01

[root@mickey ddrucker]# sacct --starttime $(date -d "7 days ago" +"%Y-%m-%d") 
-X --format JobID,JobName,State,Elapsed   --name zsh --state COMPLETED
JobID   JobName  StateElapsed
 -- -- --

[root@mickey ddrucker]# sinfo --version
slurm 21.08.8-2





--
Daniel M. Drucker, Ph.D.
Director of IT, MGB Imaging at Belmont
McLean Hospital, a Harvard Medical School Affiliate

The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham 
Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .

Please note that this e-mail is not secure (encrypted).  If you do not wish to 
continue communication over unencrypted e-mail, please notify the sender of 
this message immediately.  Continuing to send or respond to e-mail after 
receiving this message means you understand and accept this risk and wish to 
continue to communicate over unencrypted e-mail.

Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Ryan Novosielski

This is basically always somebody filling up /tmp and /tmp residing on the same 
filesystem as the actual SlurmdSpoolDirectory.

/tmp, without modifications, it’s almost certainly the wrong place for 
temporary HPC files. Too large.

Sent from my iPhone

> On Dec 8, 2023, at 10:02, Xaver Stiensmeier  wrote:
> 
> Dear slurm-user list,
> 
> during a larger cluster run (the same I mentioned earlier 242 nodes), I
> got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
> directory on the workers that is used for job state information
> (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
> I was unable to find more precise information on that dictionary. We
> compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
> of free space where nothing is intentionally put during the run. This
> error only occurred on very few nodes.
> 
> I would like to understand what Slurmd is placing in this dir that fills
> up the space. Do you have any ideas? Due to the workflow used, we have a
> hard time reconstructing the exact scenario that caused this error. I
> guess, the "fix" is to just pick a bit larger disk, but I am unsure
> whether Slurm behaves normal here.
> 
> Best regards
> Xaver Stiensmeier
> 
>

Re: [slurm-users] Time spent in PENDING/Priority

2023-12-07 Thread Ryan Novosielski

I can’t quite answer the question, but I know that Open XDMoD does provide a 
field that gives this exact information, so they must have a formula they are 
using. They use exclusively the accounting database, AFAIK.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Dec 7, 2023, at 15:09, Chip Seraphine  wrote:

Hi all,

I am trying to find some good metrics for our slurm cluster, and want it to 
reflect a factor that is very important to users—how long did they have to wait 
because resources were unavailable.  This is a very key metric for us because 
it is a decent approximation of how much life could be improved if we had more 
capacity, so it’d be an important consideration when doing growth planning, 
setting user expectations, etc.  So we are specifically interested in how long 
jobs were in the PENDING state for reason Priority.

Unfortunately, I’m finding that this is difficult to pull out of squeue or the 
accounting data.My first thought was that I could simply subtract 
SubmitTime from EligibleTime (or StartTime), but that includes time spent in 
expected ways, e.g. waiting while an array chugs along.   The delta between 
StartTime and EligibleTime does not reflect the time spent PENDING at all, so 
it’s not useful either.

I can grab some of my own metrics by polling squeue or the REST interface, I 
suppose, but those will be less accurate, more work, and will not allow me to 
see my past history.  I was wondering if there was something I was missing that 
someone on the list has figured out?   Perhaps some existing bit of accounting 
data that can tell me how long a job was stuck behind other jobs?

--

Chip Seraphine
Grid Operations
For support please use help-grid in email or slack.
This e-mail and any attachments may contain information that is confidential 
and proprietary and otherwise protected from disclosure. If you are not the 
intended recipient of this e-mail, do not read, duplicate or redistribute it by 
any means. Please immediately delete it and any attachments and notify the 
sender that you have received it by mistake. Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail or any 
attachments. The DRW Companies make no representations that this e-mail or any 
attachments are free of computer viruses or other defects.

Re: [slurm-users] SLURM new user query, does SLURM has GUI /Web based management version also

2023-11-28 Thread Ryan Novosielski

It primarily does other things, but you can interact with Slurm in Open 
OnDemand.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Nov 19, 2023, at 03:11, Joseph John  wrote:

Dear All,
I am new user, trying out SLURM
Like to check if the SLURM has a GUI/web based management tool also
Thanks
Joseph John

Re: [slurm-users] ReservedCoresPerGPU

2023-11-27 Thread Ryan Novosielski

It does appear that way. Slurm versions are YY.MM.

On Nov 27, 2023, at 17:43, Mike Mikailov  wrote:

Thank you for the quick reply Ryan.

I heard about ReservedCoresPerGPU at the recent SuperComputing conference.

Do you mean ReservedCoresPerGPU is not available yet?

Thanks,
- Mike



On Nov 27, 2023, at 5:34 PM, Ryan Novosielski  wrote:

 Looks like 24.08 to me, so s/introduced/introduces.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Nov 27, 2023, at 17:28, Mike Mikailov  wrote:

Hello,

Can someone please tell me which version of Slurm introduced the 
ReservedCoresPerGPU parameter?

Thanks,
-Mike

Re: [slurm-users] ReservedCoresPerGPU

2023-11-27 Thread Ryan Novosielski

Looks like 24.08 to me, so s/introduced/introduces.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Nov 27, 2023, at 17:28, Mike Mikailov  wrote:

Hello,

Can someone please tell me which version of Slurm introduced the 
ReservedCoresPerGPU parameter?

Thanks,
-Mike

Re: [slurm-users] slurm comunication between versions

2023-11-24 Thread Ryan Novosielski

What do you mean by management node, slurmctld? Or just a node with the client 
software on it?

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Nov 23, 2023, at 12:14, Felix  wrote:

Hello

I have a curiosity and question in the same time,

Will slurm-20.02 which is installed on a management node comunicate with 
slurm-22.05 installed on a work nodes?

They have the same configuration file slurm.conf

Or do the version have to be the same. Slurm 20.02 was installed manually and 
slurm 22.05 was installed through dnf.

Thank you

Felix

--
Dr. Eng. Farcas Felix
National Institute of Research and Development of Isotopic and Molecular 
Technology,
IT - Department - Cluj-Napoca, Romania
Mobile: +40742195323

Re: [slurm-users] ulimits

2023-11-16 Thread Ryan Novosielski

The pam_slurm.so<http://pam_slurm.so> module has an impact on these values, if 
you are using it.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  |     Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Nov 16, 2023, at 15:00, Ozeryan, Vladimir  
wrote:

Hello everyone,

I am having the following issue, on the compute nodes “POSIX message queues” is 
set to unlimited for soft and hard limits.
However, when I do “srun -w node01 --pty bash -I” and then once I am in the 
node I do “cat /proc/SLURMPID/limits” it shows that “Max msgqueue size” is set 
to 819200 for both soft and hard limits.
Where is it being set and how do I change it?

Thank you,

Vlad Ozeryan
AMDS – AB1 Linux-Support
vladimir.ozer...@jhuapl.edu<mailto:vladimir.ozer...@jhuapl.edu>
Ext. 23966

Re: [slurm-users] cpus-per-task behaviour of srun after 22.05

2023-10-22 Thread Ryan Novosielski

What we say at our site is that you should use srun, if you don’t use srun, you 
will see limited, if any, output on resource usage in the various places you 
can see it (sacct, etc), and I learned recently that sattach won’t work either. 
I find it’s also easier to make mistakes with resource use if you don’t.

We also recommend using it to launch MPI jobs, instead of mpirun/mpiexec/etc. 
and that is our supported means of operation/the way all of the centrally built 
MPI stacks work.

Sent from my iPhone

On Oct 22, 2023, at 12:52, Jason Simms  wrote:


Hello Michael,

I don't have an elegant solution, but I'm writing mostly to +1 this. I didn't 
catch this in the release notes but am concerned if it is indeed the new 
behavior. Researchers use scripts that rely on --cpus-per-task (or -c) as part 
of, e.g., SBATCH directives. I suppose you could simply include something like 
this, unless someone knows why it wouldn't work, but even if so it seems 
inelegant:

SRUN_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK

A related question I have, which has come up a couple of times in various other 
contexts, is truly understanding the difference, in a submit script, between 
including srun and not, for example:

srun myscript
myscript

People have asked whether srun is required, or what the difference is if it is 
not included, and honestly it seems like the common reply is that "it doesn't 
matter that much." But, nobody that I've seen (and I've not done an exhaustive 
search) has articulated whether it actually matters to use srun within a batch 
script. Because if this is now the behavior, it appears that simply not using 
srun will still permit the task to use --cpus-per-task.

Warmest regards,
Jason

On Fri, Oct 20, 2023 at 5:00 AM Michael Müller 
mailto:michael.muelle...@tu-dresden.de>> wrote:
Hello,

I haven't really seen this discussed anywhere, but maybe I didn't look
in the right places.

After our upgrade from 21.08 to 23.02 we had users complaining about
srun not using the specified --cpus-per-task given in sbatch-directives.
The changelog of 22.05 mentions this change and explains the need to set
the Environment variable SRUN_CPUS_PER_TASK. The environment variable
SLURM_CPUS_PER_TASK will be set by the sbatch-directive, but is ignored
by srun.

Does anyone know why this behaviour was changed? Imo the expectation
that an sbatch-directive is the default for the whole job-script is
reasonable.

Is there a config option to reenable the old behaviour, or do we have to
find a workaround with a job_submit script or a profile.d script? If so,
have any of you already implemented such a workaround?


With kind regards
Michael

--
Michael Müller
Application Developer

Dresden University of Technology
Center of Information Services and High Performance Computing (ZIH)
Department of Interdisciplinary Application Development and Coordination (IAK)
01062 Dresden

phone: (0351)463-35261
www:www.tu-dresden.de/zih



--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-13 Thread Ryan Novosielski

If you look at the downloads page, this has happened before:

https://www.schedmd.com/archives.php

This should probably be updated as well, to indicate the new floor after this 
CVE.

But the point is, basically, you’re going to hit this if you don’t upgrade at 
least every ~18 months.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Oct 13, 2023, at 06:22, Taras Shapovalov  wrote:

Oh, does this mean that no one should use Slurm versions <= 21.08 any more?

Best regards,

Taras



From: slurm-users on behalf of Bjørn-Helge Mevik
Sent: Thursday, October 12, 2023 11:56
To: slurm-us...@schedmd.com
Subject: Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now 
available (CVE-2023-41914)

Taras Shapovalov  writes:

> Are the older versions affected as well?

Yes, all older versjons are affected.

--
B/H

Re: [slurm-users] A strange situation of different network cards on the same network

2023-10-10 Thread Ryan Novosielski

We have, and have had it come and go with no clear explanation. I’d watch out 
for MTU and netmask troubles, sysctl limits that might be relevant (apparently 
the default settings for time spent doing ethernet are really appropriate for 
<1 Gb, not so much faster), hot spots on the network, etc.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  |     Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Oct 10, 2023, at 22:29, James Lam  wrote:

We have a cluster of 176 nodes consisting Infiniband switch and 10GbE and we 
are using 10GbE as SSH. Currently we have the older cards of
Marvell 10GbE at launch
https://support.hpe.com/connect/s/softwaredetails?language=en_US=MTX_117b0672d7ef4c5bb0eca02886

and
Current batch of 10GbE Qlogic card
https://support.hpe.com/connect/s/softwaredetails?language=en_US=MTX_9bd8f647238c4a5f8c72a5221b=revisionHistory

We are using slurm 20.11.4 as server and node health check daemon are also 
deployed using the OpenHPC method.  However , we have no issue on using the 
Marvell 10GbE cards - which don't have slurm node down <--> idle state. 
However, we do have the flip-flip situation of the down <--> idle state

We tried on increasing the ARP caching , changing the subversion of the client 
to 20.11.9 , which doesn't help with the situation.

We would like to see if anyone faced similar situation?

Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Ryan Novosielski

You can get some information on that from sdiag, and there are tweaks you can 
make to backfill scheduling that affect how quickly it will get to a job.

That doesn’t really answer your real question, but might help you when you are 
looking into this.

Sent from my iPhone

On Sep 29, 2023, at 16:10, Groner, Rob  wrote:


I'm not looking for a one-time answer.  We run these tests anytime we change 
anything related to slurmversion, configuration, etc.We certainly run 
the test after the system comes back up after an outage, and an hour would be a 
long time to wait for that.  That's certainly the brute-force approach, but I'm 
hoping there's a definitive way to show, through scontrol job output, that the 
job won't preempt.

I could set the preemptexempttime to a smaller value, say 5 minutes instead of 
1 hour, that is true, but there's a few issues with that.


  1.  I would then no longer be testing the system as it actually is.  I want 
to test the system in its actual production configuration.
  2.  If I did lower its value, what would be a safe value?  5 minutes?  Does 
running for 5 minutes guarantee that the higher priority job had a chance to 
preempt it but didn't?  Or did the scheduler even ever get to it?  On a test 
cluster with few jobs, you could be reasonably assured it did, but running 
tests on the production cluster...isn't it possible the scheduler hasn't yet 
had a chance to process it, even after 5 minutes?  Depends on the slurm 
scheduler  settings I suppose

rob


From: slurm-users  on behalf of 
Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) 

Sent: Friday, September 29, 2023 3:14 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Verifying preemption WON'T happen

You don't often get email from noam.bernst...@nrl.navy.mil. Learn why this is 
important
On Sep 29, 2023, at 2:51 PM, Davide DelVento 
mailto:davide.quan...@gmail.com>> wrote:

I don't really have an answer for you other than a "hallway comment", that it 
sounds like a good thing which I would test with a simulator, if I had one. 
I've been intrigued by (but really not looked much into) 
https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf

On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob 
mailto:rug...@psu.edu>> wrote:

I could obviously let the test run for an hour to verify the lower priority job 
was never preempted...but that's not really feasible.

Why not? Isn't it going to take longer than an hour to wait for responses to 
this post? Also, you could set up the minimum time to a much smaller value, so 
it won't take as long to test.

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Ryan Novosielski

I’ll just say, we haven’t done an online/jobs running upgrade recently (in part 
because we know our database upgrade will take a long time, and we have some 
processes that rely on -M), but we have done it and it does work fine. So the 
paranoia isn’t necessary unless you know that, like us, the DB upgrade time is 
not tenable (Ole’s wiki has some great suggestions for how to test that, but 
they aren’t especially Slurm specific, it’s just a dry-run).

As far as the shared symlink thing goes, I think you’d be fine, dependent on 
whether or not you have anything else stored in the shared software tree, 
changing the symlink and just not restarting compute nodes’ slurmd until you’re 
ready — though again, you can do this while jobs are running, so there’s not 
really a reason to wait, except in cases like ours where it’s just easier to 
reboot the node than one process for running nodes, and then rebooting, and 
wanting to be sure that the rebooted compute node and the running upgraded node 
will operate exactly the same.

On Sep 29, 2023, at 10:10, Paul Edmon  wrote:

This is one of the reasons we stick with using RPM's rather than the symlink 
process. It's just cleaner and avoids the issue of having the install on shared 
storage that may get overwhelmed with traffic or suffer outages. Also the 
package manager automatically removes the previous versions and local installs 
stuff. I've never been a fan of the symlink method has it runs counter to the 
entire point and design of Linux and package managers which are supposed to do 
this heavy lifting for you.

Rant aside :). Generally for minor upgrades the process is less touchy. For our 
setup we follow the following process that works well for us, but does create 
an outage for the period of the upgrade.

1. Set all partitions to down: This makes sure no new jobs are scheduled.
2. Suspend all jobs: This makes sure jobs aren't running while we upgrade.
3. Stop slurmctld and slurmdbd.
4. Upgrade the slurmdbd. Restart slurmdbd
5. Upgrade the slurmd and slurmctld across the cluster.
6. Restart slurmd and slurmctld simultaneously using choria.
7. Unsuspend all jobs
8. Reopen all partitions.

For major upgrades we always take a mysqldump and backup the spool for the 
slurmctld before upgrading just in case something goes wrong. We've had this 
happen before when the slurmdbd upgrade cut out early (note, always run the 
slurmdbd and slurmctld upgrades in -D mode and not via systemctl as systemctl 
can timeout and kill the upgrade midway for large upgrades).

That said I've also skipped steps 1, 2, 7, and 8 before for minor upgrades and 
it works fine. The slurmd, slurmctld, and slurmdbd can all run on different 
versions so long as the slurmdbd > slurmctld > slurmd.  So if you want to do a 
live upgrade you can do it. However out paranoia we general stop everything. 
The entire process takes about an hour start to finish, with the longest part 
being the pausing of all the jobs.

-Paul Edmon-

On 9/29/2023 9:48 AM, Groner, Rob wrote:
I did already see the upgrade section of Jason's talk, but it wasn't much about 
the mechanics of the actual upgrade process, more of a big picture it seemed.  
It dealt a lot with different parts of slurm at different versions, which is 
something we don't have.

One little wrinkle here is that while, yes, we're using a symlink to point to 
what version of slurm is the current one...it's all on a shared filesystem.  
So, ALL nodes, slurmdb, slurmctld are using that same symlink.  There is no 
means to upgrade one component at a time.  That means to upgrade, EVERYTHING 
has to come down before it could come back up.  Jason's slides seemed to 
indicate that, if there were separate symlinks, then I could focus on just the 
slurmdb first and upgrade it...then focus on slurmctld and upgrade it, and then 
finally the nodes (take down their slurmd, upgrade the link, bring up slurmd).  
So maybe that's what I'm missing.

Otherwise, I think what I'm saying is that I see references to a "rolling 
upgrade", but I don't see any guide to a rolling upgrade.  I just see the 14 
steps  in https://slurm.schedmd.com/quickstart_admin.html#upgrade, and I guess 
I'd always thought of that as the full octane, high fat upgrade.  I've only 
ever done upgrades during one of our many scheduled downtimes, because the 
upgrades were always to a new major version, and because I'm a scared little 
chicken, so I figured there were maybe some smaller subset of steps if only 
upgrading a patchlevel change.  Smaller change, less risk, less precautionary 
steps...?  I'm seeing now that's not the case.

Thank you all for the suggestions!

Rob



From: slurm-users 
<mailto:slurm-users-boun...@lists.schedmd.com>
 on behalf of Ryan Novosielski 
<mailto:novos...@rutgers.edu>
Sent: Friday, September 29, 2023 2:48 AM
To: Slurm User Community List 
<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [s

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Ryan Novosielski

I started off writing there’s really no particular process for these/just do 
your changes and start the new software (be mindful of any PATH that might 
contain data that’s under your software tree, if you have that setup), and that 
you might need to watch the timeouts, but I figured I’d have a look at the 
upgrade guide to be sure.

There’s really nothing onerous in there. I’d personally back up my database and 
state save directories just because I’d rather be safe than sorry, or for if 
have to go backwards and want to be sure. You can run SlurmCtld for a good 
while with no database (note that -M on the command line will be broken during 
that time), just being mindful of the RAM on the SlurmCtld machine/don’t 
restart it before the DB is back up, and backing up our fairly large database 
doesn’t take all that long. Whether or not 5 is required mostly depends on how 
long you think it will take you to do 6-11 (which could really take you seconds 
if your process is really as simple as stop, change symlink, start), 12 you’re 
going to do no matter what, 13 you don’t need if you skipped 5, and 14 is up to 
you. So practically, that’s what you’re going to do anyway.

We just did an upgrade last week, and the only difference is that our compute 
nodes are stateless, so the compute node upgrades were a reboot (we could 
upgrade them running, but we did it during a maintenance period anyway, so 
why?).

If you want to do this with running jobs, I’d definitely back up the state save 
directory, but as long as you watch the timeouts, it’s pretty uneventful. You 
won’t have that long database upgrade period, since no database modifications 
will be required, so it’s pretty much like upgrading anything else.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Sep 28, 2023, at 11:58, Groner, Rob  wrote:


There's 14 steps to upgrading slurm listed on their website, including shutting 
down and backing up the database.  So far we've only updated slurm during a 
downtime, and it's been a major version change, so we've taken all the steps 
indicated.

We now want to upgrade from 23.02.4 to 23.02.5.

Our slurm builds end up in version named directories, and we tell production 
which one to use via symlink.  Changing the symlink will automatically change 
it on our slurm controller node and all slurmd nodes.

Is there an expedited, simple, slimmed down upgrade path to follow if we're 
looking at just a . level upgrade?

Rob

Re: [slurm-users] enabling job script archival

2023-09-28 Thread Ryan Novosielski

Sorry for the duplicate e-mail in a short time: do you know (or anyone) when 
the hashing was added? Was planning to enable this on 21.08, but we then had to 
delay our upgrade to it. I’m assuming later than that, as I believe that’s when 
the feature was added.

On Sep 28, 2023, at 13:55, Ryan Novosielski  wrote:

Thank you; we’ll put in a feature request for improvements in that area, and 
also thanks for the warning? I thought of that in passing, but the real world 
experience is really useful. I could easily see wanting that stuff to be 
retained less often than the main records, which is what I’d ask for.

I assume that archiving, in general, would also remove this stuff, since old 
jobs themselves will be removed?

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Sep 28, 2023, at 13:48, Paul Edmon  wrote:

Slurm should take care of it when you add it.

So far as horror stories, under previous versions our database size ballooned 
to be so massive that it actually prevented us from upgrading and we had to 
drop the columns containing the job_script and job_env.  This was back before 
slurm started hashing the scripts so that it would only store one copy of 
duplicate scripts.  After this point we found that the job_script database 
stayed at a fairly reasonable size as most users use functionally the same 
script each time. However the job_env continued to grow like crazy as there are 
variables in our environment that change fairly consistently depending on where 
the user is. Thus job_envs ended up being too massive to keep around and so we 
had to drop them. Frankly we never really used them for debugging. The 
job_scripts though are super useful and not that much overhead.

In summary my recommendation is to only store job_scripts. job_envs add too 
much storage for little gain, unless your job_envs are basically the same for 
each user in each location.

Also it should be noted that there is no way to prune out job_scripts or 
job_envs right now. So the only way to get rid of them if they get large is to 
0 out the column in the table. You can ask SchedMD for the mysql command to do 
this as we had to do it here to our job_envs.

-Paul Edmon-

On 9/28/2023 1:40 PM, Davide DelVento wrote:
In my current slurm installation, (recently upgraded to slurm v23.02.3), I only 
have

AccountingStoreFlags=job_comment

I now intend to add both

AccountingStoreFlags=job_script
AccountingStoreFlags=job_env

leaving the default 4MB value for max_script_size

Do I need to do anything on the DB myself, or will slurm take care of the 
additional tables if needed?

Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I know about 
the additional diskspace and potentially load needed, and with our resources 
and typical workload I should be okay with that.

Thanks!

Re: [slurm-users] enabling job script archival

2023-09-28 Thread Ryan Novosielski

Thank you; we’ll put in a feature request for improvements in that area, and 
also thanks for the warning? I thought of that in passing, but the real world 
experience is really useful. I could easily see wanting that stuff to be 
retained less often than the main records, which is what I’d ask for.

I assume that archiving, in general, would also remove this stuff, since old 
jobs themselves will be removed?

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Sep 28, 2023, at 13:48, Paul Edmon  wrote:

Slurm should take care of it when you add it.

So far as horror stories, under previous versions our database size ballooned 
to be so massive that it actually prevented us from upgrading and we had to 
drop the columns containing the job_script and job_env.  This was back before 
slurm started hashing the scripts so that it would only store one copy of 
duplicate scripts.  After this point we found that the job_script database 
stayed at a fairly reasonable size as most users use functionally the same 
script each time. However the job_env continued to grow like crazy as there are 
variables in our environment that change fairly consistently depending on where 
the user is. Thus job_envs ended up being too massive to keep around and so we 
had to drop them. Frankly we never really used them for debugging. The 
job_scripts though are super useful and not that much overhead.

In summary my recommendation is to only store job_scripts. job_envs add too 
much storage for little gain, unless your job_envs are basically the same for 
each user in each location.

Also it should be noted that there is no way to prune out job_scripts or 
job_envs right now. So the only way to get rid of them if they get large is to 
0 out the column in the table. You can ask SchedMD for the mysql command to do 
this as we had to do it here to our job_envs.

-Paul Edmon-

On 9/28/2023 1:40 PM, Davide DelVento wrote:
In my current slurm installation, (recently upgraded to slurm v23.02.3), I only 
have

AccountingStoreFlags=job_comment

I now intend to add both

AccountingStoreFlags=job_script
AccountingStoreFlags=job_env

leaving the default 4MB value for max_script_size

Do I need to do anything on the DB myself, or will slurm take care of the 
additional tables if needed?

Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I know about 
the additional diskspace and potentially load needed, and with our resources 
and typical workload I should be okay with that.

Thanks!

Re: [slurm-users] How to use partition option "Hidden"?

2023-08-24 Thread Ryan Novosielski

Our experience was that it only works with AllowGroups, and we probably opened 
a ticket to confirm since it was a choice we didn’t really want to make. That's 
the route we went, but it's currently causing us some problems with accounting, 
because we didn't bother doing some of the accounting stuff since we were using 
groups to control access.

I would really like for Slurm to be able hide partitions using either 
parameter, or at least offer the option. We are probably going to go back and 
do accounts for reporting purposes, but it would be nice not to have to 
duplicate these things.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Aug 24, 2023, at 11:10, Erin Gwen Roberts  wrote:

Hello,

I would like to set a partition to Hidden, and allow members of the appropriate 
Account to see this partition when running `sinfo` with no parameters.

This partition is configured with AllowAccounts= and 
AllowGroups=.   is a member of the Slurm account , 
verified with `sacctmgr show assoc where user=`.

I am running Slurm 23.02.4, and I am testing the behavior with `sudo -u  
sinfo`.  I presume this test is accurate because as an admin user, `sinfo` 
shows all partitions, but when I impersonate the user like this, `sinfo` shows 
only the partitions that have Hidden=NO.

I would prefer to do this with only AllowAccounts, but if this only works with 
AllowGroups, that's not a big deal.

We intend to add partitions corresponding to many small research groups, and it 
would be lovely to have `sinfo` output be clean and clear, rather than showing 
all partitions.  What could I be doing wrong, that Hidden= doesn't seem to be 
taking into account AllowAccounts= or AllowGroups=?

Thanks for your help,

Erin Roberts
System Administrator
The Infrastructure Group
MIT CSAIL

Re: [slurm-users] Transport from SLC to Provo?

2023-08-19 Thread Ryan Novosielski

Or an airport hotel for the first night. Done that many times.

Sent from my iPhone

On Aug 19, 2023, at 13:53, Lloyd Brown  wrote:


Something else to consider that I just thought of.

If you're arriving late on Sunday, and SLUG doesn't start until Tuesday, you 
cound just get a hotel in SLC somewhere for Sunday night, and head to Provo 
some time on Monday, when the FrontRunner train *is* running.

The Trax light rail that goes to/from the airport and around SLC still runs on 
Sunday, although less often than weekdays.  Obviously check the schedule.

Lloyd


On August 14, 2023 6:12:21 AM MDT, "Styrk, Daryl"  wrote:

I would try Uber or Lyft. You should check with your hotel; they may offer a 
shuttle.

Daryl

On 8/14/23, 4:58 AM, "slurm-users on behalf of Bjørn-Helge Mevik" 
mailto:slurm-users-boun...@lists.schedmd.com> on behalf of 
b.h.me...@usit.uio.no > wrote:


Dear all,


I'm going to SLUG in Provo in September. My flight lands in Salt Lake
City Airport (SLC) at 7 pm on Sunday 10. I was planning to go by bus or
train from SLC to Provo, but apparently both bus and train have stopped
running by that time on Sundays.


Does anyone know about any alternative way to get to Provo on a Sunday
night?



--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [slurm-users] extended list of nodes allocated to a job

2023-08-18 Thread Ryan Novosielski

I didn’t know that one! Thank you.

Sent from my iPhone

On Aug 17, 2023, at 09:50, Alain O' Miniussi  wrote:


Hi Sean,

A colleague pointed to me the following commands:

#scontrol show hostname x[1000,1009,1029-1031]
x1000
x1009
x1029
x1030
x1031
#scontrol show hostlist x[1000,1009,1029,1030,1031]
x[1000,1009,1029-1031]
#


Alain Miniussi
DSI, Pôles Calcul et Genie Log.
Observatoire de la Côte d'Azur
Tél. : +33609650665


- On Aug 17, 2023, at 3:12 PM, Sean Mc Grath  wrote:
Hi Alain,

I don't know if slurm can do that natively. python-hostlist, 
https://www.nsc.liu.se/~kent/python-hostlist/, may provide the functionality 
you need. I have used it in the past to generate a list of hosts that can be 
looped over.

Hope that helps.

Sean

---
Sean McGrath
Senior Systems Administrator, IT Services


From: slurm-users  on behalf of Alain O' 
Miniussi 
Sent: Thursday 17 August 2023 13:44
To: Slurm User Community List 
Subject: [slurm-users] extended list of nodes allocated to a job

Hi,

I'm looking for a way to get the list of nodes where a given job is running in 
a uncompressed way.
That is, I'd like to have node1,node2,node3 instead of node1-node3.
Is there way to achieve that ?
I need the information outside the script.

Thanks


Alain Miniussi
DSI, Pôles Calcul et Genie Log.
Observatoire de la Côte d'Azur
Tél. : +33609650665

Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Ryan Novosielski

I tend not to let them login. It will get their attention, and prevent them 
from just running their work on the login node when they discover they can’t 
submit. But appreciate seeing the other options.

Sent from my iPhone

> On May 25, 2023, at 19:19, Markuske, William  wrote:
> 
>  Hello,
> 
> I have a badly behaving user that I need to speak with and want to 
> temporarily disable their ability to submit jobs. I know I can change their 
> account settings to stop them. Is there another way to set a block on a 
> specific username that I can lift later without removing the user/account 
> associations?
> 
> Regards,
> 
> --
> Willy Markuske
> 
> HPC Systems Engineer
> SDSC - Research Data Services
> (619) 519-4435
> wmarku...@sdsc.edu
>

Re: [slurm-users] [External] Re: Slurm 22.05.8 - salloc not starting shell on remote host

2023-05-19 Thread Ryan Novosielski

Make sure you don’t have a firewall blocking connections back to the login node 
from the cluster. We had that problem at Rutgers before.

Sent from my iPhone

On May 19, 2023, at 13:13, Prentice Bisbal  wrote:



Brian,

Thanks for the reply, and I was hoping that would be the fix, but that doesn't 
seem to be the case. I'm using 22.05.8, which isn't that old. I double-checked 
the documentation archives for version 22.05.08's documetation, and setting

LaunchParameters=use_interactive_step



should be valid here. From  
https://slurm.schedmd.com/archive/slurm-22.05.8/slurm.conf.html:

use_interactive_step
Have salloc use the Interactive Step to launch a shell on an allocated compute 
node rather than locally to wherever salloc was invoked. This is accomplished 
by launching the srun command with InteractiveStepOptions as options.

This does not affect salloc called with a command as an argument. These jobs 
will continue to be executed as the calling user on the calling host.

and

InteractiveStepOptions
When LaunchParameters=use_interactive_step is enabled, launching salloc will 
automatically start an srun process with InteractiveStepOptions to launch a 
terminal on a node in the job allocation. The default value is "--interactive 
--preserve-env --pty $SHELL". The "--interactive" option is intentionally not 
documented in the srun man page. It is meant only to be used in 
InteractiveStepOptions in order to create an "interactive step" that will not 
consume resources so that other steps may run in parallel with the interactive 
step.

According to that, setting LaunchParameters=use_interactive_step should be 
enough, since "--interactive --preserve-env --pty $SHELL" is the default.

A colleague pointed out that my slurm.conf was setting LaunchParameters to 
"user_interactive_step" when it should be "use_interactive_step", but changing 
that didn't fix my problem, just changed it. Now when I try to start an 
interactive shell, it just hangs and eventually returns an error:

[pbisbal@ranger ~]$ salloc -n 1 -t 00:10:00 --mem=1G
salloc: Granted job allocation 29
salloc: Waiting for resource configuration
salloc: Nodes ranger-s22-07 are ready for job
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: launch/slurm: launch_p_step_launch: StepId=29.interactive aborted before 
step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
salloc: Relinquishing job allocation 29
[pbisbal@ranger ~]$




On 5/19/23 11:28 AM, Brian Andrus wrote:

Defaulting to a shell for salloc is a newer feature.

For your version, you should:

srun -n 1 -t 00:10:00 --mem=1G --pty bash

Brian Andrus

On 5/19/2023 8:24 AM, Ryan Novosielski wrote:
I’m not at a computer, and we run an older version of Slurm yet so I can’t say 
with 100% confidence that his this has changed and I can’t be too specific, but 
I know that this is the behavior you should expect from that command. I believe 
that there are configuration options to make it behave differently.

Otherwise, you can use srun to run commands on the assigned node.

I think if you search this list for “interactive,” or search the Slurm bugs 
database, you will see some other conversations about this.

Sent from my iPhone

On May 19, 2023, at 10:35, Prentice Bisbal 
<mailto:pbis...@pppl.gov> wrote:



I'm setting up Slurm from scratch for the first time ever. Using 22.05.8 since 
I haven't had a changed to upgrade our DB server to 23.02 yet. When I try to 
use salloc to get a shell on a compute node (ranger-s22-07), I end up with a 
shell on the login node (ranger):

[pbisbal@ranger ~]$ salloc -n 1 -t 00:10:00  --mem=1G
salloc: Granted job allocation 23
salloc: Waiting for resource configuration
salloc: Nodes ranger-s22-07 are ready for job
[pbisbal@ranger ~]$



Any ideas what's going wrong here? I have the following line in my slurm.conf:

LaunchParameters=user_interactive_step


When I run salloc with -v, here's what I see:

[pbisbal@ranger ~]$ salloc -v -n 1 -t 00:10:00  --mem=1G
salloc: defined options
salloc:  
salloc: mem : 1G
salloc: ntasks  : 1
salloc: time: 00:10:00
salloc: verbose : 5
salloc:  
salloc: end of defined options
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_res.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Consumable Resources (CR) Node Selection plugin type:select/cons_res 
version:0x160508
salloc: select/cons_res: common_init: select/cons_res loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_tres.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Trackable RESources (TRES) Selection

Re: [slurm-users] Slurm 22.05.8 - salloc not starting shell on remote host

2023-05-19 Thread Ryan Novosielski

I’m not at a computer, and we run an older version of Slurm yet so I can’t say 
with 100% confidence that his this has changed and I can’t be too specific, but 
I know that this is the behavior you should expect from that command. I believe 
that there are configuration options to make it behave differently.

Otherwise, you can use srun to run commands on the assigned node.

I think if you search this list for “interactive,” or search the Slurm bugs 
database, you will see some other conversations about this.

Sent from my iPhone

On May 19, 2023, at 10:35, Prentice Bisbal  wrote:



I'm setting up Slurm from scratch for the first time ever. Using 22.05.8 since 
I haven't had a changed to upgrade our DB server to 23.02 yet. When I try to 
use salloc to get a shell on a compute node (ranger-s22-07), I end up with a 
shell on the login node (ranger):

[pbisbal@ranger ~]$ salloc -n 1 -t 00:10:00  --mem=1G
salloc: Granted job allocation 23
salloc: Waiting for resource configuration
salloc: Nodes ranger-s22-07 are ready for job
[pbisbal@ranger ~]$



Any ideas what's going wrong here? I have the following line in my slurm.conf:

LaunchParameters=user_interactive_step


When I run salloc with -v, here's what I see:

[pbisbal@ranger ~]$ salloc -v -n 1 -t 00:10:00  --mem=1G
salloc: defined options
salloc:  
salloc: mem : 1G
salloc: ntasks  : 1
salloc: time: 00:10:00
salloc: verbose : 5
salloc:  
salloc: end of defined options
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_res.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Consumable Resources (CR) Node Selection plugin type:select/cons_res 
version:0x160508
salloc: select/cons_res: common_init: select/cons_res loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_tres.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Trackable RESources (TRES) Selection plugin type:select/cons_tres 
version:0x160508
salloc: select/cons_tres: common_init: select/cons_tres loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cray_aries.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Cray/Aries node selection plugin type:select/cray_aries version:0x160508
salloc: select/cray_aries: init: Cray/Aries node selection plugin loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_linear.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Linear node selection plugin type:select/linear version:0x160508
salloc: select/linear: init: Linear node selection plugin loaded with argument 
20
salloc: debug3: Success.
salloc: debug:  Entering slurm_allocation_msg_thr_create()
salloc: debug:  port from net_stream_listen is 43881
salloc: debug:  Entering _msg_thr_internal
salloc: debug4: eio: handling events for 1 objects
salloc: debug3: eio_message_socket_readable: shutdown 0 fd 6
salloc: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Munge authentication plugin type:auth/munge version:0x160508
salloc: debug:  auth/munge: init: Munge authentication plugin loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib64/slurm/hash_k12.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:KangarooTwelve hash plugin type:hash/k12 version:0x160508
salloc: debug:  hash/k12: init: init: KangarooTwelve hash plugin loaded
salloc: debug3: Success.
salloc: Granted job allocation 24
salloc: Waiting for resource configuration
salloc: Nodes ranger-s22-07 are ready for job
salloc: debug:  laying out the 1 tasks on 1 hosts ranger-s22-07 dist 8192
[pbisbal@ranger ~]$

This is all I see in /var/log/slurm/slurmd.log on the compute node:

[2023-05-19T10:21:36.898] [24.extern] task/cgroup: _memcg_initialize: job: 
alloc=1024MB mem.limit=1024MB memsw.limit=unlimited
[2023-05-19T10:21:36.899] [24.extern] task/cgroup: _memcg_initialize: step: 
alloc=1024MB mem.limit=1024MB memsw.limit=unlimited



And this is all I see in /var/log/slurm/slurmctld.log on the controller:


[2023-05-19T10:18:16.815] sched: _slurm_rpc_allocate_resources JobId=23 
NodeList=ranger-s22-07 usec=1136
[2023-05-19T10:18:22.423] Time limit exhausted for JobId=22
[2023-05-19T10:21:36.861] sched: _slurm_rpc_allocate_resources JobId=24 
NodeList=ranger-s22-07 usec=1039

Here's my slurm.conf file:


# grep -v ^# /etc/slurm/slurm.conf  | grep -v ^$

ClusterName=ranger
SlurmctldHost=ranger-master
EnforcePartLimits=ALL
JobSubmitPlugins=lua,require_timelimit
LaunchParameters=user_interactive_step
MaxStepCount=2500
MaxTasksPerNode=32
MpiDefault=none
ProctrackType=proctrack/cgroup
PrologFlags=contain
ReturnToService=0

Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-23 Thread Ryan Novosielski

I think it’s easier than all of this. Are you actually changing names of all of 
these things, or just IP addresses? It they all resolve to an IP now and you 
can bring everything down and change the hosts files or DNS, it seems to me 
that if the names aren’t changing, that’s that. I know that “scontrol show 
cluster” will show the wrong IP address but I think that updates itself. 

The names of the servers are in slurm.conf, but again, if the names don’t 
change, that won’t matter. If you have IPs there, you will need to change them. 

Sent from my iPhone

> On Apr 23, 2023, at 14:01, Purvesh Parmar  wrote:
> 
> Hello,
> 
> We have slurm 21.08 on ubuntu 20. We have a cluster of 8 nodes. Entire slurm 
> communication happens over 192.168.5.x network (LAN). However as per 
> requirement, now we are migrating the cluster to other premises and there we 
> have 172.16.1.x (LAN). I have to migrate the entire network including 
> SLURMDBD (mariadb), SLURMCTLD, SLURMD. ALso the cluster network is also 
> changing from 192.168.5.x to 172.16.1.x and each node will be assigned the ip 
> address from the 172.16.1.x network. 
> The cluster has been running for the last 3 months and it is required to 
> maintain the old usage stats as well.
> 
> 
>  Is the procedure correct as below :
> 
> 1) Stop slurm
> 2) suspend all the queued jobs
> 3) backup slurm database
> 4) change the slurm & munge configuration i.e. munge conf, mariadb conf, 
> slurmdbd.conf, slurmctld.conf, slurmd.conf (on compute nodes), gres.conf, 
> service file 
> 5) Later, do the update in the slurm database by executing below command
> sacctmgr modify node where node=old_name set name=new_name
> for all the nodes.
> ALso, I think, slurm server name and slurmdbd server names are also required 
> to be updated. How to do it, still checking
> 6) Finally, start slurmdbd, slurmctld on server and slurmd on compute nodes
> 
> Please help and guide for above.
> 
> Regards,
> 
> Purvesh Parmar
> INHAIT

Re: [slurm-users] srun --mem issue

2022-12-08 Thread Ryan Novosielski

On Dec 8, 2022, at 03:57, Loris Bennett 
mailto:loris.benn...@fu-berlin.de>> wrote:

Loris Bennett mailto:loris.benn...@fu-berlin.de>> 
writes:

Moshe Mergy mailto:moshe.me...@weizmann.ac.il>> 
writes:

Hi Sandor

I personnaly block "--mem=0" requests in file job_submit.lua (slurm 20.02):

 if (job_desc.min_mem_per_node == 0  or  job_desc.min_mem_per_cpu == 0) then
   slurm.log_info("%s: ERROR: unlimited memory requested", log_prefix)
   slurm.log_info("%s: ERROR: job %s from user %s rejected because of an 
invalid (unlimited) memory request.", log_prefix, job_desc.name, 
job_desc.user_name)
   slurm.log_user("Job rejected because of an invalid memory request.")
   return slurm.ERROR
  end

What happens if somebody explicitly requests all the memory, so in
Sandor's case --mem=500G ?

Maybe there is a better or nicer solution...

Can't you just use account and QOS limits:

 https://slurm.schedmd.com/resource_limits.html

?

And anyway, what is the use-case for preventing someone using all the
memory? In our case, if someone really need all the memory, they should be able
to have it.

However, I do have a chronic problem with users requesting too much
memory. My approach has been to try to get people to use 'seff' to see
what resources their jobs in fact need.  In addition each month we
generate a graphical summary of 'seff' data for each user, like the one
shown here

 
https://www.fu-berlin.de/en/sites/high-performance-computing/Dokumentation/Statistik

and automatically send an email to those with a large percentage of
resource-inefficient jobs telling them to look at their graphs and
correct their resource requirements for future jobs.

Cheers,

Loris

I may be wrong about this, but aren’t people penalized in their fair share 
score for using too much memory, and effectively penalized for wasting it by 
“paying” for it even if they don’t need it? They’re also penalized for it by 
likely having to wait longer to have their request satisfied if they specify 
more than they need. That’s generally what I used to tell people.

I also make quite a bit of use of Ole Holm Nielsen’s pestat, to catch jobs that 
are not running efficiently, but that’s not automated, just a way to review.

https://github.com/OleHolmNielsen/Slurm_tools/blob/master/pestat/pestat

--
#BlackLivesMatter

|| \\UTGERS,|---*O*-------
||_// the State  | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] srun --mem issue

2022-12-08 Thread Ryan Novosielski

> On Dec 8, 2022, at 21:30, Kilian Cavalotti  
> wrote:
> 
> Hi Loris,
> 
> On Thu, Dec 8, 2022 at 12:59 AM Loris Bennett
>  wrote:
>> However, I do have a chronic problem with users requesting too much
>> memory. My approach has been to try to get people to use 'seff' to see
>> what resources their jobs in fact need.  In addition each month we
>> generate a graphical summary of 'seff' data for each user, like the one
>> shown here
>> 
>>  
>> https://www.fu-berlin.de/en/sites/high-performance-computing/Dokumentation/Statistik
>> 
>> and automatically send an email to those with a large percentage of
>> resource-inefficient jobs telling them to look at their graphs and
>> correct their resource requirements for future jobs.
> 
> That's a very clever approach, and the graphs look awesome!
> Would you be willing to share the scripts you're using to generate
> those reports? That sounds like something many sites could benefit
> from!

Agreed, same.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] Upgrade from 20.11.0 to Slurm version 22.05.6 ?

2022-11-10 Thread Ryan Novosielski

We basically always do this. Just be mindful of how long it takes to upgrade 
your database (if you have that ability to do a dry run, you might ant to do 
that). That’s true of any upgrade, though.

If you have to skip more than one version, you’ll have to upgrade in stages.

On Nov 10, 2022, at 7:00 PM, Michael Gutteridge 
mailto:michael.gutteri...@gmail.com>> wrote:

Theoretically I think you should be able to.  Slurm should upgrade from the 
previous two releases (see 
this)
 and I think that should include 20.11. (20.11 -> 21.08 -> 22.05).  Not 
something I've done though.

 - Michael


On Thu, Nov 10, 2022 at 2:15 PM Sid Young 
mailto:sid.yo...@gmail.com>> wrote:
Is there a direct upgrade path from  20.11.0 to 22.05.6 or is it in multiple 
steps?


Sid Young



On Fri, Nov 11, 2022 at 7:53 AM Marshall Garey 
mailto:marsh...@schedmd.com>> wrote:
We are pleased to announce the availability of Slurm version 22.05.6.

This includes a fix to core selection for steps which could result in
random task launch failures, alongside a number of other moderate
severity issues.

- Marshall

--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

> * Changes in Slurm 22.05.6
> ==
>  -- Fix a partition's DisableRootJobs=no from preventing root jobs from 
> working.
>  -- Fix the number of allocated cpus for an auto-adjustment case in which the
> job requests --ntasks-per-node and --mem (per-node) but the limit is
> MaxMemPerCPU.
>  -- Fix POWER_DOWN_FORCE request leaving node in completing state.
>  -- Do not count magnetic reservation queue records towards backfill limits.
>  -- Clarify error message when --send-libs=yes or BcastParameters=send_libs
> fails to identify shared library files, and avoid creating an empty
> "_libs" directory on the target filesystem.
>  -- Fix missing CoreSpec on dynamic nodes upon slurmctld restart.
>  -- Fix node state reporting when using specialized cores.
>  -- Fix number of CPUs allocated if --cpus-per-gpu used.
>  -- Add flag ignore_prefer_validation to not validate --prefer on a job.
>  -- Fix salloc/sbatch SLURM_TASKS_PER_NODE output environment variable when 
> the
> number of tasks is not requested.
>  -- Permit using wildcard magic cookies with X11 forwarding.
>  -- cgroup/v2 - Add check for swap when running OOM check after task
> termination.
>  -- Fix deadlock caused by race condition when disabling power save with a
> reconfigure.
>  -- Fix memory leak in the dbd when container is sent to the database.
>  -- openapi/dbv0.0.38 - correct dbv0.0.38_tres_info.
>  -- Fix node SuspendTime, SuspendTimeout, ResumeTimeout being updated after
> altering partition node lists with scontrol.
>  -- jobcomp/elasticsearch - fix data_t memory leak after serialization.
>  -- Fix issue where '*' wasn't accepted in gpu/cpu bind.
>  -- Fix SLURM_GPUS_ON_NODE for shared GPU gres (MPS, shards).
>  -- Add SLURM_SHARDS_ON_NODE environment variable for shards.
>  -- Fix srun error with overcommit.
>  -- Fix bug in core selection for the default cyclic distribution of tasks
> across sockets, that resulted in random task launch failures.
>  -- Fix core selection for steps requesting multiple tasks per core when
> allocation contains more cores than required for step.
>  -- gpu/nvml - Fix MIG minor number generation when GPU minor number
> (/dev/nvidia[minor_number]) and index (as seen in nvidia-smi) do not 
> match.
>  -- Fix accrue time underflow errors after slurmctld reconfig or restart.
>  -- Surpress errant errors from prolog_complete about being unable to locate
> "node:(null)".
>  -- Fix issue where shards were selected from multiple gpus and failed to
> allocate.
>  -- Fix step cpu count calculation when using --ntasks-per-gpu=.
>  -- Fix overflow problems when validating array index parameters in slurmctld
> and prevent a potential condition causing slurmctld to crash.
>  -- Remove dependency on json-c in slurmctld when running with power saving.
> Only the new "SLURM_RESUME_FILE" support relies on this, and it will be
> disabled if json-c support is unavailable instead.

Re: [slurm-users] Using "srun" on compute nodes -- Ray cluster

2022-07-15 Thread Ryan Novosielski

Are you talking about a script that is run via sbatch containing srun 
command lines? If so, there are a lot of reasons to do that. One is 
better instrumentation, as I understand it, but also srun --mpi is a way 
to eliminate mpiexec/mpirun/etc., and is what we recommend at our site 
instead (using the PMI2 or PMIx methods).


On 7/15/22 05:17, Kamil Wilczek wrote:

Dear Slurm Users,

one of my cluster users would like to run a Ray cluster on Slurm.
I noticed that the batch script example requires running the "srun"
command on a compute node, which already is allocated:
https://docs.ray.io/en/latest/cluster/examples/slurm-template.html#slurm-template 



This is the first time I see or hear about this type of usage
and I have problems wrapping my head around this.
Is there anything wrong or unusual about this? I understand that
this would allocate some resources on other nodes. Would
Slurm enforce limits properly ("qos" or "partition" limits)?

Kind Regards


--
#BlackLivesMatter

 || \\UTGERS, |--*O*----
 ||_// the State  |Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
  `'

Re: [slurm-users] Need to restart slurmctld for gres jobs to start

2022-06-24 Thread Ryan Novosielski


On 6/2/22 14:02, tluchko wrote:

Hello,

I have recently started to have problems where jobs sit in the queue 
waiting for resources to become available, even when the resources are 
available. If I stop and restart slurmctld, the pending jobs start running.


This seems to be related to GRES jobs.  I have seven nodes with

Gres=bandwidth:ib:no_consume:1G

four nodes with

Gres=gpu:gtx_titan_x:4,bandwidth:ethernet:no_consume:1G

and one node with.

Gres=gpu:rtx_2080_ti:4,bandwidth:ethernet:no_consume:1G

Jobs only sit in the queue with RESOURCES as the REASON when we include 
the flag --gres=bandwidth:ib.  If we remove the flag, the jobs run fine. 
  But we need the flag to ensure that we don't get a mix of IB and 
ethernet nodes because they fail in this case.


It seems that once a node completes a job with --gres=bandwidth:ib it 
won't run another job with this setting until I restart slurmctld.


The only error I can find is in /var/log/slurm/slurmctld.log

[2022-05-31T03:27:49.144] error: gres/bandwidth: _step_dealloc 
StepId=140569.0 dealloc, node_in_use is NULL


These jobs were running consistently but then started giving us trouble 
about a month ago. I have tried restarting slurmd on all nodes and 
slurmctld.  Restarting slurmctld does provide a temporary fix.


I'm using Slurm 21.08.3 and Rocky Linux release 8.5.

Do you have any suggestions as to what is wrong or how to fix it?

Thank you,

Tyler


Another alternate way to deal with this is the topology plugin. We use 
this to keep jobs from spanning two different infiniband fabrics that 
are connected together via lower bandwidth than the rest of the fabric.


--
#BlackLivesMatter

 || \\UTGERS, |--*O*
 ||_// the State  |Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
  `'

Re: [slurm-users] DBD Reset

2022-06-15 Thread Ryan Novosielski

It very much rang a bell!

I think there is also an scontrol command that you can use to show the actual 
running config (probably “show config”), which will include the defaults if you 
are seeing something that you don’t have specified in the config file.

Sent from my iPhone

On Jun 15, 2022, at 21:35, Reed Dier  wrote:

 Well, you nailed it.

Honestly a little surprised it was working to begin with.

In the DBD conf
-#DbdPort=7031
+DbdPort=7031

And then in the slurm.conf
-#AccountingStoragePort=3306
+AccountingStoragePort=7031

I’m not sure how my slurm.conf showed the 3306 mysql port commented out.
I did confirm that the slurmdbd was listening on 6819 before,  so I assumed 
that the default would be 6819 on the dbd and the “client” (ctld or otherwise) 
side, but somehow that wasn’t the case?

Either way, I do feel like things are getting back to the right state.
So thank you so much for pointing me in the correct direction.

Thanks,
Reed

On Jun 15, 2022, at 7:50 PM, Ryan Novosielski 
mailto:novos...@rutgers.edu>> wrote:

Apologies for not having more concrete information available when I’m replying 
to you, but I figured maybe having a fast hint might be better.

Have a look at how the various daemons communicate with one another. This 
sounds to me like a firewall thing between maybe the SlurmCtld and where the 
SlurmDBD is running right now, or vice-versa or something like that. The 
“scontrol show cluster” thing is a giveaway. That is populated dynamically, not 
pulled from a config file exactly.

I ran into this exact thing years ago, but can’t remember where the firewall 
was the issue.

Sent from my iPhone

On Jun 15, 2022, at 20:12, Reed Dier 
mailto:reed.d...@focusvq.com>> wrote:

 Hoping this is an easy answer.

My mysql instance somehow corrupted itself, and I’m having to purge and start 
over.
This is ok, because the data in there isn’t too valuable, and we aren’t making 
use of associations or anything like that yet (no AccountingStorageEnforce).

That said, I’ve decided to put the dbd’s mysql instance on my main database 
server, rather than in a small vm alongside the dbd.
Jobs are still submitting alright, and after adding the cluster back with 
`sacctmgr create cluster $cluster` it seems to have stopped the log firehose.
The issue I’m mainly seeing now is in the dbd logs:

[2022-06-15T19:40:43.064] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:40:43.065] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:45:39.827] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:48:01.038] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:48:01.039] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:48:38.104] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:50:39.290] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:55:39.769] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port

And if I run
$ sacctmgr show cluster
Cluster ControlHost  ControlPort   RPC Share GrpJobs   GrpTRES 
GrpSubmit MaxJobs   MaxTRES MaxSubmit MaxWall  QOS   
Def QOS
 -- ---  - - --- - 
- --- - - ---  
-
   $cluster0 0 1
   normal

I can see the ControlHost, ControlPort, and RPC are all missing.
So I’m not sure what I need to do to figure out how to effectively reset my dbd.
Also, $cluster in sacctmgr matches ClusterName=$cluster in my slurm.conf.

The only thing that has changed is the StorageHost in the dbd conf, and I made 
the database, user, and grant all on slurm_acct_db.*, on the new database 
server.
And I’ve verified that it has made tables, and that I can connect from the host 
with the correct credentials.

mysql> show tables;
+--+
| Tables_in_slurm_acct_db  |
+--+
| acct_coord_table |
| acct_table   |
| $cluster_assoc_table |
| $cluster_assoc_usage_day_table   |
| $cluster_assoc_usage_hour_table  |
| $cluster_assoc_usage_month_table |
| $cluster_event_table |
| $cluster_job_table   |
| $cluster_last_ran_table  |
| $cluster_resv_table  |
| $cluster_step_table  |
| $cluster_suspend_table   |
| $cluster_usage_day_table |
| $cluster_usage_hour_table|
| $cluster_us

Re: [slurm-users] DBD Reset

2022-06-15 Thread Ryan Novosielski

Apologies for not having more concrete information available when I’m replying 
to you, but I figured maybe having a fast hint might be better.

Have a look at how the various daemons communicate with one another. This 
sounds to me like a firewall thing between maybe the SlurmCtld and where the 
SlurmDBD is running right now, or vice-versa or something like that. The 
“scontrol show cluster” thing is a giveaway. That is populated dynamically, not 
pulled from a config file exactly.

I ran into this exact thing years ago, but can’t remember where the firewall 
was the issue.

Sent from my iPhone

On Jun 15, 2022, at 20:12, Reed Dier  wrote:

 Hoping this is an easy answer.

My mysql instance somehow corrupted itself, and I’m having to purge and start 
over.
This is ok, because the data in there isn’t too valuable, and we aren’t making 
use of associations or anything like that yet (no AccountingStorageEnforce).

That said, I’ve decided to put the dbd’s mysql instance on my main database 
server, rather than in a small vm alongside the dbd.
Jobs are still submitting alright, and after adding the cluster back with 
`sacctmgr create cluster $cluster` it seems to have stopped the log firehose.
The issue I’m mainly seeing now is in the dbd logs:

[2022-06-15T19:40:43.064] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:40:43.065] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:45:39.827] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:48:01.038] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:48:01.039] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:48:38.104] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:50:39.290] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:55:39.769] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port

And if I run
$ sacctmgr show cluster
Cluster ControlHost  ControlPort   RPC Share GrpJobs   GrpTRES 
GrpSubmit MaxJobs   MaxTRES MaxSubmit MaxWall  QOS   
Def QOS
 -- ---  - - --- - 
- --- - - ---  
-
   $cluster0 0 1
   normal

I can see the ControlHost, ControlPort, and RPC are all missing.
So I’m not sure what I need to do to figure out how to effectively reset my dbd.
Also, $cluster in sacctmgr matches ClusterName=$cluster in my slurm.conf.

The only thing that has changed is the StorageHost in the dbd conf, and I made 
the database, user, and grant all on slurm_acct_db.*, on the new database 
server.
And I’ve verified that it has made tables, and that I can connect from the host 
with the correct credentials.

mysql> show tables;
+--+
| Tables_in_slurm_acct_db  |
+--+
| acct_coord_table |
| acct_table   |
| $cluster_assoc_table |
| $cluster_assoc_usage_day_table   |
| $cluster_assoc_usage_hour_table  |
| $cluster_assoc_usage_month_table |
| $cluster_event_table |
| $cluster_job_table   |
| $cluster_last_ran_table  |
| $cluster_resv_table  |
| $cluster_step_table  |
| $cluster_suspend_table   |
| $cluster_usage_day_table |
| $cluster_usage_hour_table|
| $cluster_usage_month_table   |
| $cluster_wckey_table |
| $cluster_wckey_usage_day_table   |
| $cluster_wckey_usage_hour_table  |
| $cluster_wckey_usage_month_table |
| clus_res_table   |
| cluster_table|
| convert_version_table|
| federation_table |
| qos_table|
| res_table|
| table_defs_table |
| tres_table   |
| txn_table|
| user_table   |
+--+
29 rows in set (0.01 sec)

Any tips are appreciated.

21.08.7 and Ubuntu 20.04.
Slurmdbd and slurmctld(1) are running on one host, and slurmctld(2) is running 
on another host, and is the primary.

Thanks,
Reed

Re: [slurm-users] sbatch - accept jobs above limits

2022-02-08 Thread Ryan Novosielski

I’m not 100% certain that this affects this situation, but there’s a slurm.conf 
setting called EnforcePartLimits that you might want to change.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Feb 8, 2022, at 5:26 PM, z1...@arcor.de wrote:
> 
> 
> Dear all,
> 
> sbatch jobs are immediately rejected if no suitable node is available in
> the configuration.
> 
> 
> sbatch: error: Memory specification can not be satisfied
> sbatch: error: Batch job submission failed: Requested node configuration
> is not available
> 
> These jobs should be accepted, if a suitable node will be active soon.
> For example, these jobs could be in PartitionConfig.
> 
> Is that configurable?
> 
> 
> Many thanks,
> 
> Mike
>

Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-03 Thread Ryan Novosielski

> On Feb 3, 2022, at 2:55 PM, Ole Holm Nielsen  
> wrote:
> 
> On 03-02-2022 16:37, Nathan Smith wrote:
>> Yes, we are running slurmdbd. We could arrange enough downtime to do an 
>> incremental upgrade of major versions as Brian Andrus suggested, at least on 
>> the slurmctld and slurmdbd systems. The slurmds I would just do a direct 
>> upgrade once the scheduler work was completed.
> 
> As Brian Andrus said, you must upgrade Slurm by at most 2 major versions, and 
> that includes slurmd's as well!  Don't do a "direct upgrade" of slurmd by 
> more than 2 versions!
> 
> I recommend separate physical servers for slurmdbd and slurmctld.  Then you 
> can upgrade slurmdbd without taking the cluster offline.  It's OK for 
> slurmdbd to be down for many hours, since slurmctld caches the state 
> information in the meantime.

The one thing you want to watch out for here – maybe more so if you are using a 
VM than a physical server as you may have sized the RAM for how much slurmctld 
appears to need, as we did – is that that caching that takes place on the 
slurmctld uses memory (I guess obviously, when you think about it). The result 
there can be that eventually if you have slurmd down for a long time (we had 
someone who was hitting a bug that would start running jobs right after 
everyone went to sleep for example), your slurmctld can run out of memory, 
crash, and then that cache is lost. You don’t normally see that memory being 
used like that, because slurmdbd is normally up/accepting the accounting data.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] srun : Communication connection failure

2022-01-25 Thread Ryan Novosielski

I’m coming to this question late, and this is not the answer to your problem 
(well, maybe tangentially), but it may help someone else: my recollection is 
that the compute node that gets assigned the job must be able to contact the 
node you’re starting the interactive job from (so bg-slurmb-login1 here) on a 
wide variety of ports in the case of interactive jobs. For us, we had a 
firewall config that didn’t allow for that and all interactive jobs failed 
until we resolved that. I guess having the wrong address someplace could a 
mimic that behavior.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jan 20, 2022, at 9:40 AM, Durai Arasan  wrote:
> 
> Hello Slurm users,
> 
> We are suddenly encountering strange errors while trying to launch 
> interactive jobs on our cpu partitions. Have you encountered this problem 
> before? Kindly let us know.
> 
> [darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" --ntasks=1 
> --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G  
> --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
> srun: error: Task launch for StepId=1137134.0 failed on node slurm-cpu-hm-7: 
> Communication connection failure
> srun: error: Application launch failed: Communication connection failure
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
> 
> Best regards,
> Durai Arasan
> MPI Tuebingen

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-14 Thread Ryan Novosielski

Did a git bisect and answered my own question: “yes.”

[novosirj@amarel1 Slurm_tools]$ git bisect good
72cd05d78f1077142143f20c4293c8c367ffb5a7 is the first bad commit
commit 72cd05d78f1077142143f20c4293c8c367ffb5a7
Author: OleHolmNielsen 
Date:   Fri Apr 23 15:11:37 2021 +0200

Changes related to "squeue -O".  May not work with Slurm 19.05 and older.

:04 04 dee11077f72dd898dcadccf9d0dd2cfc438a8d1f 
61880fe14a49a7a96167b89d21dede41f2751d86 M  pestat

> On Dec 14, 2021, at 4:29 PM, Ryan Novosielski  wrote:
> 
> Hi Ole,
> 
> Thanks again for your great tools!
> 
> Is something expected to have broken this script for older versions of Slurm 
> somehow? A version we have with a file time of 1/19/21 will show job IDs and 
> users for a given node, but the version you released yesterday does not seem 
> to (we may have missed versions in the middle, so it may not be this version 
> that did it):
> 
> Older: 
> 
> [root@amarel1 pestat]# ./pestat -F -w slepner080
> Print only nodes that are flagged by * (RED nodes)
> Select only nodes in hostlist=slepner080
> Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem  Joblist
>State Use/Tot  (MB) (MB)  JobId 
> User ...
> slepner080   main*  mix  22  241.07*   128000   116325  
> 17036194 mt1044 17032319 as2654 17039145 vs670  
> 
> Current:
> 
> [root@amarel1 pestat]# ~novosirj/bin/pestat -F -w slepner080
> Print only nodes that are flagged by * (RED nodes)
> Select only nodes in hostlist=slepner080
> Hostname Partition Node Num_CPU  CPUload  Memsize  Freemem  
> Joblist
>  State Use/Tot  (15min) (MB) (MB)  JobID 
> User ...
> slepner080   main* mix   22  241.07*   128000   116325   
> 
> You can see Joblist and JobID User are not present.
> 
> --
> #BlackLivesMatter
> 
> || \\UTGERS,   
> |---*O*---
> ||_// the State| Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ| Office of Advanced Research Computing - MSB C630, 
> Newark
> `'
> 
>> On Dec 13, 2021, at 7:09 AM, Ole Holm Nielsen  
>> wrote:
>> 
>> Hi Slurm users,
>> 
>> I have updated the "pestat" tool for printing Slurm nodes status with 1 line 
>> per node including job info.  The download page is 
>> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
>> (also listed in https://slurm.schedmd.com/download.html).
>> 
>> Improvements:
>> 
>> * The GRES/GPU output option "pestat -G" now prints the job gres/gpu 
>> information as obtained from squeue's tres-alloc output option, which should 
>> contain the most correct GRES/GPU information.
>> 
>> If you have a cluster with GPUs, could you try out the latest version and 
>> send me any feedback?
>> 
>> Thanks to René Sitt for helpful suggestions and testing.
>> 
>> The pestat tool can print a large variety of node and job information, and 
>> is generally useful for monitoring nodes and jobs on Slurm clusters.  For 
>> command options and examples please see the download page.  My own favorite 
>> usage is "pestat -F".
>> 
>> Thanks,
>> Ole
>> 
>> -- 
>> Ole Holm Nielsen
>> PhD, Senior HPC Officer
>> Department of Physics, Technical University of Denmark
>> 
>

Re: [slurm-users] How to get an estimate of job completion for planned maintenance?

2021-12-14 Thread Ryan Novosielski

Another useful format string – and again, this is if you mess up and don’t do a 
reservation early enough (or your environment has no concept of a time limit) – 
is this one:

squeue -o %u,%i,%L

Will show you username, job id, and remaining time – which is sometimes easier 
to deal with than end date/time.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Nov 7, 2021, at 7:45 AM, Carsten Beyer  wrote:
> 
> Hi Ahmad,
> 
> you could use squeue -h -t r --format="%i %e" | sort -k2 to get a list of all 
> running jobs sorted by their endtime.
> 
> We use normaly a maintenance reservation with starttime of the mainenance (or 
> with some leading time before) to get the system free of jobs. That make 
> things easier, because if you drain your cluster no new jobs could start. 
> With the reservation jobs with a shorter wallclock time could be backfilled 
> till the reservation/maintenance starts. You can put the reservation anytime 
> in the system but at least or before " minus  MaxTime of partition>", e.g.
> 
> scontrol create reservation= starttime= duration=  
> user=root flags=maint nodes=ALL
> 
> Hope, that helps a little bit,
> 
> Carsten
> 
> -- 
> Carsten Beyer
> Abteilung Systeme
> 
> Deutsches Klimarechenzentrum GmbH (DKRZ)
> Bundesstraße 45a * D-20146 Hamburg * Germany
> 
> Phone:  +49 40 460094-221
> Fax:+49 40 460094-270
> Email:  be...@dkrz.de
> URL:http://www.dkrz.de
> 
> Geschäftsführer: Prof. Dr. Thomas Ludwig
> Sitz der Gesellschaft: Hamburg
> Amtsgericht Hamburg HRB 39784
> 
> 
> Am 05.11.2021 um 23:16 schrieb Ahmad Khalifa:
>> If I plan maintenance on a certain day, how long before that day should I 
>> set the queue to drain mode?! Is there a way to estimate the completion date 
>> / time of current running jobs?!
>> 
>> Regards.
>

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-14 Thread Ryan Novosielski

Hi Ole,

Thanks again for your great tools!

Is something expected to have broken this script for older versions of Slurm 
somehow? A version we have with a file time of 1/19/21 will show job IDs and 
users for a given node, but the version you released yesterday does not seem to 
(we may have missed versions in the middle, so it may not be this version that 
did it):

Older: 

[root@amarel1 pestat]# ./pestat -F -w slepner080
Print only nodes that are flagged by * (RED nodes)
Select only nodes in hostlist=slepner080
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem  Joblist
State Use/Tot  (MB) (MB)  JobId 
User ...
slepner080   main*  mix  22  241.07*   128000   116325  
17036194 mt1044 17032319 as2654 17039145 vs670  

Current:

[root@amarel1 pestat]# ~novosirj/bin/pestat -F -w slepner080
Print only nodes that are flagged by * (RED nodes)
Select only nodes in hostlist=slepner080
Hostname Partition Node Num_CPU  CPUload  Memsize  Freemem  Joblist
  State Use/Tot  (15min) (MB) (MB)  JobID 
User ...
slepner080   main* mix   22  241.07*   128000   116325   

You can see Joblist and JobID User are not present.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Dec 13, 2021, at 7:09 AM, Ole Holm Nielsen  
> wrote:
> 
> Hi Slurm users,
> 
> I have updated the "pestat" tool for printing Slurm nodes status with 1 line 
> per node including job info.  The download page is 
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
> (also listed in https://slurm.schedmd.com/download.html).
> 
> Improvements:
> 
> * The GRES/GPU output option "pestat -G" now prints the job gres/gpu 
> information as obtained from squeue's tres-alloc output option, which should 
> contain the most correct GRES/GPU information.
> 
> If you have a cluster with GPUs, could you try out the latest version and 
> send me any feedback?
> 
> Thanks to René Sitt for helpful suggestions and testing.
> 
> The pestat tool can print a large variety of node and job information, and is 
> generally useful for monitoring nodes and jobs on Slurm clusters.  For 
> command options and examples please see the download page.  My own favorite 
> usage is "pestat -F".
> 
> Thanks,
> Ole
> 
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark
>

Re: [slurm-users] [External] Re: PropagateResourceLimits

2021-04-29 Thread Ryan Novosielski

It may not for specifically PropagateResourceLimits – as I said, the docs are a 
little sparse on the “how” this actually works – but you’re not correct that 
PAM doesn’t come into play re: user jobs. If you have “UsePam = 1” set, and 
have an /etc/pam.d/slurm, as our site does, there is some amount of interaction 
here, and PAM definitely affects user jobs.

> On Apr 27, 2021, at 11:31 AM, Prentice Bisbal  wrote:
> 
> I don't think PAM comes into play here. Since Slurm is starting the processes 
> on the compute nodes as the user, etc., PAM is being bypassed.
> 
> Prentice
> 
> 
> On 4/22/21 10:55 AM, Ryan Novosielski wrote:
>> My recollection is that this parameter is talking about “ulimit” parameters, 
>> and doesn’t have to do with cgroups. The documentation is not as clear here 
>> as it could be, about what this does, the mechanism by which it’s applied 
>> (PAM module), etc.
>> 
>> Sent from my iPhone
>> 
>>> On Apr 22, 2021, at 09:07, Diego Zuccato  wrote:
>>> 
>>> Hello all.
>>> 
>>> I'd need a clarification about PropagateResourceLimits.
>>> If I set it to NONE, will cgroup still limit the resources a job can use on 
>>> the worker node(s), actually decoupling limits on the frontend from limits 
>>> on the worker nodes?
>>> 
>>> I've been bitten by the default being ALL, so when I tried to limit to 1GB 
>>> soft / 4GB hard the memory users can use on the frontend, the jobs began to 
>>> fail at startup even if they requested 200G (that are available on the 
>>> worker nodes but not on the frontend)...
>>> 
>>> Tks.
>>> 
>>> -- 
>>> Diego Zuccato
>>> DIFA - Dip. di Fisica e Astronomia
>>> Servizi Informatici
>>> Alma Mater Studiorum - Università di Bologna
>>> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>>> tel.: +39 051 20 95786
>>> 
>

Re: [slurm-users] PropagateResourceLimits

2021-04-22 Thread Ryan Novosielski

My recollection is that this parameter is talking about “ulimit” parameters, 
and doesn’t have to do with cgroups. The documentation is not as clear here as 
it could be, about what this does, the mechanism by which it’s applied (PAM 
module), etc. 

Sent from my iPhone

> On Apr 22, 2021, at 09:07, Diego Zuccato  wrote:
> 
> Hello all.
> 
> I'd need a clarification about PropagateResourceLimits.
> If I set it to NONE, will cgroup still limit the resources a job can use on 
> the worker node(s), actually decoupling limits on the frontend from limits on 
> the worker nodes?
> 
> I've been bitten by the default being ALL, so when I tried to limit to 1GB 
> soft / 4GB hard the memory users can use on the frontend, the jobs began to 
> fail at startup even if they requested 200G (that are available on the worker 
> nodes but not on the frontend)...
> 
> Tks.
> 
> -- 
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>

Re: [slurm-users] Jobs that may still be running at X time?

2021-04-16 Thread Ryan Novosielski

I knew we weren’t alone! Thanks, Juergen!

If the scheduling engine was slightly better for reservations (eg. “Third 
Tuesday” type stuff), it would probably happen a little less often. I know it’s 
sort of getting there.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Apr 16, 2021, at 6:21 PM, Juergen Salk  wrote:
> 
> * Ryan Novosielski  [210416 21:33]:
> 
>> Does anyone have a particularly clever way, either built-in or
>> scripted, to find out which jobs will still be running at
>> such-and-such time? 
> 
> Hi Ryan,
> 
> coincidentally, I just did this today. For exactly the same reason.
> squeue does have a "%L" format option which will print out the time
> left for the jobs in days-hours:minutes:seconds.
> 
> For example: squeue -t r -o "%u %i %L"
> 
> This may help to identify jobs that already started and may 
> eventually run into a maintenance reservation. 
> 
>> I bet anyone who’s made the mistake of not
>> entering a maintenance reservation soon enough knows the feeling.
> 
> Yes. ;-)
> 
> Best regards
> Jürgen
> 
>

[slurm-users] Jobs that may still be running at X time?

2021-04-16 Thread Ryan Novosielski

Hi there,

Does anyone have a particularly clever way, either built-in or scripted, to 
find out which jobs will still be running at such-and-such time? I bet anyone 
who’s made the mistake of not entering a maintenance reservation soon enough 
knows the feeling.

I know that jobs /may/ end earlier, but we’re looking for the potential.

I originally started looking for jobs with an EndTime= that overlapped with the 
maintenance, using "scontrol show job” and excluding jobs that had a StartTime= 
that was after the maintenance. Today I had a minor panic as many many new jobs 
had an EndTime= that overlapped with the maintenance, but a StartTime= that was 
after we put in the maintenance reservation but before the maintenance period. 
Come to find they were still pending. Not sure why they have a StartTime= that 
is before the reservation, but it appears as if they won’t be running.

Anyway, I figure this is something people probably need to know often enough. 
Any tips?

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-02-03 Thread Ryan Novosielski

My main point here is that essentially upgrading someone from, for example, 
SLURM 20.02 to SLURM 20.11 is not desirable, and that’s why upgrades between 
major versions, IMO, should not happen automatically. There’s a whole section 
of the documentation about how to do this properly, and I’m not sure that one 
should ever count on a package to do this properly, at least without advance 
planning (which I guess you could argue someone should do anyway). That doesn’t 
have much to do with the initial complaint of them just appearing, which is a 
one-time problem as now it’s in the repo. At any rate, not wanting to upgrade 
automatically is at least true of slurmdbd and slurmctld, maybe less true of 
the client versions. Sure, someone should know what they’re upgrading to, but I 
don’t imagine anyone’s goal here is to make it easier to shoot oneself in the 
foot. I imagine no one releases packages with the express goal of having people 
version lock them out. :-D

None of this affects me, as I’m not installing SLURM this way, but if someone 
/were/ asking me what I thought, I’d say that the current “slurm” package 
should be renamed “slurm-20.11,” which can follow an upgrade path with no issue 
as it’s safe to upgrade within a SLURM release, but that should only ever 
upgrade to another minor release of SLURM 20.11. The next one should be called 
slurm-21.02 or whatever version it ultimately becomes (I don’t think it’s even 
been pre-released yet). I don’t know enough about EPEL to know if that’s an 
allowable solution, but it’s one I’ve seen used for other software packages 
where one needs to manually upgrade to a new major version (I’m thinking on 
Ubuntu where the VirtualBox package has an embedded major/minor version number 
in it so you only automatically upgrade to a new point release).

To give an example of why we don’t just upgrade via packages, our SlurmDBD 
upgrade has at times taken more than 24 hours to do, and if something else gets 
upgraded sometime before that, it could cause problems.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Feb 3, 2021, at 1:06 PM, Philip Kovacs  wrote:
> 
> I am familiar with the package rename process and it would not have the 
> effect you might think it would.
> If I provide an upgrade path to a new package name, e.g. slurm-xxx, the net 
> effect would be to tell yum or
> dnf-managed systems that the new package name for slurm is slurm-xxx.  That 
> would make the problem
> worse by not only upgrading slurm from EPEL but also renaming it.
> 
> The way to handle the problem properly is to use any of the other methods I 
> described earlier that make
> the name collision moot, i.e. tell your systems that your local repo has a 
> higher priority or that packages
> in that repo are protected from upgrade or lock slurm onto a designated 
> version.
> 
> If you accidentally got pulled into an upgrade that you didn't want, those 
> are the steps to prevent it in the future.
> 
> You can easily create test cases for yum by placing (artificially) higher 
> versions of the packages you want
> to protect into a test repo.  Then use yum protectbase or yum priorities to 
> tell your system that your local repo
> is protected and/or that the local repo has a higher priority than the test 
> repo, depending on which yum plugins
> you use. Then verify that that `yum check-update` does not report an upgrade 
> for your test packages.
> 
> 
> Phil
> On Wednesday, February 3, 2021, 04:03:00 AM EST, Jürgen Salk 
>  wrote:
> 
> 
> Hi Phil,
> 
> assuming that all sites maintaining their own Slurm rpm packages must 
> now somehow ensure that these are not replaced by the EPEL packages 
> anyway, why wouldn't it be possible, in the long run, to follow the 
> Fedora packaging guidelines for renaming existing packages?
> 
> https://docs.fedoraproject.org/en-US/packaging-guidelines/#renaming-or-replacing-existing-packages
> 
> Best regards
> Jürgen
> 
> 
> On 03.02.21 01:58, Philip Kovacs wrote:
> > Lots of mixed reactions here, many in favor (and grateful) for the add 
> > to EPEL, many much less enthusiastic.
> > 
> > I cannot rename an EPEL package that is now in the wild without 
> > providing an upgrade path to the new name.
> > Such an upgrade path would defeat the purpose of the rename and won't 
> > help at all.
> > 
> > The best option, in my opinion, would be to use one of the following yum 
> > plugins:
> > 
> > yum-plugin-versionlock
> > yum-plugin-priorities
>

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-24 Thread Ryan Novosielski

I agree that by and large it’s no big deal, but a suggestion might be to 
provide the SLURM as slurm-*- being the set of packages you install, 
so that updating between major versions wouldn’t happen by surprise, given how 
careful one needs to be with SLURM upgrades — ordering, timing, etc. VirtualBox 
does something like that, and their upgrades aren’t even as disruptive.

More work still, though, and I install SLURM via OpenHPC so I’m not a 
constituent necessarily. Just an idea.

Thanks for your effort!

--
#BlackLivesMatter

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Jan 23, 2021, at 15:44, Philip Kovacs  wrote:


I can assure you it was easier for you to filter slurm from your repos than it 
was for me to make them available to both epel7 and epel8.

No good deed goes unpunished I guess.
On Saturday, January 23, 2021, 07:03:08 AM EST, Ole Holm Nielsen 
 wrote:


We use the EPEL yum repository on our CentOS 7 nodes.  Today EPEL
surprisingly delivers Slurm 20.11.2 RPMs, and the daily yum updates
(luckily) fail with some errors:

--> Running transaction check
---> Package slurm.x86_64 0:20.02.6-1.el7 will be updated
--> Processing Dependency: slurm(x86-64) = 20.02.6-1.el7 for package:
slurm-libpmi-20.02.6-1.el7.x86_64
--> Processing Dependency: libslurmfull.so()(64bit) for package:
slurm-libpmi-20.02.6-1.el7.x86_64
---> Package slurm.x86_64 0:20.11.2-2.el7 will be an update
--> Processing Dependency: pmix for package: slurm-20.11.2-2.el7.x86_64
--> Processing Dependency: libfreeipmi.so.17()(64bit) for package:
slurm-20.11.2-2.el7.x86_64
--> Processing Dependency: libipmimonitoring.so.6()(64bit) for package:
slurm-20.11.2-2.el7.x86_64
--> Processing Dependency: libslurmfull-20.11.2.so()(64bit) for package:
slurm-20.11.2-2.el7.x86_64
---> Package slurm-contribs.x86_64 0:20.02.6-1.el7 will be updated
---> Package slurm-contribs.x86_64 0:20.11.2-2.el7 will be an update
---> Package slurm-devel.x86_64 0:20.02.6-1.el7 will be updated
---> Package slurm-devel.x86_64 0:20.11.2-2.el7 will be an update
---> Package slurm-perlapi.x86_64 0:20.02.6-1.el7 will be updated
---> Package slurm-perlapi.x86_64 0:20.11.2-2.el7 will be an update
---> Package slurm-slurmdbd.x86_64 0:20.02.6-1.el7 will be updated
---> Package slurm-slurmdbd.x86_64 0:20.11.2-2.el7 will be an update
--> Running transaction check
---> Package freeipmi.x86_64 0:1.5.7-3.el7 will be installed
---> Package pmix.x86_64 0:1.1.3-1.el7 will be installed
---> Package slurm.x86_64 0:20.02.6-1.el7 will be updated
--> Processing Dependency: slurm(x86-64) = 20.02.6-1.el7 for package:
slurm-libpmi-20.02.6-1.el7.x86_64
--> Processing Dependency: libslurmfull.so()(64bit) for package:
slurm-libpmi-20.02.6-1.el7.x86_64
---> Package slurm-libs.x86_64 0:20.11.2-2.el7 will be installed
--> Finished Dependency Resolution
Error: Package: slurm-libpmi-20.02.6-1.el7.x86_64
(@/slurm-libpmi-20.02.6-1.el7.x86_64)
Requires: libslurmfull.so()(64bit)
Removing: slurm-20.02.6-1.el7.x86_64
(@/slurm-20.02.6-1.el7.x86_64)
libslurmfull.so()(64bit)
Updated By: slurm-20.11.2-2.el7.x86_64 (epel)
Not found
Error: Package: slurm-libpmi-20.02.6-1.el7.x86_64
(@/slurm-libpmi-20.02.6-1.el7.x86_64)
Requires: slurm(x86-64) = 20.02.6-1.el7
Removing: slurm-20.02.6-1.el7.x86_64
(@/slurm-20.02.6-1.el7.x86_64)
slurm(x86-64) = 20.02.6-1.el7
Updated By: slurm-20.11.2-2.el7.x86_64 (epel)
slurm(x86-64) = 20.11.2-2.el7
  You could try using --skip-broken to work around the problem
  You could try running: rpm -Va --nofiles --nodigest


We still run Slurm 20.02 and don't want EPEL to introduce any Slurm
updates!!  Slurm must be upgraded with some care, see for example
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

Therefore we must disable EPEL's slurm RPMs permanently.  The fix is to
add to the file /etc/yum.repos.d/epel.repo an "exclude=slurm*" line like
the last line in:

[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
#baseurl=http://download.fedoraproject.org/pub/epel/7/$basearch
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-7=$basearch=$infra=$contentdir
failovermethod=priority
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
exclude=slurm*

/Ole

Re: [slurm-users] Compute node process monitoring tools updated

2021-01-19 Thread Ryan Novosielski

Thanks, that’s great! I do a lot of that by hand (including lots over this 
weekend), so it will be a nice timesaver.

--
#BlackLivesMatter

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Jan 18, 2021, at 09:08, Ole Holm Nielsen  wrote:

FYI: My Slurm tools for displaying batch job user process information have 
been updated.  Besides the user process list from "ps", a summary of the number 
of processes and threads is now printed as well.  We use this for monitoring 
the sanity of user jobs.  For example, we often see jobs that run too many 
threads per process and overload the CPUs.

The tools are:

* psjob   for all user processes in a job
* psnode   for all user processes on a node or list of nodes

Download the psjob and psnode tools from:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/nodes

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-15 Thread Ryan Novosielski

Do you have any more information about that? I think that’s the bug I alluded 
to earlier in the conversation, and I believe I’m affected by it, but don’t 
know how to tell, how to fix it, or how to refer to it if I wanted to ask 
SchedMD (we have a contract).

--
#BlackLivesMatter

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Jan 14, 2021, at 18:56, Fulcomer, Samuel  wrote:


Also note that there was a bug in an older version of SLURM (pre-17-something) 
that corrupted the database in a way that prevented GPU/gres fencing. If that 
affected you and you're still using the same database, GPU fencing probably 
isn't working. There's a way of fixing this manually through sql hacking; 
however, we just went with a virgin database when we last upgraded in order to 
get it working (and sucked the accounting data into XDMoD).



On Thu, Jan 14, 2021 at 6:36 PM Fulcomer, Samuel 
mailto:samuel_fulco...@brown.edu>> wrote:
AllowedDevicesFile should not be necessary. The relevant devices are identified 
in gres.conf. "ConstrainDevices=yes" should be all that's needed.

nvidia-smi will only see the allocated GPUs. Note that a single allocated GPU 
will always be shown by nvidia-smi to be GPU 0, regardless of its actual 
hardware ordinal, and GPU_DEVICE_ORDINAL will be set to 0. The value of 
SLURM_STEP_GPUS will be set to the actual device number (N, where the device is 
/dev/nvidiaN).

On Thu, Jan 14, 2021 at 6:20 PM Ryan Novosielski 
mailto:novos...@rutgers.edu>> wrote:
AFAIK, if you have this set up correctly, nvidia-smi will be restricted too, 
though I think we were seeing a bug there at one time in this version.

--
#BlackLivesMatter

|| \\UTGERS,   |---*O*---
||_// the State     | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Jan 14, 2021, at 18:05, Abhiram Chintangal 
mailto:achintan...@berkeley.edu>> wrote:


Sean,

Thanks for the clarification.I noticed that I am missing the "AllowedDevices" 
option in mine. After adding this, the GPU allocations started working. (Slurm 
version 18.08.8)

I was also incorrectly using "nvidia-smi" as a check.

Regards,

Abhiram

On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby 
mailto:scro...@unimelb.edu.au>> wrote:
Hi Abhiram,

You need to configure cgroup.conf to constrain the devices a job has access to. 
See https://slurm.schedmd.com/cgroup.conf.html

My cgroup.conf is

CgroupAutomount=yes
AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes

TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup

The ConstrainDevices=yes is the key to stopping jobs from having access to GPUs 
they didn't request.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal 
mailto:achintan...@berkeley.edu>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation 
attempts


Hello,

I recently set up a small cluster at work using Warewulf/Slurm. Currently, I am 
not able to get the scheduler to
work well with GPU's (Gres).

While slurm is able to filter by GPU type, it allocates all the GPU's on the 
node. See below:

[abhiram@whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu nvidia-smi 
--query-gpu=index,name --format=csv
index, name
0, Tesla P100-PCIE-16GB
1, Tesla P100-PCIE-16GB
2, Tesla P100-PCIE-16GB
3, Tesla P100-PCIE-16GB
[abhiram@whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu nvidia-smi 
--query-gpu=index,name --format=csv
index, name
0, TITAN RTX
1, TITAN RTX
2, TITAN RTX
3, TITAN RTX
4, TITAN RTX
5, TITAN RTX
6, TITAN RTX
7, TITAN RTX

I am fairly new to Slurm and still figuring out my way around it. I would 
really appreciate any help with this.

For your reference, I attached the slurm.conf and gres.conf files.

Best,

Abhiram

--

Abhiram Chintangal
QB3 Nogales Lab
Bioinformatics Specialist @ Howard Hughes Medical Institute
University of California Berkeley
708D Stanley Hall, Berkeley, CA 94720
Phone (510)666-3344


--

Abhiram Chintangal
QB3 Nogales Lab
Bioinformatics Specialist @ Howard Hughes Medical Institute
University of California Berkeley
708D Stanley Hall, Berkeley, CA 94720
Phone (510)666-3344

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Ryan Novosielski

AFAIK, if you have this set up correctly, nvidia-smi will be restricted too, 
though I think we were seeing a bug there at one time in this version.

--
#BlackLivesMatter

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Jan 14, 2021, at 18:05, Abhiram Chintangal  wrote:


Sean,

Thanks for the clarification.I noticed that I am missing the "AllowedDevices" 
option in mine. After adding this, the GPU allocations started working. (Slurm 
version 18.08.8)

I was also incorrectly using "nvidia-smi" as a check.

Regards,

Abhiram

On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby 
mailto:scro...@unimelb.edu.au>> wrote:
Hi Abhiram,

You need to configure cgroup.conf to constrain the devices a job has access to. 
See https://slurm.schedmd.com/cgroup.conf.html

My cgroup.conf is

CgroupAutomount=yes
AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes

TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup

The ConstrainDevices=yes is the key to stopping jobs from having access to GPUs 
they didn't request.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal 
mailto:achintan...@berkeley.edu>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation 
attempts


Hello,

I recently set up a small cluster at work using Warewulf/Slurm. Currently, I am 
not able to get the scheduler to
work well with GPU's (Gres).

While slurm is able to filter by GPU type, it allocates all the GPU's on the 
node. See below:

[abhiram@whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu nvidia-smi 
--query-gpu=index,name --format=csv
index, name
0, Tesla P100-PCIE-16GB
1, Tesla P100-PCIE-16GB
2, Tesla P100-PCIE-16GB
3, Tesla P100-PCIE-16GB
[abhiram@whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu nvidia-smi 
--query-gpu=index,name --format=csv
index, name
0, TITAN RTX
1, TITAN RTX
2, TITAN RTX
3, TITAN RTX
4, TITAN RTX
5, TITAN RTX
6, TITAN RTX
7, TITAN RTX

I am fairly new to Slurm and still figuring out my way around it. I would 
really appreciate any help with this.

For your reference, I attached the slurm.conf and gres.conf files.

Best,

Abhiram

--

Abhiram Chintangal
QB3 Nogales Lab
Bioinformatics Specialist @ Howard Hughes Medical Institute
University of California Berkeley
708D Stanley Hall, Berkeley, CA 94720
Phone (510)666-3344


--

Abhiram Chintangal
QB3 Nogales Lab
Bioinformatics Specialist @ Howard Hughes Medical Institute
University of California Berkeley
708D Stanley Hall, Berkeley, CA 94720
Phone (510)666-3344

Re: [slurm-users] [External] Re: can't lengthen my jobs log

2020-12-04 Thread Ryan Novosielski

As root, -a is effectively applied to every command I’m aware of.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Dec 4, 2020, at 1:54 PM, Prentice Bisbal  wrote:
> 
> I know I'm very late to this thread, but were/are you using the --allusers 
> flag to sacct? If not, sacct only returns results for the user running the 
> command (not sure if this is the case for root - I never need to run sacct as 
> root). This minor detail tripped me up a few days ago when I was expecting 
> hundreds of thousands of results, and only got a couple hundred, which were 
> *my* jobs from the period I was searching, not *all* jobs. I was about to 
> have a heart attack because I thought someone purged the SlurmDB. 
> 
> For those wondering, I was using  -o to get only a few pieces of data per job 
>  from my query, and userid wasn't one of them, which is why my mistake wasn't 
> so obvious at first. 
> 
> Prentice
> 
> On 11/12/2020 4:49 PM, Erik Bryer wrote:
>> That worked pretty well in that I got more data than I ever have before by a 
>> lot. It only goes back about 18 days, but I'm not sure why. The 
>> slurmdbd.conf back then contained no directives on retaining logs, which is 
>> supposed to mean it defaults to retaining them indefinitely. On another test 
>> cluster it shows records back 2 days, which is about when I started fiddling 
>> with the settings. Could that have wiped the previous records, if they 
>> existed, or have my changes started the saving of older data. Still, this is 
>> progress.
>> 
>> Erik
>> From: slurm-users  on behalf of 
>> Sebastian T Smith 
>> Sent: Thursday, November 12, 2020 2:32 PM
>> To: slurm-users@lists.schedmd.com 
>> Subject: Re: [slurm-users] can't lengthen my jobs log
>>  
>> Hi John,
>> 
>> Have you tried specifying a start time?  The default is 00:00:00 of the 
>> current day (depending on other options).  Example:
>> 
>> sacct -S 2020-11-01T00:00:00
>> 
>> Our accounting database retains all job data from the epoch of our system.
>> 
>> Best,
>> 
>> Sebastian
>> 
>> --
>> 
>>  
>> Sebastian Smith
>> High-Performance Computing Engineer
>> Office of Information Technology
>> 1664 North Virginia Street
>> MS 0291
>> 
>> work-phone: 775-682-5050
>> email: stsm...@unr.edu
>> website: http://rc.unr.edu
>> 
>> From: slurm-users  on behalf of john 
>> abignail 
>> Sent: Thursday, November 12, 2020 12:57 PM
>> To: slurm-users@lists.schedmd.com 
>> Subject: [slurm-users] can't lengthen my jobs log
>>  
>> Hi,
>> 
>> My jobs database empties after about 1 day. "sacct -a" returns no results. 
>> I've tried to lengthen that, but have been unsuccessful. I've tried adding 
>> the following to slurmdbd.conf and restarting slurmdbd:
>> ArchiveJobs=yes
>> PurgeEventAfter=1month
>> PurgeJobAfter=12month
>> PurgeResvAfter=1month
>> PurgeStepAfter=1month
>> PurgeSuspendAfter=1month
>> PurgeTXNAfter=12month
>> PurgeUsageAfter=24month
>> No job archives appear (in the default /tmp dir) either. What I'd like to do 
>> is have the slurm database retain information on jobs for at least a few 
>> weeks, writing out data beyond that threshold to files, but mainly I just 
>> want to keep job data in the database for longer.
>> 
>> Regards,
>> John
> -->

Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Ryan Novosielski

I’ve previously seen code contributed back in that way. See bug 1611 as an 
example (happened to have looked at that just yesterday).

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 11:29, Relu Patrascu  wrote:


Thanks Ryan, I'll try the bugs site. And indeed, one person in our organization 
has already said "let's pay for support, maybe they'll listen." :) It's a 
little bit funny to me that we don't actually need support, but get it hoping 
that they might consider adding a feature which we think would benefit everyone.

We have actually modified the code on both v 19 and 20 to do what we would 
like, preemption within the same QOS, but we think that the community would 
benefit from this feature, hence our request to have it in the release version.
Relu

On 2020-09-30 11:02, Ryan Novosielski wrote:
Depends on the issue I think, but the bugs site is often a way to request 
enhancements, etc. Of course, requests coming from an entity with a support 
contact carry more weight.

--

|| \\UTGERS,   |---*O*---
||_// the State     | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 10:57, Relu Patrascu 
<mailto:r...@cs.toronto.edu> wrote:

Hi all,

I posted recently on this mailing list a feature request and got no reply from 
the developers. Is there a better way to contact the slurm developers or we 
should just accept that they are not interested in community feedback?

Regards,

Relu

Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Ryan Novosielski

Depends on the issue I think, but the bugs site is often a way to request 
enhancements, etc. Of course, requests coming from an entity with a support 
contact carry more weight.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 10:57, Relu Patrascu  wrote:

Hi all,

I posted recently on this mailing list a feature request and got no reply from 
the developers. Is there a better way to contact the slurm developers or we 
should just accept that they are not interested in community feedback?

Regards,

Relu

Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-30 Thread Ryan Novosielski

Absolutely not. It’s recommended.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 10:46, Luecht, Jeff A  wrote:


So just to confirm, there is not inherent issue using srun within an SBATCH 
file?

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Ryan Novosielski
Sent: Wednesday, September 30, 2020 10:01 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] EXTERNAL: Re: Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Primary one I’m aware of is that resource use is better reported (or at all in 
some cases) via srun, and srun can take care of MPI for an MPI job.  I’m sure 
there are others as well (I guess avoiding another place where you have to 
describe the resources to be used and making sure they match, in the case of 
mpirun, etc.).
--

|| \\UTGERS,   
|---*O*---
||_// the State |     Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'


On Sep 30, 2020, at 09:38, Luecht, Jeff A 
mailto:jeff.lue...@pnc.com>> wrote:
First off, I want to thank everyone for their input and suggestions.  They 
were very helpful an ultimately pointed me in the right direction.  I spent 
several hours playing around with various settings.

Some additional background. When the srun command is used to execute this job,  
we do not see this issue.  We only see it in SBATCH.

What I ultimate did was the following:

1 - Change the NodeName to add the specific parameters Sockets, Cores and 
Threads.
2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 
respectively

I tested jobs after the above changes and used 'scontrol --defaults job ' 
command.  The CPU allocation now works as expected.

I do have one question though - what is the benefit/recommendation of using 
srun to execute a process within SBATCH.  We are running primarily python jobs, 
but need to also support R jobs.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Wednesday, September 30, 2020 2:18 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>; Michael 
Di Domenico mailto:mdidomeni...@gmail.com>>
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Il 29/09/20 16:19, Michael Di Domenico ha scritto:


what leads you to believe that you're getting 2 CPU's instead of 1?
I think I saw that too, once, but thought it was related to hyperthreading.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 
Bologna - Italy
tel.: +39 051 20 95786




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com


The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com

Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-30 Thread Ryan Novosielski

Primary one I’m aware of is that resource use is better reported (or at all in 
some cases) via srun, and srun can take care of MPI for an MPI job.  I’m sure 
there are others as well (I guess avoiding another place where you have to 
describe the resources to be used and making sure they match, in the case of 
mpirun, etc.).

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 09:38, Luecht, Jeff A  wrote:

First off, I want to thank everyone for their input and suggestions.  They 
were very helpful an ultimately pointed me in the right direction.  I spent 
several hours playing around with various settings.

Some additional background. When the srun command is used to execute this job,  
we do not see this issue.  We only see it in SBATCH.

What I ultimate did was the following:

1 - Change the NodeName to add the specific parameters Sockets, Cores and 
Threads.
2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 
respectively

I tested jobs after the above changes and used 'scontrol --defaults job ' 
command.  The CPU allocation now works as expected.

I do have one question though - what is the benefit/recommendation of using 
srun to execute a process within SBATCH.  We are running primarily python jobs, 
but need to also support R jobs.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Wednesday, September 30, 2020 2:18 AM
To: Slurm User Community List ; Michael Di 
Domenico 
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Il 29/09/20 16:19, Michael Di Domenico ha scritto:

what leads you to believe that you're getting 2 CPU's instead of 1?
I think I saw that too, once, but thought it was related to hyperthreading.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 
Bologna - Italy
tel.: +39 051 20 95786




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com

Re: [slurm-users] Slurm -- using GPU cards with NVLINK

2020-09-10 Thread Ryan Novosielski

I’m fairly sure that you set this up the same way you set up for a peer-to-peer 
setup. Here’s ours:

[root@cuda001 ~]# nvidia-smi topo --matrix
GPU0GPU1GPU2GPU3mlx4_0  CPU Affinity
GPU0 X  PIX SYS SYS PHB 0-11
GPU1PIX  X  SYS SYS PHB 0-11
GPU2SYS SYS  X  PIX SYS 12-23
GPU3SYS SYS PIX  X  SYS 12-23
mlx4_0  PHB PHB SYS SYS  X 

[root@cuda001 ~]# cat /etc/slurm/gres.conf 

…

# 2 x K80 (perceval)
NodeName=cuda[001-008] Name=gpu File=/dev/nvidia[0-1] CPUs=0-11
NodeName=cuda[001-008] Name=gpu File=/dev/nvidia[2-3] CPUs=12-23

This also seems to be related:

https://slurm.schedmd.com/SLUG19/GPU_Scheduling_and_Cons_Tres.pdf

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Sep 10, 2020, at 11:00 AM, David Baker  wrote:
> 
> Hello,
> 
> We are installing a group of nodes which all contain 4 GPU cards. The GPUs 
> are paired together using NVLINK as described in the matrix below. 
> 
> We are familiar with using Slurm to schedule and run jobs on GPU cards, but 
> this is the first time we have dealt with NVLINK enabled GPUs. Could someone 
> please advise us how to configure Slurm so that we can submit jobs to the 
> cards and make use of the NVLINK? That is, what do we need to put in the 
> gres.conf or slurm.conf, and how should users use the sbatch command? I 
> presume, for example, that a user could make use of a GPU card, and 
> potentially make use of memory on the paired card.
> 
> Best regards,
> David
> 
> [root@alpha51 ~]# nvidia-smi topo --matrix
> GPU0GPU1GPU2GPU3CPU AffinityNUMA Affinity
> GPU0 X  NV2 SYS SYS 0,2,4,6,8,100
> GPU1NV2  X  SYS SYS 0,2,4,6,8,100
> GPU2SYS SYS  X  NV2 1,3,5,7,9,111
> GPU3SYS SYS NV2  X  1,3,5,7,9,111

Re: [slurm-users] is there a way to delay the scheduling.

2020-08-28 Thread Ryan Novosielski

Sounds like you’re sort of the poster-child for this section of the 
documentation:

https://slurm.schedmd.com/high_throughput.html — note that it’s possible for 
this to be version specific, so look for this file in the “archive” section of 
the website if you need other than 20.02.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Aug 28, 2020, at 6:30 AM, navin srivastava  wrote:
> 
> Hi Team,
> 
> facing one issue. several users submitting 2 job in a single batch job 
> which is very short jobs( says 1-2 sec). so while submitting more job 
> slurmctld become unresponsive and started giving message
> 
> ending job 6e508a88155d9bec40d752c8331d7ae8 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6e638939f90cd59e60c23b8450af9839 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue.
> 
> even that time the load of cpu started consuming more than 100%  of slurmctld 
> process.
> I found that the node is not able to acknowledge immediately to server. it is 
> moving from comp to idle.
> so in my thought delay a scheduling cycle will help here. any idea how it can 
> be done.
> 
> so is there any other solution available for such issues.
> 
> Regards
> Navin.
> 
> 
>

Re: [slurm-users] GRES Restrictions

2020-08-25 Thread Ryan Novosielski

Sorry about that. “NJT” should have read “but;” apparently my phone decided I 
was talking about our local transit authority. 

On Aug 25, 2020, at 10:30, Ryan Novosielski  wrote:

 I believe that’s done via a QoS on the partition. Have a look at the docs 
there, and I think “require” is a good key word to look for.

Cgroups should also help with this, NJT I’ve been troubleshooting a problem 
where that seems not to be working correctly.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Aug 25, 2020, at 10:13, Willy Markuske  wrote:



Hello,

I'm trying to restrict access to gpu resources on a cluster I maintain for a 
research group. There are two nodes put into a partition with gres gpu 
resources defined. User can access these resources by submitting their job 
under the gpu partition and defining a gres=gpu.

When a user includes the flag --gres=gpu:# they are allocated the number of 
gpus and slurm properly allocates them. If a user requests only 1 gpu they only 
see CUDA_VISIBLE_DEVICES=1. However, if a user does not include the 
--gres=gpu:# flag they can still submit a job to the partition and are then 
able to see all the GPUs. This has led to some bad actors running jobs on all 
GPUs that other users have allocated and causing OOM errors on the gpus.

Is it possible, and where would I find the documentation on doing so, to 
require users to define a --gres=gpu:# to be able to submit to a partition? So 
far reading the gres documentation doesn't seem to have yielded any word on 
this issue specifically.

Regards,

--

Willy Markuske

HPC Systems Engineer



Research Data Services

P: (858) 246-5593

Re: [slurm-users] GRES Restrictions

2020-08-25 Thread Ryan Novosielski

I believe that’s done via a QoS on the partition. Have a look at the docs 
there, and I think “require” is a good key word to look for.

Cgroups should also help with this, NJT I’ve been troubleshooting a problem 
where that seems not to be working correctly.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Aug 25, 2020, at 10:13, Willy Markuske  wrote:



Hello,

I'm trying to restrict access to gpu resources on a cluster I maintain for a 
research group. There are two nodes put into a partition with gres gpu 
resources defined. User can access these resources by submitting their job 
under the gpu partition and defining a gres=gpu.

When a user includes the flag --gres=gpu:# they are allocated the number of 
gpus and slurm properly allocates them. If a user requests only 1 gpu they only 
see CUDA_VISIBLE_DEVICES=1. However, if a user does not include the 
--gres=gpu:# flag they can still submit a job to the partition and are then 
able to see all the GPUs. This has led to some bad actors running jobs on all 
GPUs that other users have allocated and causing OOM errors on the gpus.

Is it possible, and where would I find the documentation on doing so, to 
require users to define a --gres=gpu:# to be able to submit to a partition? So 
far reading the gres documentation doesn't seem to have yielded any word on 
this issue specifically.

Regards,

--

Willy Markuske

HPC Systems Engineer



Research Data Services

P: (858) 246-5593

Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Ryan Novosielski

Are you sure that the OOM killer is involved? I can get you specifics later, 
but if it’s that one line about OOM events, you may see it after successful 
jobs too. I just had a SLURM bug where this came up.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Jul 2, 2020, at 09:53, Prentice Bisbal  wrote:

I maintain a very heterogeneous cluster (different processors, different 
amounts of RAM, etc.) I have a user reporting the following problem.

He's running the same job multiple times with different input parameters. The 
jobs run fine unless they land on specific nodes. He's specifying --mem=2G in 
his sbatch files. On the nodes where the jobs fail, I see that the OOM killer 
is invoked, so I asked him to specify more RAM, so he did. He set --mem=4G, and 
still the jobs fail on these 2 nodes. However, they run just fine on other 
nodes with --mem=2G.

When I look at the slurm log file on the nodes, I see something like this for a 
failing job (in this case, --mem=4G was set)

[2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job 801777 ran for 
0 seconds
[2020-07-01T16:19:06.479] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-07-01T16:19:06.483] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777/step_extern: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
[2020-07-01T16:19:06.621] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-07-01T16:19:06.623] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:19.385] [801777.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:0 status:0
[2020-07-01T16:19:19.389] [801777.batch] done with job
[2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill event 
count: 1
[2020-07-01T16:19:19.508] [801777.extern] done with job

Any ideas why the jobs are failing on just these two nodes, while they run just 
fine on many other nodes?

For now, the user is excluding these two nodes using the -x option to sbatch, 
but I'd really like to understand what's going on here.

--

Prentice

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-13 Thread Ryan Novosielski

From what I know of how this works, no, it’s not getting it from a local file 
or the master node. I don’t believe it even makes a network connection, nor 
requires a slurm.conf in order to run. If you can run it fresh on a node with 
no config and that’s what it comes up with, it’s probably getting it from the 
VM somehow.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Mar 11, 2020, at 10:26 AM, mike tie  wrote:
> 
> 
> Yep, slurmd -C is obviously getting the data from somewhere, either a local 
> file or from the master node.  hence my email to the group;  I was hoping 
> that someone would just say:  "yeah, modify file ".  But oh well. I'll 
> start playing with strace and gdb later this week;  looking through the 
> source might also be helpful.  
> 
> I'm not cloning existing virtual machines with slurm.  I have access to a 
> vmware system that from time to time isn't running at full capacity;  usage 
> is stable for blocks of a month or two at a time, so my thought/plan was to 
> spin up a slurm compute node  on it, and resize it appropriately every few 
> months (why not put it to work).  I started with 10 cores, and it looks like 
> I can up it to 16 cores for a while, and that's when I ran into the problem.
> 
> -mike
> 
> 
> 
> Michael Tie
> Technical Director
> Mathematics, Statistics, and Computer Science
> 
>  One North College Street  phn:  507-222-4067
>  Northfield, MN 55057   cel:952-212-8933
>  m...@carleton.edufax:507-222-4312
> 
> 
> 
> On Wed, Mar 11, 2020 at 1:15 AM Kirill 'kkm' Katsnelson  
> wrote:
> On Tue, Mar 10, 2020 at 1:41 PM mike tie  wrote:
> Here is the output of lstopo
> 
> $ lstopo -p
> Machine (63GB)
>   Package P#0 + L3 (16MB)
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#3
>   Package P#1 + L3 (16MB)
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#4
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#5
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#6
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#7
>   Package P#2 + L3 (16MB)
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#8
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#9
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#10
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#11
>   Package P#3 + L3 (16MB)
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#12
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#13
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#14
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#15
> 
> There is no sane way to derive the number 10 from this topology. obviously: 
> it has a prime factor of 5, but everything in the lstopo output is sized in 
> powers of 2 (4 packages, a.k.a.  sockets, 4 single-threaded CPU cores per). 
> 
> I responded yesterday but somehow managed to plop my signature into the 
> middle of it, so maybe you have missed inline replies?
> 
> It's very, very likely that the number is stored *somewhere*. First to 
> eliminate is the hypothesis that the number is acquired from the control 
> daemon. That's the simplest step and the largest landgrab in the 
> divide-and-conquer analysis plan. Then just look where it comes from on the 
> VM. strace(1) will reveal all files slurmd reads. 
> 
> You are not rolling out the VMs from an image, ain't you? I'm wondering why 
> do you need to tweak an existing VM that is already in a weird state. Is 
> simply setting its snapshot aside and creating a new one from an image 
> hard/impossible? I did not touch VMWare for more than 10 years, so I may be a 
> bit naive; in the platform I'm working now (GCE), create-use-drop pattern of 
> VM use is much more common and simpler than create and maintain it to either 
> *ad infinitum* or *ad nauseam*, whichever will have been reached the 
> earliest.  But I don't know anything about VMWare; maybe it's not possible or 
> feasible with it.
> 
>  -kkm
>

Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-25 Thread Ryan Novosielski

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I seem to remember there being a config option to specify rewriting
the hostname as well. I thought it was part of X11Parameters, but I
only see one option there:

https://slurm.schedmd.com/archive/slurm-19.05-latest/slurm.conf.html

On 2/25/20 10:55 AM, Tina Friedrich wrote:
> I remember having issues when I set up X forwarding that had to do
> with how the host names were set on the nodes. I had them set
> (CentOS default) to the fully qualified hostname, and that didn't
> work - with an error message very similar to what you're getting,
> if memory serves right. 'Fixed' it by setting the hostnames on all
> my cluster nodes to the short version.
>
> What does 'xauth list' give you on your nodes (and the machine
> you're coming from)?
>
> This is/was SLURM 18.08 though, not sure if that makes a
> difference.
>
> Tina
>
> On 25/02/2020 04:55, Pär Lundö wrote:
>> Hi,
>>
>> Thank you for your reply Patrick. I´ve tried that but I still get
>> the error stating that the magic cookie could not be retrieved.
>> Reading Tim´s answer, this bug should have been fixed in a
>> release following 18.08, but I´m using 19.05 thus it should have
>> been taken care of?
>>
>> Best regards, Pär Lundö
>>
>> -Original Message- From: slurm-users
>>  On Behalf Of Patrick
>> Goetz Sent: den 24 februari 2020 21:38 To:
>> slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Slurm
>> 19.05 X11-forwarding
>>
>> This bug report appears to address the issue you're seeing:
>>
>> https://bugs.schedmd.com/show_bug.cgi?id=5868
>>
>>
>>
>> On 2/24/20 4:46 AM, Pär Lundö wrote:
>>> Dear all,
>>>
>>> I started testing and evaluating Slurm roughly a year ago and
>>> used it succesfully with MPI-programs. I have now identified
>>> that I need to use X-forwarding in order to make use of an
>>> application needed to run a GUI. However I seem not to be able
>>> to use the "X11Forwarding" in Slurm.conf with any sucess. A
>>> simple test with srun as "srun - -x11 ",
>>> yields an error stating: "srun: error: x11_get_xauth: Could not
>>> retrieve magic cookie. Cannot use X11 forwarding." I'm guessing
>>> that there some additional setting I'm missing out on.
>>> Searching through documentation does not reveal anything
>>> extraordinary to be performed with the X11-forwarding. I have
>>> searched bug-reports and found similiar problems but fixes have
>>> been implemented in Slurm-versions prior to the one I am using.
>>> In addition users have been reporting that the implementation
>>> made by Slurm, now have fixed their issues. I have used the
>>> "export=DISPLAY, HOME" as an additional argument for srun but
>>> without any progress. Anyone with similiar problem who can aid
>>> or advice me on howto use the X11Forward feature? Any help is
>>> much appreciated.
>>>
>>> I am running Slurm 19.05 and Ubuntu 18.10.
>>>
>>> Best regards, Pär Lundö
>>>
>>> This message is from an external sender. Learn more about why
>>> this matters.
>>> <https://ut.service-now.com/sp?id=kb_article=KB0011401>
>>>
>>>
>>

- -- 
 
 || \\UTGERS, |--*O*
 ||_// the State  |Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
  `'
-BEGIN PGP SIGNATURE-

iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXlWCNgAKCRCZv6Bp0Ryx
vjTSAJ9S7WA/rYjw5ylOljq9IpPqwTsHKACfbn/nnGJvE+QHvJoJ2djkkCQjMok=
=Ju41
-END PGP SIGNATURE-

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Ryan Novosielski

The node is not getting the status from itself, it’s querying the slurmctld to 
ask for its status.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jan 20, 2020, at 3:56 PM, Dean Schulze  wrote:
> 
> If I run sinfo on the node itself it shows an asterisk.  How can the node be 
> unreachable from itself?
> 
> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy  wrote:
> Hi,
> 
> The * next to the idle status in sinfo means that the node is unreachable/not 
> responding. Check the status of the slurmd on the node and check the 
> connectivity from the slurmctld host to the compute node (telnet may be 
> enough). You can also check the slurmctld logs for more information. 
> 
> Regards,
> Carlos
> 
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze  wrote:
> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 code 
> base.  It's behavior is strange to say the least.
> 
> The controller was built from the same code base, but on Ubuntu 19.10.  The 
> controller reports the nodes state with sinfo, but can't run a simple job 
> with srun because it thinks the node isn't available, even when it is idle.  
> (And squeue shows an empty queue.)
> 
> On the controller:
> $ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 30 queued and waiting for resources
> ^Csrun: Job allocation 30 has been revoked
> srun: Force Terminated job 30
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*   up   infinite  1  idle* liqidos-dean-node1 
> $ squeue
>  JOBID  PARTITION  USER  STTIME   NODES 
> NODELIST(REASON) 
> 
> 
> When I try to run the simple job on the node I get:
> 
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*   up   infinite  1  idle* liqidos-dean-node1 
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 27 queued and waiting for resources
> ^Csrun: Job allocation 27 has been revoked
> [liqid@liqidos-dean-node1 ~]$ squeue
>  JOBID  PARTITION  USER  STTIME   NODES 
> NODELIST(REASON) 
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 28 queued and waiting for resources
> ^Csrun: Job allocation 28 has been revoked
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*   up   infinite  1  idle* liqidos-dean-node1 
> 
> Apparently slurm thinks there are a bunch of jobs queued, but shows an empty 
> queue.  How do I get rid of these?
> 
> If these zombie jobs aren't the problem what else could be keeping this from 
> running?
> 
> Thanks.
> -- 
> --
> Carles Fenoy

Re: [slurm-users] Downgraded to slurm 19.05.4 and now slrumctld won't start because of incompatible state

2020-01-20 Thread Ryan Novosielski

Check slurm.conf for StateSaveLocation.

https://slurm.schedmd.com/slurm.conf.html

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jan 20, 2020, at 5:58 PM, Dean Schulze  wrote:
> 
> 
> This is what I get from systemctl status slurmctld:
> 
> fatal: Can not recover last_tres state, incompatible version, got 8960 need 
> >= 8192 <= 8704, start with '-i' to ignore this
> 
> Starting it with the -i option doesn't do anything.
> 
> Where does slurm store this state so I can get rid of it?
> 
> Thanks.
>

[slurm-users] Array jobs vs. many jobs

2019-11-22 Thread Ryan Novosielski

Hi there,

Quick question that I'm not sure how to find the answer to otherwise: do array 
jobs have less impact on the scheduler in any way than a whole long list of 
jobs run the more traditional way? Less startup overhead, anything like that?

Thanks!

(we run 17.11 on CentOS 7, but I'm not sure it makes any difference here)

--

|| \\UTGERS,|---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] Get GPU usage from sacct?

2019-11-14 Thread Ryan Novosielski

Do you mean akin to what some would consider "CPU efficiency" on a CPU job? 
"How much... used" is a little vague.

From: slurm-users  on behalf of Prentice 
Bisbal 
Sent: Thursday, November 14, 2019 13:41
To: Slurm User Community List
Subject: [slurm-users] Get GPU usage from sacct?

Is there any way to see how much a job used the GPU(s) on a cluster
using sacct or any other slurm command?

--
Prentice

Re: [slurm-users] Slurm node weights

2019-07-25 Thread Ryan Novosielski

My understanding is that the topology plug-in will overrule this, and that may 
or may not be a problem depending on your environment. I had a ticket in to 
SchedMD about this, because it looked like our nodes were getting allocated in 
the exact reverse order. I suspected this was because our higher weight 
equipment was on a switch with fewer nodes, and the scheduler was trying to 
keep workloads contiguous (opting to preserve larger blocks where possible). 
SchedMD was not able to duplicate this with my configuration, however, so it 
remains a suspicion of mine, and I’ve heard that there IS an interaction of 
some sort.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Jul 25, 2019, at 06:51, David Baker 
mailto:d.j.ba...@soton.ac.uk>> wrote:


Hello,


I'm experimenting with node weights and I'm very puzzled by what I see. Looking 
at the documentation I gathered that jobs will be allocated to the nodes with 
the lowest weight which satisfies their requirements. I have 3 nodes in a 
partition and I have defined the nodes like so..


NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN Weight=50
NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN


So, given that the default weight is 1 I would expect jobs to be allocated to 
orange02 and orange03 first. I find, however that my test job is always 
allocated to orange01 with the higher weight. Have I overlooked something? I 
would appreciate your advice, please.

Re: [slurm-users] Hide Filesystem From Slurm

2019-07-11 Thread Ryan Novosielski

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

There are some plugins of one sort or another (at least one is SPANK)
that can create a private temporary directory. We've been considering
implementing something like this but have not gotten around to it.
"systemd" also offers some of this functionality.

On 7/11/19 11:19 AM, Douglas Duckworth wrote:
> Hello
> 
> I am wondering if it's possible to hide a file system, that's
> world writable on compute node, logically within Slurm.  That way
> any job a user runs cannot possible access this file system.
> 
> Essentially we define $TMPDIR as /scratch, which Slurm cleans up
> in epilogue scripts, but some users still keep writing to /tmp
> instead which we do not want.  We would use tmpwatch to clean up
> /tmp but I would rather just prevent people from writing to it
> within Slurm.
> 
> Thanks Doug
> 
> Thanks,
> 
> Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific 
> Computing Unit <https://scu.med.cornell.edu> Weill Cornell
> Medicine 1300 York Avenue New York, NY 10065 E:
> d...@med.cornell.edu <mailto:d...@med.cornell.edu> O: 212-746-6305
> F: 212-746-8690

- -- 
 
 || \\UTGERS, |--*O*
 ||_// the State  |Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
  `'
-BEGIN PGP SIGNATURE-

iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXSdX8gAKCRCZv6Bp0Ryx
vmEwAKCQa6iG5fW+p/OQ+m2ebOlnlfcnGwCggwi6BXym2hjmtpDDfPqARiTwsws=
=NkJ0
-END PGP SIGNATURE-

Re: [slurm-users] ConstrainRAMSpace=yes and page cache?

2019-06-21 Thread Ryan Novosielski

I’ve suspected for some time that this matters in our environment, though we 
/do/ use GPFS. Maybe any use of local scratch (XFS, local drive) could figure 
in here?

Are there any tips for how to determine easily where the extra money is coming 
from, for example when the user has specifically constrained application to a 
certain amount of memory with its own flags, or – to put another way – to prove 
that it’s not this sort of phenomenon happening?

On Jun 21, 2019, at 13:04, Christopher Samuel 
mailto:ch...@csamuel.org>> wrote:

On 6/13/19 5:27 PM, Kilian Cavalotti wrote:

I would take a look at the various *KmemSpace options in cgroups.conf,
they can certainly help with this.

Specifically I think you'll want:

ConstrainKmemSpace=no

to fix this.  This happens for NFS and Lustre based systems, I don't think it's 
a problem for GPFS as mmfsd has its own pagepool separate to the processes 
address space.

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--

|| \\UTGERS,   |---*O*---
||_// the State |     Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

Re: [slurm-users] Increasing job priority based on resources requested.

2019-04-18 Thread Ryan Novosielski

This is not an official answer really, but I’ve always just considered this to 
be the way that the scheduler works. It wants to get work completed, so it will 
have a bias toward doing what is possible vs. not (can’t use 239GB of RAM on a 
128GB node). And really, is a higher priority what you want? I’m not so sure. 
How soon will someone figure out that they might get a higher priority based on 
requesting some feature they don’t need?

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Apr 18, 2019, at 5:20 PM, Prentice Bisbal  wrote:
> 
> Slurm-users,
> 
> Is there away to increase a jobs priority based on the resources or 
> constraints it has requested?
> 
> For example, we have a very heterogeneous cluster here: Some nodes only have 
> 1 Gb Ethernet, some have 10 Gb Ethernet, and others have DDR IB. In addition, 
> we have some large memory nodes with RAM amounts ranging from 128 GB up to 
> 512 GB. To allow a user to request IB, I have implemented that as a feature 
> in the node definition so users can request that as a constraint.
> 
> I would like to make it that if a job request IB, it's priority will go up, 
> or if it requests a lot of memory (specifically memory-per-cpu), it's 
> priority will go up proportionately to the amount of memory requested. Is 
> this possible? If so, how?
> 
> I have tried going through the documentation, and googling, but 'priority' is 
> used to discuss job priority so much, I couldn't find any search results 
> relevant to this.
> 
> -- 
> Prentice
> 
>

Re: [slurm-users] X11 forwarding and VNC?

2019-03-25 Thread Ryan Novosielski

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

If the error message is accurate, the fix may be having the VNC server
not set DISPLAY equal to localhost:10.0 or similar as SSH normally
does these days, but to configure it to set DISPLAY to fqdn:10.0. We
had to do something similar with FastX.

On 3/22/19 9:44 AM, Loris Bennett wrote:
> Hi,
> 
> I'm using 18.08.6-2 and have got X11 forwarding working using the 
> in-built mechanism.  This works fine for users who log in with 'ssh
> -X' and then do 'srun --x11 --pty bash'.
> 
> However, I have users who start a VNC session on the login node and
> when they run the srun command above from an xterm within the VNC
> session, they get
> 
> srun: error: Cannot forward to local display. Can only use X11
> forwarding with network displays.
> 
> Does anyone have any ideas whether this can be made to work and, if
> so, how?

- -- 
 
 || \\UTGERS, |--*O*--------
 ||_// the State  |Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
  `'
-BEGIN PGP SIGNATURE-

iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXJjQQgAKCRCZv6Bp0Ryx
vk7iAKCs/MlqWUtZySGZ3Qc2m6D1oXKZsACgyHzVhnIu2Pen+hJZuwBLWojL8Nw=
=5YqO
-END PGP SIGNATURE-

Re: [slurm-users] Database Tuning w/SLURM

2019-03-22 Thread Ryan Novosielski

> On Mar 22, 2019, at 4:22 AM, Ole Holm Nielsen  
> wrote:
> 
> On 3/21/19 6:56 PM, Ryan Novosielski wrote:
>>> On Mar 21, 2019, at 12:21 PM, Loris Bennett  
>>> wrote:
>>> 
>>>  Our last cluster only hit around 2.5 million jobs after
>>> around 6 years, so database conversion was never an issue.  For sites
>>> with a higher-throughput things may be different, but I would hope that
>>> at those places, the managers would know the importance of planned
>>> updates and testing.
>> I’d be curious about any database tuning you might have done, or anyone else 
>> here. SchedMD’s guidance is minimal.
>> I’ve ever been impressed with the performance on ours, and I’ve also seen 
>> other sites reporting >24 hour database conversion times.
> 
> Database tuning is actually documented by SchedMD, but you have to find the 
> appropriate pages first ;-)

Yeah, I’ve seen it, but there’s very little information provided (similar to 
what you’ve got listed). The major difference between theirs is the further 
mention of “you might want to increase innodb_buffer_pool_size quite a bit more 
than 1024MB.” In my conversations with SchedMD I more or less asked, “is that 
it? what if it’s still slow, does that mean look somewhere else or keep 
tweaking.” There is also other advice from SchedMD bugs (the one you mention on 
your site included), but many of them are for dramatically different versions 
of MySQL or SlurmDBD and it’s not always easy to tell what still applies. It 
does depend also on the type of access, the size of the DB, etc., but I don’t 
have any other size DB than the size I have; presumably the community knows how 
much is required for whatever kind, or how many years of X amount of job can be 
kept before you start to have problems with most tuning settings. I have taken 
some advice from mysqltuner.pl in some cases too, though I’m using basically 
the SchedMD recommendations right now (that thread_cache_size one was mine — 
can’t recall where I found it, but it seemed like a good idea for our workload):

[root@squid ~]# cat /etc/my.cnf.d/slurmdbd.cnf 
[mysqld]
innodb_buffer_pool_size=1G
thread_cache_size=4
innodb_log_file_size = 64M
innodb_lock_wait_timeout = 900

> I have collected Slurm database information in my Wiki page 
> https://wiki.fysik.dtu.dk/niflheim/Slurm_database.  You may want to focus on 
> these sections:
> 
> * MySQL configuration (Innodb configuration)
> 
> * Setting database purge parameters (prune unwanted old database entries)
> 
> * Backup and restore of database (hopefully everyone does this already)
> 
> * Upgrade of MySQL/MariaDB (MySQL versions)
> 
> * Migrate the slurmdbd service to another server (I decided to do that 
> recently)
> 
> I hope this sheds some light on what needs to be considered.

Thanks, it’s helpful to have more information, particularly on purging and the 
migration process (which doesn’t seem complicated, but it’s nice to simply rip 
off the steps as opposed to having to write them :-D).

The tug-of-war on our system comes from SlurmDBD often needing quite a bit of 
memory itself for certain operations, and it sits on the MySQL server. I 
sometimes wonder whether it might not be better to colocate SlurmDBD with 
slurmctld, separating them both from the MySQL server.

PS: mainly for Prentice, Ole’s site has the thread from this list that 
mentioned the very large DB upgrade time:
https://lists.schedmd.com/pipermail/slurm-users/2018-February/000612.html — we 
tested the DB upgrade first independently because of that risk.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

[slurm-users] Database Tuning w/SLURM (was: Re: SLURM heterogeneous jobs, a little help needed plz)

2019-03-21 Thread Ryan Novosielski

> On Mar 21, 2019, at 12:21 PM, Loris Bennett  
> wrote:
> 
>  Our last cluster only hit around 2.5 million jobs after
> around 6 years, so database conversion was never an issue.  For sites
> with a higher-throughput things may be different, but I would hope that
> at those places, the managers would know the importance of planned
> updates and testing.

I’d be curious about any database tuning you might have done, or anyone else 
here. SchedMD’s guidance is minimal.

I’ve ever been impressed with the performance on ours, and I’ve also seen other 
sites reporting >24 hour database conversion times.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Ryan Novosielski

> On Mar 21, 2019, at 11:26 AM, Prentice Bisbal  wrote:
> On 3/20/19 1:58 PM, Christopher Samuel wrote:
>> On 3/20/19 4:20 AM, Frava wrote:
>> 
>>> Hi Chris, thank you for the reply.
>>> The team that manages that cluster is not very fond of upgrading SLURM, 
>>> which I understand.
> 
> As a system admin who manages clusters myself, I don't understand this. Our 
> job is to provide and maintain resources for our users. Part of that 
> maintenance is to provide updates for security, performance, and 
> functionality (new features) reasons. HPC has always been a leading-edge kind 
> if field, so I feel this is even more important for HPC admins.
> 
> Yes, there can be issues caused by updates, but those can be with proper 
> planning: Have a plan to do the actual upgrade, have a plan to test for 
> issues, and have a plan to revert to an earlier version if issues are 
> discovered. This is work, but it's really not all that much work, and this is 
> exactly the work we are being paid to do as cluster admins.
> 
> From my own experience, I find *not* updating in a timely manner is actually 
> more problematic and more work than keep on top of updates. For example, 
> where I work now, we still haven't upgraded to CentOS 7, and as a result, 
> many basic libraries are older than what many of the open-source apps my 
> users need require. As a result, I don't just have to install application X, 
> I often have to install up-to-date versions of basic libraries like 
> libreadline, libcurses, zlib, etc. And then there are the security concerns...
> 
> Okay, rant over. I'm sorry. It just bothers me when I hear fellow system 
> admins aren't "very fond" of things that I think are a core responsbility of 
> our jobs. I take a lot of pride on my job.

All of those things take time, depending on where you work (not necessarily 
speaking about my current employer/employment situation), you may be ordered to 
do something else with that time. If so, all bets are off. Planned updates 
where sufficient testing time is not allotted moves the associated work from 
planned work to unplanned emergency (something broken, etc.), and in some cases 
from business hours to off hours, generate lots of support queries, etc.

I’ve never seen a paycheck signed by “Best Practices”.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'


signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] Topology configuration questions:

2019-01-22 Thread Ryan Novosielski

Prentice (and others) — if the NodeWeight/topology plugin interaction bothers 
you, feel free to tack onto bug 6384.

https://bugs.schedmd.com/show_bug.cgi?id=6384

> On Jan 22, 2019, at 1:15 PM, Prentice Bisbal  wrote:
> 
> Killian,
> 
> Thanks for the input. Unfortunately, all of this information from you, Ryan 
> and others, is really ruining my plans, since it makes it look like my plan 
> to fix a problem wit my cluster will not be as easy to fix as I'd hoped. One 
> of the issues with my "Frankencluster" is that I'd like to assign jobs to 
> different nodes based on the network they're on (1 GbE, 10 GbE, IB), along 
> with other criteria, such as features requested.
> 
> I think it might be best if I write a longer e-mail to this list describing 
> my cluster architecture, the problems I'm trying to address, and different 
> possible approaches, and then get this list's feedback.
> 
> Prentice
> 
> On 1/18/19 11:53 AM, Kilian Cavalotti wrote:
>> On Fri, Jan 18, 2019 at 6:31 AM Prentice Bisbal  wrote:
>>>> Note that if you care about node weights (eg. NodeName=whatever001 
>>>> Weight=2, etc. in slurm.conf), using the topology function will disable 
>>>> it. I believe I was promised a warning about that in the future in a 
>>>> conversation with SchedMD.
>>> Well, that's going to be a big problem for me. One of the goals of me
>>> overhauling our Slurm config is to take advantage of the node weighting
>>> function to prioritize certain hardware over others in our very
>>> heterogeneous cluster.
>> I've heard that too (that enabling the Topology plugin would disable
>> node weighting), but I don't think it's accurate, both from the
>> documentation and from observation.
>> 
>> The doc actually says (https://slurm.schedmd.com/topology.html)
>> 
>> """
>> NOTE:Slurm first identifies the network switches which provide the
>> best fit for pending jobs and then selectes the nodes with the lowest
>> "weight" within those switches. If optimizing resource selection by
>> node weight is more important than optimizing network topology then do
>> NOT use the topology/tree plugin.
>> """
>> 
>> So the Topology plugin does take precedence over the weighting
>> algorithm, but it doesn't disable it, AFAIK. And for sites using
>> disjoint networks, as we do, this is a sane behavior.
>> 
>> Cheers,
> 

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] Topology configuration questions:

2019-01-18 Thread Ryan Novosielski

The documentation indicates you need it everywhere:

https://slurm.schedmd.com/topology.conf.html

"Changes to the configuration file take effect upon restart of Slurm daemons, 
daemon receipt of the SIGHUP signal, or execution of the command "scontrol 
reconfigure" unless otherwise noted."

I have vague memories of not being able to schedule any jobs if it’s missing, 
but it’s been awhile now.

> On Jan 17, 2019, at 4:52 PM, Prentice Bisbal  wrote:
> 
> And a follow-up question: Does topology.conf need to be on all the nodes, or 
> just the slurm controller? It's not clear from that web page. I would assume 
> only the controller needs it.
> 
> Prentice
> 
> On 1/17/19 4:49 PM, Prentice Bisbal wrote:
>> From https://slurm.schedmd.com/topology.html:
>> 
>>> Note that compute nodes on switches that lack a common parent switch can be 
>>> used, but no job will span leaf switches without a common parent (unless 
>>> the TopologyParam=TopoOptional option is used). For example, it is legal to 
>>> remove the line "SwitchName=s4 Switches=s[0-3]" from the above 
>>> topology.conf file. In that case, no job will span more than four compute 
>>> nodes on any single leaf switch. This configuration can be useful if one 
>>> wants to schedule multiple phyisical clusters as a single logical cluster 
>>> under the control of a single slurmctld daemon.
>> 
>> My current environment falls into the category of multiple physical clusters 
>> being treated as a single logical cluster under the control of a single 
>> slurmctld daemon. At least, that's my goal.
>> 
>> In my environment, I have 2 "clusters" connected by their own separate IB 
>> fabrics, and one "cluster" connected with 10 GbE. I have a fourth cluster 
>> connected with only 1GbE. For this 4th cluster, we don't want jobs to span 
>> nodes, due to the slow performance of 1 GbE. (This cluster is intended for 
>> serial and low-core count parallel jobs) If I just leave those nodes out of 
>> the topology.conf file, will that have the desired affect of not allocating 
>> multi-node jobs to those nodes, or will it result in an error of some sort?

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] Topology configuration questions:

2019-01-18 Thread Ryan Novosielski



> On Jan 18, 2019, at 11:53 AM, Kilian Cavalotti 
>  wrote:
> 
> On Fri, Jan 18, 2019 at 6:31 AM Prentice Bisbal  wrote:
>>> Note that if you care about node weights (eg. NodeName=whatever001 
>>> Weight=2, etc. in slurm.conf), using the topology function will disable it. 
>>> I believe I was promised a warning about that in the future in a 
>>> conversation with SchedMD.
>> 
>> Well, that's going to be a big problem for me. One of the goals of me
>> overhauling our Slurm config is to take advantage of the node weighting
>> function to prioritize certain hardware over others in our very
>> heterogeneous cluster.
> 
> I've heard that too (that enabling the Topology plugin would disable
> node weighting), but I don't think it's accurate, both from the
> documentation and from observation.
> 
> The doc actually says (https://slurm.schedmd.com/topology.html)
> 
> """
> NOTE:Slurm first identifies the network switches which provide the
> best fit for pending jobs and then selectes the nodes with the lowest
> "weight" within those switches. If optimizing resource selection by
> node weight is more important than optimizing network topology then do
> NOT use the topology/tree plugin.
> """
> 
> So the Topology plugin does take precedence over the weighting
> algorithm, but it doesn't disable it, AFAIK. And for sites using
> disjoint networks, as we do, this is a sane behavior.

I’m not sure if that’s a change, or whether that was always the behavior, but 
as a practical matter, it still really defeats the node weight. We have a fully 
defined topology for two different clusters, and it happens that the switch 
with the smallest number of connected nodes has the most specialized equipment 
(usually the login nodes, a couple of high memory nodes, and a few CUDA nodes). 
If someone runs a single node job, the job will favor that switch. I can think 
of a few ways to work around that, I guess, but by default, the behavior seems 
to be roughly the inverse of the node weights.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'


signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] Topology configuration questions:

2019-01-17 Thread Ryan Novosielski

I don’t actually know the answer to this one, but we have it provisioned to all 
nodes.

Note that if you care about node weights (eg. NodeName=whatever001 Weight=2, 
etc. in slurm.conf), using the topology function will disable it. I believe I 
was promised a warning about that in the future in a conversation with SchedMD.

> On Jan 17, 2019, at 4:52 PM, Prentice Bisbal  wrote:
> 
> And a follow-up question: Does topology.conf need to be on all the nodes, or 
> just the slurm controller? It's not clear from that web page. I would assume 
> only the controller needs it.
> 
> Prentice
> 
> On 1/17/19 4:49 PM, Prentice Bisbal wrote:
>> From https://slurm.schedmd.com/topology.html:
>> 
>>> Note that compute nodes on switches that lack a common parent switch can be 
>>> used, but no job will span leaf switches without a common parent (unless 
>>> the TopologyParam=TopoOptional option is used). For example, it is legal to 
>>> remove the line "SwitchName=s4 Switches=s[0-3]" from the above 
>>> topology.conf file. In that case, no job will span more than four compute 
>>> nodes on any single leaf switch. This configuration can be useful if one 
>>> wants to schedule multiple phyisical clusters as a single logical cluster 
>>> under the control of a single slurmctld daemon.
>> 
>> My current environment falls into the category of multiple physical clusters 
>> being treated as a single logical cluster under the control of a single 
>> slurmctld daemon. At least, that's my goal.
>> 
>> In my environment, I have 2 "clusters" connected by their own separate IB 
>> fabrics, and one "cluster" connected with 10 GbE. I have a fourth cluster 
>> connected with only 1GbE. For this 4th cluster, we don't want jobs to span 
>> nodes, due to the slow performance of 1 GbE. (This cluster is intended for 
>> serial and low-core count parallel jobs) If I just leave those nodes out of 
>> the topology.conf file, will that have the desired affect of not allocating 
>> multi-node jobs to those nodes, or will it result in an error of some sort?
>> 
> 



signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] Topology configuration questions:

2019-01-17 Thread Ryan Novosielski

> On Jan 17, 2019, at 4:49 PM, Prentice Bisbal  wrote:
> 
> From https://slurm.schedmd.com/topology.html:
> 
>> Note that compute nodes on switches that lack a common parent switch can be 
>> used, but no job will span leaf switches without a common parent (unless the 
>> TopologyParam=TopoOptional option is used). For example, it is legal to 
>> remove the line "SwitchName=s4 Switches=s[0-3]" from the above topology.conf 
>> file. In that case, no job will span more than four compute nodes on any 
>> single leaf switch. This configuration can be useful if one wants to 
>> schedule multiple phyisical clusters as a single logical cluster under the 
>> control of a single slurmctld daemon.
> 
> My current environment falls into the category of multiple physical clusters 
> being treated as a single logical cluster under the control of a single 
> slurmctld daemon. At least, that's my goal.
> 
> In my environment, I have 2 "clusters" connected by their own separate IB 
> fabrics, and one "cluster" connected with 10 GbE. I have a fourth cluster 
> connected with only 1GbE. For this 4th cluster, we don't want jobs to span 
> nodes, due to the slow performance of 1 GbE. (This cluster is intended for 
> serial and low-core count parallel jobs) If I just leave those nodes out of 
> the topology.conf file, will that have the desired affect of not allocating 
> multi-node jobs to those nodes, or will it result in an error of some sort?

It will print a warning:

[2019-01-10T12:41:32.457] TOPOLOGY: warning -- no switch can reach all nodes 
through its descendants.Do not use route/topology

…which sort of makes it sound like it’s going to ignore the topology plugin, 
but I believe it works (and the documentation sure indicates it does).


--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] salloc with bash scripts problem

2019-01-02 Thread Ryan Novosielski

> On Jan 2, 2019, at 3:49 PM, Mark Hahn  wrote:
> 
>> [mahmood@rocks7 ~]$ salloc -n1 hostname
>> salloc: Granted job allocation 278
>> rocks7.jupiterclusterscu.com
>> salloc: Relinquishing job allocation 278
>> salloc: Job allocation 278 has been revoked.
>> [mahmood@rocks7 ~]$
>> 
>> As you can see whenever I run salloc, I see the rocks7 prompt which is the
>> login node.
> 
> this is precisely as expected.  salloc allocates; srun runs.
> 
> to get a compute node do this instead:
> salloc srun hostname
> 
> if you actually want to srun an interactive shell each time,
> why are you not using SallocDefaultCommand as others have suggested?
> 
> you earlier mentioned wanting to run an X-requiring script.  why not just:
> salloc --x11 srun ./whateveryourscriptwas

For that matter, however, what’s the advantage of “salloc --x11 srun” vs. just 
"srun --x11”?

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] salloc with bash scripts problem

2019-01-02 Thread Ryan Novosielski

I don’t think that’s true (and others have shared documentation regarding 
interactive jobs and the S commands). There was documentation shared for how 
this works, and it seems as if it has been ignored.

[novosirj@amarel2 ~]$ salloc -n1
salloc: Pending job allocation 83053985
salloc: job 83053985 queued and waiting for resources
salloc: job 83053985 has been allocated resources
salloc: Granted job allocation 83053985
salloc: Waiting for resource configuration
salloc: Nodes slepner012 are ready for job

This is the behavior I’ve always seen. If I include a command at the end of the 
line, it appears to simply run it in the “new” shell that is created by salloc 
(which you’ll notice you can exit via CTRL-D or exit).

[novosirj@amarel2 ~]$ salloc -n1 hostname
salloc: Pending job allocation 83054458
salloc: job 83054458 queued and waiting for resources
salloc: job 83054458 has been allocated resources
salloc: Granted job allocation 83054458
salloc: Waiting for resource configuration
salloc: Nodes slepner012 are ready for job
amarel2.amarel.rutgers.edu
salloc: Relinquishing job allocation 83054458

You can, however, tell it to srun something in that shell instead:

[novosirj@amarel2 ~]$ salloc -n1 srun hostname
salloc: Pending job allocation 83054462
salloc: job 83054462 queued and waiting for resources
salloc: job 83054462 has been allocated resources
salloc: Granted job allocation 83054462
salloc: Waiting for resource configuration
salloc: Nodes node073 are ready for job
node073.perceval.rutgers.edu
salloc: Relinquishing job allocation 83054462

When you use salloc, it starts an allocation and sets up the environment:

[novosirj@amarel2 ~]$ env | grep SLURM
SLURM_NODELIST=slepner012
SLURM_JOB_NAME=bash
SLURM_NODE_ALIASES=(null)
SLURM_MEM_PER_CPU=4096
SLURM_NNODES=1
SLURM_JOBID=83053985
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE=1
SLURM_JOB_ID=83053985
SLURM_SUBMIT_DIR=/cache/home/novosirj
SLURM_NPROCS=1
SLURM_JOB_NODELIST=slepner012
SLURM_CLUSTER_NAME=amarel
SLURM_JOB_CPUS_PER_NODE=1
SLURM_SUBMIT_HOST=amarel2.amarel.rutgers.edu
SLURM_JOB_PARTITION=main
SLURM_JOB_NUM_NODES=1

If you run “srun” subsequently, it will run on the compute node, but a regular 
command will run right where you are:

[novosirj@amarel2 ~]$ srun hostname
slepner012.amarel.rutgers.edu

[novosirj@amarel2 ~]$ hostname
amarel2.amarel.rutgers.edu

Again, I’d advise Mahmood to read the documentation that was already provided. 
It really doesn’t matter what behavior is requested — that’s not what this 
command does. If one wants to run a script on a compute node, the correct 
command is sbatch. I’m not sure what advantage salloc with srun has. I assume 
it’s so you can open an allocation and then occasionally send srun commands 
over to it.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jan 2, 2019, at 12:20 PM, Terry Jones  wrote:
> 
> I know very little about how SLURM works, but this sounds like it's a 
> configuration issue - that it hasn't been configured in a way that indicates 
> the login nodes cannot also be used as compute nodes. When I run salloc on 
> the cluster I use, I *always* get a shell on a compute node, never on the 
> login node that I ran salloc on.
> 
> Terry
> 
> 
> On Wed, Jan 2, 2019 at 4:56 PM Mahmood Naderan  wrote:
> Currently, users run "salloc --spankx11 ./qemu.sh" where qemu.sh is a script 
> to run a qemu-system-x86_64 command.
> When user (1) runs that command, the qemu is run on the login node since the 
> user is accessing the login node. When user (2) runs that command, his qemu 
> process is also running on the login node and so on.
> 
> That is not what I want!
> I expected slurm to dispatch the jobs on compute nodes.
> 
> 
> Regards,
> Mahmood
> 
> 
> 
> 
> On Wed, Jan 2, 2019 at 7:39 PM Renfro, Michael  wrote:
> Not sure what the reasons behind “have to manually ssh to a node”, but salloc 
> and srun can be used to allocate resources and run commands on the allocated 
> resources:
> 
> Before allocation, regular commands run locally, and no Slurm-related 
> variables are present:
> 
> =
> 
> [renfro@login ~]$ hostname
> login
> [renfro@login ~]$ echo $SLURM_TASKS_PER_NODE
> 
> 
> =
> 
> After allocation, regular commands still run locally, Slurm-related variables 
> are present, and srun runs commands on the allocated node (my prompt change 
> inside a job is a local thing, not done by default):
> 
> =
> 
> [renfro@login ~]$ salloc
> salloc: Granted job allocation 147867
> [renfro@login(job 147867) ~]$ hostname
> login
> [renfro@login(job 147867) ~]

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-12-07 Thread Ryan Novosielski

This is only so relevant, but the scenario presents itself similarly. This is 
not in a scheduler environment, but we have an interactive server that would 
have PS hangs on certain tasks (top -bn1 is a way around that, BTW, if it’s 
hard to even find out what the process is). For us, it appeared to be a process 
that was using a lot of memory that khugepaged was attempting to manipulate.

https://access.redhat.com/solutions/46111

I have never seen this happen on 7.x that I can recall. On our 6.x machine 
where we’ve seen it happen, all we did was this:

echo "madvise" > /sys/kernel/mm/redhat_transparent_hugepage/defrag

…in /etc/rc.local (which I hate, but I’m not sure where else that can go — 
maybe on the boot command line). This prevented nearly 100% of our problems.

No idea if that has anything to do with your situation.

> On Nov 29, 2018, at 1:27 PM, Christopher Benjamin Coffey 
>  wrote:
> 
> Hi,
> 
> We've been noticing an issue with nodes from time to time that become 
> "wedged", or unusable. This is a state where ps, and w hang. We've been 
> looking into this for a while when we get time and finally put some more 
> effort into it yesterday. We came across this blog which describes almost the 
> exact scenario:
> 
> https://rachelbythebay.com/w/2014/10/27/ps/
> 
> It has nothing to do with Slurm, but it does have to do with cgroups which we 
> have enabled. It appears that processes that have hit their ceiling for 
> memory and should be killed by oom-killer, and are in D state at the same 
> time, cause the system to become wedged. For each node wedged, I've found a 
> job out in:
> 
> /cgroup/memory/slurm/uid_3665/job_15363106/step_batch
> - memory.max_usage_in_bytes
> - memory.limit_in_bytes
> 
> The two files are the same bytes, which I'd think would be a candidate for 
> oom-killer. But memory.oom_control says:
> 
> oom_kill_disable 0
> under_oom 0
> 
> My feeling is that the process was in D state, the oom-killer tried to be 
> invoked, but then didn't and the system became wedged.
> 
> Has anyone run into this? If so, whats the fix? Apologies if this has been 
> discussed before, I haven't noticed it on the group.
> 
> I wonder if it’s a bug in the oom-killer? Maybe it's been patched in a more 
> recent kernel but looking at the kernels in the 6.10 series it doesn't look 
> like a newer one would have a patch for a oom-killer bug.
> 
> Our setup is:
> 
> Centos 6.10
> 2.6.32-642.6.2.el6.x86_64
> Slurm 17.11.12
> 
> And /etc/slurm/cgroup.conf
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> 
> Cheers,
> Chris
> 
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
> 
> 

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] How to check the percent cpu of a job?

2018-11-21 Thread Ryan Novosielski

Olm’s “pestat” script does allow you to get similar information, but I’m 
interested to see if indeed there’s a better answer. I’ve used his script for 
more or less the same reason, to see if the jobs are using the resources 
they’re allocated. They show at a node level though, and then you have to look 
closer. For example:

Print only nodes that are flagged by * (RED nodes)
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem  Joblist
State Use/Tot  (MB) (MB)  JobId 
User ...

  gpu003oarc drng*  8  12   58.06*6400024507  82565618 
yc567  
...
 hal0027  kopp_1alloc  28  288.64*   128000   115610  82591085 
mes373 82595703 aek119 

You can see, both of the above are examples of jobs that have allocated CPU 
numbers that are very different from the ultimate CPU load (the first one using 
way more than allocated, though they’re in a cgroup so theoretically isolated 
from the other users on the machine), and the second one asking for all 28 CPUs 
but only “using” ~8 of them.

If you’re using cgroups, it would seem to me that there must also be a way to 
see the output of “top” for just a group, or at least something similar. 
systemd-cgtop does more or less that, but doesn’t seem to show exactly what 
you’d want here:

Path
 Tasks   %CPU   Memory  Input/s Output/s

/   
   306  900.6 9.8G--
/slurm  
 -  - 3.7G--
/slurm/uid_140780   
 -  - 3.0G--
/slurm/uid_140780/job_82591085  
 -  - 3.0G--
/slurm/uid_142473   
 -  -   374.7M--
/slurm/uid_142473/job_82595703  
 -  -   374.7M--

…CPU only being shown as an aggregate at the top level (sorry about the 
formatting).

> On Nov 21, 2018, at 1:27 PM, 宋亚磊  wrote:
> 
> Hi Jing, thank you! 
> 
> The following command show us the cpu load of the node,
> 
> $ scontrol show node   | grep CPULoad
> 
> but I want the percent cpu of the job, like top or ps.
> For examplt, a job allocated 10 cpus, but it just use 2, so the percent
> cpu should be 200%, not be 1000%, I want konw this.
> 
> Anyway, thank you again, Jing.
> 
> Best regards,
> Yalei
> 
>> -原始邮件-
>> 发件人: "Jing Gong" 
>> 发送时间: 2018-11-22 02:04:59 (星期四)
>> 收件人: "Slurm User Community List" 
>> 抄送: 
>> 主题: Re: [slurm-users] How to check the percent cpu of a job?
>> 
>> Hi,
>> 
>>> How to check the percent cpu of a job in slurm? 
>> 
>> We use command "scontrol" likes
>> 
>> $ scontrol show node   | grep CPULoad
>> ...
>>   CPUAlloc=48 CPUErr=0 CPUTot=48 CPULoad=25.32
>> ...
>> 
>> Regards, Jing 
>> 
>> 
>> From: slurm-users  on behalf of 宋亚磊 
>> 
>> Sent: Wednesday, November 21, 2018 18:51
>> To: slurm-users@lists.schedmd.com
>> Subject: [slurm-users] How to check the percent cpu of a job?
>> 
>> Hello everyone,
>> 
>> How to check the percent cpu of a job in slurm? I tried sacct, sstat, 
>> squeue, but I can't find that how to check.
>> Can someone help me?
>> 
>> Best regards,
>> Yalei
>> 

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] Updated Slurm tool "pestat" (Processor Element status)

2018-11-21 Thread Ryan Novosielski

Thanks Olm! I am quite fond of your utilities — thank you for providing them. 

Sent from my iPhone

> On Nov 21, 2018, at 08:51, Ole Holm Nielsen  
> wrote:
> 
> Dear Slurm users,
> 
> The Slurm tool "pestat" (Processor Element status) has been enhanced due to a 
> user request.  Now pestat will display an additional available GRES column 
> for the nodes if the -G flag is used.  This is useful if your nodes have GPUs 
> installed.
> 
> The pestat tool prints a Slurm cluster nodes status with 1 line per node and 
> job info.  Download pestat from 
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
> 
> The pestat tool has many options:
> 
> # pestat -h
> Usage: pestat [-p partition(s)] [-u username] [-g groupname]
>[-q qoslist] [-s statelist] [-n/-w hostlist] [-j joblist] [-G] [-N]
>[-f | -F | -m free_mem | -M free_mem ] [-1|-2] [-d] [-E] [-C|-c] [-V] [-h]
> where:
>-p partition: Select only partion 
>-u username: Print only user 
>-g groupname: Print only users in UNIX group 
>-q qoslist: Print only QOS in the qoslist 
>-R reservationlist: Print only node reservations 
>-s statelist: Print only nodes with state in 
>-n/-w hostlist: Print only nodes in hostlist
>-j joblist: Print only nodes in job 
>-G: Print GRES (Generic Resources) in addition to JobId
>-N: Print JobName in addition to JobId
>-f: Print only nodes that are flagged by * (unexpected load etc.)
>-F: Like -f, but only nodes flagged in RED are printed.
>-m free_mem: Print only nodes with free memory LESS than free_mem MB
>-M free_mem: Print only nodes with free memory GREATER than free_mem MB 
> (under-utilized)
>-d: Omit nodes with states: down drained
>-1: Default: Only 1 line per node (unique nodes in multiple partitions are 
> printed once only)
>-2: 2..N lines per node which participates in multiple partitions
>-E: Job EndTime is printed after each jobid/user
>-C: Color output is forced ON
>-c: Color output is forced OFF
>-h: Print this help information
>-V: Version information
> 
> My monitoring of jobs is usually done simply with "pestat -F", and also with 
> "pestat -s mix".
> 
> /Ole
> 
> 
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark
>

Re: [slurm-users] Job allocating more CPUs than requested

2018-09-21 Thread Ryan Novosielski

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/21/2018 11:22 PM, Chris Samuel wrote:
> On Saturday, 22 September 2018 2:53:58 AM AEST Nicolas Bock wrote:
> 
>> shows as requesting 1 CPU when in queue, but then allocates all 
>> CPU cores once running. Why is that?
> 
> Do you mean that Slurm expands the cores requested to all the cores
> on the node or allocates the node in exclusive mode, or do you mean
> that the code inside the job uses all the cores on the node instead
> of what was requested?
> 
> The latter is often the case for badly behaved codes and that's why
> using cgroups to contain applications is so important.

I apologize for potentially thread hijacking here, but it's in the
spirit of the original question I guess.

We constrain using cgroups, and occasionally someone will request 1
core (-n1 -c1) and then run something that asks for way more
cores/threads, or that tries to use the whole machine. They won't
succeed obviously. Is this any sort of problem? It seems to me that
trying to run 24 threads on a single core might generate some sort of
overhead, and that I/O could be increased, but I'm not sure. What I do
know is that if someone does this -- let's say in the extreme by
running something -n24 that itself tries to run 24 threads in each
task -- and someone uses the other 23 cores, you'll end up with a load
average near 24*24+23. Does this make any difference? We have NHC set
to offline such nodes, but that affects job preemption. What sort of
choices do others make in this area?

- -- 

 || \\UTGERS, |--*O*--------
 ||_// the State  |Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
  `'
-BEGIN PGP SIGNATURE-

iEYEARECAAYFAlulxpAACgkQmb+gadEcsb543gCeOnUj+raTuEjLdYe+rfmHDiPP
kfgAn0zY0Ykm3fEOha9P25Q4m0F0/yKQ
=kI8g
-END PGP SIGNATURE-

[slurm-users] "Owner" field in scontrol show node?

2018-08-08 Thread Ryan Novosielski

Does anyone have any idea or a pointer to documentation about what the node 
“owner” field is in “scontrol show node ” like the below (set out by 
*s):

[root@hal0099 ~]# scontrol show node hal0097
NodeName=hal0097 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=0 CPUErr=0 CPUTot=32 CPULoad=0.01
   AvailableFeatures=hal,skylake,edr
   ActiveFeatures=hal,skylake,edr
   Gres=(null)
   NodeAddr=hal0097 NodeHostName=hal0097 Version=17.11
   OS=Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018
   RealMemory=192000 AllocMem=0 FreeMem=184792 Sockets=2 Boards=1
   State=DOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=20 *Owner=N/A* 
MCS_label=N/A
   Partitions=main,bg,oarc,test
   BootTime=2018-08-07T16:59:02 SlurmdStartTime=2018-08-07T17:00:33
   CfgTRES=cpu=32,mem=187.50G,billing=32
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=HDRT #1019681 [root@2018-08-06T12:14:44]

Thanks!

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] "fatal: can't stat gres.conf"

2018-07-23 Thread Ryan Novosielski

> On Jul 23, 2018, at 10:31 PM, Ian Mortimer  wrote:
> 
> On Tue, 2018-07-24 at 02:19 +0000, Ryan Novosielski wrote:
> 
>> Best off running nvidia-persistenced. Handles all of this stuff as a
>> side effect, and also enables persistence mode, provided you don’t
>> configure it otherwise. 
> 
> Yes.  But you have to ensure it starts before slurmd.

While true, I don’t find I need to take any special precaution on my machines. 
Probably prudent to set a systemd dependency though.

--

|| \\UTGERS, |---*O*-------
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

Re: [slurm-users] Is It Possible to change the node order for different partition

2018-06-27 Thread Ryan Novosielski

Bear in mind that at least, last I knew, using a topology plugin will cause all 
of this to be ignored and, at least in our case, for SLURM to do roughly the 
opposite of what you want regarding weights (for us, this is because the 
highest weighted equipment was connected to the switch with the least connected 
equipment, so it tended to aim any job that would fit there to those nodes).

> On Jun 27, 2018, at 12:42 AM, Brian Andrus  wrote:
> 
> Yes, if you put a weight parameter to the nodes.
> 
> From the manpage:
> 
> Weight
> 
> The priority of the node for scheduling purposes. All things being equal, 
> jobs will be allocated the nodes with the lowest weight which satisfies their 
> requirements. For example, a heterogeneous collection of nodes might be 
> placed into a single partition for greater system utilization, responsiveness 
> and capability. It would be preferable to allocate smaller memory nodes 
> rather than larger memory nodes if either will satisfy a job's requirements. 
> The units of weight are arbitrary, but larger weights should be assigned to 
> nodes with more processors, memory, disk space, higher processor speed, etc. 
> Note that if a job allocation request can not be satisfied using the nodes 
> with the lowest weight, the set of nodes with the next lowest weight is added 
> to the set of nodes under consideration for use (repeat as needed for higher 
> weight values). If you absolutely want to minimize the number of higher 
> weight nodes allocated to a job (at a cost of higher scheduling overhead), 
> give each node a distinct Weight value and they will be added to the pool of 
> nodes being considered for scheduling individually. The default value is 1.
> 
> Brian Andrus
> 
> 
> On 6/26/2018 3:06 PM, Bill wrote:
>> Hi Everyone,
>> 
>> For example, I have two partitions, high,low each has same nodes node[1-10], 
>> When we submit job to high partition the nodes order is 
>> node1,node2..node10,  when we submit job to low partition, the order is 
>> node10,node9..node1.
>> 
>> Is it possible to do that?
>> 
>> Thanks in advance,
>> Bill
> 



signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-07 Thread Ryan Novosielski

One of these TRES-related ones in a QOS ought to do it:

https://slurm.schedmd.com/resource_limits.html

Your problem there, though, is you will eventually have stuff waiting to run it 
and when the system is idle. We had the same circumstance and the same eventual 
outcome.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On May 8, 2018, at 00:43, Jonathon A Anderson 
<jonathon.ander...@colorado.edu<mailto:jonathon.ander...@colorado.edu>> wrote:

We have two main issues with our scheduling policy right now. The first is an 
issue that we call "queue stuffing." The second is an issue with interactive 
job availability. We aren't confused about why these issues exist, but we 
aren't sure the best way to address them.

I'd love to hear any suggestions on how other sites address these issues. 
Thanks for any advice!


## Queue stuffing

We use multifactor scheduling to provide account-based fairshare scheduling as 
well as standard fifo-style job aging. In general, this works pretty well, and 
accounts meet their scheduling targets; however, every now and again, we have a 
user who has a relatively high-throughput (not HPC) workload that they're 
willing to wait a significant period of time for. They're low-priority work, 
but they put a few thousand jobs into the queue, and just sit and wait. 
Eventually the job aging makes the jobs so high-priority, compared to the 
fairshare, that they all _as a set_ become higher-priority than the rest of the 
work on the cluster. Since they continue to age as the other jobs continue to 
age, these jobs end up monopolizing the cluster for days at a time, as their 
high volume of relatively small jobs use up a greater and greater percentage of 
the machine.

In Moab I'd address this by limiting the number of jobs the user could have 
*eligible* at any given time; but it appears that the only option for slurm is 
limiting the number of jobs a user can *submit*, which isn't as nice a user 
experience and can lead to some pathological user behaviors (like users running 
cron jobs that wake repeatedly and submit more jobs automatically).


## Interactive job availability

I'm becoming increasingly convinced that holding some portion of our resource 
aside as dedicated for relatively short, small, interactive jobs is a unique 
good; but I'm not sure how best to implement it. My immediate thought was to 
use a reservation with the DAILY and REPLACE flags. I particularly like the 
idea of using the REPLACE flag here as we could keep a flexible amount of 
resources available irrespective of how much was actually being used for the 
purpose at any given time; but it doesn't appear that there's any way to limit 
the per-user use of resources *within* a reservation; so if we created such a 
reservation and granted all users access to it, any individual user would be 
capable of consuming all resources in the reservation anyway. I'd have a 
dedicated "interactive" qos or similar to put such restrictions on; but there 
doesn't appear to be a way to then limit the use of the reservation to only 
jobs with that qos. (Aside from job_submit scripts or similar. Please correct 
me if I'm wrong.)

In lieu of that, I'm leaning towards having a dedicated interactive partition 
that we'd manually move some resources to; but that's a bit less flexible.

Re: [slurm-users] ReqNodeNotAvail, but none of nodes in partition are listed.

2018-05-07 Thread Ryan Novosielski

Fewer. ;)

I think rumor had it that there were plans for some improvement in this area 
(you might check the bugs or this mailing list — I can’t remember where I saw 
it, but it was awhile back now), because ReqNodeNotAvail almost never means 
something useful, and reservations don’t actually generate any message 
whatsoever that would indicate that they are there. Almost 100% of the time we 
see questions about this at our site, it’s a reservation doing it, and 
sometimes even the person who set the reservation doesn’t figure it out.

> On May 7, 2018, at 5:32 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:
> 
> Dang it. That's it. I recently changed the default time limit on some of my 
> partitions, to only 48 hours. I have a reservation that starts on Friday at 5 
> PM. These jobs are all assigned to partitions that still have longer time 
> limits. I forgot that not all partitions have the new 48-hour limit.
> 
> Still, Slurm should provide a better error message for that situation, since 
> I'm sure it's not that uncommon for this to happen. It would certainly result 
> in a lot less tickets being sent to me.
> 
> Prentice Bisbal
> Lead Software Engineer
> Princeton Plasma Physics Laboratory
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.pppl.gov=02%7C01%7Cnovosirj%40rutgers.edu%7C813f589858f0446965e408d5b4625547%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636613256792836113=pQ2NHUJwFOkcAZOb55SQn3nOAvy1Koav7HYgr5aKrng%3D=0
> 
> On 05/07/2018 05:11 PM, Ryan Novosielski wrote:
>> In my experience, it may say that even if it has nothing to do with the 
>> reason the job isn’t running, if there are nodes on the system that aren’t 
>> available.
>> 
>> I assume you’ve checked for reservations?
>> 
>>> On May 7, 2018, at 5:06 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:
>>> 
>>> Dear Slurm Users,
>>> 
>>> On my cluster, I have several partitions, each with their own QOS, time 
>>> limits, etc.
>>> 
>>> Several times today, I've received complaints from users that they 
>>> submitted jobs to a partition with available nodes, but jobs are stuck in 
>>> the PD state. I have spent the majority of my day investigating this, but 
>>> haven't turned up anything meaningful. Both jobs show the "ReqNodeNotAvail" 
>>> reason, but none of the nodes listed at not available are even in the 
>>> partition these jobs are submitted to. Neither job has requested a specific 
>>> node, either.
>>> 
>>> I have checked slurmctld.log on the server, and have not been able to find 
>>> any clues. Any where else I should look? Any ideas what could be causing 
>>> this?
>> --
>> 
>> || \\UTGERS,  
>> |---*O*---
>> ||_// the State   | Ryan Novosielski - novos...@rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>> ||  \\of NJ   | Office of Advanced Research Computing - MSB C630, 
>> Newark
>>  `'
>> 
> 
> 

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] ReqNodeNotAvail, but none of nodes in partition are listed.

2018-05-07 Thread Ryan Novosielski

In my experience, it may say that even if it has nothing to do with the reason 
the job isn’t running, if there are nodes on the system that aren’t available.

I assume you’ve checked for reservations?

> On May 7, 2018, at 5:06 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:
> 
> Dear Slurm Users,
> 
> On my cluster, I have several partitions, each with their own QOS, time 
> limits, etc.
> 
> Several times today, I've received complaints from users that they submitted 
> jobs to a partition with available nodes, but jobs are stuck in the PD state. 
> I have spent the majority of my day investigating this, but haven't turned up 
> anything meaningful. Both jobs show the "ReqNodeNotAvail" reason, but none of 
> the nodes listed at not available are even in the partition these jobs are 
> submitted to. Neither job has requested a specific node, either.
> 
> I have checked slurmctld.log on the server, and have not been able to find 
> any clues. Any where else I should look? Any ideas what could be causing this?

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] Finding / compiling "pam_slurm.so" for Ubuntu 16.04

2018-05-04 Thread Ryan Novosielski

No problem.

For anyone else reading this thread, this will be the case for whichever 
package requires MySQL support, which I think may be a companion package to 
SlurmDBD not the package itself (you’ll need the appropriate -dev/-devel 
package), and I believe one or two others.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On May 4, 2018, at 18:24, Will Dennis 
<wden...@nec-labs.com<mailto:wden...@nec-labs.com>> wrote:

Yes! That was it. I needed to install ‘libpam0g-dev’ (pkg description: 
Development files for PAM)

Then after running “./configure, make, make contrib” again  –

pkgbuilder@mlbuild02:~/test-build/slurm-16.05.4$ find . -name "pam_slurm.so" 
-print
./contribs/pam/.libs/pam_slurm.so
pkgbuilder@mlbuild02:~/test-build/slurm-16.05.4$ file 
./contribs/pam/.libs/pam_slurm.so
./contribs/pam/.libs/pam_slurm.so: ELF 64-bit LSB shared object, x86-64, 
version 1 (SYSV), dynamically linked, 
BuildID[sha1]=9f00a1ca513188adad900980a832dbb8a0b9ddcb, not stripped

Should be able to re-package and distribute now.

Thanks so much!

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Ryan Novosielski
Sent: Friday, May 04, 2018 5:52 PM
To: Slurm User Community List
Subject: Re: [slurm-users] Finding / compiling "pam_slurm.so" for Ubuntu 16.04

On RedHat-based distributions, there were packages that would not be produced 
if the required libraries/headers were not available. So you will possibly need 
a package to be installed on the host where you are building this SLURM 
packages that is called something like some libpam-dev — I don’t quite remember 
the naming convention on Debian-based systems. I think the build process might 
work the same way though: don’t build what can’t be built/is missing 
dependencies.

You might also be able to install the required build dependencies by finding 
out what the currently installed PAM package is and doing apt build-dep 
.

Sorry, I’m not near a computer right now, but this might help you search for 
info.

--

|| \\UTGERS,   
|---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark

Re: [slurm-users] Slurm overhead

2018-04-24 Thread Ryan Novosielski

I would likely crank up the debugging on the slurmd process and look at the log 
files to see what’s going on in that time. You could also watch the job via top 
or other means (on Linux, you can press “1” to see line-by-line for each CPU 
core), or use strace on the process itself. Presumably something is happening 
that’s either eating up 4 minutes, or the job is running 4 minutes more slowly 
and you’ll need to figure out why. I know that our jobs run via the scheduler 
perform about on par for the hardware, and that jobs start fairly immediately.

> On Apr 22, 2018, at 2:06 AM, Mahmood Naderan <mahmood...@gmail.com> wrote:
> 
> I ran some other tests and got the nearly the same results. That 4
> minutes in my previous post means about 50% overhead. So, 24000
> minutes on direct run is about 35000 minutes via slurm. I will post
> with details later. the methodology I used is
> 
> 1- Submit a job to a specific node (compute-0-0) via slurm on the
> frontend and get te elapsed run time (or add  time command in the
> script)
> 2- ssh to the specific node (compute-0-0) and directly run the program
> with time command.
> 
> So, the hardware is the same. I have to say that the frontnend has
> little differences with compute-0-0 but that is not important because
> as I said before, the program is installed on /usr and not the shared
> file system.
> 
> 
> I think the slurm process which query the node to collect runtime
> information is not negligible. For example, squeue updates the runtime
> every seconds. How can I tell slurm not to query very soon. For
> example, update the node information every 10 seconds. Though I am not
> sure how much effect that has.
> 
> 
> Regards,
> Mahmood
> 
> On Fri, Apr 20, 2018 at 10:39 AM, Loris Bennett
> <loris.benn...@fu-berlin.de> wrote:
>> Hi Mahmood,
>> Rather than the overhead being 50%, maybe it is just 4 minutes.  If
>> another job runs for a week, that might not be a problem.  In addition,
>> you just have one data point, so it is rather difficult to draw any
>> conclusion.
>> 
>> However, I think that it is unlikely that Slurm is responsible for
>> this difference.  What can happen is that, if a node is powered down
>> before the job starts, then the clock starts ticking as soon as the job
>> is assigned to the node.  This means that the elapsed time also includes
>> the time for the node to be provisioned.  If this is not relevant in
>> your case, then you are probably just not comparing like with like,
>> e.g. is the hardware underlying /tmp identical in both cases?

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] Automatically migrating jobs to different partitions?

2018-03-22 Thread Ryan Novosielski

If I’m not mistaken, you may submit with multiple partitions specified and it 
will run on the one that makes the most sense.

> On Mar 22, 2018, at 5:29 PM, Alexander John Mamach 
> <alex.mam...@northwestern.edu> wrote:
> 
> Hi all,
> 
> I’ve been looking into a way to automatically migrate queued jobs from one 
> partition to another. For example, if someone submits in partition A and must 
> wait for resources, move their job request to partition B and try to run, and 
> if they must still wait, then try partition C, etc?

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP

Re: [slurm-users] maxim number of pending jobs

2018-03-08 Thread Ryan Novosielski

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/08/2018 10:29 AM, Ole Holm Nielsen wrote:
> On 03/08/2018 04:00 PM, Renat Yakupov wrote:
>> is there a limit to a maximum number of jobs that can be queued
>> in pending state? If so, how can I find it out?
> 
> Maybe this answers your question?
> 
> scontrol show config | grep MaxJobCount
> 
> If this was your question, you may find my script warn_maxjobs
> useful, see 
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
b.com%2FOleHolmNielsen%2FSlurm_tools%2Ftree%2Fmaster%2Fjobs=02%7C01
%7Cnovosirj%40rutgers.edu%7Ce01b9c4babf148c0b98308d5850a2ead%7Cb92d2b234
d35447093ff69aca6632ffe%7C1%7C1%7C636561201142832416=6JADVahgy4m6R
%2BEcr%2Fajhk1N0cmrVjgadXDjJ3Adv2c%3D=0

What
> 
do you do with these? Run them by hand? Schedule them up? Run
from Nagios or similar?

Thanks! Would be better than our current scenario that lets the users
notify is. :-|

- -- 
 
 || \\UTGERS, |--*O*--------
 ||_// the State  |Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
  `'
-BEGIN PGP SIGNATURE-

iFwEARECABwFAlqhWLAVHG5vdm9zaXJqQHJ1dGdlcnMuZWR1AAoJEJm/oGnRHLG+
RnIAn2OpyWjE1/sIYj46dVEhQcZpVHQZAJ9EjNH9JxJo17+tf0rp9LWSHtuJcA==
=I4Hl
-END PGP SIGNATURE-

Re: [slurm-users] Automatically setting OMP_NUM_THREADS=SLURM_CPUS_PER_TASK?

2018-03-06 Thread Ryan Novosielski

Thanks again! I’d seen the second one but not the first one.

> On Mar 6, 2018, at 6:28 PM, Martin Cuma <martin.c...@utah.edu> wrote:
> 
> MKL is trying to be flexible as it has different potential levels of 
> parallelism inside. Having MKL_ and OMP_NUM_THREADS can be beneficial in 
> programs where you may want to use your own OpenMP but restrict MKL's or vice 
> versa.
> 
> A good article on the different options that MKL provides is here:
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.diracprogram.org%2Fdoc%2Frelease-12%2Finstallation%2Fmkl.html=02%7C01%7Cnovosirj%40rutgers.edu%7C1cd606b70e004f334bc408d583ba29be%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C1%7C636559757962555771=5D%2B3OJmw7pbNVYO45vF%2F%2B7Nvx29ewJnMH2thf4O2slo%3D=0
> 
> Official Intel guide is here:
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fen-us%2Farticles%2Frecommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications=02%7C01%7Cnovosirj%40rutgers.edu%7C1cd606b70e004f334bc408d583ba29be%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C1%7C636559757962555771=FLYu9kXwIq9OW2%2BwHNNDEwXXy1UUV%2F0C35TmTH5HJEU%3D=0

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP

1 2 >

1 - 100 of 105 matches

Mail list logo