[slurm-users] Re: Redirect jobs submitted to old partition to new

2024-04-16 Thread Williams, Jenny Avis via slurm-users
For jobs already in default_queue squeue -t pd -h --Format=jobID |xargs -L1 -I{} scontrol update jobID={} partition=queue1 What version of slurm are you running? In slurm 23.02.5, man slurm.conf under PARTITION CONFIGURATION Alternate Partition name of alternate parti

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users
te, I can not see the bug description. So perhaps with Slurm 24.xx release we'll see something new. cheers josef On 11. 04. 24 19:53, Williams, Jenny Avis wrote: There needs to be a slurmstepd infinity process running before slurmd starts. This doc goes into it: https://slurm.sched

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users
There needs to be a slurmstepd infinity process running before slurmd starts. This doc goes into it: https://slurm.schedmd.com/cgroup_v2.html Probably a better way to do this, but this is what we do to deal with that: :: files/slurm-cgrepair.service :: [Unit] Before=slurmd

[slurm-users] Re: Avoiding fragmentation

2024-04-10 Thread Williams, Jenny Avis via slurm-users
Various options that might help reduce job fragmentation. Turn up debugging on slurmctld and add the DebugFlags like TraceJobs, SelectType, and Steps. With debugging set high enough one can see a good bit of the logic in regard to node selection. CR_LLN Schedule

[slurm-users] Re: How to reinstall / reconfigure Slurm?

2024-04-03 Thread Williams, Jenny Avis via slurm-users
Slurm source code should be downloaded and recompiled including the configuration flag - with-nvml. As an example, using rpmbuild mechanism for recompiling and generating rpms, this is our current method. Be aware that the compile works only if it finds the prerequisites needed for a given op

[slurm-users] Re: Slurm suspend preemption not working

2024-03-15 Thread Williams, Jenny Avis via slurm-users
CPUs are released, but memory is not released on suspend. Try looking at this output and compare allocated Memory before and after suspending a job on a node: sinfo -N -n yourNode --Format=weight:8,nodelist:15,cpusstate:12,memory:8,allocmem:8 From: Verma, Nischey (HPC ENG,RAL,LSCI) via slurm-u

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Williams, Jenny Avis via slurm-users
Also -- scontrol show nodes -Original Message- From: Williams, Jenny Avis Sent: Thursday, March 14, 2024 6:46 PM To: Ole Holm Nielsen ; slurm-users@lists.schedmd.com Subject: RE: [slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource I use an alias slist

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Williams, Jenny Avis via slurm-users
I use an alias slist = ` sed 's/ /\n/g' |sort|uniq` -- do not cp/paste lines with "--" -- it is not the two hyphens intended. The examples below are for slurm 23.02.7 . These commands assume administrator access. This is a generalized set of areas I use to find why things just are not moving

[slurm-users] Re: RHEL 8.9+SLURM-23.11.3+MLNX_OFED_LINUX-23.10-1.1.9.0+ OpenMPI-5.0.2

2024-02-20 Thread Williams, Jenny Avis via slurm-users
How was your binary compiled? If it is dynamically linked, please reply with the ldd listing of the binary ( ldd binary ) Jenny From: S L via slurm-users Sent: Tuesday, February 20, 2024 10:55 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] RHEL 8.9+SLURM-23.11.3+MLNX_OFED_LINUX-2

Re: [slurm-users] sbatch does not work with Debian image

2023-08-26 Thread Williams, Jenny Avis
Two of likely several possibilities: The slurm master host name does not resolve. The rights on /etc/slurm are such that the user running the command cannot read /etc/slurm/slurm.conf Jenny Williams UNC Chapel Hill From: slurm-users On Behalf Of Sorin Draga Sent: Wednesday, March 15, 2023 4:13

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-14 Thread Williams, Jenny Avis
me settings as we do helps in your case? Please be aware that you should change JobAcctGatherType only when there are no running job steps! Regards, Hermann On 7/12/23 16:50, Williams, Jenny Avis wrote: > The systems have only cgroup/v2 enabled > # mount |egrep cgroup >

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-12 Thread Williams, Jenny Avis
Linux distribution are you using? And which kernel version? What is the output of mount | grep cgroup What if you do not restrict the cgroup-version Slurm can use to cgroup/v2 but omit "CgroupPlugin=..." from your cgroup.conf? Regards, Hermann On 7/11/23 19:41, Williams, Jenny

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-11 Thread Williams, Jenny Avis
Additional configuration information -- /etc/slurm/cgroup.conf CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes CgroupPlugin=cgroup/v2 AllowedSwapSpace=1 ConstrainSwapSpace=yes ConstrainDevices=yes From: Williams, Jenny Avis Sent: Tuesday, July 11, 2023 10:47 AM To: slurm-us

[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-11 Thread Williams, Jenny Avis
Progress on getting slurmd to start under cgroupv2 Issue: slurmd 22.05.6 will not start when using cgroupv2 Expected result: even after reboot slurmd will start up without needing to manually add lines to /sys/fs/cgroup files. When started as service the error is: # systemctl status slurmd * s

[slurm-users] 21.08.6 srun fails with error "Invalid job credential" ; sbatch is fine.

2022-05-13 Thread Williams, Jenny Avis
Yesterday I upgraded slurmdbd and slurmctld nodes from RHEL7 / Slurm v. 20.11.8 to RHEL8.5 / Slurm v. 21.08.6 on our production cluster. I also updated slurm on the rhel7 login nodes to 21.08.6 Sbatch jobs run fine. Srun, however, fails from the updated login node with invalid job credential er

[slurm-users] export qos

2021-12-17 Thread Williams, Jenny Avis
Sacctmgr dump gets the user listings, but I do not see how to dump qos settings. Does anyone know of a quick way to export qos settings for import to a new sched box? Jenny

Re: [slurm-users] v.20.11.2 - nodes stay in maint after maintenance reservation done

2020-12-31 Thread Williams, Jenny Avis
Looked closer - figured this one out. Just taking longer on that phase. From: Williams, Jenny Avis Sent: Wednesday, December 30, 2020 9:18 AM To: Slurm User Community List Subject: v.20.11.2 - nodes stay in maint after maintenance reservation done We use a maintenance reservation to process

[slurm-users] v.20.11.2 - nodes stay in maint after maintenance reservation done

2020-12-30 Thread Williams, Jenny Avis
We use a maintenance reservation to process node slurm updates from v.20.02.[3|6] to v. 20.11.2 The last step within the job is to set the node state to drain with reason slurm-updated. Once the job is done the node reboots and the reservation terminates. After the node reboots we check that th

[slurm-users] v. 20.11.2 - error: xcpuinfo_abs_to_mac: failed

2020-12-30 Thread Williams, Jenny Avis
Since updating to 20.11.2 from 20.02.3 or 20.02.6, and not before, we are seeing this error for every job in slurmd.log files; there has been no other change besides the slurm version update, including the node configurations. [2020-12-30T07:56:40.692] [9540590.batch] error: xcpuinfo_abs_to_mac

[slurm-users] Slurm v. 20.11.2 AllocNodes behavior change

2020-12-29 Thread Williams, Jenny Avis
When upgrading from 20.02.6 to 20.11.2 a partition that used AllocNodes as the short hostname had to be updated to the FQDN of the node instead. A submit to the partition results in the error $ sbatch -p webportal t.sl sbatch: error: Batch job submission failed: Access/permission denied Expecte

Re: [slurm-users] Extremely sluggish squeue -p partition

2020-12-10 Thread Williams, Jenny Avis
We do have one partition that uses AllowGroups instead of AllowAccounts. Testing with that partition closed did not change things. This started Dec 2nd or 3rd - I noticed it on the 3rd. From: slurm-users On Behalf Of Williams, Jenny Avis Sent: Monday, December 7, 2020 11:43 PM To: Slurm User

[slurm-users] Extremely sluggish squeue -p partition

2020-12-07 Thread Williams, Jenny Avis
I have an interesting condition that has been going on for a few days that could use the feedback of those more familiar with the way slurm works under the hood. Conditions : Slurm v20.02.3 The cluster is relatively quiet given the time of year, and the commands are running on the host on which

[slurm-users] 2 topics: segregating patron accounting, and FIFO in a multifactor setup

2020-09-02 Thread Williams, Jenny Avis
Hi all - There are some cases where are researchers wish for behaviors different than our main clusters configuration for their set of machines. For the first request, where our cluster is set to use PriorityType=priority/multifactor , some wish to have a set of resources that could behave in a

Re: [slurm-users] Slurmstepd errors

2020-08-06 Thread Williams, Jenny Avis
We ran into a similar error -- A response from schedmd: https://bugs.schedmd.com/show_bug.cgi?id=3890 Remediating steps until updates got us past this particular issue: Check for "xcgroup_instantiate errors” and close nodes that show this in messages log. From the nodes listed here we close com

Re: [slurm-users] slurm only looking in "default" partition during scheduling

2020-07-06 Thread Williams, Jenny Avis
You cannot have two default partitions. The slurm.conf is picking up the last of the entries flagged as Default; because the compute partition has no partition specified it is being sent to the default partition, thus the first srun is being submitted to the compute partition, and that partitio

Re: [slurm-users] QOS cutting off users before CPU limit is reached

2020-05-14 Thread Williams, Jenny Avis
Try suspending and resuming the users pending jobs to force a re-evaluation. If the user is not in the zone of jobs that is evaluated, ie if enough higher priority jobs have dropped in ahead then this job may not have been evaluated for scheduling since a point in time when the user was indeed p

Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some point.

2018-02-15 Thread Williams, Jenny Avis
Here we see this. There is a difference in behavior depending whether the program runs out of the "standard" NFS or the GPFS filesystem. If the I/O is from NFS, there can be conditions where we see this with some frequency on a given problem. It will not be every time but can be reproduced.

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Williams, Jenny Avis
Elisabetta- Start by focusing on slurmctld. Slurmd not happy without it. Start it manually in the foreground as in /usr/sbin/slurmctld -d -vvv This assumes slurmd,conf is in default location. Pardon brevity; on my phone Jenny Williams Sent from Nine ___

Re: [slurm-users] fail when trying to set up selection=con_res

2017-11-28 Thread Williams, Jenny Avis
We run in that manner using this config on v.3.10.0-693.5.2.el7.x86_64 This is slurm 17.02.4 Do your compute nodes have hyperthreading enabled ? AuthType=auth/munge CryptoType=crypto/munge AccountingStorageEnforce=limits,qos,safe AccountingStoragePort=ANumber AccountingStorageType=ac