from:"Williams, Jenny Avis"

[slurm-users] Re: Redirect jobs submitted to old partition to new

2024-04-16 Thread Williams, Jenny Avis via slurm-users

For jobs already in default_queue squeue -t pd -h --Format=jobID |xargs -L1 -I{} scontrol update jobID={} partition=queue1 What version of slurm are you running? In slurm 23.02.5, man slurm.conf under PARTITION CONFIGURATION Alternate Partition name of alternate parti

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users

te, I can not see the bug description. So perhaps with Slurm 24.xx release we'll see something new. cheers josef On 11. 04. 24 19:53, Williams, Jenny Avis wrote: There needs to be a slurmstepd infinity process running before slurmd starts. This doc goes into it: https://slurm.sched

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Williams, Jenny Avis via slurm-users

There needs to be a slurmstepd infinity process running before slurmd starts. This doc goes into it: https://slurm.schedmd.com/cgroup_v2.html Probably a better way to do this, but this is what we do to deal with that: :: files/slurm-cgrepair.service :: [Unit] Before=slurmd

[slurm-users] Re: Avoiding fragmentation

2024-04-10 Thread Williams, Jenny Avis via slurm-users

Various options that might help reduce job fragmentation. Turn up debugging on slurmctld and add the DebugFlags like TraceJobs, SelectType, and Steps. With debugging set high enough one can see a good bit of the logic in regard to node selection. CR_LLN Schedule

[slurm-users] Re: How to reinstall / reconfigure Slurm?

2024-04-03 Thread Williams, Jenny Avis via slurm-users

Slurm source code should be downloaded and recompiled including the configuration flag - with-nvml. As an example, using rpmbuild mechanism for recompiling and generating rpms, this is our current method. Be aware that the compile works only if it finds the prerequisites needed for a given op

[slurm-users] Re: Slurm suspend preemption not working

2024-03-15 Thread Williams, Jenny Avis via slurm-users

CPUs are released, but memory is not released on suspend. Try looking at this output and compare allocated Memory before and after suspending a job on a node: sinfo -N -n yourNode --Format=weight:8,nodelist:15,cpusstate:12,memory:8,allocmem:8 From: Verma, Nischey (HPC ENG,RAL,LSCI) via slurm-u

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Williams, Jenny Avis via slurm-users

Also -- scontrol show nodes -Original Message- From: Williams, Jenny Avis Sent: Thursday, March 14, 2024 6:46 PM To: Ole Holm Nielsen ; slurm-users@lists.schedmd.com Subject: RE: [slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource I use an alias slist

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Williams, Jenny Avis via slurm-users

I use an alias slist = ` sed 's/ /\n/g' |sort|uniq` -- do not cp/paste lines with "--" -- it is not the two hyphens intended. The examples below are for slurm 23.02.7 . These commands assume administrator access. This is a generalized set of areas I use to find why things just are not moving

[slurm-users] Re: RHEL 8.9+SLURM-23.11.3+MLNX_OFED_LINUX-23.10-1.1.9.0+ OpenMPI-5.0.2

2024-02-20 Thread Williams, Jenny Avis via slurm-users

How was your binary compiled? If it is dynamically linked, please reply with the ldd listing of the binary ( ldd binary ) Jenny From: S L via slurm-users Sent: Tuesday, February 20, 2024 10:55 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] RHEL 8.9+SLURM-23.11.3+MLNX_OFED_LINUX-2

Re: [slurm-users] sbatch does not work with Debian image

2023-08-26 Thread Williams, Jenny Avis

Two of likely several possibilities: The slurm master host name does not resolve. The rights on /etc/slurm are such that the user running the command cannot read /etc/slurm/slurm.conf Jenny Williams UNC Chapel Hill From: slurm-users On Behalf Of Sorin Draga Sent: Wednesday, March 15, 2023 4:13

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-14 Thread Williams, Jenny Avis

me settings as we do helps in your case? Please be aware that you should change JobAcctGatherType only when there are no running job steps! Regards, Hermann On 7/12/23 16:50, Williams, Jenny Avis wrote: > The systems have only cgroup/v2 enabled > # mount |egrep cgroup >

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-12 Thread Williams, Jenny Avis

Linux distribution are you using? And which kernel version? What is the output of mount | grep cgroup What if you do not restrict the cgroup-version Slurm can use to cgroup/v2 but omit "CgroupPlugin=..." from your cgroup.conf? Regards, Hermann On 7/11/23 19:41, Williams, Jenny

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-11 Thread Williams, Jenny Avis

Additional configuration information -- /etc/slurm/cgroup.conf CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes CgroupPlugin=cgroup/v2 AllowedSwapSpace=1 ConstrainSwapSpace=yes ConstrainDevices=yes From: Williams, Jenny Avis Sent: Tuesday, July 11, 2023 10:47 AM To: slurm-us

[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-11 Thread Williams, Jenny Avis

Progress on getting slurmd to start under cgroupv2 Issue: slurmd 22.05.6 will not start when using cgroupv2 Expected result: even after reboot slurmd will start up without needing to manually add lines to /sys/fs/cgroup files. When started as service the error is: # systemctl status slurmd * s

[slurm-users] 21.08.6 srun fails with error "Invalid job credential" ; sbatch is fine.

2022-05-13 Thread Williams, Jenny Avis

Yesterday I upgraded slurmdbd and slurmctld nodes from RHEL7 / Slurm v. 20.11.8 to RHEL8.5 / Slurm v. 21.08.6 on our production cluster. I also updated slurm on the rhel7 login nodes to 21.08.6 Sbatch jobs run fine. Srun, however, fails from the updated login node with invalid job credential er

[slurm-users] export qos

2021-12-17 Thread Williams, Jenny Avis

Sacctmgr dump gets the user listings, but I do not see how to dump qos settings. Does anyone know of a quick way to export qos settings for import to a new sched box? Jenny

Re: [slurm-users] v.20.11.2 - nodes stay in maint after maintenance reservation done

2020-12-31 Thread Williams, Jenny Avis

Looked closer - figured this one out. Just taking longer on that phase. From: Williams, Jenny Avis Sent: Wednesday, December 30, 2020 9:18 AM To: Slurm User Community List Subject: v.20.11.2 - nodes stay in maint after maintenance reservation done We use a maintenance reservation to process

[slurm-users] v.20.11.2 - nodes stay in maint after maintenance reservation done

2020-12-30 Thread Williams, Jenny Avis

We use a maintenance reservation to process node slurm updates from v.20.02.[3|6] to v. 20.11.2 The last step within the job is to set the node state to drain with reason slurm-updated. Once the job is done the node reboots and the reservation terminates. After the node reboots we check that th

[slurm-users] v. 20.11.2 - error: xcpuinfo_abs_to_mac: failed

2020-12-30 Thread Williams, Jenny Avis

Since updating to 20.11.2 from 20.02.3 or 20.02.6, and not before, we are seeing this error for every job in slurmd.log files; there has been no other change besides the slurm version update, including the node configurations. [2020-12-30T07:56:40.692] [9540590.batch] error: xcpuinfo_abs_to_mac

[slurm-users] Slurm v. 20.11.2 AllocNodes behavior change

2020-12-29 Thread Williams, Jenny Avis

When upgrading from 20.02.6 to 20.11.2 a partition that used AllocNodes as the short hostname had to be updated to the FQDN of the node instead. A submit to the partition results in the error $ sbatch -p webportal t.sl sbatch: error: Batch job submission failed: Access/permission denied Expecte

Re: [slurm-users] Extremely sluggish squeue -p partition

2020-12-10 Thread Williams, Jenny Avis

We do have one partition that uses AllowGroups instead of AllowAccounts. Testing with that partition closed did not change things. This started Dec 2nd or 3rd - I noticed it on the 3rd. From: slurm-users On Behalf Of Williams, Jenny Avis Sent: Monday, December 7, 2020 11:43 PM To: Slurm User

[slurm-users] Extremely sluggish squeue -p partition

2020-12-07 Thread Williams, Jenny Avis

I have an interesting condition that has been going on for a few days that could use the feedback of those more familiar with the way slurm works under the hood. Conditions : Slurm v20.02.3 The cluster is relatively quiet given the time of year, and the commands are running on the host on which

[slurm-users] 2 topics: segregating patron accounting, and FIFO in a multifactor setup

2020-09-02 Thread Williams, Jenny Avis

Hi all - There are some cases where are researchers wish for behaviors different than our main clusters configuration for their set of machines. For the first request, where our cluster is set to use PriorityType=priority/multifactor , some wish to have a set of resources that could behave in a

Re: [slurm-users] Slurmstepd errors

2020-08-06 Thread Williams, Jenny Avis

We ran into a similar error -- A response from schedmd: https://bugs.schedmd.com/show_bug.cgi?id=3890 Remediating steps until updates got us past this particular issue: Check for "xcgroup_instantiate errors” and close nodes that show this in messages log. From the nodes listed here we close com

Re: [slurm-users] slurm only looking in "default" partition during scheduling

2020-07-06 Thread Williams, Jenny Avis

You cannot have two default partitions. The slurm.conf is picking up the last of the entries flagged as Default; because the compute partition has no partition specified it is being sent to the default partition, thus the first srun is being submitted to the compute partition, and that partitio

Re: [slurm-users] QOS cutting off users before CPU limit is reached

2020-05-14 Thread Williams, Jenny Avis

Try suspending and resuming the users pending jobs to force a re-evaluation. If the user is not in the zone of jobs that is evaluated, ie if enough higher priority jobs have dropped in ahead then this job may not have been evaluated for scheduling since a point in time when the user was indeed p

Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some point.

2018-02-15 Thread Williams, Jenny Avis

Here we see this. There is a difference in behavior depending whether the program runs out of the "standard" NFS or the GPFS filesystem. If the I/O is from NFS, there can be conditions where we see this with some frequency on a given problem. It will not be every time but can be reproduced.

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Williams, Jenny Avis

Elisabetta- Start by focusing on slurmctld. Slurmd not happy without it. Start it manually in the foreground as in /usr/sbin/slurmctld -d -vvv This assumes slurmd,conf is in default location. Pardon brevity; on my phone Jenny Williams Sent from Nine ___

Re: [slurm-users] fail when trying to set up selection=con_res

2017-11-28 Thread Williams, Jenny Avis

We run in that manner using this config on v.3.10.0-693.5.2.el7.x86_64 This is slurm 17.02.4 Do your compute nodes have hyperthreading enabled ? AuthType=auth/munge CryptoType=crypto/munge AccountingStorageEnforce=limits,qos,safe AccountingStoragePort=ANumber AccountingStorageType=ac

[slurm-users] Re: Redirect jobs submitted to old partition to new

[slurm-users] Re: Slurmd enabled crash with CgroupV2

[slurm-users] Re: Slurmd enabled crash with CgroupV2

[slurm-users] Re: Avoiding fragmentation

[slurm-users] Re: How to reinstall / reconfigure Slurm?

[slurm-users] Re: Slurm suspend preemption not working

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

[slurm-users] Re: RHEL 8.9+SLURM-23.11.3+MLNX_OFED_LINUX-23.10-1.1.9.0+ OpenMPI-5.0.2

Re: [slurm-users] sbatch does not work with Debian image

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

[slurm-users] 21.08.6 srun fails with error "Invalid job credential" ; sbatch is fine.

[slurm-users] export qos

Re: [slurm-users] v.20.11.2 - nodes stay in maint after maintenance reservation done

[slurm-users] v.20.11.2 - nodes stay in maint after maintenance reservation done

[slurm-users] v. 20.11.2 - error: xcpuinfo_abs_to_mac: failed

[slurm-users] Slurm v. 20.11.2 AllocNodes behavior change

Re: [slurm-users] Extremely sluggish squeue -p partition

[slurm-users] Extremely sluggish squeue -p partition

[slurm-users] 2 topics: segregating patron accounting, and FIFO in a multifactor setup

Re: [slurm-users] Slurmstepd errors

Re: [slurm-users] slurm only looking in "default" partition during scheduling

Re: [slurm-users] QOS cutting off users before CPU limit is reached

Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some point.

Re: [slurm-users] Slurm not starting

Re: [slurm-users] fail when trying to set up selection=con_res

29 matches

Site Navigation

Mail list logo

Footer information