I am new to SLURM and trying to configure slurm for a new cluster.
I have 4 nodes, each has 14 cores. I wanted to share nodes in a way that every core can run independently (i.e., node01 can have 14 independent serial jobs going on at the same time), but no core should run more than one job. Going through the documentation I figured I need to set ---- SelectType = select/cons_res SelectTypeParameters = CR_CORE ---- So I did so in slurm.conf and restarted slurmctld. But now if I submit a job, I get either errors that it cannot find node configuration per resource required, or the job ends up CG state. Example 1: ---- [sr@clstr mpitests]$ cat newHello.slrm #!/bin/sh #SBATCH --time=00:01:00 #SBATCH -N 1 #SBATCH --ntasks=4 #SBATCH --ntasks-per-node=4 module add shared openmpi/gcc/64 slurm module load somesh/scripts/1.0 mpirun helloMPIf90 --- Leads to: --- [sr@clstr mpitests]$ sbatch -v newHello.slrm sbatch: defined options for program `sbatch' sbatch: ----------------- --------------------- sbatch: user : `sr' sbatch: uid : 1003 sbatch: gid : 1003 sbatch: cwd : /home/sr/clusterTests/mpitests sbatch: ntasks : 4 (set) sbatch: nodes : 1-1 sbatch: jobid : 4294967294 (default) sbatch: partition : default sbatch: profile : `NotSet' sbatch: job name : `newHello.slrm' sbatch: reservation : `(null)' sbatch: wckey : `(null)' sbatch: distribution : unknown sbatch: verbose : 1 sbatch: immediate : false sbatch: overcommit : false sbatch: time_limit : 1 sbatch: nice : -2 sbatch: account : (null) sbatch: comment : (null) sbatch: dependency : (null) sbatch: qos : (null) sbatch: constraints : sbatch: geometry : (null) sbatch: reboot : yes sbatch: rotate : no sbatch: network : (null) sbatch: array : N/A sbatch: cpu_freq_min : 4294967294 sbatch: cpu_freq_max : 4294967294 sbatch: cpu_freq_gov : 4294967294 sbatch: mail_type : NONE sbatch: mail_user : (null) sbatch: sockets-per-node : -2 sbatch: cores-per-socket : -2 sbatch: threads-per-core : -2 sbatch: ntasks-per-node : 4 sbatch: ntasks-per-socket : -2 sbatch: ntasks-per-core : -2 sbatch: mem_bind : default sbatch: plane_size : 4294967294 sbatch: propagate : NONE sbatch: switches : -1 sbatch: wait-for-switches : -1 sbatch: core-spec : NA sbatch: burst_buffer : `(null)' sbatch: remote command : `/home/sr/clusterTests/mpitests/newHello.slrm' sbatch: power : sbatch: wait : yes sbatch: Consumable Resources (CR) Node Selection plugin loaded with argument 4 sbatch: Cray node selection plugin loaded sbatch: Linear node selection plugin loaded with argument 4 sbatch: Serial Job Resource Selection plugin loaded with argument 4 sbatch: error: Batch job submission failed: Requested node configuration is not available --- Example 2: --- [sr@clstr mpitests]$ cat newHello.slrm #!/bin/sh #SBATCH --time=00:01:00 #SBATCH -N 1 #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 module add shared openmpi/gcc/64 slurm module load somesh/scripts/1.0 helloMPIf90 --- Leads to: --- [sr@clstr mpitests]$ sbatch -v newHello.slrm sbatch: defined options for program `sbatch' sbatch: ----------------- --------------------- sbatch: user : `sr' sbatch: uid : 1003 sbatch: gid : 1003 sbatch: cwd : /home/sr/clusterTests/mpitests sbatch: ntasks : 1 (set) sbatch: nodes : 1-1 sbatch: jobid : 4294967294 (default) sbatch: partition : default sbatch: profile : `NotSet' sbatch: job name : `newHello.slrm' sbatch: reservation : `(null)' sbatch: wckey : `(null)' sbatch: distribution : unknown sbatch: verbose : 1 sbatch: immediate : false sbatch: overcommit : false sbatch: time_limit : 1 sbatch: nice : -2 sbatch: account : (null) sbatch: comment : (null) sbatch: dependency : (null) sbatch: qos : (null) sbatch: constraints : sbatch: geometry : (null) sbatch: reboot : yes sbatch: rotate : no sbatch: network : (null) sbatch: array : N/A sbatch: cpu_freq_min : 4294967294 sbatch: cpu_freq_max : 4294967294 sbatch: cpu_freq_gov : 4294967294 sbatch: mail_type : NONE sbatch: mail_user : (null) sbatch: sockets-per-node : -2 sbatch: cores-per-socket : -2 sbatch: threads-per-core : -2 sbatch: ntasks-per-node : 1 sbatch: ntasks-per-socket : -2 sbatch: ntasks-per-core : -2 sbatch: mem_bind : default sbatch: plane_size : 4294967294 sbatch: propagate : NONE sbatch: switches : -1 sbatch: wait-for-switches : -1 sbatch: core-spec : NA sbatch: burst_buffer : `(null)' sbatch: remote command : `/home/sr/clusterTests/mpitests/newHello.slrm' sbatch: power : sbatch: wait : yes sbatch: Consumable Resources (CR) Node Selection plugin loaded with argument 4 sbatch: Cray node selection plugin loaded sbatch: Linear node selection plugin loaded with argument 4 sbatch: Serial Job Resource Selection plugin loaded with argument 4 Submitted batch job 108 [sr@clstr mpitests]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 108 defq newHello sr CG 0:01 1 node001 [sr@clstr mpitests]$ scontrol show job=108 JobId=108 JobName=newHello.slrm UserId=sr(1003) GroupId=sr(1003) MCS_label=N/A Priority=4294901756 Nice=0 Account=(null) QOS=normal JobState=COMPLETING Reason=NonZeroExitCode Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0 RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2017-03-03T18:25:51 EligibleTime=2017-03-03T18:25:51 StartTime=2017-03-03T18:26:01 EndTime=2017-03-03T18:26:02 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=defq AllocNode:Sid=clstr:20260 ReqNodeList=(null) ExcNodeList=(null) NodeList=node001 BatchHost=node001 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/sr/clusterTests/mpitests/newHello.slrm WorkDir=/home/sr/clusterTests/mpitests StdErr=/home/sr/clusterTests/mpitests/slurm-108.out StdIn=/dev/null StdOut=/home/sr/clusterTests/mpitests/slurm-108.out Power= --- In the case of second example, it stays in CG state until I reset the node. If I reset the slurm.conf to SelectType=select/linear, things behave normally as they should. I am at a loss as to where am I making mistake. Is it to do with the slurm configuration, or with my slurm job submission script, or something else entirely. Also, what do the following settings mean and why are they show up as negative? sbatch: sockets-per-node : -2 sbatch: cores-per-socket : -2 sbatch: threads-per-core : -2 sbatch: ntasks-per-socket : -2 sbatch: ntasks-per-core : -2 If anyone can point me to the right direction, that would very helpful. Thanks in advance, Somesh