Hi Jake,

Firstly, which Slurm version and which OS do you use?

Next, try simplifying by removing the oversubscribe configuration. Read the slurm.conf manual page about oversubscribe, it looks a bit tricky.

The RealMemory=1000 is extremely low and might prevent jobs from starting! Run "slurmd -C" on the nodes to read appropriate node parameters for slurm.conf.

I hope this helps.

/Ole


On 26-05-2022 21:12, Jake Jellinek wrote:
Hi

I am just building my first Slurm setup and have got everything running – well, almost.

I have a two node configuration. All of my setup exists on a single HyperV server and I have divided up the resources to create my VMs

One node I will use for heavy duty work; this is called compute001

One node I will use for normal work; this is called compute002

My compute node specification in slurm.conf is

NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN

NodeName=compute001 CPUs=32

NodeName=compute002 CPUs=2

The partition specification is

PartitionName=DEFAULT State=UP

PartitionName=interactive Nodes=compute002 MaxTime=INFINITE OverSubscribe=FORCE

PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE

I have added the OverSubscribe=FORCE option as I want more than one job to be able to land on my interactive/simulation queues.

All of the nodes and cluster master start up fine and they all talk to each other but no matter what I do, I cannot get my cluster to accept more than one job per node.

Can you help me determine where I am going wrong?

Thanks a lot

Jake

The entire slurm.conf is pasted below

# slurm.conf file generated by configurator.html.

ClusterName=pm-slurm

SlurmctldHost=slurm-master

MpiDefault=none

ProctrackType=proctrack/cgroup

ReturnToService=2

SlurmctldPidFile=/var/run/slurmctld.pid

SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=slurm

StateSaveLocation=/home/slurm/var/spool/slurmctld

SwitchType=switch/none

TaskPlugin=task/cgroup

#

# TIMERS

InactiveLimit=0

KillWait=30

MinJobAge=300

SlurmctldTimeout=120

SlurmdTimeout=300

Waittime=0

#

# SCHEDULING

SchedulerType=sched/backfill

SelectType=select/cons_tres

SelectTypeParameters=CR_Core_Memory

#

# LOGGING AND ACCOUNTING

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/cgroup

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurmctld.log

SlurmdDebug=info

SlurmdLogFile=/var/log/slurmd.log

# COMPUTE NODES

NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN

NodeName=compute001 CPUs=32

NodeName=compute002 CPUs=2

PartitionName=DEFAULT State=UP

PartitionName=interactive Nodes=compute002 MaxTime=INFINITE OverSubscribe=FORCE

PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE



Reply via email to