I am not sure if this is the correct place to share this, but maybe someone can point me in the correct directions. I recently setup a Centos 7 based slurm cluster, however my nodes continuously show an either down or drained state. The reason for the drained state is =Low socket*core*thread count. The nodes are composed of dual Quad core xeon processors w/o hyperthreading and the conf file has the configuration of 2 sockets, 4 cores per socket and 1 thread per core. Below is the node information and the slurm conf file.

NodeName=dragonsdenN3 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=32 CPULoad=0.23
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.0.7 NodeHostName=dragonsdenN3 Version=16.05
   OS=Linux RealMemory=30000 AllocMem=0 FreeMem=31205 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=4 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2016-10-04T10:25:42 SlurmdStartTime=2016-10-04T10:26:16
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low socket*core*thread count, Low CPUs [slurm@2016-10-04T10:13:23]

ControlMachine=dragonsden
ControlAddr=192.168.0.1
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING

#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=dragonsden
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
ClusterName=dragonsden
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=dragonsden NodeAddr=192.168.0.1 RealMemory=20000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=4 State=UNKNOWN NodeName=dragonsdenN1 NodeAddr=192.168.0.5 RealMemory=30000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=4 State=UNKNOWN NodeName=dragonsdenN2 NodeAddr=192.168.0.6 RealMemory=30000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=4 State=UNKNOWN NodeName=dragonsdenN3 NodeAddr=192.168.0.7 RealMemory=30000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=4 State=UNKNOWN NodeName=dragonsdenN4 NodeAddr=192.168.0.8 RealMemory=30000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=4 State=UNKNOWN NodeName=dragonsdenN5 NodeAddr=192.168.0.10 RealMemory=30000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=4 State=UNKNOWN PartitionName=debug Nodes=dragonsdenN[1-5] Default=YES MaxTime=INFINITE State=UP

Reply via email to