[slurm-dev] Re: nodes often down

曹宏嘉 Tue, 10 Sep 2013 17:27:31 -0700

You may run "sinfo -R" to see the reason that the node is left down. 
ReturnToService=1 can not recover all down nodes.



-----原始邮件-----
发件人: "Sivasangari Nandy" <sivasangari.na...@irisa.fr>
发送时间: 2013-09-10 20:38:43 (星期二)
收件人: slurm-dev <slurm-dev@schedmd.com>
抄送: slurm-dev <slurm-dev@schedmd.com>
主题: [slurm-dev] Re: nodes often down


No it was like that at first so by default.
And yea I've restarted slurm but no changes, after 10 min nodes are all down.


Here my conf file if needed : 


# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=VM-667
ControlAddr=192.168.2.26
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#PrologSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/usr/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=VM-[669-671] CPUs=2 Sockets=2 CoresPerSocket=1 ThreadsPerCore=1 
State=UNKNOWN
PartitionName=SLURM-debug Nodes=VM-[669-671] Default=YES MaxTime=INFINITE 
State=UP


De: "Alan V. Cowles" <alan.cow...@duke.edu>
À: "slurm-dev" <slurm-dev@schedmd.com>
Cc: "Sivasangari Nandy" <sivasangari.na...@irisa.fr>
Envoyé: Mardi 10 Septembre 2013 13:17:36
Objet: Re: [slurm-dev]  nodes often down


Siva,

Was that the default setting you had in place with your original config, or a 
change you made recently to combat the downed nodes problem, and did you 
restart slurm or do a reconfigure to re-read the slurm.conf file. I've found 
some changes don't take effect with a reconfigure, and you have to restart.

AC

On 09/10/2013 04:01 AM, Sivasangari Nandy wrote:

Hello, 


My nodes are often in "down" states, so I must make a 'sview' and activate all 
nodes manually to put them 'idle' in order to run jobs.
I've seen in the FAQ that I can change the slurm.conf file.


"The configuration parameter ReturnToService in slurm.conf controls how DOWN 
nodes are handled. Set its value to one in order for DOWN nodes to 
automatically be returned to service once the slurmd daemon registers with a 
valid node configuration. " 


However in my file it's already "1" for ReturnToService.


advance thanks, 
Siva

--

Sivasangari NANDY -  Plate-forme GenOuest
IRISA-INRIA, Campus de Beaulieu
263 Avenue du Général Leclerc

35042 Rennes cedex, France
Tél: +33 (0) 2 99 84 25 69
Bureau :  D152







--

Sivasangari NANDY -  Plate-forme GenOuest
IRISA-INRIA, Campus de Beaulieu
263 Avenue du Général Leclerc

35042 Rennes cedex, France
Tél: +33 (0) 2 99 84 25 69
Bureau :  D152

[slurm-dev] Re: nodes often down

Reply via email to