[slurm-dev] State of the accounting database after a controller failure

Andy Riebs Tue, 02 Jun 2015 09:17:33 -0700

In short, sacct reports "NODE_FAIL" for jobs that were running when theSlurm control node fails. Apologies if this has been fixed recently; I'mstill running with slurm 14.11.3 on RHEL 6.5.

In testing what happens when the control node fails and then recovers,it seems that slurmctld is deciding that a node that had had a jobrunning is non-responsive before checking if that is the case.


In my simple test case, I run

$ srun -N2 sleep 60

I simulate the failure of the control node in another window with

# killall -9 slurmctld slurmdbd

If I restart slurmctld and slurmdbd after the user job completes, sacctfirst reports that the job is still running, and then later reports thatthe job died with a node failure.

If I restart slurmctld and slurmdbd while the user job is still running,the user job is terminated with the message

slurmstepd: *** STEP 18743.0 CANCELLED AT 2015-06-02T10:34:46 DUE TONODE node09 FAILURE ***

In both cases, the slurmctld.log entries are about the same (job 18745is the only one that was active when slurmctld went down):


[2015-06-02T10:37:07.940] layouts: loading entities/relations information
[2015-06-02T10:37:07.940] Recovered state of 12 nodes
[2015-06-02T10:37:07.940] Recovered JobID=18741 State=0x7 NodeCnt=0 Assoc=0
[2015-06-02T10:37:07.940] Recovered JobID=18743 State=0x7 NodeCnt=0 Assoc=0
[2015-06-02T10:37:07.940] Recovered JobID=18745 State=0x1 NodeCnt=0 Assoc=0
[2015-06-02T10:37:07.940] Recovered information about 3 jobs
[2015-06-02T10:37:07.940] Killing job 18745 on DOWN node node09
[2015-06-02T10:37:07.940] _sync_nodes_to_jobs updated state of 1 nodes

[2015-06-02T10:37:07.940] init_requeue_policy: kill_invalid_depend isset to 0[2015-06-02T10:37:07.940] _sync_nodes_to_comp_job: Job 18745 incompleting state

[2015-06-02T10:37:07.940] _sync_nodes_to_comp_job: completing 1 jobs
[2015-06-02T10:37:07.940] Recovered state of 0 reservations
[2015-06-02T10:37:07.940] read_slurm_conf: backup_controller not specified.
[2015-06-02T10:37:07.940] Running as primary controller
[2015-06-02T10:37:07.940] Registering slurmctld at port 6817 with slurmdbd.
[2015-06-02T10:37:07.945] node_did_resp: node node10 returned to service
[2015-06-02T10:37:07.945] node_did_resp: node node09 returned to service
[2015-06-02T10:37:07.999] Registering slurmctld at port 6817 with slurmdbd.

[2015-06-02T10:37:10.565]SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0

[2015-06-02T10:37:10.945] node node02 returned to service
[2015-06-02T10:37:10.945] node node04 returned to service

At least in the case where the user job is still running, slurmctldshould recognize that and let the job proceed, I would think.

In the case where the job terminated while slurmctld was down, thecorrect action is less clear, but calling it a compute node failureisn't likely what was intended. slurm.conf and topology.conf are listedbelow...


Andy

--- topology.conf ---
# Simulate a more complex switching arrangement than we actually have
SwitchName=s1 Nodes=node[01-08]
SwitchName=s2 Nodes=node[09-12]
SwitchName=s3 Nodes=node[13-16]
SwitchName=top Switches=s[1-3]

--- slurm.conf ---
ClusterName=node
ControlMachine=node
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
TopologyPlugin=topology/tree
MpiDefault=pmi2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir= /opt/local/slurm/14.03.10/lib64/slurm
CacheGroups=0
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
Epilog=/home/slurm/epilog.sh
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/affinity
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/linear
## SelectType=select/cons_res
## SelectTypeParameters=CR_CPU
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=node
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=slurm
#
# COMPUTE NODES

#NodeName=node[01-16] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2State=DOWN

NodeName=node[01-12] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=DOWN
#
# Partitions
#

#PartitionName=all Nodes=node[01-16] Default=yes Shared=ExclusiveMaxTime=620 State=UPPartitionName=all Nodes=node[01-12] Default=yes Shared=ExclusiveMaxTime=620 State=UP




--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP

[slurm-dev] State of the accounting database after a controller failure

Reply via email to