In short, sacct reports "NODE_FAIL" for jobs that were running when the
Slurm control node fails. Apologies if this has been fixed recently; I'm
still running with slurm 14.11.3 on RHEL 6.5.
In testing what happens when the control node fails and then recovers,
it seems that slurmctld is deciding that a node that had had a job
running is non-responsive before checking if that is the case.
In my simple test case, I run
$ srun -N2 sleep 60
I simulate the failure of the control node in another window with
# killall -9 slurmctld slurmdbd
If I restart slurmctld and slurmdbd after the user job completes, sacct
first reports that the job is still running, and then later reports that
the job died with a node failure.
If I restart slurmctld and slurmdbd while the user job is still running,
the user job is terminated with the message
slurmstepd: *** STEP 18743.0 CANCELLED AT 2015-06-02T10:34:46 DUE TO
NODE node09 FAILURE ***
In both cases, the slurmctld.log entries are about the same (job 18745
is the only one that was active when slurmctld went down):
[2015-06-02T10:37:07.940] layouts: loading entities/relations information
[2015-06-02T10:37:07.940] Recovered state of 12 nodes
[2015-06-02T10:37:07.940] Recovered JobID=18741 State=0x7 NodeCnt=0 Assoc=0
[2015-06-02T10:37:07.940] Recovered JobID=18743 State=0x7 NodeCnt=0 Assoc=0
[2015-06-02T10:37:07.940] Recovered JobID=18745 State=0x1 NodeCnt=0 Assoc=0
[2015-06-02T10:37:07.940] Recovered information about 3 jobs
[2015-06-02T10:37:07.940] Killing job 18745 on DOWN node node09
[2015-06-02T10:37:07.940] _sync_nodes_to_jobs updated state of 1 nodes
[2015-06-02T10:37:07.940] init_requeue_policy: kill_invalid_depend is
set to 0
[2015-06-02T10:37:07.940] _sync_nodes_to_comp_job: Job 18745 in
completing state
[2015-06-02T10:37:07.940] _sync_nodes_to_comp_job: completing 1 jobs
[2015-06-02T10:37:07.940] Recovered state of 0 reservations
[2015-06-02T10:37:07.940] read_slurm_conf: backup_controller not specified.
[2015-06-02T10:37:07.940] Running as primary controller
[2015-06-02T10:37:07.940] Registering slurmctld at port 6817 with slurmdbd.
[2015-06-02T10:37:07.945] node_did_resp: node node10 returned to service
[2015-06-02T10:37:07.945] node_did_resp: node node09 returned to service
[2015-06-02T10:37:07.999] Registering slurmctld at port 6817 with slurmdbd.
[2015-06-02T10:37:10.565]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0
[2015-06-02T10:37:10.945] node node02 returned to service
[2015-06-02T10:37:10.945] node node04 returned to service
At least in the case where the user job is still running, slurmctld
should recognize that and let the job proceed, I would think.
In the case where the job terminated while slurmctld was down, the
correct action is less clear, but calling it a compute node failure
isn't likely what was intended. slurm.conf and topology.conf are listed
below...
Andy
--- topology.conf ---
# Simulate a more complex switching arrangement than we actually have
SwitchName=s1 Nodes=node[01-08]
SwitchName=s2 Nodes=node[09-12]
SwitchName=s3 Nodes=node[13-16]
SwitchName=top Switches=s[1-3]
--- slurm.conf ---
ClusterName=node
ControlMachine=node
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
TopologyPlugin=topology/tree
MpiDefault=pmi2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir= /opt/local/slurm/14.03.10/lib64/slurm
CacheGroups=0
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
Epilog=/home/slurm/epilog.sh
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/affinity
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/linear
## SelectType=select/cons_res
## SelectTypeParameters=CR_CPU
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=node
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=slurm
#
# COMPUTE NODES
#NodeName=node[01-16] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2
State=DOWN
NodeName=node[01-12] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=DOWN
#
# Partitions
#
#PartitionName=all Nodes=node[01-16] Default=yes Shared=Exclusive
MaxTime=620 State=UP
PartitionName=all Nodes=node[01-12] Default=yes Shared=Exclusive
MaxTime=620 State=UP
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP