[slurm-dev] Slurmctld auto restart and kill running job, why ?

Philippe Mon, 26 Sep 2016 00:46:54 -0700

Hello everybody,
I'm trying to understand an issue with 2 SLURM installations on Ubuntu
14.04 64b, with slurm 2.6.9 compiled. Only 1 computer running
Slurmctld/slurmd/slurmdbd.
It works very well for any job that doesn't last more than 2 days.


But every 3 days, the slurmctld process restarts by itself, as you can see
in slurmctld.log :

[2016-09-26T08:01:42.682] debug2: Performing purge of old job records
[2016-09-26T08:01:42.683] debug:  sched: Running job scheduler
[2016-09-26T08:01:42.683] debug2: Performing full system state save
[2016-09-26T08:01:44.003] debug:  slurmdbd: DBD_RC is -1 from
DBD_FLUSH_JOBS(1408): (null)
[2016-09-26T08:01:44.751] debug2: Sending cpu count of 22 for cluster
[2016-09-26T08:01:44.792] debug:  slurmdbd: Issue with call
DBD_CLUSTER_CPUS(1407): 4294967295(This cluster hasn't been added to
accounting yet)
[2016-09-26T08:01:46.792] debug2: Testing job time limits and checkpoints
[2016-09-26T08:01:47.005] debug3: Writing job id 840 to header record of
job_state file
[2016-09-26T08:02:03.003] debug:  slurmdbd: DBD_RC is -1 from
DBD_FLUSH_JOBS(1408): (null)
[2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) received
[2016-09-26T08:02:16.582] debug:  sched: slurmctld terminating
[2016-09-26T08:02:16.583] debug3: _slurmctld_rpc_mgr shutting down
[2016-09-26T08:02:16.798] Saving all slurm state
[2016-09-26T08:02:16.801] debug3: Writing job id 840 to header record of
job_state file
[2016-09-26T08:02:16.881] debug3: _slurmctld_background shutting down
[2016-09-26T08:02:16.952] slurmdbd: saved 1846 pending RPCs
[2016-09-26T08:02:17.024] Unable to remove pidfile
'/var/run/slurmctld.pid': Permission denied
[2016-09-26T08:02:16.582] killing old slurmctld[20343]
[2016-09-26T08:02:17.028] Job accounting information stored, but details
not gathered
[2016-09-26T08:02:17.028] slurmctld version 2.6.9 started on cluster graph
[2016-09-26T08:02:17.029] debug3: Trying to load plugin
/usr/local/lib/slurm/crypto_munge.so
[2016-09-26T08:02:17.052] Munge cryptographic signature plugin loaded
[2016-09-26T08:02:17.052] debug3: Success.
[2016-09-26T08:02:17.052] debug3: Trying to load plugin
/usr/local/lib/slurm/gres_gpu.so
[2016-09-26T08:02:17.060] debug:  init: Gres GPU plugin loaded
[2016-09-26T08:02:17.060] debug3: Success.

No crontab set, anything, it just restarts itself.
And the thing is, when I got jobs running for more than 3 days, they are
killed by this restart (even if, normally, slurm is capable to resume jobs)
:

[2016-09-26T08:02:01.282] [830] profile signalling type Task
[2016-09-26T08:02:20.632] debug3: in the service_connection
[2016-09-26T08:02:20.634] debug2: got this type of message 1001
[2016-09-26T08:02:20.634] debug2: Processing RPC:
REQUEST_NODE_REGISTRATION_STATUS
[2016-09-26T08:02:20.635] debug3: CPUs=22 Boards=1 Sockets=22 Cores=1
Threads=1 Memory=145060 TmpDisk=78723 Uptime=4210729
[2016-09-26T08:02:20.636] debug4: found jobid = 830, stepid = 4294967294
[2016-09-26T08:02:20.637] [830] Called _msg_socket_accept
[2016-09-26T08:02:20.637] [830] Leaving _msg_socket_accept
[2016-09-26T08:02:20.637] [830] eio: handling events for 1 objects
[2016-09-26T08:02:20.637] [830] Called _msg_socket_readable
[2016-09-26T08:02:20.637] [830] Entering _handle_accept (new thread)
[2016-09-26T08:02:20.638] [830]   Identity: uid=0, gid=0
[2016-09-26T08:02:20.638] [830] Entering _handle_request
[2016-09-26T08:02:20.639] [830] Got request 5
[2016-09-26T08:02:20.639] [830] Handling REQUEST_STATE
[2016-09-26T08:02:20.639] [830] Leaving  _handle_request: SLURM_SUCCESS
[2016-09-26T08:02:20.639] [830] Entering _handle_request
[2016-09-26T08:02:20.639] debug:  found apparently running job 830
[2016-09-26T08:02:20.640] [830] Leaving  _handle_accept
[2016-09-26T08:02:20.692] debug3: in the service_connection
[2016-09-26T08:02:20.693] debug2: got this type of message 6013
[2016-09-26T08:02:20.693] debug2: Processing RPC: REQUEST_ABORT_JOB
[2016-09-26T08:02:20.694] debug:  _rpc_abort_job, uid = 64030
[2016-09-26T08:02:20.694] debug:  task_slurmd_release_resources: 830
[2016-09-26T08:02:20.694] debug3: state for jobid 830: ctime:1474633733
revoked:0 expires:0
[2016-09-26T08:02:20.694] debug3: state for jobid 834: ctime:1474839186
revoked:1474840473 expires:1474840473
[2016-09-26T08:02:20.694] debug3: destroying job 834 state
[2016-09-26T08:02:20.694] debug3: state for jobid 835: ctime:1474840474
revoked:1474840567 expires:1474840567
[2016-09-26T08:02:20.694] debug3: destroying job 835 state
[2016-09-26T08:02:20.694] debug3: state for jobid 836: ctime:1474840570
revoked:1474840617 expires:1474840617
[2016-09-26T08:02:20.694] debug3: destroying job 836 state
[2016-09-26T08:02:20.694] debug3: state for jobid 837: ctime:1474840618
revoked:1474840672 expires:1474840672
[2016-09-26T08:02:20.694] debug3: destroying job 837 state
[2016-09-26T08:02:20.694] debug3: state for jobid 838: ctime:1474840672
revoked:1474840747 expires:1474840747
[2016-09-26T08:02:20.694] debug3: destroying job 838 state
[2016-09-26T08:02:20.694] debug3: state for jobid 839: ctime:1474840748
revoked:1474840855 expires:1474840855
[2016-09-26T08:02:20.694] debug3: destroying job 839 state
[2016-09-26T08:02:20.694] debug3: state for jobid 840: ctime:1474840883
revoked:1474840969 expires:1474840969
[2016-09-26T08:02:20.694] debug3: destroying job 840 state
[2016-09-26T08:02:20.695] debug:  credential for job 830 revoked
[2016-09-26T08:02:20.696] debug4: found jobid = 830, stepid = 4294967294
[2016-09-26T08:02:20.699] [830] Called _msg_socket_accept
[2016-09-26T08:02:20.699] [830] Leaving _msg_socket_accept
[2016-09-26T08:02:20.699] [830] eio: handling events for 1 objects
[2016-09-26T08:02:20.699] [830] Called _msg_socket_readable
[2016-09-26T08:02:20.700] [830] Entering _handle_accept (new thread)
[2016-09-26T08:02:20.700] [830]   Identity: uid=0, gid=0
[2016-09-26T08:02:20.701] [830] Entering _handle_request
[2016-09-26T08:02:20.701] debug2: container signal 997 to job 830.4294967294
[2016-09-26T08:02:20.701] [830] Got request 4
[2016-09-26T08:02:20.701] [830] Handling REQUEST_SIGNAL_CONTAINER
[2016-09-26T08:02:20.701] [830] _handle_signal_container for
step=830.4294967294 uid=0 signal=997
[2016-09-26T08:02:20.701] [830] Myname in build_hashtbl: (slurmstepd)
[2016-09-26T08:02:20.723] [830] Sending signal 9 to pid 649 (python)
[2016-09-26T08:02:20.723] [830] Sending signal 9 to pid 648 (slurm_script)
[2016-09-26T08:02:20.723] [830] Sent signal 9 to 830.4294967294
[2016-09-26T08:02:20.724] debug4: found jobid = 830, stepid = 4294967294
[2016-09-26T08:02:20.724] [830] task 0 (648) exited. Killed by signal 9.
[2016-09-26T08:02:20.725] [830] Called _msg_socket_accept
[2016-09-26T08:02:20.725] [830] task_post_term: 830.4294967294, task 0
[2016-09-26T08:02:20.725] [830] Aggregated 1 task exit messages
[2016-09-26T08:02:20.725] [830] sending task exit msg for 1 tasks
[2016-09-26T08:02:20.725] [830] Leaving _msg_socket_accept
[2016-09-26T08:02:20.725] [830] eio: handling events for 1 objects
[2016-09-26T08:02:20.725] [830] Called _msg_socket_readable
[2016-09-26T08:02:20.725] [830] Leaving  _handle_request: SLURM_SUCCESS
[2016-09-26T08:02:20.726] [830] Entering _handle_request
[2016-09-26T08:02:20.726] [830] Leaving  _handle_accept
[2016-09-26T08:02:20.726] [830] Entering _handle_accept (new thread)
[2016-09-26T08:02:20.727] [830] Myname in build_hashtbl: (slurmstepd)
[2016-09-26T08:02:20.727] [830]   Identity: uid=0, gid=0
[2016-09-26T08:02:20.727] [830] Entering _handle_request
[2016-09-26T08:02:20.728] [830] Got request 5
[2016-09-26T08:02:20.728] [830] Handling REQUEST_STATE
[2016-09-26T08:02:20.728] [830] Leaving  _handle_request: SLURM_SUCCESS
[2016-09-26T08:02:20.728] [830] Entering _handle_request
[2016-09-26T08:02:20.728] [830] Leaving  _handle_accept
[2016-09-26T08:02:20.744] [830] cpu_freq_reset: #cpus reset = 0
[2016-09-26T08:02:20.744] [830] Before call to spank_fini()
[2016-09-26T08:02:20.744] [830] After call to spank_fini()
[2016-09-26T08:02:20.744] [830] eio: handling events for 1 objects
[2016-09-26T08:02:20.744] [830] Called _msg_socket_readable
[2016-09-26T08:02:20.744] [830]   false, shutdown
[2016-09-26T08:02:20.744] [830] Message thread exited
[2016-09-26T08:02:20.745] [830] step 830.4294967294 abort completed
[2016-09-26T08:02:20.745] [830] done with job



So, if anybody can help me with this issue, then my slurm will works
perfectly with GPUs&CPUs :)

Thank you very much

Regards,
Philippe

[slurm-dev] Slurmctld auto restart and kill running job, why ?

Reply via email to