[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

John DeSantis Mon, 26 Sep 2016 12:38:04 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Philippe,


> But every 3 days, the slurmctld process restarts by itself, as you
> can see in slurmctld.log :

... SNIP ...

> No crontab set, anything, it just restarts itself. And the thing
> is, when I got jobs running for more than 3 days, they are killed
> by this restart (even if, normally, slurm is capable to resume
> jobs)

Do you have log rotation enabled that is stopping and restarting the ctl
d?

As far as the lost jobs go, check your 'SlurmctldTimeout' and see if
it's set too low.  We've never lost any jobs due to:

* ctld restarts
* typos in slurm.conf (!!)
* upgrades

I've been especially guilty of typos, and FWIW SLURM has been
extremely forgiving.

HTH,
John DeSantis



On 09/26/2016 03:46 AM, Philippe wrote:
> Hello everybody, I'm trying to understand an issue with 2 SLURM
> installations on Ubuntu 14.04 64b, with slurm 2.6.9 compiled. Only
> 1 computer running Slurmctld/slurmd/slurmdbd. It works very well
> for any job that doesn't last more than 2 days.
> 
> But every 3 days, the slurmctld process restarts by itself, as you
> can see in slurmctld.log :
> 
> [2016-09-26T08:01:42.682] debug2: Performing purge of old job
> records [2016-09-26T08:01:42.683] debug:  sched: Running job
> scheduler [2016-09-26T08:01:42.683] debug2: Performing full system
> state save [2016-09-26T08:01:44.003] debug:  slurmdbd: DBD_RC is -1
> from DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:01:44.751] debug2:
> Sending cpu count of 22 for cluster [2016-09-26T08:01:44.792]
> debug:  slurmdbd: Issue with call DBD_CLUSTER_CPUS(1407):
> 4294967295(This cluster hasn't been added to accounting yet) 
> [2016-09-26T08:01:46.792] debug2: Testing job time limits and
> checkpoints [2016-09-26T08:01:47.005] debug3: Writing job id 840 to
> header record of job_state file [2016-09-26T08:02:03.003] debug:
> slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null) 
> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM)
> received [2016-09-26T08:02:16.582] debug:  sched: slurmctld
> terminating [2016-09-26T08:02:16.583] debug3: _slurmctld_rpc_mgr
> shutting down [2016-09-26T08:02:16.798] Saving all slurm state 
> [2016-09-26T08:02:16.801] debug3: Writing job id 840 to header
> record of job_state file [2016-09-26T08:02:16.881] debug3:
> _slurmctld_background shutting down [2016-09-26T08:02:16.952]
> slurmdbd: saved 1846 pending RPCs [2016-09-26T08:02:17.024] Unable
> to remove pidfile '/var/run/slurmctld.pid': Permission denied 
> [2016-09-26T08:02:16.582] killing old slurmctld[20343] 
> [2016-09-26T08:02:17.028] Job accounting information stored, but
> details not gathered [2016-09-26T08:02:17.028] slurmctld version
> 2.6.9 started on cluster graph [2016-09-26T08:02:17.029] debug3:
> Trying to load plugin /usr/local/lib/slurm/crypto_munge.so 
> [2016-09-26T08:02:17.052] Munge cryptographic signature plugin
> loaded [2016-09-26T08:02:17.052] debug3: Success. 
> [2016-09-26T08:02:17.052] debug3: Trying to load plugin 
> /usr/local/lib/slurm/gres_gpu.so [2016-09-26T08:02:17.060] debug:
> init: Gres GPU plugin loaded [2016-09-26T08:02:17.060] debug3:
> Success.
> 
> No crontab set, anything, it just restarts itself. And the thing
> is, when I got jobs running for more than 3 days, they are killed
> by this restart (even if, normally, slurm is capable to resume
> jobs) :
> 
> [2016-09-26T08:02:01.282] [830] profile signalling type Task 
> [2016-09-26T08:02:20.632] debug3: in the service_connection 
> [2016-09-26T08:02:20.634] debug2: got this type of message 1001 
> [2016-09-26T08:02:20.634] debug2: Processing RPC: 
> REQUEST_NODE_REGISTRATION_STATUS [2016-09-26T08:02:20.635] debug3:
> CPUs=22 Boards=1 Sockets=22 Cores=1 Threads=1 Memory=145060
> TmpDisk=78723 Uptime=4210729 [2016-09-26T08:02:20.636] debug4:
> found jobid = 830, stepid = 4294967294 [2016-09-26T08:02:20.637]
> [830] Called _msg_socket_accept [2016-09-26T08:02:20.637] [830]
> Leaving _msg_socket_accept [2016-09-26T08:02:20.637] [830] eio:
> handling events for 1 objects [2016-09-26T08:02:20.637] [830]
> Called _msg_socket_readable [2016-09-26T08:02:20.637] [830]
> Entering _handle_accept (new thread) [2016-09-26T08:02:20.638]
> [830]   Identity: uid=0, gid=0 [2016-09-26T08:02:20.638] [830]
> Entering _handle_request [2016-09-26T08:02:20.639] [830] Got
> request 5 [2016-09-26T08:02:20.639] [830] Handling REQUEST_STATE 
> [2016-09-26T08:02:20.639] [830] Leaving  _handle_request:
> SLURM_SUCCESS [2016-09-26T08:02:20.639] [830] Entering
> _handle_request [2016-09-26T08:02:20.639] debug:  found apparently
> running job 830 [2016-09-26T08:02:20.640] [830] Leaving
> _handle_accept [2016-09-26T08:02:20.692] debug3: in the
> service_connection [2016-09-26T08:02:20.693] debug2: got this type
> of message 6013 [2016-09-26T08:02:20.693] debug2: Processing RPC:
> REQUEST_ABORT_JOB [2016-09-26T08:02:20.694] debug:  _rpc_abort_job,
> uid = 64030 [2016-09-26T08:02:20.694] debug:
> task_slurmd_release_resources: 830 [2016-09-26T08:02:20.694]
> debug3: state for jobid 830: ctime:1474633733 revoked:0 expires:0 
> [2016-09-26T08:02:20.694] debug3: state for jobid 834:
> ctime:1474839186 revoked:1474840473 expires:1474840473 
> [2016-09-26T08:02:20.694] debug3: destroying job 834 state 
> [2016-09-26T08:02:20.694] debug3: state for jobid 835:
> ctime:1474840474 revoked:1474840567 expires:1474840567 
> [2016-09-26T08:02:20.694] debug3: destroying job 835 state 
> [2016-09-26T08:02:20.694] debug3: state for jobid 836:
> ctime:1474840570 revoked:1474840617 expires:1474840617 
> [2016-09-26T08:02:20.694] debug3: destroying job 836 state 
> [2016-09-26T08:02:20.694] debug3: state for jobid 837:
> ctime:1474840618 revoked:1474840672 expires:1474840672 
> [2016-09-26T08:02:20.694] debug3: destroying job 837 state 
> [2016-09-26T08:02:20.694] debug3: state for jobid 838:
> ctime:1474840672 revoked:1474840747 expires:1474840747 
> [2016-09-26T08:02:20.694] debug3: destroying job 838 state 
> [2016-09-26T08:02:20.694] debug3: state for jobid 839:
> ctime:1474840748 revoked:1474840855 expires:1474840855 
> [2016-09-26T08:02:20.694] debug3: destroying job 839 state 
> [2016-09-26T08:02:20.694] debug3: state for jobid 840:
> ctime:1474840883 revoked:1474840969 expires:1474840969 
> [2016-09-26T08:02:20.694] debug3: destroying job 840 state 
> [2016-09-26T08:02:20.695] debug:  credential for job 830 revoked 
> [2016-09-26T08:02:20.696] debug4: found jobid = 830, stepid =
> 4294967294 [2016-09-26T08:02:20.699] [830] Called
> _msg_socket_accept [2016-09-26T08:02:20.699] [830] Leaving
> _msg_socket_accept [2016-09-26T08:02:20.699] [830] eio: handling
> events for 1 objects [2016-09-26T08:02:20.699] [830] Called
> _msg_socket_readable [2016-09-26T08:02:20.700] [830] Entering
> _handle_accept (new thread) [2016-09-26T08:02:20.700] [830]
> Identity: uid=0, gid=0 [2016-09-26T08:02:20.701] [830] Entering
> _handle_request [2016-09-26T08:02:20.701] debug2: container signal
> 997 to job 830.4294967294 [2016-09-26T08:02:20.701] [830] Got
> request 4 [2016-09-26T08:02:20.701] [830] Handling
> REQUEST_SIGNAL_CONTAINER [2016-09-26T08:02:20.701] [830]
> _handle_signal_container for step=830.4294967294 uid=0 signal=997 
> [2016-09-26T08:02:20.701] [830] Myname in build_hashtbl:
> (slurmstepd) [2016-09-26T08:02:20.723] [830] Sending signal 9 to
> pid 649 (python) [2016-09-26T08:02:20.723] [830] Sending signal 9
> to pid 648 (slurm_script) [2016-09-26T08:02:20.723] [830] Sent
> signal 9 to 830.4294967294 [2016-09-26T08:02:20.724] debug4: found
> jobid = 830, stepid = 4294967294 [2016-09-26T08:02:20.724] [830]
> task 0 (648) exited. Killed by signal 9. [2016-09-26T08:02:20.725]
> [830] Called _msg_socket_accept [2016-09-26T08:02:20.725] [830]
> task_post_term: 830.4294967294, task 0 [2016-09-26T08:02:20.725]
> [830] Aggregated 1 task exit messages [2016-09-26T08:02:20.725]
> [830] sending task exit msg for 1 tasks [2016-09-26T08:02:20.725]
> [830] Leaving _msg_socket_accept [2016-09-26T08:02:20.725] [830]
> eio: handling events for 1 objects [2016-09-26T08:02:20.725] [830]
> Called _msg_socket_readable [2016-09-26T08:02:20.725] [830] Leaving
> _handle_request: SLURM_SUCCESS [2016-09-26T08:02:20.726] [830]
> Entering _handle_request [2016-09-26T08:02:20.726] [830] Leaving
> _handle_accept [2016-09-26T08:02:20.726] [830] Entering
> _handle_accept (new thread) [2016-09-26T08:02:20.727] [830] Myname
> in build_hashtbl: (slurmstepd) [2016-09-26T08:02:20.727] [830]
> Identity: uid=0, gid=0 [2016-09-26T08:02:20.727] [830] Entering
> _handle_request [2016-09-26T08:02:20.728] [830] Got request 5 
> [2016-09-26T08:02:20.728] [830] Handling REQUEST_STATE 
> [2016-09-26T08:02:20.728] [830] Leaving  _handle_request:
> SLURM_SUCCESS [2016-09-26T08:02:20.728] [830] Entering
> _handle_request [2016-09-26T08:02:20.728] [830] Leaving
> _handle_accept [2016-09-26T08:02:20.744] [830] cpu_freq_reset:
> #cpus reset = 0 [2016-09-26T08:02:20.744] [830] Before call to
> spank_fini() [2016-09-26T08:02:20.744] [830] After call to
> spank_fini() [2016-09-26T08:02:20.744] [830] eio: handling events
> for 1 objects [2016-09-26T08:02:20.744] [830] Called
> _msg_socket_readable [2016-09-26T08:02:20.744] [830]   false,
> shutdown [2016-09-26T08:02:20.744] [830] Message thread exited 
> [2016-09-26T08:02:20.745] [830] step 830.4294967294 abort
> completed [2016-09-26T08:02:20.745] [830] done with job
> 
> 
> 
> So, if anybody can help me with this issue, then my slurm will
> works perfectly with GPUs&CPUs :)
> 
> Thank you very much
> 
> Regards, Philippe
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJX6XTPAAoJEEmckBqrs5nBGswIAJlqvvfdq7sOoj4X/H3+YMmQ
rIJz9/pTjLQj65SSa7fLFCLhyskZeNODmy1du2cdNf6YS2THnK53Ixk5Y1iDl/Gn
ki+nzeDpd9fccqIs4a51yYQRJiRcX3ZAkwoF+QFooqepMLYgeZE26rhKVrnqr+5r
E+cI+BPw2C2jsNWQYS61hhRm04mZmIdJ+MGP54aAfrinTR6UBp/FrysGGhzU7RP+
M0gRGIVM40SqANuJM48Vpa9Qa0z77q6VbMLtHgpx8L+Ge6mlzjgWSEncK0b83vLH
khulSjpYJiFopsvoNTq1lLjfaOXieOPnF1nppNi7NGHQADh5FyHZo5JCDCQC1AU=
=Z4ro
-----END PGP SIGNATURE-----

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

Reply via email to