Hello everybody, I'm trying to understand an issue with 2 SLURM installations on Ubuntu 14.04 64b, with slurm 2.6.9 compiled. Only 1 computer running Slurmctld/slurmd/slurmdbd. It works very well for any job that doesn't last more than 2 days.
But every 3 days, the slurmctld process restarts by itself, as you can see in slurmctld.log : [2016-09-26T08:01:42.682] debug2: Performing purge of old job records [2016-09-26T08:01:42.683] debug: sched: Running job scheduler [2016-09-26T08:01:42.683] debug2: Performing full system state save [2016-09-26T08:01:44.003] debug: slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:01:44.751] debug2: Sending cpu count of 22 for cluster [2016-09-26T08:01:44.792] debug: slurmdbd: Issue with call DBD_CLUSTER_CPUS(1407): 4294967295(This cluster hasn't been added to accounting yet) [2016-09-26T08:01:46.792] debug2: Testing job time limits and checkpoints [2016-09-26T08:01:47.005] debug3: Writing job id 840 to header record of job_state file [2016-09-26T08:02:03.003] debug: slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) received [2016-09-26T08:02:16.582] debug: sched: slurmctld terminating [2016-09-26T08:02:16.583] debug3: _slurmctld_rpc_mgr shutting down [2016-09-26T08:02:16.798] Saving all slurm state [2016-09-26T08:02:16.801] debug3: Writing job id 840 to header record of job_state file [2016-09-26T08:02:16.881] debug3: _slurmctld_background shutting down [2016-09-26T08:02:16.952] slurmdbd: saved 1846 pending RPCs [2016-09-26T08:02:17.024] Unable to remove pidfile '/var/run/slurmctld.pid': Permission denied [2016-09-26T08:02:16.582] killing old slurmctld[20343] [2016-09-26T08:02:17.028] Job accounting information stored, but details not gathered [2016-09-26T08:02:17.028] slurmctld version 2.6.9 started on cluster graph [2016-09-26T08:02:17.029] debug3: Trying to load plugin /usr/local/lib/slurm/crypto_munge.so [2016-09-26T08:02:17.052] Munge cryptographic signature plugin loaded [2016-09-26T08:02:17.052] debug3: Success. [2016-09-26T08:02:17.052] debug3: Trying to load plugin /usr/local/lib/slurm/gres_gpu.so [2016-09-26T08:02:17.060] debug: init: Gres GPU plugin loaded [2016-09-26T08:02:17.060] debug3: Success. No crontab set, anything, it just restarts itself. And the thing is, when I got jobs running for more than 3 days, they are killed by this restart (even if, normally, slurm is capable to resume jobs) : [2016-09-26T08:02:01.282] [830] profile signalling type Task [2016-09-26T08:02:20.632] debug3: in the service_connection [2016-09-26T08:02:20.634] debug2: got this type of message 1001 [2016-09-26T08:02:20.634] debug2: Processing RPC: REQUEST_NODE_REGISTRATION_STATUS [2016-09-26T08:02:20.635] debug3: CPUs=22 Boards=1 Sockets=22 Cores=1 Threads=1 Memory=145060 TmpDisk=78723 Uptime=4210729 [2016-09-26T08:02:20.636] debug4: found jobid = 830, stepid = 4294967294 [2016-09-26T08:02:20.637] [830] Called _msg_socket_accept [2016-09-26T08:02:20.637] [830] Leaving _msg_socket_accept [2016-09-26T08:02:20.637] [830] eio: handling events for 1 objects [2016-09-26T08:02:20.637] [830] Called _msg_socket_readable [2016-09-26T08:02:20.637] [830] Entering _handle_accept (new thread) [2016-09-26T08:02:20.638] [830] Identity: uid=0, gid=0 [2016-09-26T08:02:20.638] [830] Entering _handle_request [2016-09-26T08:02:20.639] [830] Got request 5 [2016-09-26T08:02:20.639] [830] Handling REQUEST_STATE [2016-09-26T08:02:20.639] [830] Leaving _handle_request: SLURM_SUCCESS [2016-09-26T08:02:20.639] [830] Entering _handle_request [2016-09-26T08:02:20.639] debug: found apparently running job 830 [2016-09-26T08:02:20.640] [830] Leaving _handle_accept [2016-09-26T08:02:20.692] debug3: in the service_connection [2016-09-26T08:02:20.693] debug2: got this type of message 6013 [2016-09-26T08:02:20.693] debug2: Processing RPC: REQUEST_ABORT_JOB [2016-09-26T08:02:20.694] debug: _rpc_abort_job, uid = 64030 [2016-09-26T08:02:20.694] debug: task_slurmd_release_resources: 830 [2016-09-26T08:02:20.694] debug3: state for jobid 830: ctime:1474633733 revoked:0 expires:0 [2016-09-26T08:02:20.694] debug3: state for jobid 834: ctime:1474839186 revoked:1474840473 expires:1474840473 [2016-09-26T08:02:20.694] debug3: destroying job 834 state [2016-09-26T08:02:20.694] debug3: state for jobid 835: ctime:1474840474 revoked:1474840567 expires:1474840567 [2016-09-26T08:02:20.694] debug3: destroying job 835 state [2016-09-26T08:02:20.694] debug3: state for jobid 836: ctime:1474840570 revoked:1474840617 expires:1474840617 [2016-09-26T08:02:20.694] debug3: destroying job 836 state [2016-09-26T08:02:20.694] debug3: state for jobid 837: ctime:1474840618 revoked:1474840672 expires:1474840672 [2016-09-26T08:02:20.694] debug3: destroying job 837 state [2016-09-26T08:02:20.694] debug3: state for jobid 838: ctime:1474840672 revoked:1474840747 expires:1474840747 [2016-09-26T08:02:20.694] debug3: destroying job 838 state [2016-09-26T08:02:20.694] debug3: state for jobid 839: ctime:1474840748 revoked:1474840855 expires:1474840855 [2016-09-26T08:02:20.694] debug3: destroying job 839 state [2016-09-26T08:02:20.694] debug3: state for jobid 840: ctime:1474840883 revoked:1474840969 expires:1474840969 [2016-09-26T08:02:20.694] debug3: destroying job 840 state [2016-09-26T08:02:20.695] debug: credential for job 830 revoked [2016-09-26T08:02:20.696] debug4: found jobid = 830, stepid = 4294967294 [2016-09-26T08:02:20.699] [830] Called _msg_socket_accept [2016-09-26T08:02:20.699] [830] Leaving _msg_socket_accept [2016-09-26T08:02:20.699] [830] eio: handling events for 1 objects [2016-09-26T08:02:20.699] [830] Called _msg_socket_readable [2016-09-26T08:02:20.700] [830] Entering _handle_accept (new thread) [2016-09-26T08:02:20.700] [830] Identity: uid=0, gid=0 [2016-09-26T08:02:20.701] [830] Entering _handle_request [2016-09-26T08:02:20.701] debug2: container signal 997 to job 830.4294967294 [2016-09-26T08:02:20.701] [830] Got request 4 [2016-09-26T08:02:20.701] [830] Handling REQUEST_SIGNAL_CONTAINER [2016-09-26T08:02:20.701] [830] _handle_signal_container for step=830.4294967294 uid=0 signal=997 [2016-09-26T08:02:20.701] [830] Myname in build_hashtbl: (slurmstepd) [2016-09-26T08:02:20.723] [830] Sending signal 9 to pid 649 (python) [2016-09-26T08:02:20.723] [830] Sending signal 9 to pid 648 (slurm_script) [2016-09-26T08:02:20.723] [830] Sent signal 9 to 830.4294967294 [2016-09-26T08:02:20.724] debug4: found jobid = 830, stepid = 4294967294 [2016-09-26T08:02:20.724] [830] task 0 (648) exited. Killed by signal 9. [2016-09-26T08:02:20.725] [830] Called _msg_socket_accept [2016-09-26T08:02:20.725] [830] task_post_term: 830.4294967294, task 0 [2016-09-26T08:02:20.725] [830] Aggregated 1 task exit messages [2016-09-26T08:02:20.725] [830] sending task exit msg for 1 tasks [2016-09-26T08:02:20.725] [830] Leaving _msg_socket_accept [2016-09-26T08:02:20.725] [830] eio: handling events for 1 objects [2016-09-26T08:02:20.725] [830] Called _msg_socket_readable [2016-09-26T08:02:20.725] [830] Leaving _handle_request: SLURM_SUCCESS [2016-09-26T08:02:20.726] [830] Entering _handle_request [2016-09-26T08:02:20.726] [830] Leaving _handle_accept [2016-09-26T08:02:20.726] [830] Entering _handle_accept (new thread) [2016-09-26T08:02:20.727] [830] Myname in build_hashtbl: (slurmstepd) [2016-09-26T08:02:20.727] [830] Identity: uid=0, gid=0 [2016-09-26T08:02:20.727] [830] Entering _handle_request [2016-09-26T08:02:20.728] [830] Got request 5 [2016-09-26T08:02:20.728] [830] Handling REQUEST_STATE [2016-09-26T08:02:20.728] [830] Leaving _handle_request: SLURM_SUCCESS [2016-09-26T08:02:20.728] [830] Entering _handle_request [2016-09-26T08:02:20.728] [830] Leaving _handle_accept [2016-09-26T08:02:20.744] [830] cpu_freq_reset: #cpus reset = 0 [2016-09-26T08:02:20.744] [830] Before call to spank_fini() [2016-09-26T08:02:20.744] [830] After call to spank_fini() [2016-09-26T08:02:20.744] [830] eio: handling events for 1 objects [2016-09-26T08:02:20.744] [830] Called _msg_socket_readable [2016-09-26T08:02:20.744] [830] false, shutdown [2016-09-26T08:02:20.744] [830] Message thread exited [2016-09-26T08:02:20.745] [830] step 830.4294967294 abort completed [2016-09-26T08:02:20.745] [830] done with job So, if anybody can help me with this issue, then my slurm will works perfectly with GPUs&CPUs :) Thank you very much Regards, Philippe