Dear slurm-user list,

I got this error:

Unable to start service slurmctld: Job for slurmctld.service failed
because the control process exited with error code.\nSee \"systemctl
status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for
details.

but in slurmctld.service I see nothing suspicious:

slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
    Drop-In: /etc/systemd/system/slurmctld.service.d
             └─override.conf
     Active: active (running) since Wed 2024-02-07 15:50:56 UTC; 19min ago
   Main PID: 51552 (slurmctld)
      Tasks: 21 (limit: 9363)
     Memory: 10.4M
        CPU: 1min 16.088s
     CGroup: /system.slice/slurmctld.service
             ├─51552 /usr/sbin/slurmctld --systemd
             └─51553 "slurmctld: slurmscriptd" "" "" "" "" "" ""

Feb 07 15:58:21 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: sched: _slurm_rpc_allocate_resources JobId=3 NodeList=(null)
usec=959
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=3 WTERMSIG 2
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=3 cancelled by interactive user
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=3 done
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _slurm_rpc_complete_job_allocation: JobId=3 error Job/step
already completing or completed
Feb 07 15:58:42 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: sched: _slurm_rpc_allocate_resources JobId=4
NodeList=cluster-master-2vt2bqh7ahec04c,cluster-worker-2vt2bqh7ahec04c-2
usec=512
Feb 07 16:06:04 cluster-master-2vt2bqh7ahec04c slurmctld[51553]:
slurmctld: error: _run_script: JobId=0 resumeprog exit status 1:0
Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=4 WTERMSIG 2
Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=4 done
Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _slurm_rpc_complete_job_allocation: JobId=4 error Job/step
already completing or completed

I am unsure how to debug this further. It might be coming from a
previous problem I tried to fix (basically a few deprecated keys in the
configuration).

I will try to restart the entire cluster with the added changes to rule
out any follow up errors, but maybe it's something obvious a fellow list
user can see.

Best regards,
Xaver


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to