Hi all,

Was looking at the running jobs on one groups cluster, and saw there was an 
insane amount of "running" jobs when I did a sacct -X -s R; then looked at 
output of squeue, and found a much more reasonable number...

root@slurm-controller1:/ # sacct -X -p -s R | wc -l
8895
root@ slurm-controller1:/ # squeue | wc -l
43

In looking for the cause, I see a large amount of the following in the 
slurmctld.log file:

[2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full (20140), 
discarding DBD_STEP_START:1442 request
[2019-07-16T09:40:27.515] error: slurmdbd: agent queue filling (20140), RESTART 
SLURMDBD NOW
[2019-07-16T09:40:27.515] error: slurmdbd: agent queue is full (20140), 
discarding DBD_JOB_COMPLETE:1424 request
[2019-07-16T09:40:27.515] error: slurmdbd: agent queue is full (20140), 
discarding DBD_STEP_COMPLETE:1441 request
[2019-07-16T09:42:40.766] error: slurmdbd: agent queue filling (20140), RESTART 
SLURMDBD NOW
[2019-07-16T09:42:40.766] error: slurmdbd: agent queue is full (20140), 
discarding DBD_STEP_START:1442 request
[2019-07-16T09:46:05.905] error: slurmdbd: agent queue filling (20140), RESTART 
SLURMDBD NOW
[2019-07-16T09:46:05.905] error: slurmdbd: agent queue is full (20140), 
discarding DBD_STEP_COMPLETE:1441 request
[2019-07-16T09:46:05.905] error: slurmdbd: agent queue is full (20140), 
discarding DBD_JOB_COMPLETE:1424 request
[2019-07-16T09:48:42.616] error: slurmdbd: agent queue filling (20140), RESTART 
SLURMDBD NOW
[2019-07-16T09:48:42.616] error: slurmdbd: agent queue is full (20140), 
discarding DBD_JOB_COMPLETE:1424 request
[2019-07-16T09:48:42.616] error: slurmdbd: agent queue is full (20140), 
discarding DBD_STEP_COMPLETE:1441 request
[2019-07-16T09:53:00.188] error: slurmdbd: agent queue filling (20140), RESTART 
SLURMDBD NOW
[2019-07-16T09:53:00.188] error: slurmdbd: agent queue is full (20140), 
discarding DBD_JOB_COMPLETE:1424 request
[2019-07-16T09:53:00.189] error: slurmdbd: agent queue is full (20140), 
discarding DBD_STEP_COMPLETE:1441 request

What may be the cause of this issue? And, is there any way now to correct the 
accounting records in the db?

Thanks,
Will

Reply via email to