We’re running SLURM 15.08.8 and having problems with slurmctld and slurmdb
aborting. It started with slurmctld aborting enough that a cron was added to
check if slurmctld was running and if not restart it. Over time I think this
must have corrupted data as slurmdb is now aborting. The core dumps from
slurmctld are showing this gdb:
Core was generated by `/opt/slurm/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000444150 in _job_fits_in_active_row (job_ptr=0x7f7028323820,
p_ptr=0x27fcfd0) at gang.c:432
432 gang.c: No such file or directory.
in gang.c
The errors in the slurmdb.log have the following kind of error messages:
[2016-06-21T10:00:02.916] error: We have more allocated time than is possible
(24742237 > 23274000) for cluster dscr(6465) from 2016-06-20T18:00:00 -
2016-06-20T19:00:00 tres 1
[2016-06-21T10:00:02.916] error: We have more time than is possible
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from
2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1
[2016-06-21T10:00:04.348] error: We have more allocated time than is possible
(24642666 > 23274000) for cluster dscr(6465) from 2016-06-20T19:00:00 -
2016-06-20T20:00:00 tres 1
[2016-06-21T10:00:04.348] error: We have more time than is possible
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from
2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1
[2016-06-21T10:00:05.766] error: We have more allocated time than is possible
(24319743 > 23274000) for cluster dscr(6465) from 2016-06-20T20:00:00 -
2016-06-20T21:00:00 tres 1
From a search I found a script attached to a ticket at schedmd that showed
“lost jobs” (lost.pl). I ran this and it returned 4942 lines.
Is there some way to clean up the database so that the slurmdb does not have
the errors and keep from aborting?
Thanks!