Hi Lucas, Lucas Vuotto <l.vuott...@gmail.com> writes:
> Hi all, > sreport was showing that an user was using more CPU hours per week > than available. After checking the output of sacct, we found that some > jobs from an array didn't ended: > > $ sacct -j 69204 -o jobid%-14,state%6,start,elapsed,end > > JobID State Start Elapsed End > > -------------- ------ ------------------- ---------- ------------------- > 69204_[1-1000] FAILED 2016-11-09T17:46:50 00:00:00 2016-11-09T17:46:50 > 69204_1 FAILED 2016-11-09T17:46:44 71-20:25:55 Unknown > 69204_2 FAILED 2016-11-09T17:46:44 71-20:25:55 Unknown > [...] > 69204_295 FAILED 2016-11-09T17:46:46 71-20:25:53 Unknown > 69204_296 FAILED 2016-11-09T17:46:46 71-20:25:53 Unknown > 69204_297 FAILED 2016-11-09T17:46:46 00:00:00 2016-11-09T17:46:46 > [...] > 69204_999 FAILED 2016-11-09T17:46:50 00:00:00 2016-11-09T17:46:50 > > It seems that somehow those jobs got stucked (~72 days after > 2016-11-09 is today, 2017-01-20, and that's why the wrong reports). > scancel says that 69204 is an invalid job id. > > Any idea on how to fix this? We're thinking about deleting the entries > of those jobs in the DB. Is it safe to run "arbitrary" commands in the > DB, bypassing slurmdbd? > > Thanks in advance. The following might also be useful: https://groups.google.com/d/msg/slurm-devel/nf7JxV91F40/KUsS1AmyWRYJ The code heuristically decides how to deal with inconsistencies in the database and produces an SQL script to fix them as well as a second script to roll back the changes. Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de