[slurm-dev] Re: Power outage causes wrong reports

Loris Bennett Wed, 22 Feb 2017 00:44:49 -0800

Hi Lucas,

Lucas Vuotto <l.vuott...@gmail.com> writes:


> Hi all,
> sreport was showing that an user was using more CPU hours per week
> than available. After checking the output of sacct, we found that some
> jobs from an array didn't ended:
>
> $ sacct -j 69204 -o jobid%-14,state%6,start,elapsed,end
>
>          JobID  State               Start    Elapsed                 End
>
> -------------- ------ ------------------- ---------- -------------------
> 69204_[1-1000] FAILED 2016-11-09T17:46:50   00:00:00 2016-11-09T17:46:50
> 69204_1        FAILED 2016-11-09T17:46:44 71-20:25:55            Unknown
> 69204_2        FAILED 2016-11-09T17:46:44 71-20:25:55            Unknown
> [...]
> 69204_295      FAILED 2016-11-09T17:46:46 71-20:25:53            Unknown
> 69204_296      FAILED 2016-11-09T17:46:46 71-20:25:53            Unknown
> 69204_297      FAILED 2016-11-09T17:46:46   00:00:00 2016-11-09T17:46:46
> [...]
> 69204_999      FAILED 2016-11-09T17:46:50   00:00:00 2016-11-09T17:46:50
>
> It seems that somehow those jobs got stucked (~72 days after
> 2016-11-09 is today, 2017-01-20, and that's why the wrong reports).
> scancel says that 69204 is an invalid job id.
>
> Any idea on how to fix this? We're thinking about deleting the entries
> of those jobs in the DB. Is it safe to run "arbitrary" commands in the
> DB, bypassing slurmdbd?
>
> Thanks in advance.

The following might also be useful:

https://groups.google.com/d/msg/slurm-devel/nf7JxV91F40/KUsS1AmyWRYJ

The code heuristically decides how to deal with inconsistencies in the
database and produces an SQL script to fix them as well as a second
script to roll back the changes.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Power outage causes wrong reports

Reply via email to