I don't really have an answer for you, just responding to make your message
pop out in the "flood" of other topics we've got since you posted.

On our cluster we configure cancelling our jobs because it makes more sense
for our situation, so I have no experience with that resume from being
suspended. I can think of two possible reasons for this:

- one is memory (have you checked your memory logs and see if there is a
correlation between node memory occupation and jobs not resuming correctly)
- the second one is some resources disappearing (temp files? maybe in some
circumstances slurm totally wipes out /tmp the second job -- if so, that
would be a slurm bug, obviously)

Assuming that you're stuck without finding a root cause which you can
address, I guess it depends on what "doesn't recover" means. It's one thing
if it crashes immediately. It's another if it just stalls without even
starting but slurm still thinks it's running and the users are charged
their allocation -- even worse if your cluster does not enforce a
wallclock limit (or has a very long one). Depending on frequency of the
issue, size of your cluster and other conditions, you may want to consider
writing a watchdog script which would search for these jobs and cancel them?

As I said, not really an answer, just my $0.02 cents (or even less)

On Wed, May 15, 2024 at 1:54 AM Paul Jones via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi,
>
> We use PreemptMode and PriorityTier within Slurm to suspend low priority
> jobs when more urgent work needs to be done. This generally works well, but
> on occasion resumed jobs fail to restart - which is to say Slurm sets the
> job status to running but the actual code doesn't recover from being
> suspended.
>
> Technically everything is working as expected, but I wondered if there was
> any best practice to pass onto users about how to cope with this state?
> Obviously not a direct Slurm question, but wondered if others had
> experience with this and any advice on how best to limit the impact?
>
> Thanks,
> Paul
>
> --
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to