Re: [slurm-users] Troubleshooting job stuck in Pending state

2023-12-11 Thread Davide DelVento
By getting "stuck" do you mean the job stays PENDING forever or does
eventually run? I've seen the latter (and I agree with you that I wish
Slurm will log things like "I looked at this job and I am not starting it
yet because") but not the former

On Fri, Dec 8, 2023 at 9:00 AM Pacey, Mike  wrote:

> Hi folks,
>
>
>
> I’m looking for some advice on how to troubleshoot jobs we occasionally
> see on our cluster that are stuck in a pending state despite sufficient
> matching resources being free. In the case I’m trying to troubleshoot the
> Reason field lists (Priority) but to find any way to get the scheduler to
> tell me what exactly is the priority job blocking.
>
>
>
>- I tried setting the scheduler log level to debug3 for 5 minutes at
>one point, but my logfile ballooned from 0.5G to 1.5G and didn’t offer any
>useful info for this case.
>- I’ve tried ‘scontrol schedloglevel 1’ but it returns the error:
>‘slurm_set_schedlog_level error: Requested operation is presently disabled’
>
>
>
> I’m aware that the backfill scheduler will occasionally hold on to free
> resources in order to schedule a larger job with higher priority, but in
> this case I can’t find any pending job that might fit the bill.
>
>
>
> And to possibly complicate matters, this is on a large partition that has
> no maximum time limit and most pending jobs have no time limits either. (We
> use backfill/fairshare as we have smaller partitions of rarer resources
> that benefit from it, plus we’re aiming to use fairshare even on the
> no-time-limits partitions to help balance out usage).
>
>
>
> Hoping someone can provide pointers.
>
>
>
> Regards,
>
> Mike
>


[slurm-users] Troubleshooting job stuck in Pending state

2023-12-08 Thread Pacey, Mike
Hi folks,

I'm looking for some advice on how to troubleshoot jobs we occasionally see on 
our cluster that are stuck in a pending state despite sufficient matching 
resources being free. In the case I'm trying to troubleshoot the Reason field 
lists (Priority) but to find any way to get the scheduler to tell me what 
exactly is the priority job blocking.


  *   I tried setting the scheduler log level to debug3 for 5 minutes at one 
point, but my logfile ballooned from 0.5G to 1.5G and didn't offer any useful 
info for this case.
  *   I've tried 'scontrol schedloglevel 1' but it returns the error: 
'slurm_set_schedlog_level error: Requested operation is presently disabled'

I'm aware that the backfill scheduler will occasionally hold on to free 
resources in order to schedule a larger job with higher priority, but in this 
case I can't find any pending job that might fit the bill.

And to possibly complicate matters, this is on a large partition that has no 
maximum time limit and most pending jobs have no time limits either. (We use 
backfill/fairshare as we have smaller partitions of rarer resources that 
benefit from it, plus we're aiming to use fairshare even on the no-time-limits 
partitions to help balance out usage).

Hoping someone can provide pointers.

Regards,
Mike