We saw something that sounds similar to this. See this bug report: 
https://bugs.schedmd.com/show_bug.cgi?id=10196

SchedMD never found the root cause. They thought it might have something to do 
with a timing problem on Prolog scripts, but the thing that fixed it for us was 
to set GraceTime=0 on our preemptable QoS.

 

Mike Robbert

Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research 
Computing

Information and Technology Solutions (ITS)

303-273-3786 | mrobb...@mines.edu  

Our values: Trust | Integrity | Respect | Responsibility

 

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Prentice 
Bisbal <pbis...@pppl.gov>
Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>
Date: Friday, February 26, 2021 at 12:38
To: "slurm-users@lists.schedmd.com" <slurm-users@lists.schedmd.com>
Subject: [External] [slurm-users] Preemption not working in 20.11

 

CAUTION: This email originated from outside of the Colorado School of Mines 
organization. Do not click on links or open attachments unless you recognize 
the sender and know the content is safe.

 

We recently upgraded from Slurm 19.05.8 to 20.11.3. In our configuration, we 
have an interruptible partition named 'interruptible' for long-running, 
low-priority jobs that use checkpoint/restart. Jobs that are preempted would be 
killed and requeued rather than suspended. This configuration has been working 
without issue for 2+ years without issue. 

After the upgrade, this has stopped working. Preempted jobs are killed and not 
requeued. My slurm.conf file is configured to requeue preempted jobs:

$ grep -i requeue /etc/slurm/slurm.conf 
#JobRequeue=1
PreemptMode=Requeue

And the user's sbatch script included the --requeue option. 

The user reports the err output from his preempted jobs now says

slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT 
2021-02-25T16:07:48 ***

And in the past it would see PREEMPTED instead of cancelled. 


Any ideas what would cause this? I've reported this to Slurm support, and 
haven't gotten anything back yet, so I figured I'd ask here, too. If this is a 
bug, I can't be the only one who has experienced this. 

-- 
Prentice 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to