Re: [slurm-users] [External] Preemption not working in 20.11

Prentice Bisbal Mon, 01 Mar 2021 14:02:31 -0800

Thanks for the info and link to your bug report. Unfortunately, myGraceTime is already set to zero for that QOS:


$ sacctmgr show qos interruptible format=Name,gracetime


      Name  GraceTime

---------- ----------

interrupt+   00:00:00


On 2/26/21 3:58 PM, Michael Robbert wrote:

We saw something that sounds similar to this. See this bug report:https://bugs.schedmd.com/show_bug.cgi?id=10196<https://bugs.schedmd.com/show_bug.cgi?id=10196>
SchedMD never found the root cause. They thought it might havesomething to do with a timing problem on Prolog scripts, but the thingthat fixed it for us was to set GraceTime=0 on our preemptable QoS.
*Mike Robbert*
*Cyberinfrastructure Specialist, Cyberinfrastructure and AdvancedResearch Computing*
Information and Technology Solutions (ITS)

303-273-3786 | mrobb...@mines.edu <mailto:mrobb...@mines.edu>

A close up of a sign Description automatically generated

*Our values:*Trust | Integrity | Respect | Responsibility
*From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalfof Prentice Bisbal <pbis...@pppl.gov>
*Reply-To: *Slurm User Community List <slurm-users@lists.schedmd.com>
*Date: *Friday, February 26, 2021 at 12:38
*To: *"slurm-users@lists.schedmd.com" <slurm-users@lists.schedmd.com>
*Subject: *[External] [slurm-users] Preemption not working in 20.11
*CAUTION:*This email originated from outside of the Colorado School ofMines organization. Do not click on links or open attachments unlessyou recognize the sender and know the content is safe.
We recently upgraded from Slurm 19.05.8 to 20.11.3. In ourconfiguration, we have an interruptible partition named'interruptible' for long-running, low-priority jobs that usecheckpoint/restart. Jobs that are preempted would be killed andrequeued rather than suspended. This configuration has been workingwithout issue for 2+ years without issue.
After the upgrade, this has stopped working. Preempted jobs are killedand not requeued. My slurm.conf file is configured to requeuepreempted jobs:
$ grep -i requeue /etc/slurm/slurm.conf
#JobRequeue=1
PreemptMode=Requeue

And the user's sbatch script included the --requeue option.

The user reports the err output from his preempted jobs now says
slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT2021-02-25T16:07:48 ***
And in the past it would see PREEMPTED instead of cancelled.
Any ideas what would cause this? I've reported this to Slurm support,and haven't gotten anything back yet, so I figured I'd ask here, too.If this is a bug, I can't be the only one who has experienced this.
--
Prentice


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [slurm-users] [External] Preemption not working in 20.11

Reply via email to