Not quite.
The user’s job script in question is checking the error status of the program 
it ran while it is running. If a program fails the running job wants to exclude 
the machine it is currently running on and requeue itself in case it died due 
to a local machine issue that the scheduler has not flagged as a problem.

The current goal is to have a running job step in an array job add the current 
host to its exclude list and requeue itself when it detects a problem. I can’t 
seem to modify the exclude list while a job is running, but once the task is 
requeued and back in the queue it is no longer running so it can’t modify its 
own exclude list.

I.e…. put something like the following into a sbatch script so each task can 
run it against itself.

If ! $runprogram $args ; then
  NewExcNodeList=”$ ExcNodeList,$HOSTNAME”
  scontrol update job ${SLURM_JOB_ID} ExcNodeList=$NewExcNodeList
  scontrol requeue ${ SLURM_JOB_ID}
  sleep 10
fi



From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Rodrigo 
Santibáñez
Sent: Thursday, June 4, 2020 4:16 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [EXT] Re: [slurm-users] Change ExcNodeList on a running job

APL external email warning: Verify sender 
slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>
 before clicking links or attachments



Hello,

Jobs can be requeue if something wrong happens, and the node with failure 
excluded by the controller.

--requeue
Specifies that the batch job should eligible to being requeue. The job may be 
requeued explicitly by a system administrator, after node failure, or upon 
preemption by a higher priority job. When a job is requeued, the batch script 
is initiated from its beginning. Also see the --no-requeue option. The 
JobRequeue configuration parameter controls the default behavior on the cluster.

Also, jobs can be run selecting a specific node or excluding nodes

-w, --nodelist=<node name list>
Request a specific list of hosts. The job will contain all of these hosts and 
possibly additional hosts as needed to satisfy resource requirements. The list 
may be specified as a comma-separated list of hosts, a range of hosts 
(host[1-5,7,...] for example), or a filename. The host list will be assumed to 
be a filename if it contains a "/" character. If you specify a minimum node or 
processor count larger than can be satisfied by the supplied host list, 
additional resources will be allocated on other nodes as needed. Duplicate node 
names in the list will be ignored. The order of the node names in the list is 
not important; the node names will be sorted by Slurm.

-x, --exclude=<node name list>
Explicitly exclude certain nodes from the resources granted to the job.

does this help?

El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M. 
(<geoffrey.ran...@jhuapl.edu<mailto:geoffrey.ran...@jhuapl.edu>>) escribió:

Hello
   We are moving from Univa(sge) to slurm and one of our users has jobs that if 
they detect a failure on the current machine they add that machine to their 
exclude list and requeue themselves. The user wants to emulate that behavior in 
slurm.

It seems like “scontrol update job ${SLURM_JOB_ID} ExcNodeList $NEWExcNodeList” 
won’t work on a running job, but it does work on a job pending in the queue. 
This means the job can’t do this step and requeue itself to avoid running on 
the same host as before.

Our user wants his jobs to be able to exclude the current node and requeue 
itself.
Is there some way to accomplish this in slurm?
Is there a requeue counter of some sort so a job can see if it has requeued 
itself more than X times and give up?

Thanks.

Reply via email to