"Golpayegani, Navid (GSFC-6190)" <navid.golpayeg...@nasa.gov> writes:

> Hi all,
>   Is there a way to submit submit maintenance jobs in a rolling fashion? What
> I’m thinking is the ability to run a job on every node in a slurm
> cluster/queue in exclusive mode but X at a time.

We do this for rolling upgrades.  Basically, we submit X copies of a
jobscript that asks for exclusive access to any node with a feature
"fixme" (actually, we use "vaskmeg" :).  The jobs are run as root and
specify --nice -10000 to get highest priority.  They do their job,
remove the "fixme" feature from the node, and then request themself to
be requeued.

Prior to submit the jobs, we add the "fixme" feature to all nodes
needing maintenance.

(In reality, our setup is a little mor complex, since it includes
reinstalling the os on the nodes, but the principle is the same.)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

Attachment: signature.asc
Description: PGP signature

Reply via email to