Looking for some help with / understanding of how the Prolog scripts work in 
Slurm.

I have a Slurm 14.03.7 installation on a cluster that I am administering. We 
want to add a process to check that a node has enough disk space on a 
particular device (/dev/shm in this case) and if not, then set that node to 
DRAIN with a Reason of "Insufficient diskspace on /dev/shm".  For simplicity, 
imagine I currently have only one node, "n0" in the cluster.

Preparing for this I wrote a simple test prolog script:

#! /bin/bash
exit 10

As I understand it when I then call srun / sbatch the prolog script will return 
a non-zero exit code (10) and then the node where this failed will go into the 
"drain" state while the job should get rescheduled on a different node.  
However, this seems to not be happening.

$ srun hostname
n0
# All nodes are idle, srun prints n0


I added the explicit command I would want to run to drain the node to my script 
before it exits:

#! /bin/bash
scontrol update NodeName=n0 State=drain Reason="Insufficient diskspace on 
/dev/shm"
exit 10

And then when I call `srun hostname` I get the node printing out the hostname, 
and the node ends up in the drain state.  

$ srun hostname
n0
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   drain n0

So I know that the prolog script is running, and if I try to run again it fails 
to queue, as I would expect:

$ srun hostname
srun: Required node not available (down or drained)
srun: job 235 queued and waiting for resources

Relevant settings from my slurm.conf file:

Prolog=/etc/slurm/slurm.prolog.sh
PrologFlags=Alloc


My question really is after the Prolog script is failing (exit(10)) why does 
the job continue along?

Alternatively, how would I configure SLURM so that I can do what I really want 
to do which is to drain a node if the diskspace of a particular disk is 
insufficient?

Thank you,


~ Ian Lee
Lawrence Livermore National Laboratory
(W) 925-423-4941=

Reply via email to