Looking for some help with / understanding of how the Prolog scripts work in Slurm.
I have a Slurm 14.03.7 installation on a cluster that I am administering. We want to add a process to check that a node has enough disk space on a particular device (/dev/shm in this case) and if not, then set that node to DRAIN with a Reason of "Insufficient diskspace on /dev/shm". For simplicity, imagine I currently have only one node, "n0" in the cluster. Preparing for this I wrote a simple test prolog script: #! /bin/bash exit 10 As I understand it when I then call srun / sbatch the prolog script will return a non-zero exit code (10) and then the node where this failed will go into the "drain" state while the job should get rescheduled on a different node. However, this seems to not be happening. $ srun hostname n0 # All nodes are idle, srun prints n0 I added the explicit command I would want to run to drain the node to my script before it exits: #! /bin/bash scontrol update NodeName=n0 State=drain Reason="Insufficient diskspace on /dev/shm" exit 10 And then when I call `srun hostname` I get the node printing out the hostname, and the node ends up in the drain state. $ srun hostname n0 $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 drain n0 So I know that the prolog script is running, and if I try to run again it fails to queue, as I would expect: $ srun hostname srun: Required node not available (down or drained) srun: job 235 queued and waiting for resources Relevant settings from my slurm.conf file: Prolog=/etc/slurm/slurm.prolog.sh PrologFlags=Alloc My question really is after the Prolog script is failing (exit(10)) why does the job continue along? Alternatively, how would I configure SLURM so that I can do what I really want to do which is to drain a node if the diskspace of a particular disk is insufficient? Thank you, ~ Ian Lee Lawrence Livermore National Laboratory (W) 925-423-4941=