Davide DelVento <davide.quan...@gmail.com> writes:

> 2. How to debug the issue?

I'd try capturing all stdout and stderr from the script into a file on the 
compute
node, for instance like this:

exec &> /root/prolog_slurmd.$$
set -x # To print out all commands

before any other commands in the script.  The "prolog_slurmd.<pid>" will
then contain a log of all commands executed in the script, along with
all output (stdout and stderr).  If there is no "prolog_slurmd.<pid>"
file after the job has been scheduled, then as has been pointed out by
others, slurm wasn't able to exec the prolog at all.

> Even increasing the debug level the
> slurmctld.log contains simply a "error: validate_node_specs: Prolog or
> job env setup failure on node xxx, draining the node" message, without
> even a line number or anything.

Slurm only executes the prolog script.  It doesn't parse it or evaluate
it itself, so it has no way of knowing what fails inside the script.

> 3. And more generally, how to debug a prolog (and epilog) script
> without disrupting all production jobs? Unfortunately we can't have
> another slurm install for testing, is there a sbatch option to force
> utilizing a prolog script which would not be executed for all the
> other jobs? Or perhaps making a dedicated queue?

I tend to reserve a node, install the updated prolog scripts there, and
run test jobs asking for that reservation.  (Otherwise one could always
set up a small cluster of VMs and use that for simpler testing.)

-- 
B/H

Attachment: signature.asc
Description: PGP signature

Reply via email to