Re: [slurm-users] How to debug a prolog script?

2022-10-29 Thread Davide DelVento
Finally I found some time available when I could do the job without
disrupting my users.

It turned out to be both the permissions issue as discussed here, and
the fact that the slurm.conf needs the fully qualified path of the
prolog script.

So that is solved, but sadly my problem is not solved as I will
describe in another thread.

On Sun, Sep 18, 2022 at 11:57 PM Bjørn-Helge Mevik
 wrote:
>
> Davide DelVento  writes:
>
> >> I'm curious: What kind of disruption did it cause for your production
> >> jobs?
> >
> > All jobs failed and went in pending/held with "launch failed requeued
> > held" status, all nodes where the jobs were scheduled went draining.
> >
> > The logs only said "error: validate_node_specs: Prolog or job env
> > setup failure on node , draining the node". I guess if they said
> > "-bash: /path/to/prolog: Permission denied" I would have caught the
> > problem myself.
>
> But that is not a problem caused by having things like
>
> exec &> /root/prolog_slurmd.$$
>
> in the script, as you indicated.  It is a problem caused by the prolog
> script file not being executable.
>
> > In hindsight it is obvious, but I don't think even the documentation
> > mentions that, does it? After all you can execute a file with a
> > non-executable with with "sh filename", so I made the incorrect
> > assumption that slurm would have invoked the prolog that way.
>
> Slurm prologs can be written in any language - we used to have perl
> prolog scripts. :)
>
> --
> Regards,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo
>



Re: [slurm-users] How to debug a prolog script?

2022-09-18 Thread Bjørn-Helge Mevik
Davide DelVento  writes:

>> I'm curious: What kind of disruption did it cause for your production
>> jobs?
>
> All jobs failed and went in pending/held with "launch failed requeued
> held" status, all nodes where the jobs were scheduled went draining.
>
> The logs only said "error: validate_node_specs: Prolog or job env
> setup failure on node , draining the node". I guess if they said
> "-bash: /path/to/prolog: Permission denied" I would have caught the
> problem myself.

But that is not a problem caused by having things like

exec &> /root/prolog_slurmd.$$

in the script, as you indicated.  It is a problem caused by the prolog
script file not being executable.

> In hindsight it is obvious, but I don't think even the documentation
> mentions that, does it? After all you can execute a file with a
> non-executable with with "sh filename", so I made the incorrect
> assumption that slurm would have invoked the prolog that way.

Slurm prologs can be written in any language - we used to have perl
prolog scripts. :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Davide DelVento
Thanks a lot.

> > Does it need the execution permission? For root alone sufficient?
>
> slurmd runs as root, so it only need exec perms for root.

Perfect. That must have been then, since my script (like the example
one) did not have the execution permission on.

> I'm curious: What kind of disruption did it cause for your production
> jobs?

All jobs failed and went in pending/held with "launch failed requeued
held" status, all nodes where the jobs were scheduled went draining.

The logs only said "error: validate_node_specs: Prolog or job env
setup failure on node , draining the node". I guess if they said
"-bash: /path/to/prolog: Permission denied" I would have caught the
problem myself.

In hindsight it is obvious, but I don't think even the documentation
mentions that, does it? After all you can execute a file with a
non-executable with with "sh filename", so I made the incorrect
assumption that slurm would have invoked the prolog that way.

Thanks!



Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Bjørn-Helge Mevik
Davide DelVento  writes:

> Does it need the execution permission? For root alone sufficient?

slurmd runs as root, so it only need exec perms for root.

>> > 2. How to debug the issue?
>> I'd try capturing all stdout and stderr from the script into a file on the 
>> compute
>> node, for instance like this:
>>
>> exec &> /root/prolog_slurmd.$$
>> set -x # To print out all commands
>
> Do you mean INSIDE the prologue script itself?

Yes, inside the prolog script itself.

> Yes, this is what I'd have done, if it weren't so disruptive of all my
> production jobs, hence I had to turn it off before wrecking havoc too
> much.

I'm curious: What kind of disruption did it cause for your production
jobs?

We use this in our slurmd prologs (and similar in epilogs) on all our
production clusters, and have not seen any disruption due to it.  (We do
have things like

## Remove log file if we got this far:
rm -f /root/prolog_slurmd.$$

at the bottom of the scripts, though, so as to remove the log file when
the prolog succeeded.)

> Sure, but even "just executing" there is stdout and stderr which could
> be captured and logged rather than thrown away and force one to do the
> above.

True.  But slurmd doesn't, so...

> How do you "install the prolog scripts there"? Isn't the prolog
> setting in slurm.conf global?

I just overwrite the prolog script file itself on the node.  We
don't have them on a shared file system, though.  If you have the
prologs on a shared file system, you'd have to override the slurm config
on the compute node itself.  This can be done in several ways, for
instance by starting slurmd with the "-f "
option.

>> (Otherwise one could always
>> set up a small cluster of VMs and use that for simpler testing.)
>
> Yes, but I need to request that cluster of VM to IT, have the same OS
> installed and configured (and to be 100% identical, it needs to be
> RHEL so license paid), and everything sync'ed with the actual
> cluster I know it'd be very useful, but sadly we don't have the
> resources to do that, so unfortunately this is not an option for me.

I totally agree that VMs instead of a physical test cluster is never
going to be 100 % the same, but some things can be tested even though
the setups are not exactly the same (for instance, in my experience,
CentOS and Rocky are close enough to RHEL for most slurm-related
things).  One takes what one have. :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Davide DelVento
Thanks to both of you.

> Permissions on the file itself (and the directories in the path to it)

Does it need the execution permission? For root alone sufficient?

> Existence of the script on the nodes (prologue is run on the nodes, not the 
> head)

Yes, it's in a shared filesystem.

> Not sure your error is the prologue script itself. Does everything run fine 
> with no prologue configured?

Yes, everything has been working fine for months and still does as
soon as I take the prolog out of slurm.conf.

> > 2. How to debug the issue?
> I'd try capturing all stdout and stderr from the script into a file on the 
> compute
> node, for instance like this:
>
> exec &> /root/prolog_slurmd.$$
> set -x # To print out all commands

Do you mean INSIDE the prologue script itself? Yes, this is what I'd
have done, if it weren't so disruptive of all my production jobs,
hence I had to turn it off before wrecking havoc too much.


> > Even increasing the debug level the
> > slurmctld.log contains simply a "error: validate_node_specs: Prolog or
> > job env setup failure on node xxx, draining the node" message, without
> > even a line number or anything.
>
> Slurm only executes the prolog script.  It doesn't parse it or evaluate
> it itself, so it has no way of knowing what fails inside the script.

Sure, but even "just executing" there is stdout and stderr which could
be captured and logged rather than thrown away and force one to do the
above.

> > 3. And more generally, how to debug a prolog (and epilog) script
> > without disrupting all production jobs? Unfortunately we can't have
> > another slurm install for testing, is there a sbatch option to force
> > utilizing a prolog script which would not be executed for all the
> > other jobs? Or perhaps making a dedicated queue?
>
> I tend to reserve a node, install the updated prolog scripts there, and
> run test jobs asking for that reservation.

How do you "install the prolog scripts there"? Isn't the prolog
setting in slurm.conf global?

> (Otherwise one could always
> set up a small cluster of VMs and use that for simpler testing.)

Yes, but I need to request that cluster of VM to IT, have the same OS
installed and configured (and to be 100% identical, it needs to be
RHEL so license paid), and everything sync'ed with the actual
cluster I know it'd be very useful, but sadly we don't have the
resources to do that, so unfortunately this is not an option for me.

Thanks again.



Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Bjørn-Helge Mevik
Davide DelVento  writes:

> 2. How to debug the issue?

I'd try capturing all stdout and stderr from the script into a file on the 
compute
node, for instance like this:

exec &> /root/prolog_slurmd.$$
set -x # To print out all commands

before any other commands in the script.  The "prolog_slurmd." will
then contain a log of all commands executed in the script, along with
all output (stdout and stderr).  If there is no "prolog_slurmd."
file after the job has been scheduled, then as has been pointed out by
others, slurm wasn't able to exec the prolog at all.

> Even increasing the debug level the
> slurmctld.log contains simply a "error: validate_node_specs: Prolog or
> job env setup failure on node xxx, draining the node" message, without
> even a line number or anything.

Slurm only executes the prolog script.  It doesn't parse it or evaluate
it itself, so it has no way of knowing what fails inside the script.

> 3. And more generally, how to debug a prolog (and epilog) script
> without disrupting all production jobs? Unfortunately we can't have
> another slurm install for testing, is there a sbatch option to force
> utilizing a prolog script which would not be executed for all the
> other jobs? Or perhaps making a dedicated queue?

I tend to reserve a node, install the updated prolog scripts there, and
run test jobs asking for that reservation.  (Otherwise one could always
set up a small cluster of VMs and use that for simpler testing.)

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] How to debug a prolog script?

2022-09-15 Thread Brian Andrus

Davide,

Quick things to check:

 * Permissions on the file itself (and the directories in the path to it)
 * Existence of the script on the nodes (prologue is run on the nodes,
   not the head)

Not sure your error is the prologue script itself. Does everything run 
fine with no prologue configured?


Brian Andrus

On 9/15/2022 2:49 PM, Davide DelVento wrote:

I have a super simple prolog script, as follows (very similar to the
example one)

#!/bin/bash

if [[ $VAR == 1 ]]; then
 echo "True"
fi

exit 0

This fails (and obviously causes great disruption to my production
jobs). So I have two questions:

1. Why does it fail? It does so regardless of the value of the
variable, so it must not be the echo not being in the PATH (note that
[[ is a shell keyword). I understand that the echo command will go in
a black hole and I should use "print ..." (not sure about its syntax,
and the documentation is very cryptic, but I digress) or perhaps
logger (as the example does), and I tried some of them with no luck.

2. How to debug the issue? Even increasing the debug level the
slurmctld.log contains simply a "error: validate_node_specs: Prolog or
job env setup failure on node xxx, draining the node" message, without
even a line number or anything. Google does not return anything useful
about this message

3. And more generally, how to debug a prolog (and epilog) script
without disrupting all production jobs? Unfortunately we can't have
another slurm install for testing, is there a sbatch option to force
utilizing a prolog script which would not be executed for all the
other jobs? Or perhaps making a dedicated queue?


[slurm-users] How to debug a prolog script?

2022-09-15 Thread Davide DelVento
I have a super simple prolog script, as follows (very similar to the
example one)

#!/bin/bash

if [[ $VAR == 1 ]]; then
echo "True"
fi

exit 0

This fails (and obviously causes great disruption to my production
jobs). So I have two questions:

1. Why does it fail? It does so regardless of the value of the
variable, so it must not be the echo not being in the PATH (note that
[[ is a shell keyword). I understand that the echo command will go in
a black hole and I should use "print ..." (not sure about its syntax,
and the documentation is very cryptic, but I digress) or perhaps
logger (as the example does), and I tried some of them with no luck.

2. How to debug the issue? Even increasing the debug level the
slurmctld.log contains simply a "error: validate_node_specs: Prolog or
job env setup failure on node xxx, draining the node" message, without
even a line number or anything. Google does not return anything useful
about this message

3. And more generally, how to debug a prolog (and epilog) script
without disrupting all production jobs? Unfortunately we can't have
another slurm install for testing, is there a sbatch option to force
utilizing a prolog script which would not be executed for all the
other jobs? Or perhaps making a dedicated queue?