Re: [slurm-users] How to debug a prolog script?
Davide, Quick things to check: * Permissions on the file itself (and the directories in the path to it) * Existence of the script on the nodes (prologue is run on the nodes, not the head) Not sure your error is the prologue script itself. Does everything run fine with no prologue configured? Brian Andrus On 9/15/2022 2:49 PM, Davide DelVento wrote: I have a super simple prolog script, as follows (very similar to the example one) #!/bin/bash if [[ $VAR == 1 ]]; then echo "True" fi exit 0 This fails (and obviously causes great disruption to my production jobs). So I have two questions: 1. Why does it fail? It does so regardless of the value of the variable, so it must not be the echo not being in the PATH (note that [[ is a shell keyword). I understand that the echo command will go in a black hole and I should use "print ..." (not sure about its syntax, and the documentation is very cryptic, but I digress) or perhaps logger (as the example does), and I tried some of them with no luck. 2. How to debug the issue? Even increasing the debug level the slurmctld.log contains simply a "error: validate_node_specs: Prolog or job env setup failure on node xxx, draining the node" message, without even a line number or anything. Google does not return anything useful about this message 3. And more generally, how to debug a prolog (and epilog) script without disrupting all production jobs? Unfortunately we can't have another slurm install for testing, is there a sbatch option to force utilizing a prolog script which would not be executed for all the other jobs? Or perhaps making a dedicated queue?
Re: [slurm-users] How to debug a prolog script?
Davide DelVento writes: > 2. How to debug the issue? I'd try capturing all stdout and stderr from the script into a file on the compute node, for instance like this: exec &> /root/prolog_slurmd.$$ set -x # To print out all commands before any other commands in the script. The "prolog_slurmd." will then contain a log of all commands executed in the script, along with all output (stdout and stderr). If there is no "prolog_slurmd." file after the job has been scheduled, then as has been pointed out by others, slurm wasn't able to exec the prolog at all. > Even increasing the debug level the > slurmctld.log contains simply a "error: validate_node_specs: Prolog or > job env setup failure on node xxx, draining the node" message, without > even a line number or anything. Slurm only executes the prolog script. It doesn't parse it or evaluate it itself, so it has no way of knowing what fails inside the script. > 3. And more generally, how to debug a prolog (and epilog) script > without disrupting all production jobs? Unfortunately we can't have > another slurm install for testing, is there a sbatch option to force > utilizing a prolog script which would not be executed for all the > other jobs? Or perhaps making a dedicated queue? I tend to reserve a node, install the updated prolog scripts there, and run test jobs asking for that reservation. (Otherwise one could always set up a small cluster of VMs and use that for simpler testing.) -- B/H signature.asc Description: PGP signature
Re: [slurm-users] How to debug a prolog script?
Thanks to both of you. > Permissions on the file itself (and the directories in the path to it) Does it need the execution permission? For root alone sufficient? > Existence of the script on the nodes (prologue is run on the nodes, not the > head) Yes, it's in a shared filesystem. > Not sure your error is the prologue script itself. Does everything run fine > with no prologue configured? Yes, everything has been working fine for months and still does as soon as I take the prolog out of slurm.conf. > > 2. How to debug the issue? > I'd try capturing all stdout and stderr from the script into a file on the > compute > node, for instance like this: > > exec &> /root/prolog_slurmd.$$ > set -x # To print out all commands Do you mean INSIDE the prologue script itself? Yes, this is what I'd have done, if it weren't so disruptive of all my production jobs, hence I had to turn it off before wrecking havoc too much. > > Even increasing the debug level the > > slurmctld.log contains simply a "error: validate_node_specs: Prolog or > > job env setup failure on node xxx, draining the node" message, without > > even a line number or anything. > > Slurm only executes the prolog script. It doesn't parse it or evaluate > it itself, so it has no way of knowing what fails inside the script. Sure, but even "just executing" there is stdout and stderr which could be captured and logged rather than thrown away and force one to do the above. > > 3. And more generally, how to debug a prolog (and epilog) script > > without disrupting all production jobs? Unfortunately we can't have > > another slurm install for testing, is there a sbatch option to force > > utilizing a prolog script which would not be executed for all the > > other jobs? Or perhaps making a dedicated queue? > > I tend to reserve a node, install the updated prolog scripts there, and > run test jobs asking for that reservation. How do you "install the prolog scripts there"? Isn't the prolog setting in slurm.conf global? > (Otherwise one could always > set up a small cluster of VMs and use that for simpler testing.) Yes, but I need to request that cluster of VM to IT, have the same OS installed and configured (and to be 100% identical, it needs to be RHEL so license paid), and everything sync'ed with the actual cluster I know it'd be very useful, but sadly we don't have the resources to do that, so unfortunately this is not an option for me. Thanks again.
Re: [slurm-users] How to debug a prolog script?
Davide DelVento writes: > Does it need the execution permission? For root alone sufficient? slurmd runs as root, so it only need exec perms for root. >> > 2. How to debug the issue? >> I'd try capturing all stdout and stderr from the script into a file on the >> compute >> node, for instance like this: >> >> exec &> /root/prolog_slurmd.$$ >> set -x # To print out all commands > > Do you mean INSIDE the prologue script itself? Yes, inside the prolog script itself. > Yes, this is what I'd have done, if it weren't so disruptive of all my > production jobs, hence I had to turn it off before wrecking havoc too > much. I'm curious: What kind of disruption did it cause for your production jobs? We use this in our slurmd prologs (and similar in epilogs) on all our production clusters, and have not seen any disruption due to it. (We do have things like ## Remove log file if we got this far: rm -f /root/prolog_slurmd.$$ at the bottom of the scripts, though, so as to remove the log file when the prolog succeeded.) > Sure, but even "just executing" there is stdout and stderr which could > be captured and logged rather than thrown away and force one to do the > above. True. But slurmd doesn't, so... > How do you "install the prolog scripts there"? Isn't the prolog > setting in slurm.conf global? I just overwrite the prolog script file itself on the node. We don't have them on a shared file system, though. If you have the prologs on a shared file system, you'd have to override the slurm config on the compute node itself. This can be done in several ways, for instance by starting slurmd with the "-f " option. >> (Otherwise one could always >> set up a small cluster of VMs and use that for simpler testing.) > > Yes, but I need to request that cluster of VM to IT, have the same OS > installed and configured (and to be 100% identical, it needs to be > RHEL so license paid), and everything sync'ed with the actual > cluster I know it'd be very useful, but sadly we don't have the > resources to do that, so unfortunately this is not an option for me. I totally agree that VMs instead of a physical test cluster is never going to be 100 % the same, but some things can be tested even though the setups are not exactly the same (for instance, in my experience, CentOS and Rocky are close enough to RHEL for most slurm-related things). One takes what one have. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to debug a prolog script?
Thanks a lot. > > Does it need the execution permission? For root alone sufficient? > > slurmd runs as root, so it only need exec perms for root. Perfect. That must have been then, since my script (like the example one) did not have the execution permission on. > I'm curious: What kind of disruption did it cause for your production > jobs? All jobs failed and went in pending/held with "launch failed requeued held" status, all nodes where the jobs were scheduled went draining. The logs only said "error: validate_node_specs: Prolog or job env setup failure on node , draining the node". I guess if they said "-bash: /path/to/prolog: Permission denied" I would have caught the problem myself. In hindsight it is obvious, but I don't think even the documentation mentions that, does it? After all you can execute a file with a non-executable with with "sh filename", so I made the incorrect assumption that slurm would have invoked the prolog that way. Thanks!
Re: [slurm-users] How to debug a prolog script?
Davide DelVento writes: >> I'm curious: What kind of disruption did it cause for your production >> jobs? > > All jobs failed and went in pending/held with "launch failed requeued > held" status, all nodes where the jobs were scheduled went draining. > > The logs only said "error: validate_node_specs: Prolog or job env > setup failure on node , draining the node". I guess if they said > "-bash: /path/to/prolog: Permission denied" I would have caught the > problem myself. But that is not a problem caused by having things like exec &> /root/prolog_slurmd.$$ in the script, as you indicated. It is a problem caused by the prolog script file not being executable. > In hindsight it is obvious, but I don't think even the documentation > mentions that, does it? After all you can execute a file with a > non-executable with with "sh filename", so I made the incorrect > assumption that slurm would have invoked the prolog that way. Slurm prologs can be written in any language - we used to have perl prolog scripts. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to debug a prolog script?
Finally I found some time available when I could do the job without disrupting my users. It turned out to be both the permissions issue as discussed here, and the fact that the slurm.conf needs the fully qualified path of the prolog script. So that is solved, but sadly my problem is not solved as I will describe in another thread. On Sun, Sep 18, 2022 at 11:57 PM Bjørn-Helge Mevik wrote: > > Davide DelVento writes: > > >> I'm curious: What kind of disruption did it cause for your production > >> jobs? > > > > All jobs failed and went in pending/held with "launch failed requeued > > held" status, all nodes where the jobs were scheduled went draining. > > > > The logs only said "error: validate_node_specs: Prolog or job env > > setup failure on node , draining the node". I guess if they said > > "-bash: /path/to/prolog: Permission denied" I would have caught the > > problem myself. > > But that is not a problem caused by having things like > > exec &> /root/prolog_slurmd.$$ > > in the script, as you indicated. It is a problem caused by the prolog > script file not being executable. > > > In hindsight it is obvious, but I don't think even the documentation > > mentions that, does it? After all you can execute a file with a > > non-executable with with "sh filename", so I made the incorrect > > assumption that slurm would have invoked the prolog that way. > > Slurm prologs can be written in any language - we used to have perl > prolog scripts. :) > > -- > Regards, > Bjørn-Helge Mevik, dr. scient, > Department for Research Computing, University of Oslo >