Correction (copy/pasted wrong thing): It was the "JobSubmitPlugins=lua"
line in slurm.conf, not "job_submit.lua: initialized", that did the trick.

At least, I thought that was the end of the story. Now I'm getting odd
errors with reading job_desc and part_list that behave, in my estimate,
like lua's receiving a bad pointer to the underlying c data structure.

On ubuntu, the unedited job_submit.lua provided with the sample code runs
without crashing, though it does not respect the --partition="foo" flag in
sbatch as the source code suggests it should. When edited to include
slurm.log_info("bar"), the script crashes with:
/etc/slurm/job_submit.lua:38: attempt to compare number with nil
The fact that behaviour changes based on the presence of unrelated code
makes me think that this is a pointer issue, but I don't know enough about
the compilation of lua to bytecode to diagnose it.

On centos, with or without the log command, it crashes at the same point as
on ubuntu.

On both:
When I comment out the example code so that it doesn't crash, then try to
print out values in job_desc, I get some really odd results. For example,
job_desc.min_nodes is 4294967294 (on both systems), regardless of what I
set with sbatch job.sh --nodes=X. At first I thought that slurm gave my lua
script a bad pointer to something that had already been garbage collected,
but then I discovered that if I hard code something in lua such as
job_desc.min_nodes=X, then slurm assigns X nodes to the job. So perhaps
slurm respects what lua populates job_desc with, but slurm initially fills
it with arbitrary values?

Here's the lua script I used for the above experiments:
======== BEGIN job_submit.lua ========
function slurm_job_submit(job_desc, part_list, submit_uid)
    slurm.log_info(job_desc.min_nodes)
    job_desc.min_nodes=5
    return slurm.SUCCESS
end

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
    return slurm.SUCCESS
end

slurm.log_info("initialized")
return slurm.SUCCESS
======== END job_submit.lua ========

As an aside, it looks like job_desc uses job_descriptor under the hood:
https://github.com/SchedMD/slurm/blob/master/slurm/slurm.h.in#L1373-L1553
As I wasn't positive, I experimented first using job_desc.qos, which
Nicholas indicated should be supported, but while it exhibited similar
behaviour to min_nodes, it didn't fail quite as spectacularly.
I couldn't figure out what structure backs part_list. The documentation at
https://slurm.schedmd.com/job_submit_plugins.html isn't clear when all it
says is that it's a "List of pointer to partitions which this user is
authorized to use." [sic]

I'm still using slurm 17.02.5. On ubuntu I'm using lua5.2, and on centos
it's lua5.1. In both cases, lua (both the interpreter and the dev
libraries) were installed from the repositories, and slurm was built from
source.

It seems like I filled an email with a whole lot of complaints and no real
questions. So, is this a configuration error on my end? Should I suck it up
and write my plugin in c, even though I don't need full access to
slurmctld? Should I switch to using slurm-wlm? Should I open a bug report?

Thanks,
Nathan

On 27 June 2017 at 17:07, Nicholas McCollum <nmccol...@asc.edu> wrote:

> Nathan,
>
> I have very much appreciated the job_submit.lua plugin for helping
> educate users on what is an acceptable job.  It is one of my favorite
> features about SLURM and has been invaluable in assisting students in
> submitting valid job requirements.
>
> If a user specifies some absurd amount of memory, or some other sbatch
> or srun parameter... or does not choose a parameter, I like to notify
> the user what they have done wrong.  For example I require all users to
> specify a QoS when they submit a job.
>
> ====== BEGIN EXAMPLE job_submit.lua ======
>
> function slurm_job_modify(job_desc, part_list, submit_uid)
> end
>
> function slurm_job_submit(job_desc, part_list, submit_uid)
>
>     --[[ Start with an error count of 0 ]]--
>   local asc_error = 0
>   local asc_error_verbose = ""
>
>   --[[ Pretend if statement ]]--
>     asc_error = asc_error + 1
>     asc_error_verbose = string.format("%s\nERROR: Job requested
> something we dont like.\n", asc_error_verbose)
>   --[[ End Pretend if statement ]]--
>
>   --[[ Pretend if statement ]]--
>     asc_error = asc_error + 1
>     asc_error_verbose = string.format("%s\nERROR: More bad stuff.\n",
> asc_error_verbose)
>   --[[ End Pretend if statement ]]--
>
>   if asc_error > 0 then
>     slurm.log_user("\n%s", asc_error_verbose)
>     return slurm.ERROR
>   end
>
>   --[[ Want to return slurm.SUCCESS if the entire script runs to end
> ]]--
>   return slurm.SUCCESS
> end
>
> ====== END EXAMPLE job_submit.lua =======
>
> This is the method that I worked out, where it collects all of the
> errors inside asc_error_verbose and dumps out at the end with return
> slurm.ERROR.   If you use the current file above, it will return every
> job with those errors above.  This would be a great way to check that
> job_submit.lua is working on your system.  If you have any current jobs
> though, it will kill them all... so use this on a development
> environment for testing.
>
> My example for making a user specify a QoS:
>
>   local asc_qos = job_desc.qos
>   if asc_qos == nil then
>     asc_error = asc_error + 1
>     asc_error_verbose = string.format("%s\nJob must request a QoS using
> the --qos= flag.\n",asc_error_verbose)
>     asc_qos = "invalid"
>   end
>
>
> I'd be more than happy to share my job_submit.lua if anyone is
> interested.  I only ask that you share yours back.
>
> --
> Nicholas McCollum
> HPC Systems Administrator
> Alabama Supercomputer Authority
>
> On Tue, 2017-06-27 at 14:30 -0600, Nathan Vance wrote:
> > Darby,
> >
> > The "job_submit.lua: initialized" line in slurm.conf was indeed the
> > issue. When compiling slurm I only got the "yes lua" line without the
> > flags, but that seems to be just a difference in OS's.
> >
> > Now that I have debugging feedback I should be good to go!
> >
> > Thanks,
> > Nathan
> >
> > On 27 June 2017 at 16:13, Vicker, Darby (JSC-EG311) <darby.vicker-1@n
> > asa.gov> wrote:
> > > We recently started using a lua job submit plugin as well.  You
> > > have to have the lua-devel package installed when you compile
> > > slurm.  It looks like you do (but we use RHEL the package name is
> > > lua-devel) but confirm that you see something like these in
> > > config.log:
> > >
> > > configure:24784: result: yes lua
> > > pkg_cv_lua_LIBS='-llua -lm -ldl  '
> > > lua_CFLAGS='  -DLUA_COMPAT_ALL'
> > > lua_LIBS='-llua -lm -ldl  '
> > >
> > > Do you have this in your slurm.conf?
> > >
> > > JobSubmitPlugins=lua
> > >
> > > I'm guessing not given you don't see anything in the logs. Before I
> > > got all the errors worked out, I would see errors like this in
> > > slurmctld_log:
> > >
> > > error: Couldn't find the specified plugin name for job_submit/lua
> > > looking at all files
> > > error: cannot find job_submit plugin for job_submit/lua
> > > error: cannot create job_submit context for job_submit/lua
> > > failed to initialize job_submit plugin
> > >
> > >
> > > After getting everything working, you should see this:
> > >
> > > job_submit.lua: initialized
> > >
> > > As well as any other slurm.log_info messages you put in your lua
> > > script.
> > >
> > >
> > > From: Nathan Vance <naterva...@gmail.com>
> > > Reply-To: slurm-dev <slurm-dev@schedmd.com>
> > > Date: Tuesday, June 27, 2017 at 12:15 PM
> > > To: slurm-dev <slurm-dev@schedmd.com>
> > > Subject: [slurm-dev] Job Submit Lua Plugin
> > >
> > > Hello all!
> > >
> > > I've been working on getting off the ground with Lua plugins. The
> > > goal is to implement Torque's routing queues for SLURM, but so far
> > > I have been unable to get SLURM to even call my plugin.
> > >
> > > What I have tried:
> > > 1) Copied contrib/lua/job_submit.lua to /etc/slurm/ (the same
> > > directory as slurm.conf)
> > > 2) Restarted slurmctld and verified that no functionality was
> > > broken
> > > 3) Added slurm.log_info("I got here") to several points in the
> > > script. After restarting slurmctld and submitting a job, grep "I
> > > got here" -R /var/log found no results.
> > > 4) In case there was a problem with the log file, I added
> > > os.execute("touch /home/myUser/slurm_job_submitted") to the top of
> > > the slurm_job_submit method. Restarting slurmctld and submitting a
> > > job still produced no evidence that my plugin was called.
> > > 5) In case there were permission issues, I made job_submit.lua
> > > executable. Nothing. Even grep "job_submit" -R /var/log (in case
> > > there was an error calling the script) comes up dry.
> > >
> > > Relevant information:
> > > OS: Ubuntu 16.04
> > > Lua: lua5.2 and liblua5.2-dev (I can use Lua interactively)
> > > SLURM version: 17.02.5, compiled from source (after installing Lua)
> > > using ./configure --prefix=/usr --sysconfdir=/etc/slurm
> > >
> > > Any guidance to get me up and running would be greatly appreciated!
> > >
> > > Thanks,
> > > Nathan
> >
> >
>

Reply via email to