[slurm-users] Job completed but child process still running

2020-01-13 Thread Youssef Eldakar
In an sbatch script, a user calls a shell script that starts a Java
background process. The job immediately is completed, but the child Java
process is still running on the compute node.

Is there a way to prevent this from happening?

Thanks in advance for any pointers.

Youssef Eldakar
Bibliotheca Alexandrina


Re: [slurm-users] Job completed but child process still running

2020-01-13 Thread Chris Samuel

On 1/13/20 5:55 am, Youssef Eldakar wrote:

In an sbatch script, a user calls a shell script that starts a Java 
background process. The job immediately is completed, but the child Java 
process is still running on the compute node.


Is there a way to prevent this from happening?


What I would recommend is to use Slurm's cgroups support so that 
processes that put themselves into the background this way are tracked 
as part of the job and cleaned up when the job exits.


https://slurm.schedmd.com/cgroups.html

Depending on how the Java process puts itself into the background you 
could try adding a "wait" command at the end of the shell script so that 
it doesn't exit immediately (it's not guaranteed though).


With cgroups the Slurm script could also check the processes in your 
cgroup to monitor the existence of the Java process, sleeping for a 
while between checks, and exit when it's no longer found.  For instance 
once you've got the PID of the Java process you can use "kill -0 $PID" 
to check if it's still there (rather than using ps).


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Job completed but child process still running

2020-01-13 Thread Juergen Salk
* Chris Samuel  [200113 07:30]:
> On 1/13/20 5:55 am, Youssef Eldakar wrote:
> 
> > In an sbatch script, a user calls a shell script that starts a Java
> > background process. The job immediately is completed, but the child Java
> > process is still running on the compute node.
> > 
> > Is there a way to prevent this from happening?
> 
> What I would recommend is to use Slurm's cgroups support so that processes
> that put themselves into the background this way are tracked as part of the
> job and cleaned up when the job exits.
> 
> https://slurm.schedmd.com/cgroups.html

Hi,

I don't intend to hijack this thread but may I add a 
question here - just to be 100% sure.

Are you saying that there is absolutely no need to take care 
of potential leftover/stray processes in the epilog script any
more with proctrack/cgroup enabled?

I do have ProctrackType=proctrack/cgroup in slurm.conf but still also
have a cleanup routine in the epilog script to kill potential leftover
processes owned by the user (along with leftover semaphores, shared
memory and message queues by means of ipcrm). Is that totally
pointless when using proctrack/cgroup plugin for process tracking in
Slurm?

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471



[slurm-users] Adding to / upgrading slurm with a new plugin

2020-01-13 Thread Dean Schulze
I'm writing a select plugin for slurm, and when I do an srun command with
the new plugin as the SelectType in slurm.conf I get this error:

srun: error: Task launch for 32.0 failed on node slurmnode2: Header
lengths are longer than data received

The new plugin is just a copy of the cons_res plugin directory with files
renamed.  I've changed the plugin_id and plugin_type[], and plugin_name[],
but those are the only code changes.

I've added the new plugin to the controller by copying the
select_new_cons_res.so, .a, and .la files to /usr/local/lib/slurm (where
the rest of the plugin files are), modified the SelectType in
/etc/slurm/slurm.conf, and restarted the slurmctld.

It looks like the errors are coming from the nodes.

To use a new plugin do I need to do a complete reinstall of both controller
and nodes?

Thanks.