Hello Eugene,

it is just a wild guess, but could you try "srun --mpi=pmi2"(you said
you built OMPI with pmi support) instead of "mpirun".
srun is build-in and I think the preferred way of running parallel
processes. Maybe scontrol is able to suspend it this way.

Regards,
Dennis

Am 10.07.2017 um 22:20 schrieb Eugene Dedits:
> Hello SLURM-DEV
>
>
> I have a problem with slurm, openmpi, and “scontrol suspend”. 
>
> My setup is:
> 96-node cluster with IB, running rhel 6.8
> slurm 17.02.1
> openmpi 2.0.0 (built using Intel 2016 compiler)
>
>
> I am running some application (hpl in this particular case) using batch 
> script similar to:
> -----------------------------
> #!/bin/bash
> #SBATCH —partiotion=standard
> #SBATCH -N 10
> #SBATCH —ntasks-per-node=16
>
> mpirun -np 160 xhpl | tee LOG
> -----------------------------
>
> So I am running it on 160 cores, 2 nodes. 
>
> Once job is submitted to the queue and is running I suspend it using
> ~# scontrol suspend JOBID
>
> I see that indeed my job stopped producing output. I go to each of the 10
> nodes that were assigned for my job and see if the xhpl processes are running
> there with :
>
> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl | wc 
> -l”; done
>
> I expect this little script to return 0 from every node (because suspend sent 
> the
> SIGSTOP and they shouldn’t show up in top). However I see that processes 
> are reliable suspended only on node10. I get:
> 0
> 16
> 16
> …
> 16
>
> So 9 out of 10 nodes still have 16 MPI threads of my xhpl application running 
> at 100%. 
>
> If I run “scontrol resume JOBID” and then suspend it again I see that 
> (sometimes) more
> nodes have “xhpl” processes properly suspended. Every time I resume and 
> suspend the
> job, I see different nodes returning 0 in my “ssh-run-top” script. 
>
> So all together it looks like the suspend mechanism doesn’t properly work in 
> SLURM with 
> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm 
> —with-pmi=/path/to/my/slurm”. 
> I’ve observed the same behavior. 
>
> I would appreciate any help.   
>
>
> Thanks,
> Eugene. 
>
>
>
>  

-- 
Dennis Tants
Auszubildender: Fachinformatiker für Systemintegration

ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation
ZARM - Center of Applied Space Technology and Microgravity

Universität Bremen
Am Fallturm
28359 Bremen, Germany

Telefon: 0421 218 57940
E-Mail: ta...@zarm.uni-bremen.de

www.zarm.uni-bremen.de

Reply via email to