Hello Eugene, it is just a wild guess, but could you try "srun --mpi=pmi2"(you said you built OMPI with pmi support) instead of "mpirun". srun is build-in and I think the preferred way of running parallel processes. Maybe scontrol is able to suspend it this way.
Regards, Dennis Am 10.07.2017 um 22:20 schrieb Eugene Dedits: > Hello SLURM-DEV > > > I have a problem with slurm, openmpi, and “scontrol suspend”. > > My setup is: > 96-node cluster with IB, running rhel 6.8 > slurm 17.02.1 > openmpi 2.0.0 (built using Intel 2016 compiler) > > > I am running some application (hpl in this particular case) using batch > script similar to: > ----------------------------- > #!/bin/bash > #SBATCH —partiotion=standard > #SBATCH -N 10 > #SBATCH —ntasks-per-node=16 > > mpirun -np 160 xhpl | tee LOG > ----------------------------- > > So I am running it on 160 cores, 2 nodes. > > Once job is submitted to the queue and is running I suspend it using > ~# scontrol suspend JOBID > > I see that indeed my job stopped producing output. I go to each of the 10 > nodes that were assigned for my job and see if the xhpl processes are running > there with : > > ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl | wc > -l”; done > > I expect this little script to return 0 from every node (because suspend sent > the > SIGSTOP and they shouldn’t show up in top). However I see that processes > are reliable suspended only on node10. I get: > 0 > 16 > 16 > … > 16 > > So 9 out of 10 nodes still have 16 MPI threads of my xhpl application running > at 100%. > > If I run “scontrol resume JOBID” and then suspend it again I see that > (sometimes) more > nodes have “xhpl” processes properly suspended. Every time I resume and > suspend the > job, I see different nodes returning 0 in my “ssh-run-top” script. > > So all together it looks like the suspend mechanism doesn’t properly work in > SLURM with > OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm > —with-pmi=/path/to/my/slurm”. > I’ve observed the same behavior. > > I would appreciate any help. > > > Thanks, > Eugene. > > > > -- Dennis Tants Auszubildender: Fachinformatiker für Systemintegration ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation ZARM - Center of Applied Space Technology and Microgravity Universität Bremen Am Fallturm 28359 Bremen, Germany Telefon: 0421 218 57940 E-Mail: ta...@zarm.uni-bremen.de www.zarm.uni-bremen.de