I’ve just tried 3.0.0rc1 and problems still persists there… Thanks, E.
> On Jul 11, 2017, at 10:20 AM, r...@open-mpi.org wrote: > > > Just checked the planning board and saw that my PR to bring that change to > 2.1.2 is pending and not yet in the release branch. I’ll try to make that > happen soon > > Sent from my iPad > >> On Jul 11, 2017, at 8:03 AM, "r...@open-mpi.org" <r...@open-mpi.org> wrote: >> >> >> There is an mca param ess_base_forward_signals that controls which signals >> to forward. However, I just looked at source code and see that it wasn't >> backported. Sigh. >> >> You could try the 3.0.0 branch as it is in release candidate and should go >> out within a week. I'd suggest just cloning that branch of the OMPI repo to >> get the latest state. The fix is definitely there >> >> Sent from my iPad >> >>> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <eugene.ded...@gmail.com> wrote: >>> >>> >>> Hi Ralph, >>> >>> >>> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same >>> problem… :-\ >>> Could you point me to some discussion of this? >>> >>> Thanks, >>> Eugene. >>> >>>> On Jul 11, 2017, at 6:17 AM, r...@open-mpi.org wrote: >>>> >>>> >>>> There is an issue with how the signal is forwarded. This has been fixed in >>>> the latest OMPI release so you might want to upgrade >>>> >>>> Ralph >>>> >>>> Sent from my iPad >>>> >>>>> On Jul 11, 2017, at 2:53 AM, Dennis Tants >>>>> <dennis.ta...@zarm.uni-bremen.de> wrote: >>>>> >>>>> >>>>> Hello Eugene, >>>>> >>>>> it is just a wild guess, but could you try "srun --mpi=pmi2"(you said >>>>> you built OMPI with pmi support) instead of "mpirun". >>>>> srun is build-in and I think the preferred way of running parallel >>>>> processes. Maybe scontrol is able to suspend it this way. >>>>> >>>>> Regards, >>>>> Dennis >>>>> >>>>>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits: >>>>>> Hello SLURM-DEV >>>>>> >>>>>> >>>>>> I have a problem with slurm, openmpi, and “scontrol suspend”. >>>>>> >>>>>> My setup is: >>>>>> 96-node cluster with IB, running rhel 6.8 >>>>>> slurm 17.02.1 >>>>>> openmpi 2.0.0 (built using Intel 2016 compiler) >>>>>> >>>>>> >>>>>> I am running some application (hpl in this particular case) using batch >>>>>> script similar to: >>>>>> ----------------------------- >>>>>> #!/bin/bash >>>>>> #SBATCH —partiotion=standard >>>>>> #SBATCH -N 10 >>>>>> #SBATCH —ntasks-per-node=16 >>>>>> >>>>>> mpirun -np 160 xhpl | tee LOG >>>>>> ----------------------------- >>>>>> >>>>>> So I am running it on 160 cores, 2 nodes. >>>>>> >>>>>> Once job is submitted to the queue and is running I suspend it using >>>>>> ~# scontrol suspend JOBID >>>>>> >>>>>> I see that indeed my job stopped producing output. I go to each of the 10 >>>>>> nodes that were assigned for my job and see if the xhpl processes are >>>>>> running >>>>>> there with : >>>>>> >>>>>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl >>>>>> | wc -l”; done >>>>>> >>>>>> I expect this little script to return 0 from every node (because suspend >>>>>> sent the >>>>>> SIGSTOP and they shouldn’t show up in top). However I see that processes >>>>>> are reliable suspended only on node10. I get: >>>>>> 0 >>>>>> 16 >>>>>> 16 >>>>>> … >>>>>> 16 >>>>>> >>>>>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl application >>>>>> running at 100%. >>>>>> >>>>>> If I run “scontrol resume JOBID” and then suspend it again I see that >>>>>> (sometimes) more >>>>>> nodes have “xhpl” processes properly suspended. Every time I resume and >>>>>> suspend the >>>>>> job, I see different nodes returning 0 in my “ssh-run-top” script. >>>>>> >>>>>> So all together it looks like the suspend mechanism doesn’t properly >>>>>> work in SLURM with >>>>>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm >>>>>> —with-pmi=/path/to/my/slurm”. >>>>>> I’ve observed the same behavior. >>>>>> >>>>>> I would appreciate any help. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Eugene. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Dennis Tants >>>>> Auszubildender: Fachinformatiker für Systemintegration >>>>> >>>>> ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation >>>>> ZARM - Center of Applied Space Technology and Microgravity >>>>> >>>>> Universität Bremen >>>>> Am Fallturm >>>>> 28359 Bremen, Germany >>>>> >>>>> Telefon: 0421 218 57940 >>>>> E-Mail: ta...@zarm.uni-bremen.de >>>>> >>>>> www.zarm.uni-bremen.de