Okay, it has been committed so you can grab a tarball tomorrow if you like. 

Sent from my iPad

> On Jul 11, 2017, at 9:20 AM, "r...@open-mpi.org" <r...@open-mpi.org> wrote:
> 
> Just checked the planning board and saw that my PR to bring that change to 
> 2.1.2 is pending and not yet in the release branch. I’ll try to make that 
> happen soon
> 
> Sent from my iPad
> 
>> On Jul 11, 2017, at 8:03 AM, "r...@open-mpi.org" <r...@open-mpi.org> wrote:
>> 
>> 
>> There is an mca param ess_base_forward_signals that controls which signals 
>> to forward. However, I just looked at source code and see that it wasn't 
>> backported. Sigh.
>> 
>> You could try the 3.0.0 branch as it is in release candidate and should go 
>> out within a week. I'd suggest just cloning that branch of the OMPI repo to 
>> get the latest state. The fix is definitely there 
>> 
>> Sent from my iPad
>> 
>>> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <eugene.ded...@gmail.com> wrote:
>>> 
>>> 
>>> Hi Ralph, 
>>> 
>>> 
>>> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same 
>>> problem… :-\
>>> Could you point me to some discussion of this? 
>>> 
>>> Thanks,
>>> Eugene. 
>>> 
>>>> On Jul 11, 2017, at 6:17 AM, r...@open-mpi.org wrote:
>>>> 
>>>> 
>>>> There is an issue with how the signal is forwarded. This has been fixed in 
>>>> the latest OMPI release so you might want to upgrade 
>>>> 
>>>> Ralph
>>>> 
>>>> Sent from my iPad
>>>> 
>>>>> On Jul 11, 2017, at 2:53 AM, Dennis Tants 
>>>>> <dennis.ta...@zarm.uni-bremen.de> wrote:
>>>>> 
>>>>> 
>>>>> Hello Eugene,
>>>>> 
>>>>> it is just a wild guess, but could you try "srun --mpi=pmi2"(you said
>>>>> you built OMPI with pmi support) instead of "mpirun".
>>>>> srun is build-in and I think the preferred way of running parallel
>>>>> processes. Maybe scontrol is able to suspend it this way.
>>>>> 
>>>>> Regards,
>>>>> Dennis
>>>>> 
>>>>>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits:
>>>>>> Hello SLURM-DEV
>>>>>> 
>>>>>> 
>>>>>> I have a problem with slurm, openmpi, and “scontrol suspend”. 
>>>>>> 
>>>>>> My setup is:
>>>>>> 96-node cluster with IB, running rhel 6.8
>>>>>> slurm 17.02.1
>>>>>> openmpi 2.0.0 (built using Intel 2016 compiler)
>>>>>> 
>>>>>> 
>>>>>> I am running some application (hpl in this particular case) using batch 
>>>>>> script similar to:
>>>>>> -----------------------------
>>>>>> #!/bin/bash
>>>>>> #SBATCH —partiotion=standard
>>>>>> #SBATCH -N 10
>>>>>> #SBATCH —ntasks-per-node=16
>>>>>> 
>>>>>> mpirun -np 160 xhpl | tee LOG
>>>>>> -----------------------------
>>>>>> 
>>>>>> So I am running it on 160 cores, 2 nodes. 
>>>>>> 
>>>>>> Once job is submitted to the queue and is running I suspend it using
>>>>>> ~# scontrol suspend JOBID
>>>>>> 
>>>>>> I see that indeed my job stopped producing output. I go to each of the 10
>>>>>> nodes that were assigned for my job and see if the xhpl processes are 
>>>>>> running
>>>>>> there with :
>>>>>> 
>>>>>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl 
>>>>>> | wc -l”; done
>>>>>> 
>>>>>> I expect this little script to return 0 from every node (because suspend 
>>>>>> sent the
>>>>>> SIGSTOP and they shouldn’t show up in top). However I see that processes 
>>>>>> are reliable suspended only on node10. I get:
>>>>>> 0
>>>>>> 16
>>>>>> 16
>>>>>> …
>>>>>> 16
>>>>>> 
>>>>>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl application 
>>>>>> running at 100%. 
>>>>>> 
>>>>>> If I run “scontrol resume JOBID” and then suspend it again I see that 
>>>>>> (sometimes) more
>>>>>> nodes have “xhpl” processes properly suspended. Every time I resume and 
>>>>>> suspend the
>>>>>> job, I see different nodes returning 0 in my “ssh-run-top” script. 
>>>>>> 
>>>>>> So all together it looks like the suspend mechanism doesn’t properly 
>>>>>> work in SLURM with 
>>>>>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm 
>>>>>> —with-pmi=/path/to/my/slurm”. 
>>>>>> I’ve observed the same behavior. 
>>>>>> 
>>>>>> I would appreciate any help.   
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Eugene. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> -- 
>>>>> Dennis Tants
>>>>> Auszubildender: Fachinformatiker für Systemintegration
>>>>> 
>>>>> ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation
>>>>> ZARM - Center of Applied Space Technology and Microgravity
>>>>> 
>>>>> Universität Bremen
>>>>> Am Fallturm
>>>>> 28359 Bremen, Germany
>>>>> 
>>>>> Telefon: 0421 218 57940
>>>>> E-Mail: ta...@zarm.uni-bremen.de
>>>>> 
>>>>> www.zarm.uni-bremen.de

Reply via email to