Just looking at it today...
> On Jul 18, 2017, at 7:25 AM, Eugene Dedits <eugene.ded...@gmail.com> wrote:
>
> Hi Ralph,
>
>
> did you have a chance to take a look at this problem?
>
> Thanks!
> Eugene.
>
>
>
>
> On Tue, Jul 11, 2017 at 12:51 PM, Eugene Dedits <eugene.ded...@gmail.com
> <mailto:eugene.ded...@gmail.com>> wrote:
> Thanks! I really appreciate your help.
> In a meantime I’ve tried experimenting with 1.8.3. Here is what I’ve noticed.
>
> 1. Running the job with “sbatch ./my_script” where my script calls
> mpirun -np 160 -mca orte_forward_job_control 1 ./xhpl
>
> and then suspending the job with “scontrol suspend JOBID”
> does not work. Of 10 nodes assigned to my job 4 are still running
> 16 mpi threads of xhpl.
>
> 2. Running exactly the same job and then sending TSPT to mpirun process
> does work: all 10 nodes show that xhpl processes are stopped. Resuming
> them with -CONT also works.
>
> Again, this is with OpenMPI 1.8.3
>
> Once again, thank you for all the help.
>
> Cheers,
> Eugene.
>
>
>
>
>> On Jul 11, 2017, at 12:08 PM, r...@open-mpi.org <mailto:r...@open-mpi.org>
>> wrote:
>>
>> Very odd - let me explore when I get back. Sorry for delay
>>
>> Sent from my iPad
>>
>> On Jul 11, 2017, at 10:59 AM, Eugene Dedits <eugene.ded...@gmail.com
>> <mailto:eugene.ded...@gmail.com>> wrote:
>>
>>> Ralph,
>>>
>>>
>>> Are you suggesting doing something similar to this:
>>> https://www.open-mpi.org/faq/?category=sge#sge-suspend-resume
>>> <https://www.open-mpi.org/faq/?category=sge#sge-suspend-resume>
>>>
>>> If yes, here is what I’ve done:
>>> - start a job using slurm and "mpirun -mca orte_forward_job_control 1 -np
>>> 160 xhpl”
>>> - ssh to the node where mpirun is launched
>>> - “kill -STOP PID” where PID is mpirun pid
>>> - “kill -TSTP PID”
>>>
>>> In both cases (STOP and TSTP) I observer that there were 16 mpi processes
>>> running
>>> at 100% on all 10 nodes where the job was started.
>>>
>>> Thanks,
>>> Eugene.
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On Jul 11, 2017, at 10:35 AM, r...@open-mpi.org <mailto:r...@open-mpi.org>
>>>> wrote:
>>>>
>>>>
>>>> Odd - I'm on travel this week but can look at it next week. One
>>>> possibility - have you tried hitting us with SIGTSTOP instead of SIGSTOP?
>>>> Difference in ability to trap and forward
>>>>
>>>> Sent from my iPad
>>>>
>>>>> On Jul 11, 2017, at 9:29 AM, Eugene Dedits <eugene.ded...@gmail.com
>>>>> <mailto:eugene.ded...@gmail.com>> wrote:
>>>>>
>>>>>
>>>>> I’ve just tried 3.0.0rc1 and problems still persists there…
>>>>>
>>>>> Thanks,
>>>>> E.
>>>>>
>>>>>
>>>>>
>>>>>> On Jul 11, 2017, at 10:20 AM, r...@open-mpi.org
>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>>
>>>>>>
>>>>>> Just checked the planning board and saw that my PR to bring that change
>>>>>> to 2.1.2 is pending and not yet in the release branch. I’ll try to make
>>>>>> that happen soon
>>>>>>
>>>>>> Sent from my iPad
>>>>>>
>>>>>>> On Jul 11, 2017, at 8:03 AM, "r...@open-mpi.org
>>>>>>> <mailto:r...@open-mpi.org>" <r...@open-mpi.org
>>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> There is an mca param ess_base_forward_signals that controls which
>>>>>>> signals to forward. However, I just looked at source code and see that
>>>>>>> it wasn't backported. Sigh.
>>>>>>>
>>>>>>> You could try the 3.0.0 branch as it is in release candidate and should
>>>>>>> go out within a week. I'd suggest just cloning that branch of the OMPI
>>>>>>> repo to get the latest state. The fix is definitely there
>>>>>>>
>>>>>>> Sent from my iPad
>>>>>>>
>>>>>>>> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <eugene.ded...@gmail.com
>>>>>>>> <mailto:eugene.ded...@gmail.com>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Ralph,
>>>>>>>>
>>>>>>>>
>>>>>>>> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same
>>>>>>>> problem… :-\
>>>>>>>> Could you point me to some discussion of this?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Eugene.
>>>>>>>>
>>>>>>>>> On Jul 11, 2017, at 6:17 AM, r...@open-mpi.org
>>>>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is an issue with how the signal is forwarded. This has been
>>>>>>>>> fixed in the latest OMPI release so you might want to upgrade
>>>>>>>>>
>>>>>>>>> Ralph
>>>>>>>>>
>>>>>>>>> Sent from my iPad
>>>>>>>>>
>>>>>>>>>> On Jul 11, 2017, at 2:53 AM, Dennis Tants
>>>>>>>>>> <dennis.ta...@zarm.uni-bremen.de
>>>>>>>>>> <mailto:dennis.ta...@zarm.uni-bremen.de>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hello Eugene,
>>>>>>>>>>
>>>>>>>>>> it is just a wild guess, but could you try "srun --mpi=pmi2"(you said
>>>>>>>>>> you built OMPI with pmi support) instead of "mpirun".
>>>>>>>>>> srun is build-in and I think the preferred way of running parallel
>>>>>>>>>> processes. Maybe scontrol is able to suspend it this way.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Dennis
>>>>>>>>>>
>>>>>>>>>>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits:
>>>>>>>>>>> Hello SLURM-DEV
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have a problem with slurm, openmpi, and “scontrol suspend”.
>>>>>>>>>>>
>>>>>>>>>>> My setup is:
>>>>>>>>>>> 96-node cluster with IB, running rhel 6.8
>>>>>>>>>>> slurm 17.02.1
>>>>>>>>>>> openmpi 2.0.0 (built using Intel 2016 compiler)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am running some application (hpl in this particular case) using
>>>>>>>>>>> batch script similar to:
>>>>>>>>>>> -----------------------------
>>>>>>>>>>> #!/bin/bash
>>>>>>>>>>> #SBATCH —partiotion=standard
>>>>>>>>>>> #SBATCH -N 10
>>>>>>>>>>> #SBATCH —ntasks-per-node=16
>>>>>>>>>>>
>>>>>>>>>>> mpirun -np 160 xhpl | tee LOG
>>>>>>>>>>> -----------------------------
>>>>>>>>>>>
>>>>>>>>>>> So I am running it on 160 cores, 2 nodes.
>>>>>>>>>>>
>>>>>>>>>>> Once job is submitted to the queue and is running I suspend it using
>>>>>>>>>>> ~# scontrol suspend JOBID
>>>>>>>>>>>
>>>>>>>>>>> I see that indeed my job stopped producing output. I go to each of
>>>>>>>>>>> the 10
>>>>>>>>>>> nodes that were assigned for my job and see if the xhpl processes
>>>>>>>>>>> are running
>>>>>>>>>>> there with :
>>>>>>>>>>>
>>>>>>>>>>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep
>>>>>>>>>>> xhpl | wc -l”; done
>>>>>>>>>>>
>>>>>>>>>>> I expect this little script to return 0 from every node (because
>>>>>>>>>>> suspend sent the
>>>>>>>>>>> SIGSTOP and they shouldn’t show up in top). However I see that
>>>>>>>>>>> processes
>>>>>>>>>>> are reliable suspended only on node10. I get:
>>>>>>>>>>> 0
>>>>>>>>>>> 16
>>>>>>>>>>> 16
>>>>>>>>>>> …
>>>>>>>>>>> 16
>>>>>>>>>>>
>>>>>>>>>>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl
>>>>>>>>>>> application running at 100%.
>>>>>>>>>>>
>>>>>>>>>>> If I run “scontrol resume JOBID” and then suspend it again I see
>>>>>>>>>>> that (sometimes) more
>>>>>>>>>>> nodes have “xhpl” processes properly suspended. Every time I resume
>>>>>>>>>>> and suspend the
>>>>>>>>>>> job, I see different nodes returning 0 in my “ssh-run-top” script.
>>>>>>>>>>>
>>>>>>>>>>> So all together it looks like the suspend mechanism doesn’t
>>>>>>>>>>> properly work in SLURM with
>>>>>>>>>>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm
>>>>>>>>>>> —with-pmi=/path/to/my/slurm”.
>>>>>>>>>>> I’ve observed the same behavior.
>>>>>>>>>>>
>>>>>>>>>>> I would appreciate any help.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Eugene.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Dennis Tants
>>>>>>>>>> Auszubildender: Fachinformatiker für Systemintegration
>>>>>>>>>>
>>>>>>>>>> ZARM - Zentrum für angewandte Raumfahrttechnologie und
>>>>>>>>>> Mikrogravitation
>>>>>>>>>> ZARM - Center of Applied Space Technology and Microgravity
>>>>>>>>>>
>>>>>>>>>> Universität Bremen
>>>>>>>>>> Am Fallturm
>>>>>>>>>> 28359 Bremen, Germany
>>>>>>>>>>
>>>>>>>>>> Telefon: 0421 218 57940
>>>>>>>>>> E-Mail: ta...@zarm.uni-bremen.de <mailto:ta...@zarm.uni-bremen.de>
>>>>>>>>>>
>>>>>>>>>> www.zarm.uni-bremen.de <http://www.zarm.uni-bremen.de/>
>>>
>
>