Hmm…I’m unable to replicate a problem on my machines. What fabric are you 
using? Does the problem go away if you add “-mca btl tcp,sm,self” to the mpirun 
cmd line?

> On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu> wrote:
> 
> Hi Ralph, et. al,
> 
> Great, thank you for the help. I downloaded the mpi loop spawn test directly 
> from what I think is the master repo on github:  
> https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c 
> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c>
> I am still using the mpi code from 1.10.2, however.
> 
> Is that test updated with the correct code? If so, I am still getting the 
> same "too many retries sending message to 0x0184:0x00001d27, giving up" 
> errors. I also just downloaded the June 14 nightly tarball (7.79MB) from: 
> https://www.open-mpi.org/nightly/v2.x/ 
> <https://www.open-mpi.org/nightly/v2.x/> and I get the same error.
> 
> Could you please point me to the correct code?
> 
> If you need me to provide more information please let me know.
> 
> Thank you,
> Jason
> 
> Jason Maldonis
> Research Assistant of Professor Paul Voyles
> Materials Science Grad Student
> University of Wisconsin, Madison
> 1509 University Ave, Rm M142
> Madison, WI 53706
> maldo...@wisc.edu <mailto:maldo...@wisc.edu>
> 608-295-5532
> 
> On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> I dug into this a bit (with some help from others) and found that the spawn 
> code appears to be working correctly - it is the test in orte/test that is 
> wrong. The test has been correctly updated in the 2.x and master repos, but 
> we failed to backport it to the 1.10 series. I have done so this morning, and 
> it will be in the upcoming 1.10.3 release (out very soon).
> 
> 
>> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> 
>> No, that PR has nothing to do with loop_spawn. I’ll try to take a look at 
>> the problem.
>> 
>>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu 
>>> <mailto:maldo...@wisc.edu>> wrote:
>>> 
>>> Hello,
>>> 
>>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn 
>>> functionality to work inside a for loop, but continue to get the error "too 
>>> many retries sending message to <addr>, giving up" somewhere down the line 
>>> in the for loop, seemingly because the processors are not being fully freed 
>>> when disconnecting/finishing. I found the orte/test/mpi/loop_spawn.c 
>>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> 
>>> example/test, and it has the exact same problem. I also found this 
>>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing 
>>> list post from ~ a month and a half ago.
>>> 
>>> Is this PR (https://github.com/open-mpi/ompi/pull/1473 
>>> <https://github.com/open-mpi/ompi/pull/1473>) about the same issue I am 
>>> having (ie the loop_spawn example not working)? If so, do you know if we 
>>> can downgrade to e.g. 1.10.1 or another version? Or is there another 
>>> solution to fix this bug until you get a new release out (or is one coming 
>>> shortly to fix this maybe?)?
>>> 
>>> Below is the output of the loop_spawn test on our university's cluster, 
>>> which I know very little about in terms of architecture but can get 
>>> information if it's helpful. The large group of people who manage this 
>>> cluster are very good.
>>> 
>>> Thanks for your time.
>>> 
>>> Jason
>>> 
>>> mpiexec -np 5 loop_spawn
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent*******************************
>>> parent: Launching MPI*
>>> parent: MPI_Comm_spawn #0 return : 0
>>> parent: MPI_Comm_spawn #0 return : 0
>>> parent: MPI_Comm_spawn #0 return : 0
>>> parent: MPI_Comm_spawn #0 return : 0
>>> parent: MPI_Comm_spawn #0 return : 0
>>> Child: launch
>>> Child merged rank = 5, size = 6
>>> parent: MPI_Comm_spawn #0 rank 4, size 6
>>> parent: MPI_Comm_spawn #0 rank 0, size 6
>>> parent: MPI_Comm_spawn #0 rank 2, size 6
>>> parent: MPI_Comm_spawn #0 rank 3, size 6
>>> parent: MPI_Comm_spawn #0 rank 1, size 6
>>> Child 329941: exiting
>>> parent: MPI_Comm_spawn #1 return : 0
>>> parent: MPI_Comm_spawn #1 return : 0
>>> parent: MPI_Comm_spawn #1 return : 0
>>> parent: MPI_Comm_spawn #1 return : 0
>>> parent: MPI_Comm_spawn #1 return : 0
>>> Child: launch
>>> parent: MPI_Comm_spawn #1 rank 0, size 6
>>> parent: MPI_Comm_spawn #1 rank 2, size 6
>>> parent: MPI_Comm_spawn #1 rank 1, size 6
>>> parent: MPI_Comm_spawn #1 rank 3, size 6
>>> Child merged rank = 5, size = 6
>>> parent: MPI_Comm_spawn #1 rank 4, size 6
>>> Child 329945: exiting
>>> parent: MPI_Comm_spawn #2 return : 0
>>> parent: MPI_Comm_spawn #2 return : 0
>>> parent: MPI_Comm_spawn #2 return : 0
>>> parent: MPI_Comm_spawn #2 return : 0
>>> parent: MPI_Comm_spawn #2 return : 0
>>> Child: launch
>>> parent: MPI_Comm_spawn #2 rank 3, size 6
>>> parent: MPI_Comm_spawn #2 rank 0, size 6
>>> parent: MPI_Comm_spawn #2 rank 2, size 6
>>> Child merged rank = 5, size = 6
>>> parent: MPI_Comm_spawn #2 rank 1, size 6
>>> parent: MPI_Comm_spawn #2 rank 4, size 6
>>> Child 329949: exiting
>>> parent: MPI_Comm_spawn #3 return : 0
>>> parent: MPI_Comm_spawn #3 return : 0
>>> parent: MPI_Comm_spawn #3 return : 0
>>> parent: MPI_Comm_spawn #3 return : 0
>>> parent: MPI_Comm_spawn #3 return : 0
>>> Child: launch
>>> [node:port?] too many retries sending message to <addr>, giving up
>>> -------------------------------------------------------
>>> Child job 5 terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpiexec detected that one or more processes exited with non-zero status, 
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>> 
>>>   Process name: [[...],0]
>>>   Exit code:    255
>>> --------------------------------------------------------------------------
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php 
>>> <http://www.open-mpi.org/community/lists/users/2016/06/29425.php>
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29435.php 
> <http://www.open-mpi.org/community/lists/users/2016/06/29435.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29438.php

Reply via email to