That message is coming from udcm in the openib btl. It indicates some sort of 
failure in the connection mechanism. It can happen if the listening thread no 
longer exists or is taking too long to process messages.

-Nathan


On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org> wrote:

Hmm…I’m unable to replicate a problem on my machines. What fabric are you 
using? Does the problem go away if you add “-mca btl tcp,sm,self” to the mpirun 
cmd line?

On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu> wrote:
Hi Ralph, et. al,

Great, thank you for the help. I downloaded the mpi loop spawn test directly 
from what I think is the master repo on github:  
https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c
I am still using the mpi code from 1.10.2, however.

Is that test updated with the correct code? If so, I am still getting the same "too 
many retries sending message to 0x0184:0x00001d27, giving up" errors. I also just 
downloaded the June 14 nightly tarball (7.79MB) from: 
https://www.open-mpi.org/nightly/v2.x/ and I get the same error.

Could you please point me to the correct code?

If you need me to provide more information please let me know.

Thank you,
Jason

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
maldo...@wisc.edu
608-295-5532

On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org> wrote:
I dug into this a bit (with some help from others) and found that the spawn 
code appears to be working correctly - it is the test in orte/test that is 
wrong. The test has been correctly updated in the 2.x and master repos, but we 
failed to backport it to the 1.10 series. I have done so this morning, and it 
will be in the upcoming 1.10.3 release (out very soon).


On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org> wrote:

No, that PR has nothing to do with loop_spawn. I’ll try to take a look at the 
problem.

On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu> wrote:

Hello,

I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn functionality to 
work inside a for loop, but continue to get the error "too many retries sending message to 
<addr>, giving up" somewhere down the line in the for loop, seemingly because the 
processors are not being fully freed when disconnecting/finishing. I found the 
orte/test/mpi/loop_spawn.c example/test, and it has the exact same problem. I also found this 
mailing list post from ~ a month and a half ago.

Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same issue I 
am having (ie the loop_spawn example not working)? If so, do you know if we can 
downgrade to e.g. 1.10.1 or another version? Or is there another solution to 
fix this bug until you get a new release out (or is one coming shortly to fix 
this maybe?)?

Below is the output of the loop_spawn test on our university's cluster, which I 
know very little about in terms of architecture but can get information if it's 
helpful. The large group of people who manage this cluster are very good.

Thanks for your time.

Jason

mpiexec -np 5 loop_spawn
parent*******************************
parent: Launching MPI*
parent*******************************
parent: Launching MPI*
parent*******************************
parent: Launching MPI*
parent*******************************
parent: Launching MPI*
parent*******************************
parent: Launching MPI*
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #0 return : 0
Child: launch
Child merged rank = 5, size = 6
parent: MPI_Comm_spawn #0 rank 4, size 6
parent: MPI_Comm_spawn #0 rank 0, size 6
parent: MPI_Comm_spawn #0 rank 2, size 6
parent: MPI_Comm_spawn #0 rank 3, size 6
parent: MPI_Comm_spawn #0 rank 1, size 6
Child 329941: exiting
parent: MPI_Comm_spawn #1 return : 0
parent: MPI_Comm_spawn #1 return : 0
parent: MPI_Comm_spawn #1 return : 0
parent: MPI_Comm_spawn #1 return : 0
parent: MPI_Comm_spawn #1 return : 0
Child: launch
parent: MPI_Comm_spawn #1 rank 0, size 6
parent: MPI_Comm_spawn #1 rank 2, size 6
parent: MPI_Comm_spawn #1 rank 1, size 6
parent: MPI_Comm_spawn #1 rank 3, size 6
Child merged rank = 5, size = 6
parent: MPI_Comm_spawn #1 rank 4, size 6
Child 329945: exiting
parent: MPI_Comm_spawn #2 return : 0
parent: MPI_Comm_spawn #2 return : 0
parent: MPI_Comm_spawn #2 return : 0
parent: MPI_Comm_spawn #2 return : 0
parent: MPI_Comm_spawn #2 return : 0
Child: launch
parent: MPI_Comm_spawn #2 rank 3, size 6
parent: MPI_Comm_spawn #2 rank 0, size 6
parent: MPI_Comm_spawn #2 rank 2, size 6
Child merged rank = 5, size = 6
parent: MPI_Comm_spawn #2 rank 1, size 6
parent: MPI_Comm_spawn #2 rank 4, size 6
Child 329949: exiting
parent: MPI_Comm_spawn #3 return : 0
parent: MPI_Comm_spawn #3 return : 0
parent: MPI_Comm_spawn #3 return : 0
parent: MPI_Comm_spawn #3 return : 0
parent: MPI_Comm_spawn #3 return : 0
Child: launch
[node:port?] too many retries sending message to <addr>, giving up
-------------------------------------------------------
Child job 5 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

 Process name: [[...],0]
 Exit code:    255
--------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29425.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29435.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29438.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29439.php

Reply via email to