That message is coming from udcm in the openib btl. It indicates some sort of failure in the connection mechanism. It can happen if the listening thread no longer exists or is taking too long to process messages.
-Nathan On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org> wrote: Hmm…I’m unable to replicate a problem on my machines. What fabric are you using? Does the problem go away if you add “-mca btl tcp,sm,self” to the mpirun cmd line? On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu> wrote: Hi Ralph, et. al, Great, thank you for the help. I downloaded the mpi loop spawn test directly from what I think is the master repo on github: https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c I am still using the mpi code from 1.10.2, however. Is that test updated with the correct code? If so, I am still getting the same "too many retries sending message to 0x0184:0x00001d27, giving up" errors. I also just downloaded the June 14 nightly tarball (7.79MB) from: https://www.open-mpi.org/nightly/v2.x/ and I get the same error. Could you please point me to the correct code? If you need me to provide more information please let me know. Thank you, Jason Jason Maldonis Research Assistant of Professor Paul Voyles Materials Science Grad Student University of Wisconsin, Madison 1509 University Ave, Rm M142 Madison, WI 53706 maldo...@wisc.edu 608-295-5532 On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org> wrote: I dug into this a bit (with some help from others) and found that the spawn code appears to be working correctly - it is the test in orte/test that is wrong. The test has been correctly updated in the 2.x and master repos, but we failed to backport it to the 1.10 series. I have done so this morning, and it will be in the upcoming 1.10.3 release (out very soon). On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org> wrote: No, that PR has nothing to do with loop_spawn. I’ll try to take a look at the problem. On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu> wrote: Hello, I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn functionality to work inside a for loop, but continue to get the error "too many retries sending message to <addr>, giving up" somewhere down the line in the for loop, seemingly because the processors are not being fully freed when disconnecting/finishing. I found the orte/test/mpi/loop_spawn.c example/test, and it has the exact same problem. I also found this mailing list post from ~ a month and a half ago. Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same issue I am having (ie the loop_spawn example not working)? If so, do you know if we can downgrade to e.g. 1.10.1 or another version? Or is there another solution to fix this bug until you get a new release out (or is one coming shortly to fix this maybe?)? Below is the output of the loop_spawn test on our university's cluster, which I know very little about in terms of architecture but can get information if it's helpful. The large group of people who manage this cluster are very good. Thanks for your time. Jason mpiexec -np 5 loop_spawn parent******************************* parent: Launching MPI* parent******************************* parent: Launching MPI* parent******************************* parent: Launching MPI* parent******************************* parent: Launching MPI* parent******************************* parent: Launching MPI* parent: MPI_Comm_spawn #0 return : 0 parent: MPI_Comm_spawn #0 return : 0 parent: MPI_Comm_spawn #0 return : 0 parent: MPI_Comm_spawn #0 return : 0 parent: MPI_Comm_spawn #0 return : 0 Child: launch Child merged rank = 5, size = 6 parent: MPI_Comm_spawn #0 rank 4, size 6 parent: MPI_Comm_spawn #0 rank 0, size 6 parent: MPI_Comm_spawn #0 rank 2, size 6 parent: MPI_Comm_spawn #0 rank 3, size 6 parent: MPI_Comm_spawn #0 rank 1, size 6 Child 329941: exiting parent: MPI_Comm_spawn #1 return : 0 parent: MPI_Comm_spawn #1 return : 0 parent: MPI_Comm_spawn #1 return : 0 parent: MPI_Comm_spawn #1 return : 0 parent: MPI_Comm_spawn #1 return : 0 Child: launch parent: MPI_Comm_spawn #1 rank 0, size 6 parent: MPI_Comm_spawn #1 rank 2, size 6 parent: MPI_Comm_spawn #1 rank 1, size 6 parent: MPI_Comm_spawn #1 rank 3, size 6 Child merged rank = 5, size = 6 parent: MPI_Comm_spawn #1 rank 4, size 6 Child 329945: exiting parent: MPI_Comm_spawn #2 return : 0 parent: MPI_Comm_spawn #2 return : 0 parent: MPI_Comm_spawn #2 return : 0 parent: MPI_Comm_spawn #2 return : 0 parent: MPI_Comm_spawn #2 return : 0 Child: launch parent: MPI_Comm_spawn #2 rank 3, size 6 parent: MPI_Comm_spawn #2 rank 0, size 6 parent: MPI_Comm_spawn #2 rank 2, size 6 Child merged rank = 5, size = 6 parent: MPI_Comm_spawn #2 rank 1, size 6 parent: MPI_Comm_spawn #2 rank 4, size 6 Child 329949: exiting parent: MPI_Comm_spawn #3 return : 0 parent: MPI_Comm_spawn #3 return : 0 parent: MPI_Comm_spawn #3 return : 0 parent: MPI_Comm_spawn #3 return : 0 parent: MPI_Comm_spawn #3 return : 0 Child: launch [node:port?] too many retries sending message to <addr>, giving up ------------------------------------------------------- Child job 5 terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[...],0] Exit code: 255 -------------------------------------------------------------------------- _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/06/29425.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/06/29435.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/06/29438.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/06/29439.php