Hmm…I’m unable to replicate a problem on my machines. What fabric are you using? Does the problem go away if you add “-mca btl tcp,sm,self” to the mpirun cmd line?
> On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu> wrote: > > Hi Ralph, et. al, > > Great, thank you for the help. I downloaded the mpi loop spawn test directly > from what I think is the master repo on github: > https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c > <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> > I am still using the mpi code from 1.10.2, however. > > Is that test updated with the correct code? If so, I am still getting the > same "too many retries sending message to 0x0184:0x00001d27, giving up" > errors. I also just downloaded the June 14 nightly tarball (7.79MB) from: > https://www.open-mpi.org/nightly/v2.x/ > <https://www.open-mpi.org/nightly/v2.x/> and I get the same error. > > Could you please point me to the correct code? > > If you need me to provide more information please let me know. > > Thank you, > Jason > > Jason Maldonis > Research Assistant of Professor Paul Voyles > Materials Science Grad Student > University of Wisconsin, Madison > 1509 University Ave, Rm M142 > Madison, WI 53706 > maldo...@wisc.edu <mailto:maldo...@wisc.edu> > 608-295-5532 > > On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > I dug into this a bit (with some help from others) and found that the spawn > code appears to be working correctly - it is the test in orte/test that is > wrong. The test has been correctly updated in the 2.x and master repos, but > we failed to backport it to the 1.10 series. I have done so this morning, and > it will be in the upcoming 1.10.3 release (out very soon). > > >> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> >> No, that PR has nothing to do with loop_spawn. I’ll try to take a look at >> the problem. >> >>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu >>> <mailto:maldo...@wisc.edu>> wrote: >>> >>> Hello, >>> >>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn >>> functionality to work inside a for loop, but continue to get the error "too >>> many retries sending message to <addr>, giving up" somewhere down the line >>> in the for loop, seemingly because the processors are not being fully freed >>> when disconnecting/finishing. I found the orte/test/mpi/loop_spawn.c >>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> >>> example/test, and it has the exact same problem. I also found this >>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing >>> list post from ~ a month and a half ago. >>> >>> Is this PR (https://github.com/open-mpi/ompi/pull/1473 >>> <https://github.com/open-mpi/ompi/pull/1473>) about the same issue I am >>> having (ie the loop_spawn example not working)? If so, do you know if we >>> can downgrade to e.g. 1.10.1 or another version? Or is there another >>> solution to fix this bug until you get a new release out (or is one coming >>> shortly to fix this maybe?)? >>> >>> Below is the output of the loop_spawn test on our university's cluster, >>> which I know very little about in terms of architecture but can get >>> information if it's helpful. The large group of people who manage this >>> cluster are very good. >>> >>> Thanks for your time. >>> >>> Jason >>> >>> mpiexec -np 5 loop_spawn >>> parent******************************* >>> parent: Launching MPI* >>> parent******************************* >>> parent: Launching MPI* >>> parent******************************* >>> parent: Launching MPI* >>> parent******************************* >>> parent: Launching MPI* >>> parent******************************* >>> parent: Launching MPI* >>> parent: MPI_Comm_spawn #0 return : 0 >>> parent: MPI_Comm_spawn #0 return : 0 >>> parent: MPI_Comm_spawn #0 return : 0 >>> parent: MPI_Comm_spawn #0 return : 0 >>> parent: MPI_Comm_spawn #0 return : 0 >>> Child: launch >>> Child merged rank = 5, size = 6 >>> parent: MPI_Comm_spawn #0 rank 4, size 6 >>> parent: MPI_Comm_spawn #0 rank 0, size 6 >>> parent: MPI_Comm_spawn #0 rank 2, size 6 >>> parent: MPI_Comm_spawn #0 rank 3, size 6 >>> parent: MPI_Comm_spawn #0 rank 1, size 6 >>> Child 329941: exiting >>> parent: MPI_Comm_spawn #1 return : 0 >>> parent: MPI_Comm_spawn #1 return : 0 >>> parent: MPI_Comm_spawn #1 return : 0 >>> parent: MPI_Comm_spawn #1 return : 0 >>> parent: MPI_Comm_spawn #1 return : 0 >>> Child: launch >>> parent: MPI_Comm_spawn #1 rank 0, size 6 >>> parent: MPI_Comm_spawn #1 rank 2, size 6 >>> parent: MPI_Comm_spawn #1 rank 1, size 6 >>> parent: MPI_Comm_spawn #1 rank 3, size 6 >>> Child merged rank = 5, size = 6 >>> parent: MPI_Comm_spawn #1 rank 4, size 6 >>> Child 329945: exiting >>> parent: MPI_Comm_spawn #2 return : 0 >>> parent: MPI_Comm_spawn #2 return : 0 >>> parent: MPI_Comm_spawn #2 return : 0 >>> parent: MPI_Comm_spawn #2 return : 0 >>> parent: MPI_Comm_spawn #2 return : 0 >>> Child: launch >>> parent: MPI_Comm_spawn #2 rank 3, size 6 >>> parent: MPI_Comm_spawn #2 rank 0, size 6 >>> parent: MPI_Comm_spawn #2 rank 2, size 6 >>> Child merged rank = 5, size = 6 >>> parent: MPI_Comm_spawn #2 rank 1, size 6 >>> parent: MPI_Comm_spawn #2 rank 4, size 6 >>> Child 329949: exiting >>> parent: MPI_Comm_spawn #3 return : 0 >>> parent: MPI_Comm_spawn #3 return : 0 >>> parent: MPI_Comm_spawn #3 return : 0 >>> parent: MPI_Comm_spawn #3 return : 0 >>> parent: MPI_Comm_spawn #3 return : 0 >>> Child: launch >>> [node:port?] too many retries sending message to <addr>, giving up >>> ------------------------------------------------------- >>> Child job 5 terminated normally, but 1 process returned >>> a non-zero exit code.. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpiexec detected that one or more processes exited with non-zero status, >>> thus causing >>> the job to be terminated. The first process to do so was: >>> >>> Process name: [[...],0] >>> Exit code: 255 >>> -------------------------------------------------------------------------- >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> <https://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php >>> <http://www.open-mpi.org/community/lists/users/2016/06/29425.php> > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > <https://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29435.php > <http://www.open-mpi.org/community/lists/users/2016/06/29435.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29438.php