Hi Ralph, et. al,

Great, thank you for the help. I downloaded the mpi loop spawn test
directly from what I think is the master repo on github:
https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c
I am still using the mpi code from 1.10.2, however.

Is that test updated with the correct code? If so, I am still getting the
same "too many retries sending message to 0x0184:0x00001d27, giving up"
errors. I also just downloaded the June 14 nightly tarball (7.79MB) from:
https://www.open-mpi.org/nightly/v2.x/ and I get the same error.

Could you please point me to the correct code?

If you need me to provide more information please let me know.

Thank you,
Jason

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
maldo...@wisc.edu
608-295-5532

On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org> wrote:

> I dug into this a bit (with some help from others) and found that the
> spawn code appears to be working correctly - it is the test in orte/test
> that is wrong. The test has been correctly updated in the 2.x and master
> repos, but we failed to backport it to the 1.10 series. I have done so this
> morning, and it will be in the upcoming 1.10.3 release (out very soon).
>
>
> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
> No, that PR has nothing to do with loop_spawn. I’ll try to take a look at
> the problem.
>
> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu> wrote:
>
> Hello,
>
> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the
> spawn functionality to work inside a for loop, but continue to get the
> error "too many retries sending message to <addr>, giving up" somewhere
> down the line in the for loop, seemingly because the processors are not
> being fully freed when disconnecting/finishing. I found the
> orte/test/mpi/loop_spawn.c
> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> 
> example/test,
> and it has the exact same problem. I also found this
> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing
> list post from ~ a month and a half ago.
>
> Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same
> issue I am having (ie the loop_spawn example not working)? If so, do you
> know if we can downgrade to e.g. 1.10.1 or another version? Or is there
> another solution to fix this bug until you get a new release out (or is one
> coming shortly to fix this maybe?)?
>
> Below is the output of the loop_spawn test on our university's cluster,
> which I know very little about in terms of architecture but can get
> information if it's helpful. The large group of people who manage this
> cluster are very good.
>
> Thanks for your time.
>
> Jason
>
> mpiexec -np 5 loop_spawn
> parent*******************************
> parent: Launching MPI*
> parent*******************************
> parent: Launching MPI*
> parent*******************************
> parent: Launching MPI*
> parent*******************************
> parent: Launching MPI*
> parent*******************************
> parent: Launching MPI*
> parent: MPI_Comm_spawn #0 return : 0
> parent: MPI_Comm_spawn #0 return : 0
> parent: MPI_Comm_spawn #0 return : 0
> parent: MPI_Comm_spawn #0 return : 0
> parent: MPI_Comm_spawn #0 return : 0
> Child: launch
> Child merged rank = 5, size = 6
> parent: MPI_Comm_spawn #0 rank 4, size 6
> parent: MPI_Comm_spawn #0 rank 0, size 6
> parent: MPI_Comm_spawn #0 rank 2, size 6
> parent: MPI_Comm_spawn #0 rank 3, size 6
> parent: MPI_Comm_spawn #0 rank 1, size 6
> Child 329941: exiting
> parent: MPI_Comm_spawn #1 return : 0
> parent: MPI_Comm_spawn #1 return : 0
> parent: MPI_Comm_spawn #1 return : 0
> parent: MPI_Comm_spawn #1 return : 0
> parent: MPI_Comm_spawn #1 return : 0
> Child: launch
> parent: MPI_Comm_spawn #1 rank 0, size 6
> parent: MPI_Comm_spawn #1 rank 2, size 6
> parent: MPI_Comm_spawn #1 rank 1, size 6
> parent: MPI_Comm_spawn #1 rank 3, size 6
> Child merged rank = 5, size = 6
> parent: MPI_Comm_spawn #1 rank 4, size 6
> Child 329945: exiting
> parent: MPI_Comm_spawn #2 return : 0
> parent: MPI_Comm_spawn #2 return : 0
> parent: MPI_Comm_spawn #2 return : 0
> parent: MPI_Comm_spawn #2 return : 0
> parent: MPI_Comm_spawn #2 return : 0
> Child: launch
> parent: MPI_Comm_spawn #2 rank 3, size 6
> parent: MPI_Comm_spawn #2 rank 0, size 6
> parent: MPI_Comm_spawn #2 rank 2, size 6
> Child merged rank = 5, size = 6
> parent: MPI_Comm_spawn #2 rank 1, size 6
> parent: MPI_Comm_spawn #2 rank 4, size 6
> Child 329949: exiting
> parent: MPI_Comm_spawn #3 return : 0
> parent: MPI_Comm_spawn #3 return : 0
> parent: MPI_Comm_spawn #3 return : 0
> parent: MPI_Comm_spawn #3 return : 0
> parent: MPI_Comm_spawn #3 return : 0
> Child: launch
> [node:port?] too many retries sending message to <addr>, giving up
> -------------------------------------------------------
> Child job 5 terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[...],0]
>   Exit code:    255
> --------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29425.php
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29435.php
>

Reply via email to