Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to the mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun; do you recommend one over the other?) Will you tell me what this means for me? For example, should I always append these arguments to mpiexec for my non-test jobs as well? I do not know what you mean by fabric unfortunately, but I can give you some system information (see end of email). Unfortunately I am not a system admin so I do not have sudo rights. Just let me know if I can tell you something more specific though and I will get it.
Nathan, Thank you for your response. Unfortunately I have no idea what that means :( I can forward that to our cluster managers, but I do not know if that is enough information for them to understand what they might need to do to help me with this issue. $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Thread(s) per core: 1 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Stepping: 2 CPU MHz: 2594.159 BogoMIPS: 5187.59 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19 Thanks, Jason Jason Maldonis Research Assistant of Professor Paul Voyles Materials Science Grad Student University of Wisconsin, Madison 1509 University Ave, Rm M142 Madison, WI 53706 maldo...@wisc.edu 608-295-5532 On Tue, Jun 14, 2016 at 1:27 PM, Nathan Hjelm <hje...@me.com> wrote: > That message is coming from udcm in the openib btl. It indicates some sort > of failure in the connection mechanism. It can happen if the listening > thread no longer exists or is taking too long to process messages. > > -Nathan > > > On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Hmm…I’m unable to replicate a problem on my machines. What fabric are you > using? Does the problem go away if you add “-mca btl tcp,sm,self” to the > mpirun cmd line? > > On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu> wrote: > Hi Ralph, et. al, > > Great, thank you for the help. I downloaded the mpi loop spawn test > directly from what I think is the master repo on github: > https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c > I am still using the mpi code from 1.10.2, however. > > Is that test updated with the correct code? If so, I am still getting the > same "too many retries sending message to 0x0184:0x00001d27, giving up" > errors. I also just downloaded the June 14 nightly tarball (7.79MB) from: > https://www.open-mpi.org/nightly/v2.x/ and I get the same error. > > Could you please point me to the correct code? > > If you need me to provide more information please let me know. > > Thank you, > Jason > > Jason Maldonis > Research Assistant of Professor Paul Voyles > Materials Science Grad Student > University of Wisconsin, Madison > 1509 University Ave, Rm M142 > Madison, WI 53706 > maldo...@wisc.edu > 608-295-5532 > > On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> I dug into this a bit (with some help from others) and found that the >> spawn code appears to be working correctly - it is the test in orte/test >> that is wrong. The test has been correctly updated in the 2.x and master >> repos, but we failed to backport it to the 1.10 series. I have done so this >> morning, and it will be in the upcoming 1.10.3 release (out very soon). >> >> >> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> No, that PR has nothing to do with loop_spawn. I’ll try to take a look at >> the problem. >> >> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu> wrote: >> >> Hello, >> >> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the >> spawn functionality to work inside a for loop, but continue to get the >> error "too many retries sending message to <addr>, giving up" somewhere >> down the line in the for loop, seemingly because the processors are not >> being fully freed when disconnecting/finishing. I found the >> orte/test/mpi/loop_spawn.c >> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> >> example/test, and it has the exact same problem. I also found this >> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing >> list post from ~ a month and a half ago. >> >> Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same >> issue I am having (ie the loop_spawn example not working)? If so, do you >> know if we can downgrade to e.g. 1.10.1 or another version? Or is there >> another solution to fix this bug until you get a new release out (or is one >> coming shortly to fix this maybe?)? >> >> Below is the output of the loop_spawn test on our university's cluster, >> which I know very little about in terms of architecture but can get >> information if it's helpful. The large group of people who manage this >> cluster are very good. >> >> Thanks for your time. >> >> Jason >> >> mpiexec -np 5 loop_spawn >> parent******************************* >> parent: Launching MPI* >> parent******************************* >> parent: Launching MPI* >> parent******************************* >> parent: Launching MPI* >> parent******************************* >> parent: Launching MPI* >> parent******************************* >> parent: Launching MPI* >> parent: MPI_Comm_spawn #0 return : 0 >> parent: MPI_Comm_spawn #0 return : 0 >> parent: MPI_Comm_spawn #0 return : 0 >> parent: MPI_Comm_spawn #0 return : 0 >> parent: MPI_Comm_spawn #0 return : 0 >> Child: launch >> Child merged rank = 5, size = 6 >> parent: MPI_Comm_spawn #0 rank 4, size 6 >> parent: MPI_Comm_spawn #0 rank 0, size 6 >> parent: MPI_Comm_spawn #0 rank 2, size 6 >> parent: MPI_Comm_spawn #0 rank 3, size 6 >> parent: MPI_Comm_spawn #0 rank 1, size 6 >> Child 329941: exiting >> parent: MPI_Comm_spawn #1 return : 0 >> parent: MPI_Comm_spawn #1 return : 0 >> parent: MPI_Comm_spawn #1 return : 0 >> parent: MPI_Comm_spawn #1 return : 0 >> parent: MPI_Comm_spawn #1 return : 0 >> Child: launch >> parent: MPI_Comm_spawn #1 rank 0, size 6 >> parent: MPI_Comm_spawn #1 rank 2, size 6 >> parent: MPI_Comm_spawn #1 rank 1, size 6 >> parent: MPI_Comm_spawn #1 rank 3, size 6 >> Child merged rank = 5, size = 6 >> parent: MPI_Comm_spawn #1 rank 4, size 6 >> Child 329945: exiting >> parent: MPI_Comm_spawn #2 return : 0 >> parent: MPI_Comm_spawn #2 return : 0 >> parent: MPI_Comm_spawn #2 return : 0 >> parent: MPI_Comm_spawn #2 return : 0 >> parent: MPI_Comm_spawn #2 return : 0 >> Child: launch >> parent: MPI_Comm_spawn #2 rank 3, size 6 >> parent: MPI_Comm_spawn #2 rank 0, size 6 >> parent: MPI_Comm_spawn #2 rank 2, size 6 >> Child merged rank = 5, size = 6 >> parent: MPI_Comm_spawn #2 rank 1, size 6 >> parent: MPI_Comm_spawn #2 rank 4, size 6 >> Child 329949: exiting >> parent: MPI_Comm_spawn #3 return : 0 >> parent: MPI_Comm_spawn #3 return : 0 >> parent: MPI_Comm_spawn #3 return : 0 >> parent: MPI_Comm_spawn #3 return : 0 >> parent: MPI_Comm_spawn #3 return : 0 >> Child: launch >> [node:port?] too many retries sending message to <addr>, giving up >> ------------------------------------------------------- >> Child job 5 terminated normally, but 1 process returned >> a non-zero exit code.. Per user-direction, the job has been aborted. >> ------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpiexec detected that one or more processes exited with non-zero status, >> thus causing >> the job to be terminated. The first process to do so was: >> >> Process name: [[...],0] >> Exit code: 255 >> -------------------------------------------------------------------------- >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29425.php >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29435.php >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29438.php > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29439.php > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29440.php >