You don’t want to always use those options as your performance will take a hit - TCP vs Infiniband isn’t a good option. Sadly, this is something we need someone like Nathan to address as it is a bug in the code base, and in an area I’m not familiar with
For now, just use TCP so you can move forward > On Jun 14, 2016, at 2:14 PM, Jason Maldonis <maldo...@wisc.edu> wrote: > > Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to the > mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun; do you > recommend one over the other?) Will you tell me what this means for me? For > example, should I always append these arguments to mpiexec for my non-test > jobs as well? I do not know what you mean by fabric unfortunately, but I > can give you some system information (see end of email). Unfortunately I am > not a system admin so I do not have sudo rights. Just let me know if I can > tell you something more specific though and I will get it. > > Nathan, Thank you for your response. Unfortunately I have no idea what that > means :( I can forward that to our cluster managers, but I do not know if > that is enough information for them to understand what they might need to do > to help me with this issue. > > $ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 20 > On-line CPU(s) list: 0-19 > Thread(s) per core: 1 > Core(s) per socket: 10 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 63 > Stepping: 2 > CPU MHz: 2594.159 > BogoMIPS: 5187.59 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 25600K > NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18 > NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19 > > Thanks, > Jason > > Jason Maldonis > Research Assistant of Professor Paul Voyles > Materials Science Grad Student > University of Wisconsin, Madison > 1509 University Ave, Rm M142 > Madison, WI 53706 > maldo...@wisc.edu <mailto:maldo...@wisc.edu> > 608-295-5532 > > On Tue, Jun 14, 2016 at 1:27 PM, Nathan Hjelm <hje...@me.com > <mailto:hje...@me.com>> wrote: > That message is coming from udcm in the openib btl. It indicates some sort of > failure in the connection mechanism. It can happen if the listening thread no > longer exists or is taking too long to process messages. > > -Nathan > > > On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > >> Hmm…I’m unable to replicate a problem on my machines. What fabric are you >> using? Does the problem go away if you add “-mca btl tcp,sm,self” to the >> mpirun cmd line? >> >>> On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu >>> <mailto:maldo...@wisc.edu>> wrote: >>> Hi Ralph, et. al, >>> >>> Great, thank you for the help. I downloaded the mpi loop spawn test >>> directly from what I think is the master repo on github: >>> https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c >>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> >>> I am still using the mpi code from 1.10.2, however. >>> >>> Is that test updated with the correct code? If so, I am still getting the >>> same "too many retries sending message to 0x0184:0x00001d27, giving up" >>> errors. I also just downloaded the June 14 nightly tarball (7.79MB) from: >>> https://www.open-mpi.org/nightly/v2.x/ >>> <https://www.open-mpi.org/nightly/v2.x/> and I get the same error. >>> >>> Could you please point me to the correct code? >>> >>> If you need me to provide more information please let me know. >>> >>> Thank you, >>> Jason >>> >>> Jason Maldonis >>> Research Assistant of Professor Paul Voyles >>> Materials Science Grad Student >>> University of Wisconsin, Madison >>> 1509 University Ave, Rm M142 >>> Madison, WI 53706 >>> maldo...@wisc.edu <mailto:maldo...@wisc.edu> >>> 608-295-5532 <tel:608-295-5532> >>> On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org >>> <mailto:r...@open-mpi.org>> wrote: >>> I dug into this a bit (with some help from others) and found that the spawn >>> code appears to be working correctly - it is the test in orte/test that is >>> wrong. The test has been correctly updated in the 2.x and master repos, but >>> we failed to backport it to the 1.10 series. I have done so this morning, >>> and it will be in the upcoming 1.10.3 release (out very soon). >>> >>> >>>> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org >>>> <mailto:r...@open-mpi.org>> wrote: >>>> >>>> No, that PR has nothing to do with loop_spawn. I’ll try to take a look at >>>> the problem. >>>> >>>>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu >>>>> <mailto:maldo...@wisc.edu>> wrote: >>>>> >>>>> Hello, >>>>> >>>>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the >>>>> spawn functionality to work inside a for loop, but continue to get the >>>>> error "too many retries sending message to <addr>, giving up" somewhere >>>>> down the line in the for loop, seemingly because the processors are not >>>>> being fully freed when disconnecting/finishing. I found the >>>>> orte/test/mpi/loop_spawn.c >>>>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> >>>>> example/test, and it has the exact same problem. I also found this >>>>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> >>>>> mailing list post from ~ a month and a half ago. >>>>> >>>>> Is this PR (https://github.com/open-mpi/ompi/pull/1473 >>>>> <https://github.com/open-mpi/ompi/pull/1473>) about the same issue I am >>>>> having (ie the loop_spawn example not working)? If so, do you know if we >>>>> can downgrade to e.g. 1.10.1 or another version? Or is there another >>>>> solution to fix this bug until you get a new release out (or is one >>>>> coming shortly to fix this maybe?)? >>>>> >>>>> Below is the output of the loop_spawn test on our university's cluster, >>>>> which I know very little about in terms of architecture but can get >>>>> information if it's helpful. The large group of people who manage this >>>>> cluster are very good. >>>>> >>>>> Thanks for your time. >>>>> >>>>> Jason >>>>> >>>>> mpiexec -np 5 loop_spawn >>>>> parent******************************* >>>>> parent: Launching MPI* >>>>> parent******************************* >>>>> parent: Launching MPI* >>>>> parent******************************* >>>>> parent: Launching MPI* >>>>> parent******************************* >>>>> parent: Launching MPI* >>>>> parent******************************* >>>>> parent: Launching MPI* >>>>> parent: MPI_Comm_spawn #0 return : 0 >>>>> parent: MPI_Comm_spawn #0 return : 0 >>>>> parent: MPI_Comm_spawn #0 return : 0 >>>>> parent: MPI_Comm_spawn #0 return : 0 >>>>> parent: MPI_Comm_spawn #0 return : 0 >>>>> Child: launch >>>>> Child merged rank = 5, size = 6 >>>>> parent: MPI_Comm_spawn #0 rank 4, size 6 >>>>> parent: MPI_Comm_spawn #0 rank 0, size 6 >>>>> parent: MPI_Comm_spawn #0 rank 2, size 6 >>>>> parent: MPI_Comm_spawn #0 rank 3, size 6 >>>>> parent: MPI_Comm_spawn #0 rank 1, size 6 >>>>> Child 329941: exiting >>>>> parent: MPI_Comm_spawn #1 return : 0 >>>>> parent: MPI_Comm_spawn #1 return : 0 >>>>> parent: MPI_Comm_spawn #1 return : 0 >>>>> parent: MPI_Comm_spawn #1 return : 0 >>>>> parent: MPI_Comm_spawn #1 return : 0 >>>>> Child: launch >>>>> parent: MPI_Comm_spawn #1 rank 0, size 6 >>>>> parent: MPI_Comm_spawn #1 rank 2, size 6 >>>>> parent: MPI_Comm_spawn #1 rank 1, size 6 >>>>> parent: MPI_Comm_spawn #1 rank 3, size 6 >>>>> Child merged rank = 5, size = 6 >>>>> parent: MPI_Comm_spawn #1 rank 4, size 6 >>>>> Child 329945: exiting >>>>> parent: MPI_Comm_spawn #2 return : 0 >>>>> parent: MPI_Comm_spawn #2 return : 0 >>>>> parent: MPI_Comm_spawn #2 return : 0 >>>>> parent: MPI_Comm_spawn #2 return : 0 >>>>> parent: MPI_Comm_spawn #2 return : 0 >>>>> Child: launch >>>>> parent: MPI_Comm_spawn #2 rank 3, size 6 >>>>> parent: MPI_Comm_spawn #2 rank 0, size 6 >>>>> parent: MPI_Comm_spawn #2 rank 2, size 6 >>>>> Child merged rank = 5, size = 6 >>>>> parent: MPI_Comm_spawn #2 rank 1, size 6 >>>>> parent: MPI_Comm_spawn #2 rank 4, size 6 >>>>> Child 329949: exiting >>>>> parent: MPI_Comm_spawn #3 return : 0 >>>>> parent: MPI_Comm_spawn #3 return : 0 >>>>> parent: MPI_Comm_spawn #3 return : 0 >>>>> parent: MPI_Comm_spawn #3 return : 0 >>>>> parent: MPI_Comm_spawn #3 return : 0 >>>>> Child: launch >>>>> [node:port?] too many retries sending message to <addr>, giving up >>>>> ------------------------------------------------------- >>>>> Child job 5 terminated normally, but 1 process returned >>>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>>> ------------------------------------------------------- >>>>> -------------------------------------------------------------------------- >>>>> mpiexec detected that one or more processes exited with non-zero status, >>>>> thus causing >>>>> the job to be terminated. The first process to do so was: >>>>> >>>>> Process name: [[...],0] >>>>> Exit code: 255 >>>>> -------------------------------------------------------------------------- >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php >>>>> <http://www.open-mpi.org/community/lists/users/2016/06/29425.php> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> <https://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/06/29435.php >>> <http://www.open-mpi.org/community/lists/users/2016/06/29435.php> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> <https://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/06/29438.php >>> <http://www.open-mpi.org/community/lists/users/2016/06/29438.php> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> <https://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29439.php >> <http://www.open-mpi.org/community/lists/users/2016/06/29439.php> > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > <https://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29440.php > <http://www.open-mpi.org/community/lists/users/2016/06/29440.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29444.php