Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

Ralph Castain Tue, 14 Jun 2016 17:51:41 -0400 (EDT)

You don’t want to always use those options as your performance will take a hit 
- TCP vs Infiniband isn’t a good option. Sadly, this is something we need 
someone like Nathan to address as it is a bug in the code base, and in an area 
I’m not familiar with


For now, just use TCP so you can move forward


> On Jun 14, 2016, at 2:14 PM, Jason Maldonis <maldo...@wisc.edu> wrote:
> 
> Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to the 
> mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun; do you 
> recommend one over the other?) Will you tell me what this means for me? For 
> example, should I always append these arguments to mpiexec for my non-test 
> jobs as well?   I do not know what you mean by fabric unfortunately, but I 
> can give you some system information (see end of email). Unfortunately I am 
> not a system admin so I do not have sudo rights. Just let me know if I can 
> tell you something more specific though and I will get it.
> 
> Nathan,  Thank you for your response. Unfortunately I have no idea what that 
> means :(  I can forward that to our cluster managers, but I do not know if 
> that is enough information for them to understand what they might need to do 
> to help me with this issue.
> 
> $ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                20
> On-line CPU(s) list:   0-19
> Thread(s) per core:    1
> Core(s) per socket:    10
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 63
> Stepping:              2
> CPU MHz:               2594.159
> BogoMIPS:              5187.59
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              25600K
> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
> 
> Thanks,
> Jason
> 
> Jason Maldonis
> Research Assistant of Professor Paul Voyles
> Materials Science Grad Student
> University of Wisconsin, Madison
> 1509 University Ave, Rm M142
> Madison, WI 53706
> maldo...@wisc.edu <mailto:maldo...@wisc.edu>
> 608-295-5532
> 
> On Tue, Jun 14, 2016 at 1:27 PM, Nathan Hjelm <hje...@me.com 
> <mailto:hje...@me.com>> wrote:
> That message is coming from udcm in the openib btl. It indicates some sort of 
> failure in the connection mechanism. It can happen if the listening thread no 
> longer exists or is taking too long to process messages.
> 
> -Nathan
> 
> 
> On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> 
>> Hmm…I’m unable to replicate a problem on my machines. What fabric are you 
>> using? Does the problem go away if you add “-mca btl tcp,sm,self” to the 
>> mpirun cmd line?
>> 
>>> On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu 
>>> <mailto:maldo...@wisc.edu>> wrote:
>>> Hi Ralph, et. al,
>>> 
>>> Great, thank you for the help. I downloaded the mpi loop spawn test 
>>> directly from what I think is the master repo on github:  
>>> https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c 
>>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c>
>>> I am still using the mpi code from 1.10.2, however.
>>> 
>>> Is that test updated with the correct code? If so, I am still getting the 
>>> same "too many retries sending message to 0x0184:0x00001d27, giving up" 
>>> errors. I also just downloaded the June 14 nightly tarball (7.79MB) from: 
>>> https://www.open-mpi.org/nightly/v2.x/ 
>>> <https://www.open-mpi.org/nightly/v2.x/> and I get the same error.
>>> 
>>> Could you please point me to the correct code?
>>> 
>>> If you need me to provide more information please let me know.
>>> 
>>> Thank you,
>>> Jason
>>> 
>>> Jason Maldonis
>>> Research Assistant of Professor Paul Voyles
>>> Materials Science Grad Student
>>> University of Wisconsin, Madison
>>> 1509 University Ave, Rm M142
>>> Madison, WI 53706
>>> maldo...@wisc.edu <mailto:maldo...@wisc.edu>
>>> 608-295-5532 <tel:608-295-5532>
>>> On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org>> wrote:
>>> I dug into this a bit (with some help from others) and found that the spawn 
>>> code appears to be working correctly - it is the test in orte/test that is 
>>> wrong. The test has been correctly updated in the 2.x and master repos, but 
>>> we failed to backport it to the 1.10 series. I have done so this morning, 
>>> and it will be in the upcoming 1.10.3 release (out very soon).
>>> 
>>> 
>>>> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org 
>>>> <mailto:r...@open-mpi.org>> wrote:
>>>> 
>>>> No, that PR has nothing to do with loop_spawn. I’ll try to take a look at 
>>>> the problem.
>>>> 
>>>>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu 
>>>>> <mailto:maldo...@wisc.edu>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the 
>>>>> spawn functionality to work inside a for loop, but continue to get the 
>>>>> error "too many retries sending message to <addr>, giving up" somewhere 
>>>>> down the line in the for loop, seemingly because the processors are not 
>>>>> being fully freed when disconnecting/finishing. I found the 
>>>>> orte/test/mpi/loop_spawn.c 
>>>>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c> 
>>>>> example/test, and it has the exact same problem. I also found this 
>>>>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> 
>>>>> mailing list post from ~ a month and a half ago.
>>>>> 
>>>>> Is this PR (https://github.com/open-mpi/ompi/pull/1473 
>>>>> <https://github.com/open-mpi/ompi/pull/1473>) about the same issue I am 
>>>>> having (ie the loop_spawn example not working)? If so, do you know if we 
>>>>> can downgrade to e.g. 1.10.1 or another version? Or is there another 
>>>>> solution to fix this bug until you get a new release out (or is one 
>>>>> coming shortly to fix this maybe?)?
>>>>> 
>>>>> Below is the output of the loop_spawn test on our university's cluster, 
>>>>> which I know very little about in terms of architecture but can get 
>>>>> information if it's helpful. The large group of people who manage this 
>>>>> cluster are very good.
>>>>> 
>>>>> Thanks for your time.
>>>>> 
>>>>> Jason
>>>>> 
>>>>> mpiexec -np 5 loop_spawn
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent*******************************
>>>>> parent: Launching MPI*
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> parent: MPI_Comm_spawn #0 return : 0
>>>>> Child: launch
>>>>> Child merged rank = 5, size = 6
>>>>> parent: MPI_Comm_spawn #0 rank 4, size 6
>>>>> parent: MPI_Comm_spawn #0 rank 0, size 6
>>>>> parent: MPI_Comm_spawn #0 rank 2, size 6
>>>>> parent: MPI_Comm_spawn #0 rank 3, size 6
>>>>> parent: MPI_Comm_spawn #0 rank 1, size 6
>>>>> Child 329941: exiting
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> parent: MPI_Comm_spawn #1 return : 0
>>>>> Child: launch
>>>>> parent: MPI_Comm_spawn #1 rank 0, size 6
>>>>> parent: MPI_Comm_spawn #1 rank 2, size 6
>>>>> parent: MPI_Comm_spawn #1 rank 1, size 6
>>>>> parent: MPI_Comm_spawn #1 rank 3, size 6
>>>>> Child merged rank = 5, size = 6
>>>>> parent: MPI_Comm_spawn #1 rank 4, size 6
>>>>> Child 329945: exiting
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> parent: MPI_Comm_spawn #2 return : 0
>>>>> Child: launch
>>>>> parent: MPI_Comm_spawn #2 rank 3, size 6
>>>>> parent: MPI_Comm_spawn #2 rank 0, size 6
>>>>> parent: MPI_Comm_spawn #2 rank 2, size 6
>>>>> Child merged rank = 5, size = 6
>>>>> parent: MPI_Comm_spawn #2 rank 1, size 6
>>>>> parent: MPI_Comm_spawn #2 rank 4, size 6
>>>>> Child 329949: exiting
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> parent: MPI_Comm_spawn #3 return : 0
>>>>> Child: launch
>>>>> [node:port?] too many retries sending message to <addr>, giving up
>>>>> -------------------------------------------------------
>>>>> Child job 5 terminated normally, but 1 process returned
>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>> -------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec detected that one or more processes exited with non-zero status, 
>>>>> thus causing
>>>>> the job to be terminated. The first process to do so was:
>>>>> 
>>>>>   Process name: [[...],0]
>>>>>   Exit code:    255
>>>>> --------------------------------------------------------------------------
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php 
>>>>> <http://www.open-mpi.org/community/lists/users/2016/06/29425.php>
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/06/29435.php 
>>> <http://www.open-mpi.org/community/lists/users/2016/06/29435.php>
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/06/29438.php 
>>> <http://www.open-mpi.org/community/lists/users/2016/06/29438.php>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/06/29439.php 
>> <http://www.open-mpi.org/community/lists/users/2016/06/29439.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29440.php 
> <http://www.open-mpi.org/community/lists/users/2016/06/29440.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29444.php

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

Reply via email to