Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to the
mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun; do
you recommend one over the other?) Will you tell me what this means for me?
For example, should I always append these arguments to mpiexec for my
non-test jobs as well?   I do not know what you mean by fabric
unfortunately, but I can give you some system information (see end of
email). Unfortunately I am not a system admin so I do not have sudo rights.
Just let me know if I can tell you something more specific though and I
will get it.

Nathan,  Thank you for your response. Unfortunately I have no idea what
that means :(  I can forward that to our cluster managers, but I do not
know if that is enough information for them to understand what they might
need to do to help me with this issue.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                20
On-line CPU(s) list:   0-19
Thread(s) per core:    1
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Stepping:              2
CPU MHz:               2594.159
BogoMIPS:              5187.59
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19

Thanks,
Jason

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
maldo...@wisc.edu
608-295-5532

On Tue, Jun 14, 2016 at 1:27 PM, Nathan Hjelm <hje...@me.com> wrote:

> That message is coming from udcm in the openib btl. It indicates some sort
> of failure in the connection mechanism. It can happen if the listening
> thread no longer exists or is taking too long to process messages.
>
> -Nathan
>
>
> On Jun 14, 2016, at 12:20 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
> Hmm…I’m unable to replicate a problem on my machines. What fabric are you
> using? Does the problem go away if you add “-mca btl tcp,sm,self” to the
> mpirun cmd line?
>
> On Jun 14, 2016, at 11:15 AM, Jason Maldonis <maldo...@wisc.edu> wrote:
> Hi Ralph, et. al,
>
> Great, thank you for the help. I downloaded the mpi loop spawn test
> directly from what I think is the master repo on github:
> https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c
> I am still using the mpi code from 1.10.2, however.
>
> Is that test updated with the correct code? If so, I am still getting the
> same "too many retries sending message to 0x0184:0x00001d27, giving up"
> errors. I also just downloaded the June 14 nightly tarball (7.79MB) from:
> https://www.open-mpi.org/nightly/v2.x/ and I get the same error.
>
> Could you please point me to the correct code?
>
> If you need me to provide more information please let me know.
>
> Thank you,
> Jason
>
> Jason Maldonis
> Research Assistant of Professor Paul Voyles
> Materials Science Grad Student
> University of Wisconsin, Madison
> 1509 University Ave, Rm M142
> Madison, WI 53706
> maldo...@wisc.edu
> 608-295-5532
>
> On Tue, Jun 14, 2016 at 10:59 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> I dug into this a bit (with some help from others) and found that the
>> spawn code appears to be working correctly - it is the test in orte/test
>> that is wrong. The test has been correctly updated in the 2.x and master
>> repos, but we failed to backport it to the 1.10 series. I have done so this
>> morning, and it will be in the upcoming 1.10.3 release (out very soon).
>>
>>
>> On Jun 13, 2016, at 3:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>> No, that PR has nothing to do with loop_spawn. I’ll try to take a look at
>> the problem.
>>
>> On Jun 13, 2016, at 3:47 PM, Jason Maldonis <maldo...@wisc.edu> wrote:
>>
>> Hello,
>>
>> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the
>> spawn functionality to work inside a for loop, but continue to get the
>> error "too many retries sending message to <addr>, giving up" somewhere
>> down the line in the for loop, seemingly because the processors are not
>> being fully freed when disconnecting/finishing. I found the
>> orte/test/mpi/loop_spawn.c
>> <https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c>
>>  example/test, and it has the exact same problem. I also found this
>> <https://www.open-mpi.org/community/lists/devel/2016/04/18814.php> mailing
>> list post from ~ a month and a half ago.
>>
>> Is this PR (https://github.com/open-mpi/ompi/pull/1473) about the same
>> issue I am having (ie the loop_spawn example not working)? If so, do you
>> know if we can downgrade to e.g. 1.10.1 or another version? Or is there
>> another solution to fix this bug until you get a new release out (or is one
>> coming shortly to fix this maybe?)?
>>
>> Below is the output of the loop_spawn test on our university's cluster,
>> which I know very little about in terms of architecture but can get
>> information if it's helpful. The large group of people who manage this
>> cluster are very good.
>>
>> Thanks for your time.
>>
>> Jason
>>
>> mpiexec -np 5 loop_spawn
>> parent*******************************
>> parent: Launching MPI*
>> parent*******************************
>> parent: Launching MPI*
>> parent*******************************
>> parent: Launching MPI*
>> parent*******************************
>> parent: Launching MPI*
>> parent*******************************
>> parent: Launching MPI*
>> parent: MPI_Comm_spawn #0 return : 0
>> parent: MPI_Comm_spawn #0 return : 0
>> parent: MPI_Comm_spawn #0 return : 0
>> parent: MPI_Comm_spawn #0 return : 0
>> parent: MPI_Comm_spawn #0 return : 0
>> Child: launch
>> Child merged rank = 5, size = 6
>> parent: MPI_Comm_spawn #0 rank 4, size 6
>> parent: MPI_Comm_spawn #0 rank 0, size 6
>> parent: MPI_Comm_spawn #0 rank 2, size 6
>> parent: MPI_Comm_spawn #0 rank 3, size 6
>> parent: MPI_Comm_spawn #0 rank 1, size 6
>> Child 329941: exiting
>> parent: MPI_Comm_spawn #1 return : 0
>> parent: MPI_Comm_spawn #1 return : 0
>> parent: MPI_Comm_spawn #1 return : 0
>> parent: MPI_Comm_spawn #1 return : 0
>> parent: MPI_Comm_spawn #1 return : 0
>> Child: launch
>> parent: MPI_Comm_spawn #1 rank 0, size 6
>> parent: MPI_Comm_spawn #1 rank 2, size 6
>> parent: MPI_Comm_spawn #1 rank 1, size 6
>> parent: MPI_Comm_spawn #1 rank 3, size 6
>> Child merged rank = 5, size = 6
>> parent: MPI_Comm_spawn #1 rank 4, size 6
>> Child 329945: exiting
>> parent: MPI_Comm_spawn #2 return : 0
>> parent: MPI_Comm_spawn #2 return : 0
>> parent: MPI_Comm_spawn #2 return : 0
>> parent: MPI_Comm_spawn #2 return : 0
>> parent: MPI_Comm_spawn #2 return : 0
>> Child: launch
>> parent: MPI_Comm_spawn #2 rank 3, size 6
>> parent: MPI_Comm_spawn #2 rank 0, size 6
>> parent: MPI_Comm_spawn #2 rank 2, size 6
>> Child merged rank = 5, size = 6
>> parent: MPI_Comm_spawn #2 rank 1, size 6
>> parent: MPI_Comm_spawn #2 rank 4, size 6
>> Child 329949: exiting
>> parent: MPI_Comm_spawn #3 return : 0
>> parent: MPI_Comm_spawn #3 return : 0
>> parent: MPI_Comm_spawn #3 return : 0
>> parent: MPI_Comm_spawn #3 return : 0
>> parent: MPI_Comm_spawn #3 return : 0
>> Child: launch
>> [node:port?] too many retries sending message to <addr>, giving up
>> -------------------------------------------------------
>> Child job 5 terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpiexec detected that one or more processes exited with non-zero status, 
>> thus causing
>> the job to be terminated. The first process to do so was:
>>
>>   Process name: [[...],0]
>>   Exit code:    255
>> --------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29435.php
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29438.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29439.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29440.php
>

Reply via email to