Thank Josh,
Then I guess I will solve it internally ☺

Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com>

Office:    +972 74 712 9244
Mobile:  +972 54 554 0233
Fax:        +972 72 257 9400

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Joshua Ladd
Sent: Wednesday, August 13, 2014 7:37 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

Ah, I see. That change didn't make it into the release branch (I don't know if 
it was never CMRed or what, I have a vague recollection of it passing through.) 
If you need that change, then I recommend checking out the trunk at r30875. 
This was back when the trunk was in a more stable state.

Best,

Josh

On Wed, Aug 13, 2014 at 9:29 AM, Lenny Verkhovsky 
<len...@mellanox.com<mailto:len...@mellanox.com>> wrote:
Hi,
I needed the following commit

r30875 | vasily | 2014-02-27 13:29:47 +0200 (Thu, 27 Feb 2014) | 3 lines
OPENIB BTL/CONNECT: Add support for AF_IB addressing in rdmacm.

Following Gilles’s  mail about known #4857 issue I got update and now I can run 
with more than 65 hosts.
( thanks,  Gilles )

Since I am facing another problem, I probably should try 1.8rc as you suggested.
Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com>

Office:    +972 74 712 9244<tel:%2B972%2074%20712%209244>
Mobile:  +972 54 554 0233<tel:%2B972%2054%20554%200233>
Fax:        +972 72 257 9400<tel:%2B972%2072%20257%209400>

From: devel 
[mailto:devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>] On 
Behalf Of Joshua Ladd
Sent: Wednesday, August 13, 2014 4:20 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

Lenny,
Is there any particular reason that you're using the trunk? The reason I ask is 
because the trunk is in an unusually high state of flux at the moment with a 
major move underway. If you're trying to use OMPI for production grade runs, I 
would strongly advise picking up one of the stable releases in the 1.8.x 
series. At this time,1.8.1 is available as the most current stable release. The 
1.8.2rc3 prerelease candidate is also available:

http://www.open-mpi.org/software/ompi/v1.8/
Best,
Josh



On Wed, Aug 13, 2014 at 5:19 AM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org<mailto:gilles.gouaillar...@iferc.org>> wrote:
Lenny,

that looks related to #4857 which has been fixed in trunk since r32517

could you please update your openmpi library and try again ?

Gilles


On 2014/08/13 17:00, Lenny Verkhovsky wrote:

Following Jeff's suggestion adding devel mailing list.



Hi All,

I am currently facing strange situation that I can't run OMPI on more than 65 
nodes.

It seems like environmental issue that does not allow me to open more 
connections.

Any ideas ?

Log attached, more info below in the mail.



Running OMPI from trunk

[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288



Thanks.

Lenny Verkhovsky

SW Engineer,  Mellanox Technologies

www.mellanox.com<http://www.mellanox.com><http://www.mellanox.com><http://www.mellanox.com>





Office:    +972 74 712 9244<tel:%2B972%2074%20712%209244>

Mobile:  +972 54 554 0233<tel:%2B972%2054%20554%200233>

Fax:        +972 72 257 9400<tel:%2B972%2072%20257%209400>



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lenny Verkhovsky

Sent: Tuesday, August 12, 2014 1:13 PM

To: Open MPI Users

Subject: Re: [OMPI users] OpenMPI fails with np > 65





Hi,



Config:

./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
--enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
--disable-openib-connectx-xrc



Run:

/home/sources/ompi-bin/bin/mpirun -np 65 --host 
ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
 --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca 
btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
hostname 2>&1|tee > /tmp/mpi.log



Environment:

     According to the attached log it's rsh environment





Output attached



Notes:

The problem is always with tha last node, 64 connections work, 65 connections 
fail.

node-119.ssauniversal.ssa.kodiak.nx == ko0237



mpi.log line 1034:

--------------------------------------------------------------------------

An invalid value was supplied for an enum variable.

  Variable     : orte_debug_daemons

  Value        : 1,1

  Valid values : 0: f|false|disabled, 1: t|true|enabled

--------------------------------------------------------------------------



mpi.log line 1059:

[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288







Lenny Verkhovsky

SW Engineer,  Mellanox Technologies

www.mellanox.com<http://www.mellanox.com><http://www.mellanox.com><http://www.mellanox.com>





Office:    +972 74 712 9244<tel:%2B972%2074%20712%209244>

Mobile:  +972 54 554 0233<tel:%2B972%2054%20554%200233>

Fax:        +972 72 257 9400<tel:%2B972%2072%20257%209400>



From: users [mailto:users-boun...@open-mpi.org

] On Behalf Of Ralph Castain

Sent: Monday, August 11, 2014 4:53 PM

To: Open MPI Users

Subject: Re: [OMPI users] OpenMPI fails with np > 65



Okay, let's start with the basics :-)



How was this configured? What environment are you running in (rsh, slurm, ??)? 
If you configured --enable-debug, then please run it with



--mca plm_base_verbose 5 --debug-daemons



and send the output





On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky 
<len...@mellanox.com<mailto:len...@mellanox.com><mailto:len...@mellanox.com><mailto:len...@mellanox.com>>
 wrote:



I don't think so,

It's always the 66th node, even if I swap between 65th and 66th

I also get the same error when setting np=66, while having only 65 hosts in 
hostfile

(I am using only tcp btl )





Lenny Verkhovsky

SW Engineer,  Mellanox Technologies

www.mellanox.com<http://www.mellanox.com><http://www.mellanox.com/><http://www.mellanox.com/>





Office:    +972 74 712 9244<tel:%2B972%2074%20712%209244>

Mobile:  +972 54 554 0233<tel:%2B972%2054%20554%200233>

Fax:        +972 72 257 9400<tel:%2B972%2072%20257%209400>



From: users [mailto:users-boun...@open-mpi.org

] On Behalf Of Ralph Castain

Sent: Monday, August 11, 2014 1:07 AM

To: Open MPI Users

Subject: Re: [OMPI users] OpenMPI fails with np > 65



Looks to me like your 65th host is missing the dstore library - is it possible 
you don't have your paths set correctly on all hosts in your hostfile?





On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky 
<len...@mellanox.com<mailto:len...@mellanox.com><mailto:len...@mellanox.com><mailto:len...@mellanox.com>>
 wrote:





Hi all,



Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
OMPI with more than 65 procs.

It looks like MPI failes to open 66th connection even with running `hostname` 
over tcp.

It also seems to unrelated to specific host.

All hosts are Ubuntu 12.04.1 LTS



mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
--mca btl tcp,self hostname

[nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 288



.......................................

It looks like environment issue, but I can't find any limit related.

Any ideas ?

Thanks.

Lenny Verkhovsky

SW Engineer,  Mellanox Technologies

www.mellanox.com<http://www.mellanox.com><http://www.mellanox.com/><http://www.mellanox.com/>





Office:    +972 74 712 9244<tel:%2B972%2074%20712%209244>

Mobile:  +972 54 554 0233<tel:%2B972%2054%20554%200233>

Fax:        +972 72 257 9400<tel:%2B972%2072%20257%209400>



_______________________________________________

users mailing list

us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org><mailto:us...@open-mpi.org>



Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24961.php



_______________________________________________

users mailing list

us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org><mailto:us...@open-mpi.org>



Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24964.php






_______________________________________________

devel mailing list

de...@open-mpi.org<mailto:de...@open-mpi.org>

Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel

Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/08/15626.php


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/08/15627.php


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/08/15630.php

Reply via email to