Hi, I needed the following commit r30875 | vasily | 2014-02-27 13:29:47 +0200 (Thu, 27 Feb 2014) | 3 lines OPENIB BTL/CONNECT: Add support for AF_IB addressing in rdmacm.
Following Gilles’s mail about known #4857 issue I got update and now I can run with more than 65 hosts. ( thanks, Gilles ) Since I am facing another problem, I probably should try 1.8rc as you suggested. Thanks. Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com> Office: +972 74 712 9244 Mobile: +972 54 554 0233 Fax: +972 72 257 9400 From: devel [mailto:[email protected]] On Behalf Of Joshua Ladd Sent: Wednesday, August 13, 2014 4:20 PM To: Open MPI Developers Subject: Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65 Lenny, Is there any particular reason that you're using the trunk? The reason I ask is because the trunk is in an unusually high state of flux at the moment with a major move underway. If you're trying to use OMPI for production grade runs, I would strongly advise picking up one of the stable releases in the 1.8.x series. At this time,1.8.1 is available as the most current stable release. The 1.8.2rc3 prerelease candidate is also available: http://www.open-mpi.org/software/ompi/v1.8/ Best, Josh On Wed, Aug 13, 2014 at 5:19 AM, Gilles Gouaillardet <[email protected]<mailto:[email protected]>> wrote: Lenny, that looks related to #4857 which has been fixed in trunk since r32517 could you please update your openmpi library and try again ? Gilles On 2014/08/13 17:00, Lenny Verkhovsky wrote: Following Jeff's suggestion adding devel mailing list. Hi All, I am currently facing strange situation that I can't run OMPI on more than 65 nodes. It seems like environmental issue that does not allow me to open more connections. Any ideas ? Log attached, more info below in the mail. Running OMPI from trunk [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 288 Thanks. Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com><http://www.mellanox.com><http://www.mellanox.com> Office: +972 74 712 9244<tel:%2B972%2074%20712%209244> Mobile: +972 54 554 0233<tel:%2B972%2054%20554%200233> Fax: +972 72 257 9400<tel:%2B972%2072%20257%209400> From: users [mailto:[email protected]] On Behalf Of Lenny Verkhovsky Sent: Tuesday, August 12, 2014 1:13 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI fails with np > 65 Hi, Config: ./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin --enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug --disable-openib-connectx-xrc Run: /home/sources/ompi-bin/bin/mpirun -np 65 --host ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237 --mca btl openib,self --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons hostname 2>&1|tee > /tmp/mpi.log Environment: According to the attached log it's rsh environment Output attached Notes: The problem is always with tha last node, 64 connections work, 65 connections fail. node-119.ssauniversal.ssa.kodiak.nx == ko0237 mpi.log line 1034: -------------------------------------------------------------------------- An invalid value was supplied for an enum variable. Variable : orte_debug_daemons Value : 1,1 Valid values : 0: f|false|disabled, 1: t|true|enabled -------------------------------------------------------------------------- mpi.log line 1059: [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 288 Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com><http://www.mellanox.com><http://www.mellanox.com> Office: +972 74 712 9244<tel:%2B972%2074%20712%209244> Mobile: +972 54 554 0233<tel:%2B972%2054%20554%200233> Fax: +972 72 257 9400<tel:%2B972%2072%20257%209400> From: users [mailto:[email protected] ] On Behalf Of Ralph Castain Sent: Monday, August 11, 2014 4:53 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI fails with np > 65 Okay, let's start with the basics :-) How was this configured? What environment are you running in (rsh, slurm, ??)? If you configured --enable-debug, then please run it with --mca plm_base_verbose 5 --debug-daemons and send the output On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky <[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>> wrote: I don't think so, It's always the 66th node, even if I swap between 65th and 66th I also get the same error when setting np=66, while having only 65 hosts in hostfile (I am using only tcp btl ) Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com><http://www.mellanox.com/><http://www.mellanox.com/> Office: +972 74 712 9244<tel:%2B972%2074%20712%209244> Mobile: +972 54 554 0233<tel:%2B972%2054%20554%200233> Fax: +972 72 257 9400<tel:%2B972%2072%20257%209400> From: users [mailto:[email protected] ] On Behalf Of Ralph Castain Sent: Monday, August 11, 2014 1:07 AM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI fails with np > 65 Looks to me like your 65th host is missing the dstore library - is it possible you don't have your paths set correctly on all hosts in your hostfile? On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky <[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>> wrote: Hi all, Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running OMPI with more than 65 procs. It looks like MPI failes to open 66th connection even with running `hostname` over tcp. It also seems to unrelated to specific host. All hosts are Ubuntu 12.04.1 LTS mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt --mca btl tcp,self hostname [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 288 ....................................... It looks like environment issue, but I can't find any limit related. Any ideas ? Thanks. Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com<http://www.mellanox.com><http://www.mellanox.com/><http://www.mellanox.com/> Office: +972 74 712 9244<tel:%2B972%2074%20712%209244> Mobile: +972 54 554 0233<tel:%2B972%2054%20554%200233> Fax: +972 72 257 9400<tel:%2B972%2072%20257%209400> _______________________________________________ users mailing list [email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/24961.php _______________________________________________ users mailing list [email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/24964.php _______________________________________________ devel mailing list [email protected]<mailto:[email protected]> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/08/15626.php _______________________________________________ devel mailing list [email protected]<mailto:[email protected]> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/08/15627.php
