Re: [OMPI devel] [1.8.2rc4] build failure with --enable-osx-builtin-atomics

2014-08-13 Thread Ralph Castain
Thanks Paul - fixed in r32530



On Wed, Aug 13, 2014 at 2:42 PM, Paul Hargrove  wrote:

> When configured with --enable-osx-builtin-atomics
>
> Making all in asm
>   CC   asm.lo
> In file included from
> /Users/Paul/OMPI/openmpi-1.8.2rc4-macos10.8-x86-clang-atomics/openmpi-1.8.2rc4/opal/asm/asm.c:21:
> /Users/Paul/OMPI/openmpi-1.8.2rc4-macos10.8-x86-clang-atomics/openmpi-1.8.2rc4/opal/include/opal/sys/atomic.h:145:10:
> fatal error: 'opal/sys/osx/atomic.h' file not found
> #include "opal/sys/osx/atomic.h"
>  ^
> 1 error generated.
>
> I reported the same issue to George in the trunk last week.
> So, I am 95% certain one just needs to cmr r32390 (commit msg == 'Dont
> miss the Os X atomics on "make dist"')
>
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15642.php
>


[OMPI devel] [1.8.2rc4] OSHMEM fortran bindings with bad compilers

2014-08-13 Thread Paul Hargrove
The following is NOT a bug report.
This is just an observation that may deserve some text in the README.

I've reported issues in the past with some Fortran compilers (mostly older
XLC and PGI) which either cannot build the "use mpi_f08" module, or cannot
correctly link to it (and sometimes this fails only if configured with
--enable-debug).

Testing the OSHMEM Fortran bindings (enabled by default on Linux) I have
found several compilers which fail to link the examples (hello_oshmemfh and
ring_oshmemfh).  I reported one specific instance (with xlc-11/xlf-13) back
in February: http://www.open-mpi.org/community/lists/devel/2014/02/14057.php

So far I have these failures only on platforms where the Fortran compiler
is *known* to be broken for the MPI f90 and/or f08 bindings.  Specifically,
all the failing platforms are ones on which either:
+ Configure determines (without my help) that FC cannot build the F90
and/or F08 modules.
OR
+ I must pass --enable-mpi-fortran=usempi or --enable-mpi-fortran=mpifh for
cases configure cannot detect.

So, I do *not* believe there is anything wrong with the OSHMEM code, which
is why I started this post with "The following is NOT a bug report".
 However, I have two recommendations to make:

1) Documentation

The README says just:

--disable-oshmem-fortran
  Disable building only the Fortran OSHMEM bindings.

So, I recommend adding a sentence there referencing the "Compiler Notes"
section of the README which has details on some known bad Fortran compilers.

2) Configure:

As I noted above, at least some of the failures are on platforms where
configure has determined it cannot build the f08 MPI bindings.  So, maybe
there is something that could be done at configure time to disqualify some
Fortran compilers from building the OSHMEM fotran bindings, too.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] [1.8.2rc4] build failure with --enable-osx-builtin-atomics

2014-08-13 Thread Paul Hargrove
When configured with --enable-osx-builtin-atomics

Making all in asm
  CC   asm.lo
In file included from
/Users/Paul/OMPI/openmpi-1.8.2rc4-macos10.8-x86-clang-atomics/openmpi-1.8.2rc4/opal/asm/asm.c:21:
/Users/Paul/OMPI/openmpi-1.8.2rc4-macos10.8-x86-clang-atomics/openmpi-1.8.2rc4/opal/include/opal/sys/atomic.h:145:10:
fatal error: 'opal/sys/osx/atomic.h' file not found
#include "opal/sys/osx/atomic.h"
 ^
1 error generated.

I reported the same issue to George in the trunk last week.
So, I am 95% certain one just needs to cmr r32390 (commit msg == 'Dont miss
the Os X atomics on "make dist"')


-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread George Bosilca
The trunk is [almost] right. It has nice error handling, and a bunch of
other features.

However, part of this bug report is troubling. We might want to check why
it doesn't exhaust all possible addressed before giving up on an endpoint.

  George.


PS: I'm not saying that we should back-port this in the 1.8 ...


On Wed, Aug 13, 2014 at 3:33 PM, Jeff Squyres (jsquyres)  wrote:

> On Aug 13, 2014, at 12:52 PM, George Bosilca  wrote:
>
> > There are many differences between the trunk and 1.8 regarding the TCP
> BTL. The major I remember about is that the TCP in the trunk is reporting
> errors to the upper level via the callbacks attached to fragments, while
> the 1.8 TCP BTL doesn't.
> >
> > So, I guess that once a connection to a particular endpoint fails, the
> trunk is getting the errors reported via the cb and then takes some drastic
> measure. In the 1.8 we might fallback and try another IP address before
> giving up.
>
> Does that has any effect on performance?
>
> I.e., should we bring this change to v1.8?
>
> Or, put simply: which way is Right?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15638.php
>


[OMPI devel] 1.8.4rc4 is out

2014-08-13 Thread Jeff Squyres (jsquyres)
Please test!  Ralph would like to release after the teleconf next Tuesday:

http://www.open-mpi.org/software/ompi/v1.8/

Changes since last rc:

- Fix cascading/over-quoting in some cases with the rsh/ssh-based
  launcher.  Thanks to multiple users for raising the issue.
- Properly add support for gfortran 4.9 ignore TKR pragma (it was
  erroneously only partially added in v1.7.5).  Thanks to Marcus
  Daniels for raising the issue.
- Update/improve help messages in the usnic BTL.
- Resolve a race condition in MPI_Abort.
- Fix obscure cases where static linking from wrapper compilers would
  fail.
- Clarify the configure --help message about when OpenSHMEM is
  enabled/disabled by default.  Thanks to Paul Hargrove for the
  suggestion.
- Align pages properly where relevant.  Thanks to Paul Hargrove for
  identifying the issue.
- Various compiler warning and minor fixes for OpenBSD, FreeBSD, and
  Solaris/SPARC.  Thanks to Paul Hargrove for the patches.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Jeff Squyres (jsquyres)
Paul: I think this is a slippery slope.

As I understand it, these private/on-host IP addresses are generated somewhat 
randomly (e.g., for on-host VM networking -- I don't know if the IP's for Phi 
on-host networking are pseudo-random or [effectively] fixed).  So you might end 
up in a situation like this:

server A: has br0 on-host IP address 10.0.0.23/8 ***same as server C
server B: has br0 on-host IP address 10.0.0.25/8
server C: has br0 on-host IP address 10.0.0.23/8 ***same as server A
server D: has br0 on-host IP address 10.0.0.107/8

In this case, servers A and C will detect that they have the same IP.  "Ah ha!" 
they say. "I'll just not use br0, because clearly this is erroneous".

But how will servers B and D know this?

You'll likely get the same "hang" behavior that we currently have, because B 
may try to send to A on 10.0.0.23/8.

Hence, the additional logic may not actually solve the problem.

I'm thinking that this is a human-configuration issue -- there may not be a 
good way to detect this automatically.

...unless there's a bit in Linux interfaces that says "this is an on-host 
network".  Does that exist?  Because that would be a better way to disqualify 
Linux IP interfaces.


On Aug 13, 2014, at 1:57 PM, Paul Hargrove  wrote:

> I think that in this case one *could* add logic that would disqualify the 
> subnet because every compute node in the job has the SAME address.  In fact, 
> any subnet on which two or more compute nodes have the same address must be 
> suspect.  If this logic were introduced, the 127.0.0.1 loopback address 
> wouldn't need to be a special case.
> 
> This is just an observation, not a feature request (at least not on my part).
> 
> -Paul
> 
> 
> On Wed, Aug 13, 2014 at 7:55 AM, Jeff Squyres (jsquyres)  
> wrote:
> I think this is expected behavior.
> 
> If you have networks that you need Open MPI to ignore (e.g., a private 
> network that *looks* reachable between multiple servers -- because the 
> interfaces are on the same subnet -- but actually *isn't*), then the 
> include/exclude mechanism is the right way to exclude them.
> 
> That being said, I'm not sure why the behavior is different between trunk and 
> v1.8.
> 
> 
> On Aug 13, 2014, at 1:41 AM, Gilles Gouaillardet 
>  wrote:
> 
> > Folks,
> >
> > i noticed mpirun (trunk) hangs when running any mpi program on two nodes
> > *and* each node has a private network with the same ip
> > (in my case, each node has a private network to a MIC)
> >
> > in order to reproduce the problem, you can simply run (as root) on the
> > two compute nodes
> > brctl addbr br0
> > ifconfig br0 192.168.255.1 netmask 255.255.255.0
> >
> > mpirun will hang
> >
> > a workaroung is to add --mca btl_tcp_if_include eth0
> >
> > v1.8 does not hang in this case
> >
> > Cheers,
> >
> > Gilles
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/08/15623.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15631.php
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15636.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Jeff Squyres (jsquyres)
On Aug 13, 2014, at 12:52 PM, George Bosilca  wrote:

> There are many differences between the trunk and 1.8 regarding the TCP BTL. 
> The major I remember about is that the TCP in the trunk is reporting errors 
> to the upper level via the callbacks attached to fragments, while the 1.8 TCP 
> BTL doesn't.
> 
> So, I guess that once a connection to a particular endpoint fails, the trunk 
> is getting the errors reported via the cb and then takes some drastic 
> measure. In the 1.8 we might fallback and try another IP address before 
> giving up.

Does that has any effect on performance?

I.e., should we bring this change to v1.8?

Or, put simply: which way is Right?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Lenny Verkhovsky
Thank Josh,
Then I guess I will solve it internally ☺


Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Joshua Ladd
Sent: Wednesday, August 13, 2014 7:37 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

Ah, I see. That change didn't make it into the release branch (I don't know if 
it was never CMRed or what, I have a vague recollection of it passing through.) 
If you need that change, then I recommend checking out the trunk at r30875. 
This was back when the trunk was in a more stable state.

Best,

Josh

On Wed, Aug 13, 2014 at 9:29 AM, Lenny Verkhovsky 
> wrote:
Hi,
I needed the following commit

r30875 | vasily | 2014-02-27 13:29:47 +0200 (Thu, 27 Feb 2014) | 3 lines
OPENIB BTL/CONNECT: Add support for AF_IB addressing in rdmacm.

Following Gilles’s  mail about known #4857 issue I got update and now I can run 
with more than 65 hosts.
( thanks,  Gilles )

Since I am facing another problem, I probably should try 1.8rc as you suggested.
Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: devel 
[mailto:devel-boun...@open-mpi.org] On 
Behalf Of Joshua Ladd
Sent: Wednesday, August 13, 2014 4:20 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

Lenny,
Is there any particular reason that you're using the trunk? The reason I ask is 
because the trunk is in an unusually high state of flux at the moment with a 
major move underway. If you're trying to use OMPI for production grade runs, I 
would strongly advise picking up one of the stable releases in the 1.8.x 
series. At this time,1.8.1 is available as the most current stable release. The 
1.8.2rc3 prerelease candidate is also available:

http://www.open-mpi.org/software/ompi/v1.8/
Best,
Josh



On Wed, Aug 13, 2014 at 5:19 AM, Gilles Gouaillardet 
> wrote:
Lenny,

that looks related to #4857 which has been fixed in trunk since r32517

could you please update your openmpi library and try again ?

Gilles


On 2014/08/13 17:00, Lenny Verkhovsky wrote:

Following Jeff's suggestion adding devel mailing list.



Hi All,

I am currently facing strange situation that I can't run OMPI on more than 65 
nodes.

It seems like environmental issue that does not allow me to open more 
connections.

Any ideas ?

Log attached, more info below in the mail.



Running OMPI from trunk

[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288



Thanks.

Lenny Verkhovsky

SW Engineer,  Mellanox Technologies

www.mellanox.com





Office:+972 74 712 9244

Mobile:  +972 54 554 0233

Fax:+972 72 257 9400



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lenny Verkhovsky

Sent: Tuesday, August 12, 2014 1:13 PM

To: Open MPI Users

Subject: Re: [OMPI users] OpenMPI fails with np > 65





Hi,



Config:

./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
--enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
--disable-openib-connectx-xrc



Run:

/home/sources/ompi-bin/bin/mpirun -np 65 --host 
ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
 --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca 
btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
hostname 2>&1|tee > /tmp/mpi.log



Environment:

 According to the attached log it's rsh environment





Output attached



Notes:

The problem is always with tha last node, 64 connections work, 65 connections 
fail.

node-119.ssauniversal.ssa.kodiak.nx == ko0237



mpi.log line 1034:

--

An invalid value was supplied for an enum variable.

  Variable : orte_debug_daemons

  Value: 1,1

  Valid values : 0: f|false|disabled, 1: t|true|enabled


Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Paul Hargrove
I think that in this case one *could* add logic that would disqualify the
subnet because every compute node in the job has the SAME address.  In
fact, any subnet on which two or more compute nodes have the same address
must be suspect.  If this logic were introduced, the 127.0.0.1 loopback
address wouldn't need to be a special case.

This is just an observation, not a feature request (at least not on my
part).

-Paul


On Wed, Aug 13, 2014 at 7:55 AM, Jeff Squyres (jsquyres)  wrote:

> I think this is expected behavior.
>
> If you have networks that you need Open MPI to ignore (e.g., a private
> network that *looks* reachable between multiple servers -- because the
> interfaces are on the same subnet -- but actually *isn't*), then the
> include/exclude mechanism is the right way to exclude them.
>
> That being said, I'm not sure why the behavior is different between trunk
> and v1.8.
>
>
> On Aug 13, 2014, at 1:41 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > Folks,
> >
> > i noticed mpirun (trunk) hangs when running any mpi program on two nodes
> > *and* each node has a private network with the same ip
> > (in my case, each node has a private network to a MIC)
> >
> > in order to reproduce the problem, you can simply run (as root) on the
> > two compute nodes
> > brctl addbr br0
> > ifconfig br0 192.168.255.1 netmask 255.255.255.0
> >
> > mpirun will hang
> >
> > a workaroung is to add --mca btl_tcp_if_include eth0
> >
> > v1.8 does not hang in this case
> >
> > Cheers,
> >
> > Gilles
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15623.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15631.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread George Bosilca
There are many differences between the trunk and 1.8 regarding the TCP BTL.
The major I remember about is that the TCP in the trunk is reporting errors
to the upper level via the callbacks attached to fragments, while the 1.8
TCP BTL doesn't.

So, I guess that once a connection to a particular endpoint fails, the
trunk is getting the errors reported via the cb and then takes some drastic
measure. In the 1.8 we might fallback and try another IP address before
giving up.

  George.



On Wed, Aug 13, 2014 at 10:55 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

> I think this is expected behavior.
>
> If you have networks that you need Open MPI to ignore (e.g., a private
> network that *looks* reachable between multiple servers -- because the
> interfaces are on the same subnet -- but actually *isn't*), then the
> include/exclude mechanism is the right way to exclude them.
>
> That being said, I'm not sure why the behavior is different between trunk
> and v1.8.
>
>
> On Aug 13, 2014, at 1:41 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > Folks,
> >
> > i noticed mpirun (trunk) hangs when running any mpi program on two nodes
> > *and* each node has a private network with the same ip
> > (in my case, each node has a private network to a MIC)
> >
> > in order to reproduce the problem, you can simply run (as root) on the
> > two compute nodes
> > brctl addbr br0
> > ifconfig br0 192.168.255.1 netmask 255.255.255.0
> >
> > mpirun will hang
> >
> > a workaroung is to add --mca btl_tcp_if_include eth0
> >
> > v1.8 does not hang in this case
> >
> > Cheers,
> >
> > Gilles
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15623.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15631.php
>


Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Joshua Ladd
Ah, I see. That change didn't make it into the release branch (I don't know
if it was never CMRed or what, I have a vague recollection of it passing
through.) If you need that change, then I recommend checking out the trunk
at r30875. This was back when the trunk was in a more stable state.


Best,

Josh


On Wed, Aug 13, 2014 at 9:29 AM, Lenny Verkhovsky 
wrote:

>  Hi,
>
> I needed the following commit
>
>
>
> r30875 | vasily | 2014-02-27 13:29:47 +0200 (Thu, 27 Feb 2014) | 3 lines
>
> OPENIB BTL/CONNECT: Add support for AF_IB addressing in rdmacm.
>
>
>
> Following Gilles’s  mail about known #4857 issue I got update and now I
> can run with more than 65 hosts.
>
> ( thanks,  Gilles )
>
>
>
> Since I am facing another problem, I probably should try 1.8rc as you
> suggested.
>
> Thanks.
>
> *Lenny Verkhovsky*
>
> SW Engineer,  Mellanox Technologies
>
> www.mellanox.com
>
>
>
> Office:+972 74 712 9244
>
> Mobile:  +972 54 554 0233
>
> Fax:+972 72 257 9400
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Joshua
> Ladd
> *Sent:* Wednesday, August 13, 2014 4:20 PM
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65
>
>
>
> Lenny,
>
> Is there any particular reason that you're using the trunk? The reason I
> ask is because the trunk is in an unusually high state of flux at the
> moment with a major move underway. If you're trying to use OMPI for
> production grade runs, I would strongly advise picking up one of the stable
> releases in the 1.8.x series. At this time,1.8.1 is available as the most
> current stable release. The 1.8.2rc3 prerelease candidate is also available:
>
> http://www.open-mpi.org/software/ompi/v1.8/
>
> Best,
>
> Josh
>
>
>
>
>
>
> On Wed, Aug 13, 2014 at 5:19 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> Lenny,
>
> that looks related to #4857 which has been fixed in trunk since r32517
>
> could you please update your openmpi library and try again ?
>
> Gilles
>
>
>
> On 2014/08/13 17:00, Lenny Verkhovsky wrote:
>
>  Following Jeff's suggestion adding devel mailing list.
>
>
>
> Hi All,
>
> I am currently facing strange situation that I can't run OMPI on more than 65 
> nodes.
>
> It seems like environmental issue that does not allow me to open more 
> connections.
>
> Any ideas ?
>
> Log attached, more info below in the mail.
>
>
>
> Running OMPI from trunk
>
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
>
>
> Thanks.
>
> Lenny Verkhovsky
>
> SW Engineer,  Mellanox Technologies
>
>  www.mellanox.com 
>
>
>
>
>
> Office:+972 74 712 9244
>
> Mobile:  +972 54 554 0233
>
> Fax:+972 72 257 9400
>
>
>
> From: users [mailto:users-boun...@open-mpi.org ] 
> On Behalf Of Lenny Verkhovsky
>
> Sent: Tuesday, August 12, 2014 1:13 PM
>
> To: Open MPI Users
>
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
>
>
>
>
> Hi,
>
>
>
> Config:
>
> ./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
> --enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
> --disable-openib-connectx-xrc
>
>
>
> Run:
>
> /home/sources/ompi-bin/bin/mpirun -np 65 --host 
> ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
>  --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 
> --mca btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
> hostname 2>&1|tee > /tmp/mpi.log
>
>
>
> Environment:
>
>  According to the attached log it's rsh environment
>
>
>
>
>
> Output attached
>
>
>
> Notes:
>
> The problem is always with tha last node, 64 connections work, 65 connections 
> fail.
>
> node-119.ssauniversal.ssa.kodiak.nx == ko0237
>
>
>
> mpi.log line 1034:
>
> --
>
> An invalid value was supplied for an enum variable.
>
>   Variable : orte_debug_daemons
>
>   Value: 1,1
>
>   Valid values : 0: f|false|disabled, 1: t|true|enabled
>
> --
>
>
>
> mpi.log line 1059:
>
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
>
>
>
>
>
>
> Lenny Verkhovsky
>
> SW Engineer,  Mellanox Technologies
>
>  www.mellanox.com 
>
>
>
>
>
> Office:+972 74 712 9244
>
> Mobile:  

Re: [OMPI devel] Errors on aborting programs on 1.8 r32515

2014-08-13 Thread Ralph Castain
Fixed - just a lingering free that should have been removed



On Wed, Aug 13, 2014 at 8:21 AM, Rolf vandeVaart 
wrote:

> I noticed MTT failures from last night and then reproduced this morning on
> 1.8 branch.  Looks like maybe a double free.  I assume it is related to
> fixes for aborting programs. Maybe related to
> https://svn.open-mpi.org/trac/ompi/changeset/32508 but not sure.
>
> [rvandevaart@drossetti-ivy0 environment]$ pwd
> /ivylogin/home/rvandevaart/tests/ompi-tests/trunk/ibm/environment
> [rvandevaart@drossetti-ivy0 environment]$ mpirun --mca odls_base_verbose
> 20 -np 2 abort
> [...stuff deleted...]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to
> tag 30 on child [[58714,1],0]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to
> tag 30 on child [[58714,1],1]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to
> tag 30 on child [[58714,1],0]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to
> tag 30 on child [[58714,1],1]
> **
> This program tests MPI_ABORT and generates error messages
> ERRORS ARE EXPECTED AND NORMAL IN THIS PROGRAM!!
> **
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 3.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:wait_local_proc
> child process [[58714,1],0] pid 14955 terminated
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child
> [[58714,1],0] exit code 3
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired
> checking abort file 
> /tmp/openmpi-sessions-rvandevaart@drossetti-ivy0_0/58714/1/0/aborted
> for child [[58714,1],0]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child
> [[58714,1],0] died by call to abort
> *** glibc detected *** mpirun: double free or corruption (fasttop):
> 0x0130e210 ***
>
> From gdb:
> gdb) where
> #0  0x7f75ede138e5 in raise () from /lib64/libc.so.6
> #1  0x7f75ede1504d in abort () from /lib64/libc.so.6
> #2  0x7f75ede517f7 in __libc_message () from /lib64/libc.so.6
> #3  0x7f75ede57126 in malloc_printerr () from /lib64/libc.so.6
> #4  0x7f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955,
> status=768, cbdata=0x0)
> at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
> #5  0x7f75eef60a78 in do_waitall (options=0) at
> ../../orte/runtime/orte_wait.c:554
> #6  0x7f75eef60712 in orte_wait_signal_callback (fd=17, event=8,
> arg=0x7f75ef201400) at ../../orte/runtime/orte_wait.c:421
> #7  0x7f75eecaecbe in event_signal_closure (base=0x1278370,
> ev=0x7f75ef201400)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1081
> #8  0x7f75eecaf7e0 in event_process_active_single_queue
> (base=0x1278370, activeq=0x12788f0)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1359
> #9  0x7f75eecafaca in event_process_active (base=0x1278370)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
> #10 0x7f75eecb0148 in opal_libevent2021_event_base_loop
> (base=0x1278370, flags=1)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
> #11 0x00405572 in orterun (argc=7, argv=0x7fffbdf1dd08) at
> ../../../../orte/tools/orterun/orterun.c:1078
> #12 0x00403904 in main (argc=7, argv=0x7fffbdf1dd08) at
> ../../../../orte/tools/orterun/main.c:13
> (gdb) up
> #1  0x7f75ede1504d in abort () from /lib64/libc.so.6
> (gdb) up
> #2  0x7f75ede517f7 in __libc_message () from /lib64/libc.so.6
> (gdb) up
> #3  0x7f75ede57126 in malloc_printerr () from /lib64/libc.so.6
> (gdb) up
> #4  0x7f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955,
> status=768, cbdata=0x0)
> at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
> 2007free(abortfile);
> (gdb) print abortfile
> $1 = 0x130e210 ""
> (gdb)
>
> ---
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> ---
> ___
> devel mailing list
> 

[OMPI devel] Errors on aborting programs on 1.8 r32515

2014-08-13 Thread Rolf vandeVaart
I noticed MTT failures from last night and then reproduced this morning on 1.8 
branch.  Looks like maybe a double free.  I assume it is related to fixes for 
aborting programs. Maybe related to 
https://svn.open-mpi.org/trac/ompi/changeset/32508 but not sure.

[rvandevaart@drossetti-ivy0 environment]$ pwd
/ivylogin/home/rvandevaart/tests/ompi-tests/trunk/ibm/environment
[rvandevaart@drossetti-ivy0 environment]$ mpirun --mca odls_base_verbose 20 -np 
2 abort
[...stuff deleted...]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],0]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],1]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],0]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],1]
**
This program tests MPI_ABORT and generates error messages
ERRORS ARE EXPECTED AND NORMAL IN THIS PROGRAM!!
**
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:wait_local_proc child 
process [[58714,1],0] pid 14955 terminated
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child 
[[58714,1],0] exit code 3
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired checking 
abort file /tmp/openmpi-sessions-rvandevaart@drossetti-ivy0_0/58714/1/0/aborted 
for child [[58714,1],0]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child 
[[58714,1],0] died by call to abort
*** glibc detected *** mpirun: double free or corruption (fasttop): 
0x0130e210 ***

>From gdb:
gdb) where
#0  0x7f75ede138e5 in raise () from /lib64/libc.so.6
#1  0x7f75ede1504d in abort () from /lib64/libc.so.6
#2  0x7f75ede517f7 in __libc_message () from /lib64/libc.so.6
#3  0x7f75ede57126 in malloc_printerr () from /lib64/libc.so.6
#4  0x7f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955, 
status=768, cbdata=0x0)
at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
#5  0x7f75eef60a78 in do_waitall (options=0) at 
../../orte/runtime/orte_wait.c:554
#6  0x7f75eef60712 in orte_wait_signal_callback (fd=17, event=8, 
arg=0x7f75ef201400) at ../../orte/runtime/orte_wait.c:421
#7  0x7f75eecaecbe in event_signal_closure (base=0x1278370, 
ev=0x7f75ef201400)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1081
#8  0x7f75eecaf7e0 in event_process_active_single_queue (base=0x1278370, 
activeq=0x12788f0)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1359
#9  0x7f75eecafaca in event_process_active (base=0x1278370)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#10 0x7f75eecb0148 in opal_libevent2021_event_base_loop (base=0x1278370, 
flags=1)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#11 0x00405572 in orterun (argc=7, argv=0x7fffbdf1dd08) at 
../../../../orte/tools/orterun/orterun.c:1078
#12 0x00403904 in main (argc=7, argv=0x7fffbdf1dd08) at 
../../../../orte/tools/orterun/main.c:13
(gdb) up
#1  0x7f75ede1504d in abort () from /lib64/libc.so.6
(gdb) up
#2  0x7f75ede517f7 in __libc_message () from /lib64/libc.so.6
(gdb) up
#3  0x7f75ede57126 in malloc_printerr () from /lib64/libc.so.6
(gdb) up
#4  0x7f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955, 
status=768, cbdata=0x0)
at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
2007free(abortfile);
(gdb) print abortfile
$1 = 0x130e210 ""
(gdb) 
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Jeff Squyres (jsquyres)
I think this is expected behavior.

If you have networks that you need Open MPI to ignore (e.g., a private network 
that *looks* reachable between multiple servers -- because the interfaces are 
on the same subnet -- but actually *isn't*), then the include/exclude mechanism 
is the right way to exclude them.

That being said, I'm not sure why the behavior is different between trunk and 
v1.8.


On Aug 13, 2014, at 1:41 AM, Gilles Gouaillardet 
 wrote:

> Folks,
> 
> i noticed mpirun (trunk) hangs when running any mpi program on two nodes
> *and* each node has a private network with the same ip
> (in my case, each node has a private network to a MIC)
> 
> in order to reproduce the problem, you can simply run (as root) on the
> two compute nodes
> brctl addbr br0
> ifconfig br0 192.168.255.1 netmask 255.255.255.0
> 
> mpirun will hang
> 
> a workaroung is to add --mca btl_tcp_if_include eth0
> 
> v1.8 does not hang in this case
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15623.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Lenny Verkhovsky
Hi,
I needed the following commit

r30875 | vasily | 2014-02-27 13:29:47 +0200 (Thu, 27 Feb 2014) | 3 lines
OPENIB BTL/CONNECT: Add support for AF_IB addressing in rdmacm.

Following Gilles’s  mail about known #4857 issue I got update and now I can run 
with more than 65 hosts.
( thanks,  Gilles )

Since I am facing another problem, I probably should try 1.8rc as you suggested.
Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Joshua Ladd
Sent: Wednesday, August 13, 2014 4:20 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

Lenny,
Is there any particular reason that you're using the trunk? The reason I ask is 
because the trunk is in an unusually high state of flux at the moment with a 
major move underway. If you're trying to use OMPI for production grade runs, I 
would strongly advise picking up one of the stable releases in the 1.8.x 
series. At this time,1.8.1 is available as the most current stable release. The 
1.8.2rc3 prerelease candidate is also available:

http://www.open-mpi.org/software/ompi/v1.8/
Best,
Josh




On Wed, Aug 13, 2014 at 5:19 AM, Gilles Gouaillardet 
> wrote:
Lenny,

that looks related to #4857 which has been fixed in trunk since r32517

could you please update your openmpi library and try again ?

Gilles


On 2014/08/13 17:00, Lenny Verkhovsky wrote:

Following Jeff's suggestion adding devel mailing list.



Hi All,

I am currently facing strange situation that I can't run OMPI on more than 65 
nodes.

It seems like environmental issue that does not allow me to open more 
connections.

Any ideas ?

Log attached, more info below in the mail.



Running OMPI from trunk

[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288



Thanks.

Lenny Verkhovsky

SW Engineer,  Mellanox Technologies

www.mellanox.com





Office:+972 74 712 9244

Mobile:  +972 54 554 0233

Fax:+972 72 257 9400



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lenny Verkhovsky

Sent: Tuesday, August 12, 2014 1:13 PM

To: Open MPI Users

Subject: Re: [OMPI users] OpenMPI fails with np > 65





Hi,



Config:

./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
--enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
--disable-openib-connectx-xrc



Run:

/home/sources/ompi-bin/bin/mpirun -np 65 --host 
ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
 --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca 
btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
hostname 2>&1|tee > /tmp/mpi.log



Environment:

 According to the attached log it's rsh environment





Output attached



Notes:

The problem is always with tha last node, 64 connections work, 65 connections 
fail.

node-119.ssauniversal.ssa.kodiak.nx == ko0237



mpi.log line 1034:

--

An invalid value was supplied for an enum variable.

  Variable : orte_debug_daemons

  Value: 1,1

  Valid values : 0: f|false|disabled, 1: t|true|enabled

--



mpi.log line 1059:

[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288







Lenny Verkhovsky

SW Engineer,  Mellanox Technologies

www.mellanox.com





Office:+972 74 712 9244

Mobile:  +972 54 554 0233

Fax:+972 72 257 9400



From: users [mailto:users-boun...@open-mpi.org

] On Behalf Of Ralph Castain

Sent: Monday, August 11, 2014 4:53 PM

To: Open MPI Users

Subject: Re: [OMPI users] OpenMPI fails with np > 65



Okay, let's start with the basics :-)



How was this configured? What environment are you running in (rsh, slurm, ??)? 
If you configured --enable-debug, then please run it with



--mca plm_base_verbose 5 --debug-daemons



and send the 

Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Joshua Ladd
Lenny,

Is there any particular reason that you're using the trunk? The reason I
ask is because the trunk is in an unusually high state of flux at the
moment with a major move underway. If you're trying to use OMPI for
production grade runs, I would strongly advise picking up one of the stable
releases in the 1.8.x series. At this time,1.8.1 is available as the most
current stable release. The 1.8.2rc3 prerelease candidate is also available:

http://www.open-mpi.org/software/ompi/v1.8/

Best,

Josh






On Wed, Aug 13, 2014 at 5:19 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Lenny,
>
> that looks related to #4857 which has been fixed in trunk since r32517
>
> could you please update your openmpi library and try again ?
>
> Gilles
>
>
> On 2014/08/13 17:00, Lenny Verkhovsky wrote:
>
> Following Jeff's suggestion adding devel mailing list.
>
> Hi All,
> I am currently facing strange situation that I can't run OMPI on more than 65 
> nodes.
> It seems like environmental issue that does not allow me to open more 
> connections.
> Any ideas ?
> Log attached, more info below in the mail.
>
> Running OMPI from trunk
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
> Thanks.
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologies
> www.mellanox.com 
>
>
> Office:+972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:+972 72 257 9400
>
> From: users [mailto:users-boun...@open-mpi.org ] 
> On Behalf Of Lenny Verkhovsky
> Sent: Tuesday, August 12, 2014 1:13 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
>
> Hi,
>
> Config:
> ./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
> --enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
> --disable-openib-connectx-xrc
>
> Run:
> /home/sources/ompi-bin/bin/mpirun -np 65 --host 
> ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
>  --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 
> --mca btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
> hostname 2>&1|tee > /tmp/mpi.log
>
> Environment:
>  According to the attached log it's rsh environment
>
>
> Output attached
>
> Notes:
> The problem is always with tha last node, 64 connections work, 65 connections 
> fail.
> node-119.ssauniversal.ssa.kodiak.nx == ko0237
>
> mpi.log line 1034:
> --
> An invalid value was supplied for an enum variable.
>   Variable : orte_debug_daemons
>   Value: 1,1
>   Valid values : 0: f|false|disabled, 1: t|true|enabled
> --
>
> mpi.log line 1059:
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
>
>
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologies
> www.mellanox.com 
>
>
> Office:+972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:+972 72 257 9400
>
> From: users [mailto:users-boun...@open-mpi.org 
> ] On Behalf Of Ralph Castain
> Sent: Monday, August 11, 2014 4:53 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
> Okay, let's start with the basics :-)
>
> How was this configured? What environment are you running in (rsh, slurm, 
> ??)? If you configured --enable-debug, then please run it with
>
> --mca plm_base_verbose 5 --debug-daemons
>
> and send the output
>
>
> On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky 
>  > wrote:
>
> I don't think so,
> It's always the 66th node, even if I swap between 65th and 66th
> I also get the same error when setting np=66, while having only 65 hosts in 
> hostfile
> (I am using only tcp btl )
>
>
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologieswww.mellanox.com 
> 
>
>
> Office:+972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:+972 72 257 9400
>
> From: users [mailto:users-boun...@open-mpi.org 
> ] On Behalf Of Ralph Castain
> Sent: Monday, August 11, 2014 1:07 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
> Looks to me like your 65th host is missing the dstore library - is it 
> 

Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Ralph Castain
Afraid I can't get to this until next week, but will look at it then



On Tue, Aug 12, 2014 at 10:41 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> Folks,
>
> i noticed mpirun (trunk) hangs when running any mpi program on two nodes
> *and* each node has a private network with the same ip
> (in my case, each node has a private network to a MIC)
>
> in order to reproduce the problem, you can simply run (as root) on the
> two compute nodes
> brctl addbr br0
> ifconfig br0 192.168.255.1 netmask 255.255.255.0
>
> mpirun will hang
>
> a workaroung is to add --mca btl_tcp_if_include eth0
>
> v1.8 does not hang in this case
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15623.php
>


Re: [hwloc-devel] patch to execute command when using hwloc-bind --get

2014-08-13 Thread Jeff Squyres (jsquyres)
How about displaying a warning if --get is specified but a command to execute 
is also specified?

Sent from my phone. No type good. 

> On Aug 13, 2014, at 5:22 AM, "John Donners"  wrote:
> 
> Hi Brice,
> 
>> On 13-08-14 10:46, Brice Goglin wrote:
>> Hello,
>> 
>> Can you elaborate how you would use this?
>> 
>> The intend of the current behavior is:
>> 1) if the target task already runs, use "hwloc-bind --pid  --get"
>> without any command since you have pid already
> this behaviour stays the same with the patch.
>> 2) you just want to check whether the upcoming binding works, so you
>> just do something like "mpirun  hwloc-bind --get" or "srun ...
>> hwloc-bind --get"
>> 
>> Do you want a mode that creates a new task and displays it binding?
> indeed.
>> Looks similar to passing "hwloc-bind --get ; newtask" to srun or mpirun ?
> the syntax would then be something like:
> 
> mpirun -n 2 bash -c "hwloc-bind --get ; newtask"
> 
> it's possible, but quite ugly.
> 
> With regards,
> John
>> 
>> Brice
>> 
>> 
>> 
>> Le 13/08/2014 10:38, John Donners a écrit :
>>> Hi,
>>> 
>>> I was somewhat surprised to notice that hwloc-bind doesn't
>>> execute the command if the --get option is used. This could
>>> come in handy to check the binding set by other programs,
>>> e.g. SLURM, mpirun or taskset. The following patch would
>>> change this.
>>> 
>>> --- hwloc-1.9/utils/hwloc-bind.c2014-03-17 16:42:36.0 +0100
>>> +++ hwloc-1.9/utils/hwloc-bind.c.getproc2014-08-13
>>> 10:24:17.832682576 +0200
>>> @@ -301,7 +301,9 @@
>>>  else
>>>printf("%s\n", s);
>>>  free(s);
>>> -return EXIT_SUCCESS;
>>> +if (get_last_cpu_location) {
>>> +  return EXIT_SUCCESS;
>>> +}
>>>}
>>>  if (got_membind) {
>>> 
>>> Please consider this change for the next release of hwloc.
>>> 
>>> With regards,
>>> John
>>> 
>>> 
>>> | John Donners | Senior adviseur | Operations, Support & Development |
>>> SURFsara | Science Park 140 | 1098 XG Amsterdam | Nederland |
>>> T (31)6 19039023 | john.donn...@surfsara.nl | www.surfsara.nl |
>>> 
>>> Aanwezig op | ma | di | wo | do | vr
>>> 
>>> ___
>>> hwloc-devel mailing list
>>> hwloc-de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/hwloc-devel/2014/08/4171.php
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/hwloc-devel/2014/08/4172.php
> 
> 
> -- 
> Probeer de SURFsara app! Beschikbaar voor iOS en Android.
> 
> | John Donners | Senior adviseur | Operations, Support & Development | 
> SURFsara | Science Park 140 | 1098 XG Amsterdam | Nederland |
> T (31)6 19039023 | john.donn...@surfsara.nl | www.surfsara.nl |
> 
> Aanwezig op | ma | di | wo | do | vr
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2014/08/4172.php


Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Gilles Gouaillardet
Lenny,

that looks related to #4857 which has been fixed in trunk since r32517

could you please update your openmpi library and try again ?

Gilles

On 2014/08/13 17:00, Lenny Verkhovsky wrote:
> Following Jeff's suggestion adding devel mailing list.
>
> Hi All,
> I am currently facing strange situation that I can't run OMPI on more than 65 
> nodes.
> It seems like environmental issue that does not allow me to open more 
> connections.
> Any ideas ?
> Log attached, more info below in the mail.
>
> Running OMPI from trunk
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
> Thanks.
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologies
> www.mellanox.com
>
> Office:+972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:+972 72 257 9400
>
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lenny Verkhovsky
> Sent: Tuesday, August 12, 2014 1:13 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
>
> Hi,
>
> Config:
> ./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
> --enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
> --disable-openib-connectx-xrc
>
> Run:
> /home/sources/ompi-bin/bin/mpirun -np 65 --host 
> ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
>  --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 
> --mca btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
> hostname 2>&1|tee > /tmp/mpi.log
>
> Environment:
>  According to the attached log it's rsh environment
>
>
> Output attached
>
> Notes:
> The problem is always with tha last node, 64 connections work, 65 connections 
> fail.
> node-119.ssauniversal.ssa.kodiak.nx == ko0237
>
> mpi.log line 1034:
> --
> An invalid value was supplied for an enum variable.
>   Variable : orte_debug_daemons
>   Value: 1,1
>   Valid values : 0: f|false|disabled, 1: t|true|enabled
> --
>
> mpi.log line 1059:
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
>
>
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologies
> www.mellanox.com
>
> Office:+972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:+972 72 257 9400
>
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Monday, August 11, 2014 4:53 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
> Okay, let's start with the basics :-)
>
> How was this configured? What environment are you running in (rsh, slurm, 
> ??)? If you configured --enable-debug, then please run it with
>
> --mca plm_base_verbose 5 --debug-daemons
>
> and send the output
>
>
> On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky 
> > wrote:
>
> I don't think so,
> It's always the 66th node, even if I swap between 65th and 66th
> I also get the same error when setting np=66, while having only 65 hosts in 
> hostfile
> (I am using only tcp btl )
>
>
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologies
> www.mellanox.com
>
> Office:+972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:+972 72 257 9400
>
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Monday, August 11, 2014 1:07 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
> Looks to me like your 65th host is missing the dstore library - is it 
> possible you don't have your paths set correctly on all hosts in your 
> hostfile?
>
>
> On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky 
> > wrote:
>
>
> Hi all,
>
> Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
> OMPI with more than 65 procs.
> It looks like MPI failes to open 66th connection even with running `hostname` 
> over tcp.
> It also seems to unrelated to specific host.
> All hosts are Ubuntu 12.04.1 LTS
>
> mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
> --mca btl tcp,self hostname
> [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
> base/ess_base_std_orted.c at line 288
>
> ...
> It looks like environment issue, but I can't find any limit related.
> Any 

Re: [hwloc-devel] patch to execute command when using hwloc-bind --get

2014-08-13 Thread John Donners

Hi Brice,

On 13-08-14 10:46, Brice Goglin wrote:

Hello,

Can you elaborate how you would use this?

The intend of the current behavior is:
1) if the target task already runs, use "hwloc-bind --pid  --get"
without any command since you have pid already

this behaviour stays the same with the patch.

2) you just want to check whether the upcoming binding works, so you
just do something like "mpirun  hwloc-bind --get" or "srun ...
hwloc-bind --get"

Do you want a mode that creates a new task and displays it binding?

indeed.

Looks similar to passing "hwloc-bind --get ; newtask" to srun or mpirun ?

the syntax would then be something like:

mpirun -n 2 bash -c "hwloc-bind --get ; newtask"

it's possible, but quite ugly.

With regards,
John


Brice



Le 13/08/2014 10:38, John Donners a écrit :

Hi,

I was somewhat surprised to notice that hwloc-bind doesn't
execute the command if the --get option is used. This could
come in handy to check the binding set by other programs,
e.g. SLURM, mpirun or taskset. The following patch would
change this.

--- hwloc-1.9/utils/hwloc-bind.c2014-03-17 16:42:36.0 +0100
+++ hwloc-1.9/utils/hwloc-bind.c.getproc2014-08-13
10:24:17.832682576 +0200
@@ -301,7 +301,9 @@
  else
printf("%s\n", s);
  free(s);
-return EXIT_SUCCESS;
+if (get_last_cpu_location) {
+  return EXIT_SUCCESS;
+}
}
  
if (got_membind) {


Please consider this change for the next release of hwloc.

With regards,
John


| John Donners | Senior adviseur | Operations, Support & Development |
SURFsara | Science Park 140 | 1098 XG Amsterdam | Nederland |
T (31)6 19039023 | john.donn...@surfsara.nl | www.surfsara.nl |

Aanwezig op | ma | di | wo | do | vr

___
hwloc-devel mailing list
hwloc-de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Link to this post:
http://www.open-mpi.org/community/lists/hwloc-devel/2014/08/4171.php

___
hwloc-devel mailing list
hwloc-de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
Link to this post: 
http://www.open-mpi.org/community/lists/hwloc-devel/2014/08/4172.php



--
Probeer de SURFsara app! Beschikbaar voor iOS en Android.

| John Donners | Senior adviseur | Operations, Support & Development | SURFsara 
| Science Park 140 | 1098 XG Amsterdam | Nederland |
T (31)6 19039023 | john.donn...@surfsara.nl | www.surfsara.nl |

Aanwezig op | ma | di | wo | do | vr



Re: [hwloc-devel] patch to execute command when using hwloc-bind --get

2014-08-13 Thread Brice Goglin
Hello,

Can you elaborate how you would use this?

The intend of the current behavior is:
1) if the target task already runs, use "hwloc-bind --pid  --get"
without any command since you have pid already
2) you just want to check whether the upcoming binding works, so you
just do something like "mpirun  hwloc-bind --get" or "srun ...
hwloc-bind --get"

Do you want a mode that creates a new task and displays it binding?
Looks similar to passing "hwloc-bind --get ; newtask" to srun or mpirun ?

Brice



Le 13/08/2014 10:38, John Donners a écrit :
> Hi,
>
> I was somewhat surprised to notice that hwloc-bind doesn't
> execute the command if the --get option is used. This could
> come in handy to check the binding set by other programs,
> e.g. SLURM, mpirun or taskset. The following patch would
> change this.
>
> --- hwloc-1.9/utils/hwloc-bind.c2014-03-17 16:42:36.0 +0100
> +++ hwloc-1.9/utils/hwloc-bind.c.getproc2014-08-13
> 10:24:17.832682576 +0200
> @@ -301,7 +301,9 @@
>  else
>printf("%s\n", s);
>  free(s);
> -return EXIT_SUCCESS;
> +if (get_last_cpu_location) {
> +  return EXIT_SUCCESS;
> +}
>}
>  
>if (got_membind) {
>
> Please consider this change for the next release of hwloc.
>
> With regards,
> John
>
>
> | John Donners | Senior adviseur | Operations, Support & Development |
> SURFsara | Science Park 140 | 1098 XG Amsterdam | Nederland |
> T (31)6 19039023 | john.donn...@surfsara.nl | www.surfsara.nl |
>
> Aanwezig op | ma | di | wo | do | vr
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post:
> http://www.open-mpi.org/community/lists/hwloc-devel/2014/08/4171.php



[hwloc-devel] patch to execute command when using hwloc-bind --get

2014-08-13 Thread John Donners

Hi,

I was somewhat surprised to notice that hwloc-bind doesn't
execute the command if the --get option is used. This could
come in handy to check the binding set by other programs,
e.g. SLURM, mpirun or taskset. The following patch would
change this.

--- hwloc-1.9/utils/hwloc-bind.c2014-03-17 16:42:36.0 +0100
+++ hwloc-1.9/utils/hwloc-bind.c.getproc2014-08-13 10:24:17.832682576 
+0200
@@ -301,7 +301,9 @@
 else
   printf("%s\n", s);
 free(s);
-return EXIT_SUCCESS;
+if (get_last_cpu_location) {
+  return EXIT_SUCCESS;
+}
   }

   if (got_membind) {

Please consider this change for the next release of hwloc.

With regards,
John


| John Donners | Senior adviseur | Operations, Support & Development | SURFsara 
| Science Park 140 | 1098 XG Amsterdam | Nederland |
T (31)6 19039023 | john.donn...@surfsara.nl | www.surfsara.nl |

Aanwezig op | ma | di | wo | do | vr



Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Lenny Verkhovsky
Following Jeff's suggestion adding devel mailing list.

Hi All,
I am currently facing strange situation that I can't run OMPI on more than 65 
nodes.
It seems like environmental issue that does not allow me to open more 
connections.
Any ideas ?
Log attached, more info below in the mail.

Running OMPI from trunk
[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288

Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lenny Verkhovsky
Sent: Tuesday, August 12, 2014 1:13 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65


Hi,

Config:
./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
--enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
--disable-openib-connectx-xrc

Run:
/home/sources/ompi-bin/bin/mpirun -np 65 --host 
ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
 --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca 
btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
hostname 2>&1|tee > /tmp/mpi.log

Environment:
 According to the attached log it's rsh environment


Output attached

Notes:
The problem is always with tha last node, 64 connections work, 65 connections 
fail.
node-119.ssauniversal.ssa.kodiak.nx == ko0237

mpi.log line 1034:
--
An invalid value was supplied for an enum variable.
  Variable : orte_debug_daemons
  Value: 1,1
  Valid values : 0: f|false|disabled, 1: t|true|enabled
--

mpi.log line 1059:
[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288



Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, August 11, 2014 4:53 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65

Okay, let's start with the basics :-)

How was this configured? What environment are you running in (rsh, slurm, ??)? 
If you configured --enable-debug, then please run it with

--mca plm_base_verbose 5 --debug-daemons

and send the output


On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky 
> wrote:

I don't think so,
It's always the 66th node, even if I swap between 65th and 66th
I also get the same error when setting np=66, while having only 65 hosts in 
hostfile
(I am using only tcp btl )


Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, August 11, 2014 1:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65

Looks to me like your 65th host is missing the dstore library - is it possible 
you don't have your paths set correctly on all hosts in your hostfile?


On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky 
> wrote:


Hi all,

Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
OMPI with more than 65 procs.
It looks like MPI failes to open 66th connection even with running `hostname` 
over tcp.
It also seems to unrelated to specific host.
All hosts are Ubuntu 12.04.1 LTS

mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
--mca btl tcp,self hostname
[nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 288

...
It looks like environment issue, but I can't find any limit related.
Any ideas ?
Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 

Re: [OMPI devel] Grammar error in git master: 'You job will now abort'

2014-08-13 Thread Gilles Gouaillardet
Thanks Christopher,

this has been fixed in the trunk with r32520

Cheers,

Gilles

On 2014/08/13 14:49, Christopher Samuel wrote:
> Hi all,
>
> We spotted this in 1.6.5 and git grep shows it's fixed in the
> v1.8 branch but in master it's still there:
>
> samuel@haswell:~/Code/OMPI/ompi-svn-mirror$ git grep -n 'You job will now 
> abort'
> orte/tools/orterun/help-orterun.txt:679:You job will now abort.
> samuel@haswell:~/Code/OMPI/ompi-svn-mirror$ 
>
> I'm using https://github.com/open-mpi/ompi-svn-mirror.git so
> let me know if I should be using something else now.
>
> cheers,
> Chris



[OMPI devel] Grammar error in git master: 'You job will now abort'

2014-08-13 Thread Christopher Samuel
Hi all,

We spotted this in 1.6.5 and git grep shows it's fixed in the
v1.8 branch but in master it's still there:

samuel@haswell:~/Code/OMPI/ompi-svn-mirror$ git grep -n 'You job will now abort'
orte/tools/orterun/help-orterun.txt:679:You job will now abort.
samuel@haswell:~/Code/OMPI/ompi-svn-mirror$ 

I'm using https://github.com/open-mpi/ompi-svn-mirror.git so
let me know if I should be using something else now.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



[OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Gilles Gouaillardet
Folks,

i noticed mpirun (trunk) hangs when running any mpi program on two nodes
*and* each node has a private network with the same ip
(in my case, each node has a private network to a MIC)

in order to reproduce the problem, you can simply run (as root) on the
two compute nodes
brctl addbr br0
ifconfig br0 192.168.255.1 netmask 255.255.255.0

mpirun will hang

a workaroung is to add --mca btl_tcp_if_include eth0

v1.8 does not hang in this case

Cheers,

Gilles