Re: [OMPI devel] [1.8.2rc4] build failure with --enable-osx-builtin-atomics

2014-08-13 Thread Ralph Castain
Thanks Paul - fixed in r32530 On Wed, Aug 13, 2014 at 2:42 PM, Paul Hargrove wrote: > When configured with --enable-osx-builtin-atomics > > Making all in asm > CC asm.lo > In file included from >

[OMPI devel] [1.8.2rc4] OSHMEM fortran bindings with bad compilers

2014-08-13 Thread Paul Hargrove
The following is NOT a bug report. This is just an observation that may deserve some text in the README. I've reported issues in the past with some Fortran compilers (mostly older XLC and PGI) which either cannot build the "use mpi_f08" module, or cannot correctly link to it (and sometimes this

[OMPI devel] [1.8.2rc4] build failure with --enable-osx-builtin-atomics

2014-08-13 Thread Paul Hargrove
When configured with --enable-osx-builtin-atomics Making all in asm CC asm.lo In file included from /Users/Paul/OMPI/openmpi-1.8.2rc4-macos10.8-x86-clang-atomics/openmpi-1.8.2rc4/opal/asm/asm.c:21:

Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread George Bosilca
The trunk is [almost] right. It has nice error handling, and a bunch of other features. However, part of this bug report is troubling. We might want to check why it doesn't exhaust all possible addressed before giving up on an endpoint. George. PS: I'm not saying that we should back-port

[OMPI devel] 1.8.4rc4 is out

2014-08-13 Thread Jeff Squyres (jsquyres)
Please test! Ralph would like to release after the teleconf next Tuesday: http://www.open-mpi.org/software/ompi/v1.8/ Changes since last rc: - Fix cascading/over-quoting in some cases with the rsh/ssh-based launcher. Thanks to multiple users for raising the issue. - Properly add support

Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Jeff Squyres (jsquyres)
Paul: I think this is a slippery slope. As I understand it, these private/on-host IP addresses are generated somewhat randomly (e.g., for on-host VM networking -- I don't know if the IP's for Phi on-host networking are pseudo-random or [effectively] fixed). So you might end up in a situation

Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Jeff Squyres (jsquyres)
On Aug 13, 2014, at 12:52 PM, George Bosilca wrote: > There are many differences between the trunk and 1.8 regarding the TCP BTL. > The major I remember about is that the TCP in the trunk is reporting errors > to the upper level via the callbacks attached to fragments,

Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Lenny Verkhovsky
Thank Josh, Then I guess I will solve it internally ☺ Lenny Verkhovsky SW Engineer, Mellanox Technologies www.mellanox.com Office:+972 74 712 9244 Mobile: +972 54 554 0233 Fax:+972 72 257 9400 From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of

Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Paul Hargrove
I think that in this case one *could* add logic that would disqualify the subnet because every compute node in the job has the SAME address. In fact, any subnet on which two or more compute nodes have the same address must be suspect. If this logic were introduced, the 127.0.0.1 loopback address

Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread George Bosilca
There are many differences between the trunk and 1.8 regarding the TCP BTL. The major I remember about is that the TCP in the trunk is reporting errors to the upper level via the callbacks attached to fragments, while the 1.8 TCP BTL doesn't. So, I guess that once a connection to a particular

Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Joshua Ladd
Ah, I see. That change didn't make it into the release branch (I don't know if it was never CMRed or what, I have a vague recollection of it passing through.) If you need that change, then I recommend checking out the trunk at r30875. This was back when the trunk was in a more stable state.

Re: [OMPI devel] Errors on aborting programs on 1.8 r32515

2014-08-13 Thread Ralph Castain
Fixed - just a lingering free that should have been removed On Wed, Aug 13, 2014 at 8:21 AM, Rolf vandeVaart wrote: > I noticed MTT failures from last night and then reproduced this morning on > 1.8 branch. Looks like maybe a double free. I assume it is related to >

[OMPI devel] Errors on aborting programs on 1.8 r32515

2014-08-13 Thread Rolf vandeVaart
I noticed MTT failures from last night and then reproduced this morning on 1.8 branch. Looks like maybe a double free. I assume it is related to fixes for aborting programs. Maybe related to https://svn.open-mpi.org/trac/ompi/changeset/32508 but not sure. [rvandevaart@drossetti-ivy0

Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Jeff Squyres (jsquyres)
I think this is expected behavior. If you have networks that you need Open MPI to ignore (e.g., a private network that *looks* reachable between multiple servers -- because the interfaces are on the same subnet -- but actually *isn't*), then the include/exclude mechanism is the right way to

Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Lenny Verkhovsky
Hi, I needed the following commit r30875 | vasily | 2014-02-27 13:29:47 +0200 (Thu, 27 Feb 2014) | 3 lines OPENIB BTL/CONNECT: Add support for AF_IB addressing in rdmacm. Following Gilles’s mail about known #4857 issue I got update and now I can run with more than 65 hosts. ( thanks, Gilles )

Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Joshua Ladd
Lenny, Is there any particular reason that you're using the trunk? The reason I ask is because the trunk is in an unusually high state of flux at the moment with a major move underway. If you're trying to use OMPI for production grade runs, I would strongly advise picking up one of the stable

Re: [OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Ralph Castain
Afraid I can't get to this until next week, but will look at it then On Tue, Aug 12, 2014 at 10:41 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Folks, > > i noticed mpirun (trunk) hangs when running any mpi program on two nodes > *and* each node has a private network with

Re: [hwloc-devel] patch to execute command when using hwloc-bind --get

2014-08-13 Thread Jeff Squyres (jsquyres)
How about displaying a warning if --get is specified but a command to execute is also specified? Sent from my phone. No type good. > On Aug 13, 2014, at 5:22 AM, "John Donners" wrote: > > Hi Brice, > >> On 13-08-14 10:46, Brice Goglin wrote: >> Hello, >> >> Can

Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Gilles Gouaillardet
Lenny, that looks related to #4857 which has been fixed in trunk since r32517 could you please update your openmpi library and try again ? Gilles On 2014/08/13 17:00, Lenny Verkhovsky wrote: > Following Jeff's suggestion adding devel mailing list. > > Hi All, > I am currently facing strange

Re: [hwloc-devel] patch to execute command when using hwloc-bind --get

2014-08-13 Thread John Donners
Hi Brice, On 13-08-14 10:46, Brice Goglin wrote: Hello, Can you elaborate how you would use this? The intend of the current behavior is: 1) if the target task already runs, use "hwloc-bind --pid --get" without any command since you have pid already this behaviour stays the same with the

Re: [hwloc-devel] patch to execute command when using hwloc-bind --get

2014-08-13 Thread Brice Goglin
Hello, Can you elaborate how you would use this? The intend of the current behavior is: 1) if the target task already runs, use "hwloc-bind --pid --get" without any command since you have pid already 2) you just want to check whether the upcoming binding works, so you just do something like

[hwloc-devel] patch to execute command when using hwloc-bind --get

2014-08-13 Thread John Donners
Hi, I was somewhat surprised to notice that hwloc-bind doesn't execute the command if the --get option is used. This could come in handy to check the binding set by other programs, e.g. SLURM, mpirun or taskset. The following patch would change this. --- hwloc-1.9/utils/hwloc-bind.c

Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Lenny Verkhovsky
Following Jeff's suggestion adding devel mailing list. Hi All, I am currently facing strange situation that I can't run OMPI on more than 65 nodes. It seems like environmental issue that does not allow me to open more connections. Any ideas ? Log attached, more info below in the mail. Running

Re: [OMPI devel] Grammar error in git master: 'You job will now abort'

2014-08-13 Thread Gilles Gouaillardet
Thanks Christopher, this has been fixed in the trunk with r32520 Cheers, Gilles On 2014/08/13 14:49, Christopher Samuel wrote: > Hi all, > > We spotted this in 1.6.5 and git grep shows it's fixed in the > v1.8 branch but in master it's still there: > >

[OMPI devel] Grammar error in git master: 'You job will now abort'

2014-08-13 Thread Christopher Samuel
Hi all, We spotted this in 1.6.5 and git grep shows it's fixed in the v1.8 branch but in master it's still there: samuel@haswell:~/Code/OMPI/ompi-svn-mirror$ git grep -n 'You job will now abort' orte/tools/orterun/help-orterun.txt:679:You job will now abort.

[OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Gilles Gouaillardet
Folks, i noticed mpirun (trunk) hangs when running any mpi program on two nodes *and* each node has a private network with the same ip (in my case, each node has a private network to a MIC) in order to reproduce the problem, you can simply run (as root) on the two compute nodes brctl addbr br0