Re: [OMPI devel] hwloc missing NUMANode object

2017-01-05 Thread Brice Goglin


Le 05/01/2017 07:07, Gilles Gouaillardet a écrit :
> Brice,
>
> things would be much easier if there were an HWLOC_OBJ_NODE object in
> the topology.
>
> could you please consider backporting the relevant changes from master
> into the v1.11 branch ?
>
> Cheers,
>
> Gilles

Hello
Unfortunately, I can't backport this to 1.x. This is very intrusive and
would break other things.
However, what problem are you actually seeing? They are no NUMA node in
hwloc 1.x when the machine isn't NUMA (or when there's no NUMA support
in the operating system but that's very unlikely). hwloc master would
show a single NUMA node that is equivalent to the entire machine, so
binding would be a noop.
Regards
Brice

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] OMPI devel] hwloc missing NUMANode object

2017-01-05 Thread Gilles Gouaillardet
Thanks Brice,

Right now, the user facing issue is that numa binding is requested, and there 
is no numa, so mpirun aborts.

But you have a good point, we could simply not bind at all in this case instead 
of aborting, since the numa node would have been the full machine, which would 
have been a noop.

Fwiw,
- the default binding policy was changed from to socket to numa (for better out 
of the box perfs on KNL iirc)
- in btl/sm we malloc(0) when there is no numa, which causes some memory 
corruption. The fix is trivial and i will push it tomorrow

Cheers,

Gilles

Brice Goglin  wrote:
>
>
>Le 05/01/2017 07:07, Gilles Gouaillardet a écrit :
>> Brice,
>>
>> things would be much easier if there were an HWLOC_OBJ_NODE object in
>> the topology.
>>
>> could you please consider backporting the relevant changes from master
>> into the v1.11 branch ?
>>
>> Cheers,
>>
>> Gilles
>
>Hello
>Unfortunately, I can't backport this to 1.x. This is very intrusive and
>would break other things.
>However, what problem are you actually seeing? They are no NUMA node in
>hwloc 1.x when the machine isn't NUMA (or when there's no NUMA support
>in the operating system but that's very unlikely). hwloc master would
>show a single NUMA node that is equivalent to the entire machine, so
>binding would be a noop.
>Regards
>Brice
>
>___
>devel mailing list
>devel@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OMPI devel] hwloc missing NUMANode object

2017-01-05 Thread r...@open-mpi.org
I can add a check to see if we have NUMA, and if not we can fall back to socket 
(if present) or just “none”

> On Jan 5, 2017, at 1:39 AM, Gilles Gouaillardet 
>  wrote:
> 
> Thanks Brice,
> 
> Right now, the user facing issue is that numa binding is requested, and there 
> is no numa, so mpirun aborts.
> 
> But you have a good point, we could simply not bind at all in this case 
> instead of aborting, since the numa node would have been the full machine, 
> which would have been a noop.
> 
> Fwiw,
> - the default binding policy was changed from to socket to numa (for better 
> out of the box perfs on KNL iirc)
> - in btl/sm we malloc(0) when there is no numa, which causes some memory 
> corruption. The fix is trivial and i will push it tomorrow
> 
> Cheers,
> 
> Gilles
> 
> Brice Goglin  wrote:
>> 
>> 
>> Le 05/01/2017 07:07, Gilles Gouaillardet a écrit :
>>> Brice,
>>> 
>>> things would be much easier if there were an HWLOC_OBJ_NODE object in
>>> the topology.
>>> 
>>> could you please consider backporting the relevant changes from master
>>> into the v1.11 branch ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>> 
>> Hello
>> Unfortunately, I can't backport this to 1.x. This is very intrusive and
>> would break other things.
>> However, what problem are you actually seeing? They are no NUMA node in
>> hwloc 1.x when the machine isn't NUMA (or when there's no NUMA support
>> in the operating system but that's very unlikely). hwloc master would
>> show a single NUMA node that is equivalent to the entire machine, so
>> binding would be a noop.
>> Regards
>> Brice
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [2.0.2rc2] FreeBSD-11 run failure

2017-01-05 Thread Howard Pritchard
HI Paul,

I opened

https://github.com/open-mpi/ompi/issues/2665

to track this.

Thanks for reporting this.

Howard



2017-01-04 14:43 GMT-07:00 Paul Hargrove :

> With the 2.0.2rc2 tarball on FreeBSD-11 (i386 or amd64) I am configuring
> with:
>  --prefix=... CC=clang CXX=clang++ --disable-mpi-fortran
>
> I get a failure running ring_c:
>
> mpirun -mca btl sm,self -np 2 examples/ring_c'
> --
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_shmem_base_select failed
>   --> Returned value -1 instead of OPAL_SUCCESS
> --
> + exit 1
>
> When I configure with either "--disable-dlopen" OR "--enable-static
> --disable-shared" the problem vanishes.
> So, I suspect a dlopen-related issue.
>
> I will
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> <(510)%20495-2352>
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> <(510)%20486-6900>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] rdmacm and udcm for 2.0.1 and RoCE

2017-01-05 Thread Howard Pritchard
Hi Dave,

Sorry for the delayed response.

Anyway, you have to use rdmacm for connection management when using ROCE.
However, with 2.0.1 and later, you have to specify per peer QP info manually
on the mpirun command line.

Could you try rerunning with

mpirun --mca btl_openib_receive_queues P,128,64,32,32,32:S,2048,1024,128,32:S,
12288,1024,128,32:S,65536,1024,128,32 (all the reset of the command line
args)

and see if it then works?

Howard


2017-01-04 16:37 GMT-07:00 Dave Turner :

> --
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
>
>   Local host:   elf22
>   Local device: mlx4_2
>   Local port:   1
>   CPCs attempted:   rdmacm, udcm
> --
>
> I posted this to the user list but got no answer so I'm reposting to
> the devel list.
>
> We recently upgraded to OpenMPI 2.0.1.  Everything works fine
> on our QDR connections but we get the error above for our
> 40 GbE connections running RoCE.  I traced through the code and
> it looks like udcm cannot be used with RoCE.  I've also read that
> there are currently some problems with rdmacm under 2.0.1, which
> would mean 2.0.1 does not currently work on RoCE.  We've tested
> 10.4 using rdmacm and that works fine so I don't think we have anything
> configured wrong on the RoCE side.
>  Could someone please verify whether this information is correct that
> RoCE requires rdmacm only and not udcm, and that rdmacm is currently
> not working.  If so, is it being worked on?
>
>  Dave
>
>
> --
> Work: davetur...@ksu.edu (785) 532-7791
>  2219 Engineering Hall, Manhattan KS  66506
> Home:drdavetur...@gmail.com
>   cell: (785) 770-5929
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [2.0.2rc2] opal_fifo hang w/ --enable-osx-builtin-atomics

2017-01-05 Thread Howard Pritchard
Hi Paul,

I opened issue 2666  to track
this.

Howard

2017-01-05 0:23 GMT-07:00 Paul Hargrove :

> On Macs running Yosemite (OS X 10.10 w/ Xcode 7.1) and El Capitan (OS X
> 10.11 w/ Xcode 8.1) I have configured with
> CC=cc CXX=c++ FC=/sw/bin/gfortran --prefix=...
> --enable-osx-builtin-atomics
>
> Upon running "make check", the test "opal_fifo" hangs on both systems.
> Without the --enable-osx-builtin-atomics things are fine.
>
> I don't have data for Sierra (10.12).
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> <(510)%20495-2352>
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> <(510)%20486-6900>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] v2.0.2rc3 posted

2017-01-05 Thread Jeff Squyres (jsquyres)
In the usual place:

https://www.open-mpi.org/software/ompi/v2.0/

The main driver for rc3 is that we think rc2 may have accidentally been made 
with older versions of the GNU Autotools, which may have led to 
https://github.com/open-mpi/ompi/issues/2665.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


[OMPI devel] [2.0.2rc3] build failure ppc64/-m32 and bultin-atomics

2017-01-05 Thread Paul Hargrove
I have a standard Linux/ppc64 system with gcc-4.8.3
I have configured the 2.0.2rc3 tarball with

--prefix=... --enable-builtin-atomics \
CFLAGS=-m32 --with-wrapper-cflags=-m32 \
CXXFLAGS=-m32 --with-wrapper-cxxflags=-m32 \
FCFLAGS=-m32 --with-wrapper-fcflags=-m32 --disable-mpi-fortran

(Yes, I know the FCFLAGS are unnecessary).

I get a "make check" failure:

make[3]: Entering directory
`/home/phargrov/OMPI/openmpi-2.0.2rc3-linux-ppc32-gcc/BLD/test/asm'
  CC   atomic_barrier.o
  CCLD atomic_barrier
  CC   atomic_barrier_noinline-atomic_barrier_noinline.o
  CCLD atomic_barrier_noinline
  CC   atomic_spinlock.o
  CCLD atomic_spinlock
  CC   atomic_spinlock_noinline-atomic_spinlock_noinline.o
  CCLD atomic_spinlock_noinline
  CC   atomic_math.o
  CCLD atomic_math
atomic_math.o: In function `atomic_math_test':
atomic_math.c:(.text+0x78): undefined reference to `__sync_add_and_fetch_8'
collect2: error: ld returned 1 exit status
make[3]: *** [atomic_math] Error 1


It looks like there is an (incorrect) assumption that 8-byte atomics are
available.
Removing --enable-builtin-atomics resolves this issue.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel