Re: [OMPI devel] trunk build failed when configured with --disable-hwloc

2012-02-14 Thread Paul H. Hargrove



On 2/14/2012 5:10 PM, Paul H. Hargrove wrote:
I have configured the ompi-trunk (from last night's tarball: 
1.7a1r25913) with --without-hwloc.

Having done so, I see the following failure at build time:


  CC rmaps_rank_file_component.lo
/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_compo 


nent.c: In function 'orte_rmaps_rank_file_open':
/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_compo 

nent.c:111: error: 'opal_hwloc_binding_policy' undeclared (first use 
in this function)
/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_compo 


nent.c:111: error: (Each undeclared identifier is reported only once
/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_compo 


nent.c:111: error: for each function it appears in.)
/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_component.c:111: 
error: 'OPAL_BIND_TO_CPUSET' undeclared (first use in this function)


Looks like this code is not "aware" that hwloc has been configured out.
This is not present in the 1.5 branch configured with identical 
arguments.


-Paul



The following appears to "fix" that, but I am uncertain if this is the 
desired fix.
--- orte/mca/rmaps/rank_file/rmaps_rank_file_component.c~   
2012-02-14 17:25:07.653483222 -0800
+++ orte/mca/rmaps/rank_file/rmaps_rank_file_component.c
2012-02-14 17:25:28.803483261 -0800

@@ -107,8 +107,10 @@
 }
 ORTE_SET_MAPPING_POLICY(orte_rmaps_base.mapping, 
ORTE_MAPPING_BYUSER);
 ORTE_SET_MAPPING_DIRECTIVE(orte_rmaps_base.mapping, 
ORTE_MAPPING_GIVEN);

+#if OPAL_HAVE_HWLOC
 /* we are going to bind to cpuset since the user is 
specifying the cpus */
 OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, 
OPAL_BIND_TO_CPUSET);

+#endif
 /* make us first */
 my_priority = 1;
 }



HOWEVER, I am now also seeing the following occurring ONLY when 
configured with --disable-hwloc:
make[1]: Entering directory 
`/home/phargrov/openmpi-1.7a1r25913/BLD2/opal/mca/event/libevent2013'

  CC libevent2013_module.lo
../../../../../opal/mca/event/libevent2013/libevent2013_module.c:7:20: 
error: config.h: No such file or directory
../../../../../opal/mca/event/libevent2013/libevent2013_module.c: In 
function 'opal_event_init':
../../../../../opal/mca/event/libevent2013/libevent2013_module.c:243: 
warning: ignoring return value of 'asprintf', declared with attribute 
warn_unused_result

make[1]: *** [libevent2013_module.lo] Error 1


It seems VERY odd to me that disabling hwloc should have that effect.
Looking deeper it appears that '#include "config.h"' in 
libevent2013_module.c has been including the config.h from HWLOC, 
instead of the one from libevent2013.  If one examines the -I options 
carefully, you will see that $(builddr)/libevent is NOT in the include 
path, but that is the location of the config.h generated by libevent's 
configure script!


-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



[OMPI devel] trunk build failed when configured with --disable-hwloc

2012-02-14 Thread Paul H. Hargrove
I have configured the ompi-trunk (from last night's tarball: 
1.7a1r25913) with --without-hwloc.

Having done so, I see the following failure at build time:


  CC rmaps_rank_file_component.lo
/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_compo
nent.c: In function 'orte_rmaps_rank_file_open':
/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_compo
nent.c:111: error: 'opal_hwloc_binding_policy' undeclared (first use 
in this function)

/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_compo
nent.c:111: error: (Each undeclared identifier is reported only once
/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_compo
nent.c:111: error: for each function it appears in.)
/home/hargrove/OMPI/openmpi-trunk-linux-mips64el//openmpi-trunk/orte/mca/rmaps/rank_file/rmaps_rank_file_component.c:111: 
error: 'OPAL_BIND_TO_CPUSET' undeclared (first use in this function)


Looks like this code is not "aware" that hwloc has been configured out.
This is not present in the 1.5 branch configured with identical arguments.

-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



[OMPI devel] the dangers of configure probing argument counts

2012-02-14 Thread Paul H. Hargrove
There was recently a fair amount of work done in hwloc to get configure 
to work correctly for a probe that was intended to determine how many 
arguments appear in a specific function prototype.  The "issue" was that 
the C spec doesn't require that the C compiler issue an error for either 
too-many or too-few arguments.  While gcc and most other compilers make 
both cases an error, there are two compilers of non-trivial importance 
which do NOT:
+  By default the IBM (xlc) C compiler warns for the case of too many 
argument.
+  By default the Intel (icc) C compiler warns for the case of too few 
arguments.


This renders configure-time tests that want to check argument counts 
unreliable unless one takes special care to add something "special" to 
CFLAGS.  While hacking on hwloc we determined that is was NOT safe for 
configure to add to CFLAGS in general, nor to ask the user to do so.  It 
was only safe to /temporarily/ add to CFLAGS for the duration of the 
argument count probe.


So, WHY am I tell you all this?
Because of the following in 
openmpi-1.7a1r25865/ompi/config/ompi_check_openib.m4:

  [AC_CACHE_CHECK(
  [number of arguments to ibv_create_cq],

which performs exactly the sort of test I am warning against.

So, I would encourage somebody to make the effort to reuse the configure 
logic Jeff and I developed for hwloc.
In particular look for setting and use of HWLOC_STRICT_ARGS_CFLAGS in 
config/hwloc.m4


-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] MVAPICH2 vs Open-MPI

2012-02-14 Thread Rolf vandeVaart
There are several things going on here that make their library perform better.

With respect to inter-node performance, both MVAPICH2 and Open MPI copy the GPU 
memory into host memory first.  However, they are using special host buffers 
that and a code path that allows them to copy the data asynchronously and 
therefore do a better job pipelining than Open MPI.  I believe their host 
buffers are bigger which works better at larger messages.  Open MPI just piggy 
backs on the existing host buffers in the Open MPI openib BTL.  Open MPI also 
just uses synchronous copies .  (There is hope to improve that)

Secondly, with respect to intra-node performance, they are using the Inter 
Process Communication feature of CUDA which means that within a node, one can 
move GPU memory directly from one GPU to another.  We have an RFC from December 
to add this into Open MPI as well, but do not have approval yet.  Hopefully 
sometime soon.

Rolf

>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Rayson Ho
>Sent: Tuesday, February 14, 2012 4:16 PM.
>To: Open MPI Developers
>Subject: [OMPI devel] MVAPICH2 vs Open-MPI
>
>See P. 38 - 40, MVAPICH2 outperforms Open-MPI for each test, so is it
>something that they are doing to optimize for CUDA & GPUs and those
>optimizations are not in OMPI, or did they specifically tune MVAPICH2 to
>make it shine??
>
>http://hpcadvisorycouncil.com/events/2012/Israel-
>Workshop/Presentations/7_OSU.pdf
>
>The benchmark package: http://mvapich.cse.ohio-state.edu/benchmarks/
>
>Rayson
>
>=
>Open Grid Scheduler / Grid Engine
>http://gridscheduler.sourceforge.net/
>
>Scalable Grid Engine Support Program
>http://www.scalablelogic.com/
>___
>devel mailing list
>de...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



[OMPI devel] MVAPICH2 vs Open-MPI

2012-02-14 Thread Rayson Ho
See P. 38 - 40, MVAPICH2 outperforms Open-MPI for each test, so is it
something that they are doing to optimize for CUDA & GPUs and those
optimizations are not in OMPI, or did they specifically tune MVAPICH2
to make it shine??

http://hpcadvisorycouncil.com/events/2012/Israel-Workshop/Presentations/7_OSU.pdf

The benchmark package: http://mvapich.cse.ohio-state.edu/benchmarks/

Rayson

=
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/


Re: [OMPI devel] poor btl sm latency

2012-02-14 Thread Matthias Jurenz
I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3. 
Unfortunately, also without any effect.

Here some results with enabled binding reports:

$ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
[n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1] to 
cpus 0002
[n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0] to 
cpus 0001
latency: 1.415us

$ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2 
./all2all_ompi1.5.5
[n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1] to 
cpus 0002
[n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0] to 
cpus 0001
latency: 1.4us

$ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np 2 
./all2all_ompi1.5.5
[n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1] to 
cpus 0002
[n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0] to 
cpus 0001
latency: 1.4us


$ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
[n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1] to 
socket 0 cpus 0001
[n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0] to 
socket 0 cpus 0001
latency: 4.0us

$ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2 
./all2all_ompi1.5.5 
[n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1] to 
socket 0 cpus 0001
[n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0] to 
socket 0 cpus 0001
latency: 4.0us

$ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings -np 2 
./all2all_ompi1.5.5 
[n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1] to 
socket 0 cpus 0001
[n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0] to 
socket 0 cpus 0001
latency: 4.0us


If socket-binding is enabled it seems that all ranks are bind to the very first 
core of one and the same socket. Is it intended? I expected that each rank 
gets its own socket (i.e. 2 ranks -> 2 sockets)...

Matthias

On Monday 13 February 2012 22:36:50 Jeff Squyres wrote:
> Also, double check that you have an optimized build, not a debugging build.
> 
> SVN and HG checkouts default to debugging builds, which add in lots of
> latency.
> 
> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
> > Few thoughts
> > 
> > 1. Bind to socket is broken in 1.5.4 - fixed in next release
> > 
> > 2. Add --report-bindings to cmd line and see where it thinks the procs
> > are bound
> > 
> > 3. Sounds lime memory may not be local - might be worth checking mem
> > binding.
> > 
> > Sent from my iPad
> > 
> > On Feb 13, 2012, at 7:07 AM, Matthias Jurenz  wrote:
> >> Hi Sylvain,
> >> 
> >> thanks for the quick response!
> >> 
> >> Here some results with enabled process binding. I hope I used the
> >> parameters correctly...
> >> 
> >> bind two ranks to one socket:
> >> $ mpirun -np 2 --bind-to-core ./all2all
> >> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
> >> 
> >> bind two ranks to two different sockets:
> >> $ mpirun -np 2 --bind-to-socket ./all2all
> >> 
> >> All three runs resulted in similar bad latencies (~1.4us).
> >> 
> >> :-(
> >> 
> >> Matthias
> >> 
> >> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
> >>> Hi Matthias,
> >>> 
> >>> You might want to play with process binding to see if your problem is
> >>> related to bad memory affinity.
> >>> 
> >>> Try to launch pingpong on two CPUs of the same socket, then on
> >>> different sockets (i.e. bind each process to a core, and try different
> >>> configurations).
> >>> 
> >>> Sylvain
> >>> 
> >>> 
> >>> 
> >>> De :Matthias Jurenz 
> >>> A : Open MPI Developers 
> >>> Date :  13/02/2012 12:12
> >>> Objet : [OMPI devel] poor btl sm latency
> >>> Envoyé par :devel-boun...@open-mpi.org
> >>> 
> >>> 
> >>> 
> >>> Hello all,
> >>> 
> >>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
> >>> latencies
> >>> (~1.5us) when performing 0-byte p2p communication on one single node
> >>> using the
> >>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which
> >>> is pretty good. The bandwidth results are similar for both MPI
> >>> implementations
> >>> (~3,3GB/s) - this is okay.
> >>> 
> >>> One node has 64 cores and 64Gb RAM where it doesn't matter how many
> >>> ranks allocated by the application. We get similar results with
> >>> different number of
> >>> ranks.
> >>> 
> >>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
> >>> special
> >>> configure options except the installation prefix and the location of
> >>> the LSF
> >>> stuff.
> >>> 
> >>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to
> >>> use /dev/shm instead of /tmp for the session directory, but it had no
> >>> effect. Furthermore, we tried the current release candidate 

Re: [OMPI devel] Question about opal/mca/memory/linux licensing

2012-02-14 Thread Denis Nagorny
2012/2/14 Jeff Squyres 

> On Feb 14, 2012, at 6:09 AM, Denis Nagorny wrote:
>
> I assume you're referring to the ptmalloc implementation under
> opal/mca/memory/linux, right?
>
Yes, you are right.


> Specifically, see opal/mca/memory/linux/README-ptmalloc.txt
>

It seems that I was misled by copyright notices in source files which have
no such exclusion. BTW Are you going to make copyright notices in source
files more consistent with such one in README file?

Denis


Re: [OMPI devel] Question about opal/mca/memory/linux licensing

2012-02-14 Thread Jeff Squyres
On Feb 14, 2012, at 6:09 AM, Denis Nagorny wrote:

> Investigating memory management implementation in OpenMPI I found that opal's 
> memory module licensed under Lesser GPL terms.

I assume you're referring to the ptmalloc implementation under 
opal/mca/memory/linux, right?

If, so, please read its licensing terms a little closer.  

It specifically allows us to include it under the same terms as the overall 
package of Open MPI, meaning the modified-BSD-like license in the top-level 
LICENSE file.  

Specifically, see opal/mca/memory/linux/README-ptmalloc.txt (for the web 
archives: 
https://svn.open-mpi.org/trac/ompi/browser/trunk/opal/mca/memory/linux/README-ptmalloc2.txt),
 where it says:

-
As part of the GNU C library, the source files are available under the
GNU Library General Public License (see the comments in the files).
But as part of this stand-alone package, the code is also available
under the (probably less restrictive) conditions described in the file
'COPYRIGHT'.  In any case, there is no warranty whatsoever for this
package.
-

And for the web archives, the COPYRIGHT file says 
(https://svn.open-mpi.org/trac/ompi/browser/trunk/opal/mca/memory/linux/COPYRIGHT-ptmalloc2.txt):

-
Copyright (c) 2001-2004 Wolfram Gloger

Permission to use, copy, modify, distribute, and sell this software
and its documentation for any purpose is hereby granted without fee,
provided that (i) the above copyright notices and this permission
notice appear in all copies of the software and related documentation,
and (ii) the name of Wolfram Gloger may not be used in any advertising
or publicity relating to the software.

THE SOFTWARE IS PROVIDED "AS-IS" AND WITHOUT WARRANTY OF ANY KIND,
EXPRESS, IMPLIED OR OTHERWISE, INCLUDING WITHOUT LIMITATION, ANY
WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

IN NO EVENT SHALL WOLFRAM GLOGER BE LIABLE FOR ANY SPECIAL,
INCIDENTAL, INDIRECT OR CONSEQUENTIAL DAMAGES OF ANY KIND, OR ANY
DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER OR NOT ADVISED OF THE POSSIBILITY OF DAMAGE, AND ON ANY THEORY
OF LIABILITY, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.
-

Which to my understanding, and the lawyers who have reviewed this, is 
compatible with Open MPI's BSD-like licensing.

Obvious disclaimer: IANAL and this is not legal advice.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] Question about opal/mca/memory/linux licensing

2012-02-14 Thread Denis Nagorny
Hello,

Investigating memory management implementation in OpenMPI I found that
opal's memory module licensed under Lesser GPL terms. This subsystem is
linked into openMPI library. As far as I know this fact should enforce
Lesser GPL license on libopen-rte.so and libopen-pal.so. Could anybody
explain me how is it possible that OpenMPI have BSD license, please?

Denis


Re: [hwloc-devel] hwloc 1.3.2rc2 released

2012-02-14 Thread Paul H. Hargrove


On 2/13/2012 1:30 PM, Jeff Squyres wrote:

Due to the volume of off-list emails, I'm kinda expecting this rc to be good / 
final.  However, please do at least some cursory testing so that we can be sure.


I disregarded the "cursory" and ran on 61 arch/os/compiler combinations.
I can see only 2 problems at this point:
+ known libnuma issues on a "wierd" virtual node - NOT expected to fix 
in 1.3.x
+ "make check" failure w/ icc-8.0 on x86/Linux - BUT icc-9.0 and gcc are 
both fine on the same node (so probably a compiler bug).


So, I agree this looks "final" to me.

-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900