[OMPI devel] poor btl sm latency

2012-02-13 Thread Matthias Jurenz
Hello all,

on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad latencies 
(~1.5us) when performing 0-byte p2p communication on one single node using the 
Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is 
pretty good. The bandwidth results are similar for both MPI implementations 
(~3,3GB/s) - this is okay.

One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks 
allocated by the application. We get similar results with different number of 
ranks.

We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any special 
configure options except the installation prefix and the location of the LSF 
stuff.

As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use 
/dev/shm instead of /tmp for the session directory, but it had no effect. 
Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI which 
provides an option to use the SysV shared memory (-mca shmem sysv) - also this 
results in similar poor latencies.

Do you have any idea? Please help!

Thanks,
Matthias


Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread sylvain . jeaugey
Hi Matthias,

You might want to play with process binding to see if your problem is 
related to bad memory affinity.

Try to launch pingpong on two CPUs of the same socket, then on different 
sockets (i.e. bind each process to a core, and try different 
configurations).

Sylvain



De :Matthias Jurenz 
A : Open MPI Developers 
Date :  13/02/2012 12:12
Objet : [OMPI devel] poor btl sm latency
Envoyé par :devel-boun...@open-mpi.org



Hello all,

on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad 
latencies 
(~1.5us) when performing 0-byte p2p communication on one single node using 
the 
Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is 
pretty good. The bandwidth results are similar for both MPI 
implementations 
(~3,3GB/s) - this is okay.

One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks 
allocated by the application. We get similar results with different number 
of 
ranks.

We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any 
special 
configure options except the installation prefix and the location of the 
LSF 
stuff.

As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use 
/dev/shm instead of /tmp for the session directory, but it had no effect. 
Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI 
which 
provides an option to use the SysV shared memory (-mca shmem sysv) - also 
this 
results in similar poor latencies.

Do you have any idea? Please help!

Thanks,
Matthias
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread Matthias Jurenz
Hi Sylvain,

thanks for the quick response!

Here some results with enabled process binding. I hope I used the parameters 
correctly...

bind two ranks to one socket:
$ mpirun -np 2 --bind-to-core ./all2all
$ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all

bind two ranks to two different sockets:
$ mpirun -np 2 --bind-to-socket ./all2all

All three runs resulted in similar bad latencies (~1.4us).
:-(


Matthias

On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
> Hi Matthias,
> 
> You might want to play with process binding to see if your problem is
> related to bad memory affinity.
> 
> Try to launch pingpong on two CPUs of the same socket, then on different
> sockets (i.e. bind each process to a core, and try different
> configurations).
> 
> Sylvain
> 
> 
> 
> De :Matthias Jurenz 
> A : Open MPI Developers 
> Date :  13/02/2012 12:12
> Objet : [OMPI devel] poor btl sm latency
> Envoyé par :devel-boun...@open-mpi.org
> 
> 
> 
> Hello all,
> 
> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
> latencies
> (~1.5us) when performing 0-byte p2p communication on one single node using
> the
> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is
> pretty good. The bandwidth results are similar for both MPI
> implementations
> (~3,3GB/s) - this is okay.
> 
> One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks
> allocated by the application. We get similar results with different number
> of
> ranks.
> 
> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
> special
> configure options except the installation prefix and the location of the
> LSF
> stuff.
> 
> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use
> /dev/shm instead of /tmp for the session directory, but it had no effect.
> Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI
> which
> provides an option to use the SysV shared memory (-mca shmem sysv) - also
> this
> results in similar poor latencies.
> 
> Do you have any idea? Please help!
> 
> Thanks,
> Matthias
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread Ralph Castain
Few thoughts

1. Bind to socket is broken in 1.5.4 - fixed in next release

2. Add --report-bindings to cmd line and see where it thinks the procs are bound

3. Sounds lime memory may not be local - might be worth checking mem binding.

Sent from my iPad

On Feb 13, 2012, at 7:07 AM, Matthias Jurenz  
wrote:

> Hi Sylvain,
> 
> thanks for the quick response!
> 
> Here some results with enabled process binding. I hope I used the parameters 
> correctly...
> 
> bind two ranks to one socket:
> $ mpirun -np 2 --bind-to-core ./all2all
> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
> 
> bind two ranks to two different sockets:
> $ mpirun -np 2 --bind-to-socket ./all2all
> 
> All three runs resulted in similar bad latencies (~1.4us).
> :-(
> 
> 
> Matthias
> 
> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
>> Hi Matthias,
>> 
>> You might want to play with process binding to see if your problem is
>> related to bad memory affinity.
>> 
>> Try to launch pingpong on two CPUs of the same socket, then on different
>> sockets (i.e. bind each process to a core, and try different
>> configurations).
>> 
>> Sylvain
>> 
>> 
>> 
>> De :Matthias Jurenz 
>> A : Open MPI Developers 
>> Date :  13/02/2012 12:12
>> Objet : [OMPI devel] poor btl sm latency
>> Envoyé par :devel-boun...@open-mpi.org
>> 
>> 
>> 
>> Hello all,
>> 
>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
>> latencies
>> (~1.5us) when performing 0-byte p2p communication on one single node using
>> the
>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is
>> pretty good. The bandwidth results are similar for both MPI
>> implementations
>> (~3,3GB/s) - this is okay.
>> 
>> One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks
>> allocated by the application. We get similar results with different number
>> of
>> ranks.
>> 
>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
>> special
>> configure options except the installation prefix and the location of the
>> LSF
>> stuff.
>> 
>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use
>> /dev/shm instead of /tmp for the session directory, but it had no effect.
>> Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI
>> which
>> provides an option to use the SysV shared memory (-mca shmem sysv) - also
>> this
>> results in similar poor latencies.
>> 
>> Do you have any idea? Please help!
>> 
>> Thanks,
>> Matthias
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread Jeff Squyres
Also, double check that you have an optimized build, not a debugging build.

SVN and HG checkouts default to debugging builds, which add in lots of latency.


On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:

> Few thoughts
> 
> 1. Bind to socket is broken in 1.5.4 - fixed in next release
> 
> 2. Add --report-bindings to cmd line and see where it thinks the procs are 
> bound
> 
> 3. Sounds lime memory may not be local - might be worth checking mem binding.
> 
> Sent from my iPad
> 
> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz  
> wrote:
> 
>> Hi Sylvain,
>> 
>> thanks for the quick response!
>> 
>> Here some results with enabled process binding. I hope I used the parameters 
>> correctly...
>> 
>> bind two ranks to one socket:
>> $ mpirun -np 2 --bind-to-core ./all2all
>> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
>> 
>> bind two ranks to two different sockets:
>> $ mpirun -np 2 --bind-to-socket ./all2all
>> 
>> All three runs resulted in similar bad latencies (~1.4us).
>> :-(
>> 
>> 
>> Matthias
>> 
>> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
>>> Hi Matthias,
>>> 
>>> You might want to play with process binding to see if your problem is
>>> related to bad memory affinity.
>>> 
>>> Try to launch pingpong on two CPUs of the same socket, then on different
>>> sockets (i.e. bind each process to a core, and try different
>>> configurations).
>>> 
>>> Sylvain
>>> 
>>> 
>>> 
>>> De :Matthias Jurenz 
>>> A : Open MPI Developers 
>>> Date :  13/02/2012 12:12
>>> Objet : [OMPI devel] poor btl sm latency
>>> Envoyé par :devel-boun...@open-mpi.org
>>> 
>>> 
>>> 
>>> Hello all,
>>> 
>>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
>>> latencies
>>> (~1.5us) when performing 0-byte p2p communication on one single node using
>>> the
>>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is
>>> pretty good. The bandwidth results are similar for both MPI
>>> implementations
>>> (~3,3GB/s) - this is okay.
>>> 
>>> One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks
>>> allocated by the application. We get similar results with different number
>>> of
>>> ranks.
>>> 
>>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
>>> special
>>> configure options except the installation prefix and the location of the
>>> LSF
>>> stuff.
>>> 
>>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use
>>> /dev/shm instead of /tmp for the session directory, but it had no effect.
>>> Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI
>>> which
>>> provides an option to use the SysV shared memory (-mca shmem sysv) - also
>>> this
>>> results in similar poor latencies.
>>> 
>>> Do you have any idea? Please help!
>>> 
>>> Thanks,
>>> Matthias
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] 1.5.5rc2r25906 test results

2012-02-13 Thread Jeff Squyres
On Feb 12, 2012, at 4:52 AM, Paul Hargrove wrote:

> I just tried tonight's nightly tarball for the 1.5 branch (1.5.5rc2r25906).
> I found the following issues, which I wad previously reported against 
> 1.5.5rc1, for which I did NOT find a corresponding ticket in "report/15".  My 
> apologies is I've missed a ticket, or if any of these were deferred to 1.6.x 
> (as was Lion+PGI, for instance).

Many thanks for being persistent.  OMPI 1.4.5 is just about done, and we've 
pretty much stuck a fork in hwloc 1.3.2, so my sights are now turning back to 
getting OMPI 1.5.5 out the door (it's still blocking on hwloc 1.3.2, but that's 
darn close).

> + GNU Make required for "make clean" due to use of non-standard $(RM)
> Reported in http://www.open-mpi.org/community/lists/devel/2011/12/10184.php
> 
> + ROMIO uses explicit MAKE=make, causing problems if one builds ompi w/ gmake
> Reported in http://www.open-mpi.org/community/lists/devel/2012/01/10300.php
> 
> + The 1.5 branch needs the same fixes to the -fvisibility probe that Jeff and 
> I have been discussing off-list for hwloc-1.3.2.  Basically this comes down 
> to the fact that the 1.4 branch of OMPI has a "stronger" configure probe for 
> -fvisibility than the 1.5 branch or trunk, and thus known NOT to use 
> -fvisibility with broken icc compilers.  This may come down to a simple CMR, 
> if one could track down when the probe was strengthened.

I fixed all of these on the trunk and filed CMR 
https://svn.open-mpi.org/trac/ompi/ticket/3013 to bring them to the v1.5 branch.

> + MacOS 10.4 on ppc fails linking libvt-mpi.la (multiply defined symbols)
> Reported in http://www.open-mpi.org/community/lists/devel/2011/12/10090.php
> My MacOS 10.4/x86 machine is down, but I don't believe it had this problem w/ 
> rc1.

I pinged the VT guys about this, and filed 
https://svn.open-mpi.org/trac/ompi/ticket/3014 about it.  I won't be 
heartbroken if this slips to v1.6 (I actually filed it as a v1.6 bug).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/