I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3. 
Unfortunately, also without any effect.

Here some results with enabled binding reports:

$ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
[n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1] to 
cpus 0002
[n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0] to 
cpus 0001
latency: 1.415us

$ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2 
./all2all_ompi1.5.5
[n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1] to 
cpus 0002
[n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0] to 
cpus 0001
latency: 1.4us

$ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np 2 
./all2all_ompi1.5.5
[n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1] to 
cpus 0002
[n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0] to 
cpus 0001
latency: 1.4us


$ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
[n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1] to 
socket 0 cpus 0001
[n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0] to 
socket 0 cpus 0001
latency: 4.0us

$ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2 
./all2all_ompi1.5.5 
[n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1] to 
socket 0 cpus 0001
[n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0] to 
socket 0 cpus 0001
latency: 4.0us

$ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings -np 2 
./all2all_ompi1.5.5 
[n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1] to 
socket 0 cpus 0001
[n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0] to 
socket 0 cpus 0001
latency: 4.0us


If socket-binding is enabled it seems that all ranks are bind to the very first 
core of one and the same socket. Is it intended? I expected that each rank 
gets its own socket (i.e. 2 ranks -> 2 sockets)...

Matthias

On Monday 13 February 2012 22:36:50 Jeff Squyres wrote:
> Also, double check that you have an optimized build, not a debugging build.
> 
> SVN and HG checkouts default to debugging builds, which add in lots of
> latency.
> 
> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
> > Few thoughts
> > 
> > 1. Bind to socket is broken in 1.5.4 - fixed in next release
> > 
> > 2. Add --report-bindings to cmd line and see where it thinks the procs
> > are bound
> > 
> > 3. Sounds lime memory may not be local - might be worth checking mem
> > binding.
> > 
> > Sent from my iPad
> > 
> > On Feb 13, 2012, at 7:07 AM, Matthias Jurenz <matthias.jurenz@tu-
dresden.de> wrote:
> >> Hi Sylvain,
> >> 
> >> thanks for the quick response!
> >> 
> >> Here some results with enabled process binding. I hope I used the
> >> parameters correctly...
> >> 
> >> bind two ranks to one socket:
> >> $ mpirun -np 2 --bind-to-core ./all2all
> >> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
> >> 
> >> bind two ranks to two different sockets:
> >> $ mpirun -np 2 --bind-to-socket ./all2all
> >> 
> >> All three runs resulted in similar bad latencies (~1.4us).
> >> 
> >> :-(
> >> 
> >> Matthias
> >> 
> >> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
> >>> Hi Matthias,
> >>> 
> >>> You might want to play with process binding to see if your problem is
> >>> related to bad memory affinity.
> >>> 
> >>> Try to launch pingpong on two CPUs of the same socket, then on
> >>> different sockets (i.e. bind each process to a core, and try different
> >>> configurations).
> >>> 
> >>> Sylvain
> >>> 
> >>> 
> >>> 
> >>> De :    Matthias Jurenz <matthias.jur...@tu-dresden.de>
> >>> A :     Open MPI Developers <de...@open-mpi.org>
> >>> Date :  13/02/2012 12:12
> >>> Objet : [OMPI devel] poor btl sm latency
> >>> Envoyé par :    devel-boun...@open-mpi.org
> >>> 
> >>> 
> >>> 
> >>> Hello all,
> >>> 
> >>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
> >>> latencies
> >>> (~1.5us) when performing 0-byte p2p communication on one single node
> >>> using the
> >>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which
> >>> is pretty good. The bandwidth results are similar for both MPI
> >>> implementations
> >>> (~3,3GB/s) - this is okay.
> >>> 
> >>> One node has 64 cores and 64Gb RAM where it doesn't matter how many
> >>> ranks allocated by the application. We get similar results with
> >>> different number of
> >>> ranks.
> >>> 
> >>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
> >>> special
> >>> configure options except the installation prefix and the location of
> >>> the LSF
> >>> stuff.
> >>> 
> >>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to
> >>> use /dev/shm instead of /tmp for the session directory, but it had no
> >>> effect. Furthermore, we tried the current release candidate 1.5.5rc1
> >>> of Open MPI which
> >>> provides an option to use the SysV shared memory (-mca shmem sysv) -
> >>> also this
> >>> results in similar poor latencies.
> >>> 
> >>> Do you have any idea? Please help!
> >>> 
> >>> Thanks,
> >>> Matthias
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to