Re: [OMPI devel] poor btl sm latency

Jeff Squyres Thu, 16 Feb 2012 07:33:21 -0500

Yowza.  With inconsistent results like that, it does sound like something is 
going on in the hardware.  Unfortunately, I don't know much/anything about AMDs 
(Cisco is an Intel shop).  :-\


Do you have (AMD's equivalent of) hyperthreading enabled, perchance?

In the latest 1.5.5 nightly tarball, I have just upgraded the included version 
of hwloc to be 1.3.2.  Maybe a good step would be to download hwloc 1.3.2 and 
verify that lstopo is faithfully reporting the actual topology of your system.  
Can you do that?


On Feb 16, 2012, at 7:06 AM, Matthias Jurenz wrote:

> Jeff,
> 
> sorry for the confusion - the all2all is a classic pingpong which uses 
> MPI_Send/Recv with 0 byte messages.
> 
> One thing I just noticed when using NetPIPE/MPI. Platform MPI results in 
> almost constant latencies for small messages (~0.89us), where I don't know 
> about process-binding in Platform MPI - I just used the defaults.
> When using Open MPI (regardless of core/socket-binding) the results differ 
> from 
> run to run:
> 
> === FIRST RUN ===
> $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 100000
> Using synchronous sends
> 1: n029
> Using synchronous sends
> 0: n029
> Now starting the main loop
>  0:       1 bytes 100000 times -->      4.66 Mbps in       1.64 usec
>  1:       2 bytes 100000 times -->      8.94 Mbps in       1.71 usec
>  2:       3 bytes 100000 times -->     13.65 Mbps in       1.68 usec
>  3:       4 bytes 100000 times -->     17.91 Mbps in       1.70 usec
>  4:       6 bytes 100000 times -->     29.04 Mbps in       1.58 usec
>  5:       8 bytes 100000 times -->     39.06 Mbps in       1.56 usec
>  6:      12 bytes 100000 times -->     57.58 Mbps in       1.59 usec
> 
> === SECOND RUN (~3s after the previous run) ===
> $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 100000
> Using synchronous sends
> 1: n029
> Using synchronous sends
> 0: n029
> Now starting the main loop
>  0:       1 bytes 100000 times -->      5.73 Mbps in       1.33 usec
>  1:       2 bytes 100000 times -->     11.45 Mbps in       1.33 usec
>  2:       3 bytes 100000 times -->     17.13 Mbps in       1.34 usec
>  3:       4 bytes 100000 times -->     22.94 Mbps in       1.33 usec
>  4:       6 bytes 100000 times -->     34.39 Mbps in       1.33 usec
>  5:       8 bytes 100000 times -->     46.40 Mbps in       1.32 usec
>  6:      12 bytes 100000 times -->     68.92 Mbps in       1.33 usec
> 
> === THIRD RUN ===
> $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 100000
> Using synchronous sends
> 0: n029
> Using synchronous sends
> 1: n029
> Now starting the main loop
>  0:       1 bytes 100000 times -->      3.50 Mbps in       2.18 usec
>  1:       2 bytes 100000 times -->      6.99 Mbps in       2.18 usec
>  2:       3 bytes 100000 times -->     10.48 Mbps in       2.18 usec
>  3:       4 bytes 100000 times -->     14.00 Mbps in       2.18 usec
>  4:       6 bytes 100000 times -->     20.98 Mbps in       2.18 usec
>  5:       8 bytes 100000 times -->     27.84 Mbps in       2.19 usec
>  6:      12 bytes 100000 times -->     41.99 Mbps in       2.18 usec
> 
> At first appearance, I assumed that some CPU power saving feature is enabled. 
> But the CPU frequency scaling is set to "performance" and there is only one 
> available frequency (2.2GHz).
> 
> Any idea how this can happen?
> 
> 
> Matthias
> 
> On Wednesday 15 February 2012 19:29:38 Jeff Squyres wrote:
>> Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte
>> HRT ping pong.  What is this all2all benchmark, btw?  Is it measuring an
>> MPI_ALLTOALL, or a pingpong?
>> 
>> FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about
>> .27us latencies for short messages over sm and binding to socket.
>> 
>> On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote:
>>> I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3.
>>> Unfortunately, also without any effect.
>>> 
>>> Here some results with enabled binding reports:
>>> 
>>> $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
>>> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1]
>>> to cpus 0002
>>> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0]
>>> to cpus 0001
>>> latency: 1.415us
>>> 
>>> $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2
>>> ./all2all_ompi1.5.5
>>> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1]
>>> to cpus 0002
>>> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0]
>>> to cpus 0001
>>> latency: 1.4us
>>> 
>>> $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np
>>> 2 ./all2all_ompi1.5.5
>>> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1]
>>> to cpus 0002
>>> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0]
>>> to cpus 0001
>>> latency: 1.4us
>>> 
>>> 
>>> $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
>>> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1]
>>> to socket 0 cpus 0001
>>> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0]
>>> to socket 0 cpus 0001
>>> latency: 4.0us
>>> 
>>> $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2
>>> ./all2all_ompi1.5.5
>>> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1]
>>> to socket 0 cpus 0001
>>> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0]
>>> to socket 0 cpus 0001
>>> latency: 4.0us
>>> 
>>> $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings
>>> -np 2 ./all2all_ompi1.5.5
>>> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1]
>>> to socket 0 cpus 0001
>>> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0]
>>> to socket 0 cpus 0001
>>> latency: 4.0us
>>> 
>>> 
>>> If socket-binding is enabled it seems that all ranks are bind to the very
>>> first core of one and the same socket. Is it intended? I expected that
>>> each rank gets its own socket (i.e. 2 ranks -> 2 sockets)...
>>> 
>>> Matthias
>>> 
>>> On Monday 13 February 2012 22:36:50 Jeff Squyres wrote:
>>>> Also, double check that you have an optimized build, not a debugging
>>>> build.
>>>> 
>>>> SVN and HG checkouts default to debugging builds, which add in lots of
>>>> latency.
>>>> 
>>>> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
>>>>> Few thoughts
>>>>> 
>>>>> 1. Bind to socket is broken in 1.5.4 - fixed in next release
>>>>> 
>>>>> 2. Add --report-bindings to cmd line and see where it thinks the procs
>>>>> are bound
>>>>> 
>>>>> 3. Sounds lime memory may not be local - might be worth checking mem
>>>>> binding.
>>>>> 
>>>>> Sent from my iPad
>>>>> 
>>>>> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz <matthias.jurenz@tu-
>>> 
>>> dresden.de> wrote:
>>>>>> Hi Sylvain,
>>>>>> 
>>>>>> thanks for the quick response!
>>>>>> 
>>>>>> Here some results with enabled process binding. I hope I used the
>>>>>> parameters correctly...
>>>>>> 
>>>>>> bind two ranks to one socket:
>>>>>> $ mpirun -np 2 --bind-to-core ./all2all
>>>>>> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
>>>>>> 
>>>>>> bind two ranks to two different sockets:
>>>>>> $ mpirun -np 2 --bind-to-socket ./all2all
>>>>>> 
>>>>>> All three runs resulted in similar bad latencies (~1.4us).
>>>>>> 
>>>>>> :-(
>>>>>> 
>>>>>> Matthias
>>>>>> 
>>>>>> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
>>>>>>> Hi Matthias,
>>>>>>> 
>>>>>>> You might want to play with process binding to see if your problem is
>>>>>>> related to bad memory affinity.
>>>>>>> 
>>>>>>> Try to launch pingpong on two CPUs of the same socket, then on
>>>>>>> different sockets (i.e. bind each process to a core, and try
>>>>>>> different configurations).
>>>>>>> 
>>>>>>> Sylvain
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> De :    Matthias Jurenz <matthias.jur...@tu-dresden.de>
>>>>>>> A :     Open MPI Developers <de...@open-mpi.org>
>>>>>>> Date :  13/02/2012 12:12
>>>>>>> Objet : [OMPI devel] poor btl sm latency
>>>>>>> Envoyé par :    devel-boun...@open-mpi.org
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
>>>>>>> latencies
>>>>>>> (~1.5us) when performing 0-byte p2p communication on one single node
>>>>>>> using the
>>>>>>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies
>>>>>>> which is pretty good. The bandwidth results are similar for both MPI
>>>>>>> implementations
>>>>>>> (~3,3GB/s) - this is okay.
>>>>>>> 
>>>>>>> One node has 64 cores and 64Gb RAM where it doesn't matter how many
>>>>>>> ranks allocated by the application. We get similar results with
>>>>>>> different number of
>>>>>>> ranks.
>>>>>>> 
>>>>>>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
>>>>>>> special
>>>>>>> configure options except the installation prefix and the location of
>>>>>>> the LSF
>>>>>>> stuff.
>>>>>>> 
>>>>>>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to
>>>>>>> use /dev/shm instead of /tmp for the session directory, but it had no
>>>>>>> effect. Furthermore, we tried the current release candidate 1.5.5rc1
>>>>>>> of Open MPI which
>>>>>>> provides an option to use the SysV shared memory (-mca shmem sysv) -
>>>>>>> also this
>>>>>>> results in similar poor latencies.
>>>>>>> 
>>>>>>> Do you have any idea? Please help!
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Matthias
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] poor btl sm latency

Reply via email to