Re: [OMPI devel] poor btl sm latency

Matthias Jurenz Thu, 16 Feb 2012 09:39:26 -0500

The inconsistent results disappears when using the option '--cpus-per-proc 2'. 
I assume it has to do with the fact that each core shares the L1-instruction 
and L2-data cache with its neighboring core (see 
http://upload.wikimedia.org/wikipedia/commons/e/ec/AMD_Bulldozer_block_diagram_%288_core_CPU%29.PNG).


However, the latencies are constant now but still too high:

$ mpirun -np 2 --bind-to-core --cpus-per-proc 2 ./NPmpi_ompi1.5.5 -S -u 12 -n 
100000
Using synchronous sends
0: n029
1: n029
Using synchronous sends
Now starting the main loop
  0:       1 bytes 100000 times -->      4.83 Mbps in       1.58 usec
  1:       2 bytes 100000 times -->      9.64 Mbps in       1.58 usec
  2:       3 bytes 100000 times -->     14.59 Mbps in       1.57 usec
  3:       4 bytes 100000 times -->     19.44 Mbps in       1.57 usec
  4:       6 bytes 100000 times -->     29.34 Mbps in       1.56 usec
  5:       8 bytes 100000 times -->     38.95 Mbps in       1.57 usec
  6:      12 bytes 100000 times -->     58.49 Mbps in       1.57 usec

I updated the Open MPI installation to version 1.5.5rc2r25939 to have hwloc 
1.3.2.

$ ompi_info | grep hwloc
MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.5)
MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.5)
MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.5.5)


Here the output of lstopo from a single compute node. I'm wondering that the 
fact of L1/L2 sharing isn't visible - also not in the graphical output...

$ lstopo --output-format console
Machine (64GB)
  Socket L#0 (16GB)
    NUMANode L#0 (P#0 8190MB) + L3 L#0 (6144KB)
      L2 L#0 (2048KB) + L1 L#0 (16KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (2048KB) + L1 L#1 (16KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (2048KB) + L1 L#2 (16KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (2048KB) + L1 L#3 (16KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (2048KB) + L1 L#4 (16KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (2048KB) + L1 L#5 (16KB) + Core L#5 + PU L#5 (P#5)
      L2 L#6 (2048KB) + L1 L#6 (16KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (2048KB) + L1 L#7 (16KB) + Core L#7 + PU L#7 (P#7)
    NUMANode L#1 (P#1 8192MB) + L3 L#1 (6144KB)
      L2 L#8 (2048KB) + L1 L#8 (16KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (2048KB) + L1 L#9 (16KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (2048KB) + L1 L#10 (16KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (2048KB) + L1 L#11 (16KB) + Core L#11 + PU L#11 (P#11)
      L2 L#12 (2048KB) + L1 L#12 (16KB) + Core L#12 + PU L#12 (P#12)
      L2 L#13 (2048KB) + L1 L#13 (16KB) + Core L#13 + PU L#13 (P#13)
      L2 L#14 (2048KB) + L1 L#14 (16KB) + Core L#14 + PU L#14 (P#14)
      L2 L#15 (2048KB) + L1 L#15 (16KB) + Core L#15 + PU L#15 (P#15)
  Socket L#1 (16GB)
    NUMANode L#2 (P#2 8192MB) + L3 L#2 (6144KB)
      L2 L#16 (2048KB) + L1 L#16 (16KB) + Core L#16 + PU L#16 (P#16)
      L2 L#17 (2048KB) + L1 L#17 (16KB) + Core L#17 + PU L#17 (P#17)
      L2 L#18 (2048KB) + L1 L#18 (16KB) + Core L#18 + PU L#18 (P#18)
      L2 L#19 (2048KB) + L1 L#19 (16KB) + Core L#19 + PU L#19 (P#19)
      L2 L#20 (2048KB) + L1 L#20 (16KB) + Core L#20 + PU L#20 (P#20)
      L2 L#21 (2048KB) + L1 L#21 (16KB) + Core L#21 + PU L#21 (P#21)
      L2 L#22 (2048KB) + L1 L#22 (16KB) + Core L#22 + PU L#22 (P#22)
      L2 L#23 (2048KB) + L1 L#23 (16KB) + Core L#23 + PU L#23 (P#23)
    NUMANode L#3 (P#3 8192MB) + L3 L#3 (6144KB)
      L2 L#24 (2048KB) + L1 L#24 (16KB) + Core L#24 + PU L#24 (P#24)
      L2 L#25 (2048KB) + L1 L#25 (16KB) + Core L#25 + PU L#25 (P#25)
      L2 L#26 (2048KB) + L1 L#26 (16KB) + Core L#26 + PU L#26 (P#26)
      L2 L#27 (2048KB) + L1 L#27 (16KB) + Core L#27 + PU L#27 (P#27)
      L2 L#28 (2048KB) + L1 L#28 (16KB) + Core L#28 + PU L#28 (P#28)
      L2 L#29 (2048KB) + L1 L#29 (16KB) + Core L#29 + PU L#29 (P#29)
      L2 L#30 (2048KB) + L1 L#30 (16KB) + Core L#30 + PU L#30 (P#30)
      L2 L#31 (2048KB) + L1 L#31 (16KB) + Core L#31 + PU L#31 (P#31)
  Socket L#2 (16GB)
    NUMANode L#4 (P#4 8192MB) + L3 L#4 (6144KB)
      L2 L#32 (2048KB) + L1 L#32 (16KB) + Core L#32 + PU L#32 (P#32)
      L2 L#33 (2048KB) + L1 L#33 (16KB) + Core L#33 + PU L#33 (P#33)
      L2 L#34 (2048KB) + L1 L#34 (16KB) + Core L#34 + PU L#34 (P#34)
      L2 L#35 (2048KB) + L1 L#35 (16KB) + Core L#35 + PU L#35 (P#35)
      L2 L#36 (2048KB) + L1 L#36 (16KB) + Core L#36 + PU L#36 (P#36)
      L2 L#37 (2048KB) + L1 L#37 (16KB) + Core L#37 + PU L#37 (P#37)
      L2 L#38 (2048KB) + L1 L#38 (16KB) + Core L#38 + PU L#38 (P#38)
      L2 L#39 (2048KB) + L1 L#39 (16KB) + Core L#39 + PU L#39 (P#39)
    NUMANode L#5 (P#5 8192MB) + L3 L#5 (6144KB)
      L2 L#40 (2048KB) + L1 L#40 (16KB) + Core L#40 + PU L#40 (P#40)
      L2 L#41 (2048KB) + L1 L#41 (16KB) + Core L#41 + PU L#41 (P#41)
      L2 L#42 (2048KB) + L1 L#42 (16KB) + Core L#42 + PU L#42 (P#42)
      L2 L#43 (2048KB) + L1 L#43 (16KB) + Core L#43 + PU L#43 (P#43)
      L2 L#44 (2048KB) + L1 L#44 (16KB) + Core L#44 + PU L#44 (P#44)
      L2 L#45 (2048KB) + L1 L#45 (16KB) + Core L#45 + PU L#45 (P#45)
      L2 L#46 (2048KB) + L1 L#46 (16KB) + Core L#46 + PU L#46 (P#46)
      L2 L#47 (2048KB) + L1 L#47 (16KB) + Core L#47 + PU L#47 (P#47)
  Socket L#3 (16GB)
    NUMANode L#6 (P#6 8192MB) + L3 L#6 (6144KB)
      L2 L#48 (2048KB) + L1 L#48 (16KB) + Core L#48 + PU L#48 (P#48)
      L2 L#49 (2048KB) + L1 L#49 (16KB) + Core L#49 + PU L#49 (P#49)
      L2 L#50 (2048KB) + L1 L#50 (16KB) + Core L#50 + PU L#50 (P#50)
      L2 L#51 (2048KB) + L1 L#51 (16KB) + Core L#51 + PU L#51 (P#51)
      L2 L#52 (2048KB) + L1 L#52 (16KB) + Core L#52 + PU L#52 (P#52)
      L2 L#53 (2048KB) + L1 L#53 (16KB) + Core L#53 + PU L#53 (P#53)
      L2 L#54 (2048KB) + L1 L#54 (16KB) + Core L#54 + PU L#54 (P#54)
      L2 L#55 (2048KB) + L1 L#55 (16KB) + Core L#55 + PU L#55 (P#55)
    NUMANode L#7 (P#7 8192MB) + L3 L#7 (6144KB)
      L2 L#56 (2048KB) + L1 L#56 (16KB) + Core L#56 + PU L#56 (P#56)
      L2 L#57 (2048KB) + L1 L#57 (16KB) + Core L#57 + PU L#57 (P#57)
      L2 L#58 (2048KB) + L1 L#58 (16KB) + Core L#58 + PU L#58 (P#58)
      L2 L#59 (2048KB) + L1 L#59 (16KB) + Core L#59 + PU L#59 (P#59)
      L2 L#60 (2048KB) + L1 L#60 (16KB) + Core L#60 + PU L#60 (P#60)
      L2 L#61 (2048KB) + L1 L#61 (16KB) + Core L#61 + PU L#61 (P#61)
      L2 L#62 (2048KB) + L1 L#62 (16KB) + Core L#62 + PU L#62 (P#62)
      L2 L#63 (2048KB) + L1 L#63 (16KB) + Core L#63 + PU L#63 (P#63)

Matthias

On Thursday 16 February 2012 13:33:16 Jeff Squyres wrote:
> Yowza.  With inconsistent results like that, it does sound like something
> is going on in the hardware.  Unfortunately, I don't know much/anything
> about AMDs (Cisco is an Intel shop).  :-\
> 
> Do you have (AMD's equivalent of) hyperthreading enabled, perchance?
> 
> In the latest 1.5.5 nightly tarball, I have just upgraded the included
> version of hwloc to be 1.3.2.  Maybe a good step would be to download
> hwloc 1.3.2 and verify that lstopo is faithfully reporting the actual
> topology of your system.  Can you do that?
> 
> On Feb 16, 2012, at 7:06 AM, Matthias Jurenz wrote:
> > Jeff,
> > 
> > sorry for the confusion - the all2all is a classic pingpong which uses
> > MPI_Send/Recv with 0 byte messages.
> > 
> > One thing I just noticed when using NetPIPE/MPI. Platform MPI results in
> > almost constant latencies for small messages (~0.89us), where I don't
> > know about process-binding in Platform MPI - I just used the defaults.
> > When using Open MPI (regardless of core/socket-binding) the results
> > differ from run to run:
> > 
> > === FIRST RUN ===
> > $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 100000
> > Using synchronous sends
> > 1: n029
> > Using synchronous sends
> > 0: n029
> > Now starting the main loop
> > 
> >  0:       1 bytes 100000 times -->      4.66 Mbps in       1.64 usec
> >  1:       2 bytes 100000 times -->      8.94 Mbps in       1.71 usec
> >  2:       3 bytes 100000 times -->     13.65 Mbps in       1.68 usec
> >  3:       4 bytes 100000 times -->     17.91 Mbps in       1.70 usec
> >  4:       6 bytes 100000 times -->     29.04 Mbps in       1.58 usec
> >  5:       8 bytes 100000 times -->     39.06 Mbps in       1.56 usec
> >  6:      12 bytes 100000 times -->     57.58 Mbps in       1.59 usec
> > 
> > === SECOND RUN (~3s after the previous run) ===
> > $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 100000
> > Using synchronous sends
> > 1: n029
> > Using synchronous sends
> > 0: n029
> > Now starting the main loop
> > 
> >  0:       1 bytes 100000 times -->      5.73 Mbps in       1.33 usec
> >  1:       2 bytes 100000 times -->     11.45 Mbps in       1.33 usec
> >  2:       3 bytes 100000 times -->     17.13 Mbps in       1.34 usec
> >  3:       4 bytes 100000 times -->     22.94 Mbps in       1.33 usec
> >  4:       6 bytes 100000 times -->     34.39 Mbps in       1.33 usec
> >  5:       8 bytes 100000 times -->     46.40 Mbps in       1.32 usec
> >  6:      12 bytes 100000 times -->     68.92 Mbps in       1.33 usec
> > 
> > === THIRD RUN ===
> > $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 100000
> > Using synchronous sends
> > 0: n029
> > Using synchronous sends
> > 1: n029
> > Now starting the main loop
> > 
> >  0:       1 bytes 100000 times -->      3.50 Mbps in       2.18 usec
> >  1:       2 bytes 100000 times -->      6.99 Mbps in       2.18 usec
> >  2:       3 bytes 100000 times -->     10.48 Mbps in       2.18 usec
> >  3:       4 bytes 100000 times -->     14.00 Mbps in       2.18 usec
> >  4:       6 bytes 100000 times -->     20.98 Mbps in       2.18 usec
> >  5:       8 bytes 100000 times -->     27.84 Mbps in       2.19 usec
> >  6:      12 bytes 100000 times -->     41.99 Mbps in       2.18 usec
> > 
> > At first appearance, I assumed that some CPU power saving feature is
> > enabled. But the CPU frequency scaling is set to "performance" and there
> > is only one available frequency (2.2GHz).
> > 
> > Any idea how this can happen?
> > 
> > 
> > Matthias
> > 
> > On Wednesday 15 February 2012 19:29:38 Jeff Squyres wrote:
> >> Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte
> >> HRT ping pong.  What is this all2all benchmark, btw?  Is it measuring an
> >> MPI_ALLTOALL, or a pingpong?
> >> 
> >> FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about
> >> .27us latencies for short messages over sm and binding to socket.
> >> 
> >> On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote:
> >>> I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3.
> >>> Unfortunately, also without any effect.
> >>> 
> >>> Here some results with enabled binding reports:
> >>> 
> >>> $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
> >>> [n043:61313] [[56788,0],0] odls:default:fork binding child
> >>> [[56788,1],1] to cpus 0002
> >>> [n043:61313] [[56788,0],0] odls:default:fork binding child
> >>> [[56788,1],0] to cpus 0001
> >>> latency: 1.415us
> >>> 
> >>> $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2
> >>> ./all2all_ompi1.5.5
> >>> [n043:61469] [[49736,0],0] odls:default:fork binding child
> >>> [[49736,1],1] to cpus 0002
> >>> [n043:61469] [[49736,0],0] odls:default:fork binding child
> >>> [[49736,1],0] to cpus 0001
> >>> latency: 1.4us
> >>> 
> >>> $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings
> >>> -np 2 ./all2all_ompi1.5.5
> >>> [n043:61508] [[49681,0],0] odls:default:fork binding child
> >>> [[49681,1],1] to cpus 0002
> >>> [n043:61508] [[49681,0],0] odls:default:fork binding child
> >>> [[49681,1],0] to cpus 0001
> >>> latency: 1.4us
> >>> 
> >>> 
> >>> $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
> >>> [n043:61337] [[56780,0],0] odls:default:fork binding child
> >>> [[56780,1],1] to socket 0 cpus 0001
> >>> [n043:61337] [[56780,0],0] odls:default:fork binding child
> >>> [[56780,1],0] to socket 0 cpus 0001
> >>> latency: 4.0us
> >>> 
> >>> $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np
> >>> 2 ./all2all_ompi1.5.5
> >>> [n043:61615] [[49914,0],0] odls:default:fork binding child
> >>> [[49914,1],1] to socket 0 cpus 0001
> >>> [n043:61615] [[49914,0],0] odls:default:fork binding child
> >>> [[49914,1],0] to socket 0 cpus 0001
> >>> latency: 4.0us
> >>> 
> >>> $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings
> >>> -np 2 ./all2all_ompi1.5.5
> >>> [n043:61639] [[49810,0],0] odls:default:fork binding child
> >>> [[49810,1],1] to socket 0 cpus 0001
> >>> [n043:61639] [[49810,0],0] odls:default:fork binding child
> >>> [[49810,1],0] to socket 0 cpus 0001
> >>> latency: 4.0us
> >>> 
> >>> 
> >>> If socket-binding is enabled it seems that all ranks are bind to the
> >>> very first core of one and the same socket. Is it intended? I expected
> >>> that each rank gets its own socket (i.e. 2 ranks -> 2 sockets)...
> >>> 
> >>> Matthias
> >>> 
> >>> On Monday 13 February 2012 22:36:50 Jeff Squyres wrote:
> >>>> Also, double check that you have an optimized build, not a debugging
> >>>> build.
> >>>> 
> >>>> SVN and HG checkouts default to debugging builds, which add in lots of
> >>>> latency.
> >>>> 
> >>>> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
> >>>>> Few thoughts
> >>>>> 
> >>>>> 1. Bind to socket is broken in 1.5.4 - fixed in next release
> >>>>> 
> >>>>> 2. Add --report-bindings to cmd line and see where it thinks the
> >>>>> procs are bound
> >>>>> 
> >>>>> 3. Sounds lime memory may not be local - might be worth checking mem
> >>>>> binding.
> >>>>> 
> >>>>> Sent from my iPad
> >>>>> 
> >>>>> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz <matthias.jurenz@tu-
> >>> 
> >>> dresden.de> wrote:
> >>>>>> Hi Sylvain,
> >>>>>> 
> >>>>>> thanks for the quick response!
> >>>>>> 
> >>>>>> Here some results with enabled process binding. I hope I used the
> >>>>>> parameters correctly...
> >>>>>> 
> >>>>>> bind two ranks to one socket:
> >>>>>> $ mpirun -np 2 --bind-to-core ./all2all
> >>>>>> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
> >>>>>> 
> >>>>>> bind two ranks to two different sockets:
> >>>>>> $ mpirun -np 2 --bind-to-socket ./all2all
> >>>>>> 
> >>>>>> All three runs resulted in similar bad latencies (~1.4us).
> >>>>>> 
> >>>>>> :-(
> >>>>>> 
> >>>>>> Matthias
> >>>>>> 
> >>>>>> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
> >>>>>>> Hi Matthias,
> >>>>>>> 
> >>>>>>> You might want to play with process binding to see if your problem
> >>>>>>> is related to bad memory affinity.
> >>>>>>> 
> >>>>>>> Try to launch pingpong on two CPUs of the same socket, then on
> >>>>>>> different sockets (i.e. bind each process to a core, and try
> >>>>>>> different configurations).
> >>>>>>> 
> >>>>>>> Sylvain
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> De :    Matthias Jurenz <matthias.jur...@tu-dresden.de>
> >>>>>>> A :     Open MPI Developers <de...@open-mpi.org>
> >>>>>>> Date :  13/02/2012 12:12
> >>>>>>> Objet : [OMPI devel] poor btl sm latency
> >>>>>>> Envoyé par :    devel-boun...@open-mpi.org
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Hello all,
> >>>>>>> 
> >>>>>>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
> >>>>>>> latencies
> >>>>>>> (~1.5us) when performing 0-byte p2p communication on one single
> >>>>>>> node using the
> >>>>>>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies
> >>>>>>> which is pretty good. The bandwidth results are similar for both
> >>>>>>> MPI implementations
> >>>>>>> (~3,3GB/s) - this is okay.
> >>>>>>> 
> >>>>>>> One node has 64 cores and 64Gb RAM where it doesn't matter how many
> >>>>>>> ranks allocated by the application. We get similar results with
> >>>>>>> different number of
> >>>>>>> ranks.
> >>>>>>> 
> >>>>>>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
> >>>>>>> special
> >>>>>>> configure options except the installation prefix and the location
> >>>>>>> of the LSF
> >>>>>>> stuff.
> >>>>>>> 
> >>>>>>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried
> >>>>>>> to use /dev/shm instead of /tmp for the session directory, but it
> >>>>>>> had no effect. Furthermore, we tried the current release candidate
> >>>>>>> 1.5.5rc1 of Open MPI which
> >>>>>>> provides an option to use the SysV shared memory (-mca shmem sysv)
> >>>>>>> - also this
> >>>>>>> results in similar poor latencies.
> >>>>>>> 
> >>>>>>> Do you have any idea? Please help!
> >>>>>>> 
> >>>>>>> Thanks,
> >>>>>>> Matthias
> >>>>>>> _______________________________________________
> >>>>>>> devel mailing list
> >>>>>>> de...@open-mpi.org
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>> 
> >>>>>> _______________________________________________
> >>>>>> devel mailing list
> >>>>>> de...@open-mpi.org
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> 
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> 
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] poor btl sm latency

Reply via email to