The inconsistent results disappears when using the option '--cpus-per-proc 2'. I assume it has to do with the fact that each core shares the L1-instruction and L2-data cache with its neighboring core (see http://upload.wikimedia.org/wikipedia/commons/e/ec/AMD_Bulldozer_block_diagram_%288_core_CPU%29.PNG).
However, the latencies are constant now but still too high: $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 ./NPmpi_ompi1.5.5 -S -u 12 -n 100000 Using synchronous sends 0: n029 1: n029 Using synchronous sends Now starting the main loop 0: 1 bytes 100000 times --> 4.83 Mbps in 1.58 usec 1: 2 bytes 100000 times --> 9.64 Mbps in 1.58 usec 2: 3 bytes 100000 times --> 14.59 Mbps in 1.57 usec 3: 4 bytes 100000 times --> 19.44 Mbps in 1.57 usec 4: 6 bytes 100000 times --> 29.34 Mbps in 1.56 usec 5: 8 bytes 100000 times --> 38.95 Mbps in 1.57 usec 6: 12 bytes 100000 times --> 58.49 Mbps in 1.57 usec I updated the Open MPI installation to version 1.5.5rc2r25939 to have hwloc 1.3.2. $ ompi_info | grep hwloc MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.5) MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.5) MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.5.5) Here the output of lstopo from a single compute node. I'm wondering that the fact of L1/L2 sharing isn't visible - also not in the graphical output... $ lstopo --output-format console Machine (64GB) Socket L#0 (16GB) NUMANode L#0 (P#0 8190MB) + L3 L#0 (6144KB) L2 L#0 (2048KB) + L1 L#0 (16KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (2048KB) + L1 L#1 (16KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (2048KB) + L1 L#2 (16KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (2048KB) + L1 L#3 (16KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (2048KB) + L1 L#4 (16KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (2048KB) + L1 L#5 (16KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (2048KB) + L1 L#6 (16KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (2048KB) + L1 L#7 (16KB) + Core L#7 + PU L#7 (P#7) NUMANode L#1 (P#1 8192MB) + L3 L#1 (6144KB) L2 L#8 (2048KB) + L1 L#8 (16KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (2048KB) + L1 L#9 (16KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (2048KB) + L1 L#10 (16KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (2048KB) + L1 L#11 (16KB) + Core L#11 + PU L#11 (P#11) L2 L#12 (2048KB) + L1 L#12 (16KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (2048KB) + L1 L#13 (16KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (2048KB) + L1 L#14 (16KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (2048KB) + L1 L#15 (16KB) + Core L#15 + PU L#15 (P#15) Socket L#1 (16GB) NUMANode L#2 (P#2 8192MB) + L3 L#2 (6144KB) L2 L#16 (2048KB) + L1 L#16 (16KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (2048KB) + L1 L#17 (16KB) + Core L#17 + PU L#17 (P#17) L2 L#18 (2048KB) + L1 L#18 (16KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (2048KB) + L1 L#19 (16KB) + Core L#19 + PU L#19 (P#19) L2 L#20 (2048KB) + L1 L#20 (16KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (2048KB) + L1 L#21 (16KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (2048KB) + L1 L#22 (16KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (2048KB) + L1 L#23 (16KB) + Core L#23 + PU L#23 (P#23) NUMANode L#3 (P#3 8192MB) + L3 L#3 (6144KB) L2 L#24 (2048KB) + L1 L#24 (16KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (2048KB) + L1 L#25 (16KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (2048KB) + L1 L#26 (16KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (2048KB) + L1 L#27 (16KB) + Core L#27 + PU L#27 (P#27) L2 L#28 (2048KB) + L1 L#28 (16KB) + Core L#28 + PU L#28 (P#28) L2 L#29 (2048KB) + L1 L#29 (16KB) + Core L#29 + PU L#29 (P#29) L2 L#30 (2048KB) + L1 L#30 (16KB) + Core L#30 + PU L#30 (P#30) L2 L#31 (2048KB) + L1 L#31 (16KB) + Core L#31 + PU L#31 (P#31) Socket L#2 (16GB) NUMANode L#4 (P#4 8192MB) + L3 L#4 (6144KB) L2 L#32 (2048KB) + L1 L#32 (16KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (2048KB) + L1 L#33 (16KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (2048KB) + L1 L#34 (16KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (2048KB) + L1 L#35 (16KB) + Core L#35 + PU L#35 (P#35) L2 L#36 (2048KB) + L1 L#36 (16KB) + Core L#36 + PU L#36 (P#36) L2 L#37 (2048KB) + L1 L#37 (16KB) + Core L#37 + PU L#37 (P#37) L2 L#38 (2048KB) + L1 L#38 (16KB) + Core L#38 + PU L#38 (P#38) L2 L#39 (2048KB) + L1 L#39 (16KB) + Core L#39 + PU L#39 (P#39) NUMANode L#5 (P#5 8192MB) + L3 L#5 (6144KB) L2 L#40 (2048KB) + L1 L#40 (16KB) + Core L#40 + PU L#40 (P#40) L2 L#41 (2048KB) + L1 L#41 (16KB) + Core L#41 + PU L#41 (P#41) L2 L#42 (2048KB) + L1 L#42 (16KB) + Core L#42 + PU L#42 (P#42) L2 L#43 (2048KB) + L1 L#43 (16KB) + Core L#43 + PU L#43 (P#43) L2 L#44 (2048KB) + L1 L#44 (16KB) + Core L#44 + PU L#44 (P#44) L2 L#45 (2048KB) + L1 L#45 (16KB) + Core L#45 + PU L#45 (P#45) L2 L#46 (2048KB) + L1 L#46 (16KB) + Core L#46 + PU L#46 (P#46) L2 L#47 (2048KB) + L1 L#47 (16KB) + Core L#47 + PU L#47 (P#47) Socket L#3 (16GB) NUMANode L#6 (P#6 8192MB) + L3 L#6 (6144KB) L2 L#48 (2048KB) + L1 L#48 (16KB) + Core L#48 + PU L#48 (P#48) L2 L#49 (2048KB) + L1 L#49 (16KB) + Core L#49 + PU L#49 (P#49) L2 L#50 (2048KB) + L1 L#50 (16KB) + Core L#50 + PU L#50 (P#50) L2 L#51 (2048KB) + L1 L#51 (16KB) + Core L#51 + PU L#51 (P#51) L2 L#52 (2048KB) + L1 L#52 (16KB) + Core L#52 + PU L#52 (P#52) L2 L#53 (2048KB) + L1 L#53 (16KB) + Core L#53 + PU L#53 (P#53) L2 L#54 (2048KB) + L1 L#54 (16KB) + Core L#54 + PU L#54 (P#54) L2 L#55 (2048KB) + L1 L#55 (16KB) + Core L#55 + PU L#55 (P#55) NUMANode L#7 (P#7 8192MB) + L3 L#7 (6144KB) L2 L#56 (2048KB) + L1 L#56 (16KB) + Core L#56 + PU L#56 (P#56) L2 L#57 (2048KB) + L1 L#57 (16KB) + Core L#57 + PU L#57 (P#57) L2 L#58 (2048KB) + L1 L#58 (16KB) + Core L#58 + PU L#58 (P#58) L2 L#59 (2048KB) + L1 L#59 (16KB) + Core L#59 + PU L#59 (P#59) L2 L#60 (2048KB) + L1 L#60 (16KB) + Core L#60 + PU L#60 (P#60) L2 L#61 (2048KB) + L1 L#61 (16KB) + Core L#61 + PU L#61 (P#61) L2 L#62 (2048KB) + L1 L#62 (16KB) + Core L#62 + PU L#62 (P#62) L2 L#63 (2048KB) + L1 L#63 (16KB) + Core L#63 + PU L#63 (P#63) Matthias On Thursday 16 February 2012 13:33:16 Jeff Squyres wrote: > Yowza. With inconsistent results like that, it does sound like something > is going on in the hardware. Unfortunately, I don't know much/anything > about AMDs (Cisco is an Intel shop). :-\ > > Do you have (AMD's equivalent of) hyperthreading enabled, perchance? > > In the latest 1.5.5 nightly tarball, I have just upgraded the included > version of hwloc to be 1.3.2. Maybe a good step would be to download > hwloc 1.3.2 and verify that lstopo is faithfully reporting the actual > topology of your system. Can you do that? > > On Feb 16, 2012, at 7:06 AM, Matthias Jurenz wrote: > > Jeff, > > > > sorry for the confusion - the all2all is a classic pingpong which uses > > MPI_Send/Recv with 0 byte messages. > > > > One thing I just noticed when using NetPIPE/MPI. Platform MPI results in > > almost constant latencies for small messages (~0.89us), where I don't > > know about process-binding in Platform MPI - I just used the defaults. > > When using Open MPI (regardless of core/socket-binding) the results > > differ from run to run: > > > > === FIRST RUN === > > $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5 -S -u 12 -n 100000 > > Using synchronous sends > > 1: n029 > > Using synchronous sends > > 0: n029 > > Now starting the main loop > > > > 0: 1 bytes 100000 times --> 4.66 Mbps in 1.64 usec > > 1: 2 bytes 100000 times --> 8.94 Mbps in 1.71 usec > > 2: 3 bytes 100000 times --> 13.65 Mbps in 1.68 usec > > 3: 4 bytes 100000 times --> 17.91 Mbps in 1.70 usec > > 4: 6 bytes 100000 times --> 29.04 Mbps in 1.58 usec > > 5: 8 bytes 100000 times --> 39.06 Mbps in 1.56 usec > > 6: 12 bytes 100000 times --> 57.58 Mbps in 1.59 usec > > > > === SECOND RUN (~3s after the previous run) === > > $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5 -S -u 12 -n 100000 > > Using synchronous sends > > 1: n029 > > Using synchronous sends > > 0: n029 > > Now starting the main loop > > > > 0: 1 bytes 100000 times --> 5.73 Mbps in 1.33 usec > > 1: 2 bytes 100000 times --> 11.45 Mbps in 1.33 usec > > 2: 3 bytes 100000 times --> 17.13 Mbps in 1.34 usec > > 3: 4 bytes 100000 times --> 22.94 Mbps in 1.33 usec > > 4: 6 bytes 100000 times --> 34.39 Mbps in 1.33 usec > > 5: 8 bytes 100000 times --> 46.40 Mbps in 1.32 usec > > 6: 12 bytes 100000 times --> 68.92 Mbps in 1.33 usec > > > > === THIRD RUN === > > $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5 -S -u 12 -n 100000 > > Using synchronous sends > > 0: n029 > > Using synchronous sends > > 1: n029 > > Now starting the main loop > > > > 0: 1 bytes 100000 times --> 3.50 Mbps in 2.18 usec > > 1: 2 bytes 100000 times --> 6.99 Mbps in 2.18 usec > > 2: 3 bytes 100000 times --> 10.48 Mbps in 2.18 usec > > 3: 4 bytes 100000 times --> 14.00 Mbps in 2.18 usec > > 4: 6 bytes 100000 times --> 20.98 Mbps in 2.18 usec > > 5: 8 bytes 100000 times --> 27.84 Mbps in 2.19 usec > > 6: 12 bytes 100000 times --> 41.99 Mbps in 2.18 usec > > > > At first appearance, I assumed that some CPU power saving feature is > > enabled. But the CPU frequency scaling is set to "performance" and there > > is only one available frequency (2.2GHz). > > > > Any idea how this can happen? > > > > > > Matthias > > > > On Wednesday 15 February 2012 19:29:38 Jeff Squyres wrote: > >> Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte > >> HRT ping pong. What is this all2all benchmark, btw? Is it measuring an > >> MPI_ALLTOALL, or a pingpong? > >> > >> FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about > >> .27us latencies for short messages over sm and binding to socket. > >> > >> On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote: > >>> I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3. > >>> Unfortunately, also without any effect. > >>> > >>> Here some results with enabled binding reports: > >>> > >>> $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5 > >>> [n043:61313] [[56788,0],0] odls:default:fork binding child > >>> [[56788,1],1] to cpus 0002 > >>> [n043:61313] [[56788,0],0] odls:default:fork binding child > >>> [[56788,1],0] to cpus 0001 > >>> latency: 1.415us > >>> > >>> $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2 > >>> ./all2all_ompi1.5.5 > >>> [n043:61469] [[49736,0],0] odls:default:fork binding child > >>> [[49736,1],1] to cpus 0002 > >>> [n043:61469] [[49736,0],0] odls:default:fork binding child > >>> [[49736,1],0] to cpus 0001 > >>> latency: 1.4us > >>> > >>> $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings > >>> -np 2 ./all2all_ompi1.5.5 > >>> [n043:61508] [[49681,0],0] odls:default:fork binding child > >>> [[49681,1],1] to cpus 0002 > >>> [n043:61508] [[49681,0],0] odls:default:fork binding child > >>> [[49681,1],0] to cpus 0001 > >>> latency: 1.4us > >>> > >>> > >>> $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5 > >>> [n043:61337] [[56780,0],0] odls:default:fork binding child > >>> [[56780,1],1] to socket 0 cpus 0001 > >>> [n043:61337] [[56780,0],0] odls:default:fork binding child > >>> [[56780,1],0] to socket 0 cpus 0001 > >>> latency: 4.0us > >>> > >>> $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np > >>> 2 ./all2all_ompi1.5.5 > >>> [n043:61615] [[49914,0],0] odls:default:fork binding child > >>> [[49914,1],1] to socket 0 cpus 0001 > >>> [n043:61615] [[49914,0],0] odls:default:fork binding child > >>> [[49914,1],0] to socket 0 cpus 0001 > >>> latency: 4.0us > >>> > >>> $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings > >>> -np 2 ./all2all_ompi1.5.5 > >>> [n043:61639] [[49810,0],0] odls:default:fork binding child > >>> [[49810,1],1] to socket 0 cpus 0001 > >>> [n043:61639] [[49810,0],0] odls:default:fork binding child > >>> [[49810,1],0] to socket 0 cpus 0001 > >>> latency: 4.0us > >>> > >>> > >>> If socket-binding is enabled it seems that all ranks are bind to the > >>> very first core of one and the same socket. Is it intended? I expected > >>> that each rank gets its own socket (i.e. 2 ranks -> 2 sockets)... > >>> > >>> Matthias > >>> > >>> On Monday 13 February 2012 22:36:50 Jeff Squyres wrote: > >>>> Also, double check that you have an optimized build, not a debugging > >>>> build. > >>>> > >>>> SVN and HG checkouts default to debugging builds, which add in lots of > >>>> latency. > >>>> > >>>> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote: > >>>>> Few thoughts > >>>>> > >>>>> 1. Bind to socket is broken in 1.5.4 - fixed in next release > >>>>> > >>>>> 2. Add --report-bindings to cmd line and see where it thinks the > >>>>> procs are bound > >>>>> > >>>>> 3. Sounds lime memory may not be local - might be worth checking mem > >>>>> binding. > >>>>> > >>>>> Sent from my iPad > >>>>> > >>>>> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz <matthias.jurenz@tu- > >>> > >>> dresden.de> wrote: > >>>>>> Hi Sylvain, > >>>>>> > >>>>>> thanks for the quick response! > >>>>>> > >>>>>> Here some results with enabled process binding. I hope I used the > >>>>>> parameters correctly... > >>>>>> > >>>>>> bind two ranks to one socket: > >>>>>> $ mpirun -np 2 --bind-to-core ./all2all > >>>>>> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all > >>>>>> > >>>>>> bind two ranks to two different sockets: > >>>>>> $ mpirun -np 2 --bind-to-socket ./all2all > >>>>>> > >>>>>> All three runs resulted in similar bad latencies (~1.4us). > >>>>>> > >>>>>> :-( > >>>>>> > >>>>>> Matthias > >>>>>> > >>>>>> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote: > >>>>>>> Hi Matthias, > >>>>>>> > >>>>>>> You might want to play with process binding to see if your problem > >>>>>>> is related to bad memory affinity. > >>>>>>> > >>>>>>> Try to launch pingpong on two CPUs of the same socket, then on > >>>>>>> different sockets (i.e. bind each process to a core, and try > >>>>>>> different configurations). > >>>>>>> > >>>>>>> Sylvain > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> De : Matthias Jurenz <matthias.jur...@tu-dresden.de> > >>>>>>> A : Open MPI Developers <de...@open-mpi.org> > >>>>>>> Date : 13/02/2012 12:12 > >>>>>>> Objet : [OMPI devel] poor btl sm latency > >>>>>>> Envoyé par : devel-boun...@open-mpi.org > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Hello all, > >>>>>>> > >>>>>>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad > >>>>>>> latencies > >>>>>>> (~1.5us) when performing 0-byte p2p communication on one single > >>>>>>> node using the > >>>>>>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies > >>>>>>> which is pretty good. The bandwidth results are similar for both > >>>>>>> MPI implementations > >>>>>>> (~3,3GB/s) - this is okay. > >>>>>>> > >>>>>>> One node has 64 cores and 64Gb RAM where it doesn't matter how many > >>>>>>> ranks allocated by the application. We get similar results with > >>>>>>> different number of > >>>>>>> ranks. > >>>>>>> > >>>>>>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any > >>>>>>> special > >>>>>>> configure options except the installation prefix and the location > >>>>>>> of the LSF > >>>>>>> stuff. > >>>>>>> > >>>>>>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried > >>>>>>> to use /dev/shm instead of /tmp for the session directory, but it > >>>>>>> had no effect. Furthermore, we tried the current release candidate > >>>>>>> 1.5.5rc1 of Open MPI which > >>>>>>> provides an option to use the SysV shared memory (-mca shmem sysv) > >>>>>>> - also this > >>>>>>> results in similar poor latencies. > >>>>>>> > >>>>>>> Do you have any idea? Please help! > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Matthias > >>>>>>> _______________________________________________ > >>>>>>> devel mailing list > >>>>>>> de...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>> > >>>>>> _______________________________________________ > >>>>>> devel mailing list > >>>>>> de...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel