Yowza. With inconsistent results like that, it does sound like something is going on in the hardware. Unfortunately, I don't know much/anything about AMDs (Cisco is an Intel shop). :-\
Do you have (AMD's equivalent of) hyperthreading enabled, perchance? In the latest 1.5.5 nightly tarball, I have just upgraded the included version of hwloc to be 1.3.2. Maybe a good step would be to download hwloc 1.3.2 and verify that lstopo is faithfully reporting the actual topology of your system. Can you do that? On Feb 16, 2012, at 7:06 AM, Matthias Jurenz wrote: > Jeff, > > sorry for the confusion - the all2all is a classic pingpong which uses > MPI_Send/Recv with 0 byte messages. > > One thing I just noticed when using NetPIPE/MPI. Platform MPI results in > almost constant latencies for small messages (~0.89us), where I don't know > about process-binding in Platform MPI - I just used the defaults. > When using Open MPI (regardless of core/socket-binding) the results differ > from > run to run: > > === FIRST RUN === > $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5 -S -u 12 -n 100000 > Using synchronous sends > 1: n029 > Using synchronous sends > 0: n029 > Now starting the main loop > 0: 1 bytes 100000 times --> 4.66 Mbps in 1.64 usec > 1: 2 bytes 100000 times --> 8.94 Mbps in 1.71 usec > 2: 3 bytes 100000 times --> 13.65 Mbps in 1.68 usec > 3: 4 bytes 100000 times --> 17.91 Mbps in 1.70 usec > 4: 6 bytes 100000 times --> 29.04 Mbps in 1.58 usec > 5: 8 bytes 100000 times --> 39.06 Mbps in 1.56 usec > 6: 12 bytes 100000 times --> 57.58 Mbps in 1.59 usec > > === SECOND RUN (~3s after the previous run) === > $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5 -S -u 12 -n 100000 > Using synchronous sends > 1: n029 > Using synchronous sends > 0: n029 > Now starting the main loop > 0: 1 bytes 100000 times --> 5.73 Mbps in 1.33 usec > 1: 2 bytes 100000 times --> 11.45 Mbps in 1.33 usec > 2: 3 bytes 100000 times --> 17.13 Mbps in 1.34 usec > 3: 4 bytes 100000 times --> 22.94 Mbps in 1.33 usec > 4: 6 bytes 100000 times --> 34.39 Mbps in 1.33 usec > 5: 8 bytes 100000 times --> 46.40 Mbps in 1.32 usec > 6: 12 bytes 100000 times --> 68.92 Mbps in 1.33 usec > > === THIRD RUN === > $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5 -S -u 12 -n 100000 > Using synchronous sends > 0: n029 > Using synchronous sends > 1: n029 > Now starting the main loop > 0: 1 bytes 100000 times --> 3.50 Mbps in 2.18 usec > 1: 2 bytes 100000 times --> 6.99 Mbps in 2.18 usec > 2: 3 bytes 100000 times --> 10.48 Mbps in 2.18 usec > 3: 4 bytes 100000 times --> 14.00 Mbps in 2.18 usec > 4: 6 bytes 100000 times --> 20.98 Mbps in 2.18 usec > 5: 8 bytes 100000 times --> 27.84 Mbps in 2.19 usec > 6: 12 bytes 100000 times --> 41.99 Mbps in 2.18 usec > > At first appearance, I assumed that some CPU power saving feature is enabled. > But the CPU frequency scaling is set to "performance" and there is only one > available frequency (2.2GHz). > > Any idea how this can happen? > > > Matthias > > On Wednesday 15 February 2012 19:29:38 Jeff Squyres wrote: >> Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte >> HRT ping pong. What is this all2all benchmark, btw? Is it measuring an >> MPI_ALLTOALL, or a pingpong? >> >> FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about >> .27us latencies for short messages over sm and binding to socket. >> >> On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote: >>> I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3. >>> Unfortunately, also without any effect. >>> >>> Here some results with enabled binding reports: >>> >>> $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5 >>> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1] >>> to cpus 0002 >>> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0] >>> to cpus 0001 >>> latency: 1.415us >>> >>> $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2 >>> ./all2all_ompi1.5.5 >>> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1] >>> to cpus 0002 >>> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0] >>> to cpus 0001 >>> latency: 1.4us >>> >>> $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np >>> 2 ./all2all_ompi1.5.5 >>> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1] >>> to cpus 0002 >>> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0] >>> to cpus 0001 >>> latency: 1.4us >>> >>> >>> $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5 >>> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1] >>> to socket 0 cpus 0001 >>> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0] >>> to socket 0 cpus 0001 >>> latency: 4.0us >>> >>> $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2 >>> ./all2all_ompi1.5.5 >>> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1] >>> to socket 0 cpus 0001 >>> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0] >>> to socket 0 cpus 0001 >>> latency: 4.0us >>> >>> $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings >>> -np 2 ./all2all_ompi1.5.5 >>> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1] >>> to socket 0 cpus 0001 >>> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0] >>> to socket 0 cpus 0001 >>> latency: 4.0us >>> >>> >>> If socket-binding is enabled it seems that all ranks are bind to the very >>> first core of one and the same socket. Is it intended? I expected that >>> each rank gets its own socket (i.e. 2 ranks -> 2 sockets)... >>> >>> Matthias >>> >>> On Monday 13 February 2012 22:36:50 Jeff Squyres wrote: >>>> Also, double check that you have an optimized build, not a debugging >>>> build. >>>> >>>> SVN and HG checkouts default to debugging builds, which add in lots of >>>> latency. >>>> >>>> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote: >>>>> Few thoughts >>>>> >>>>> 1. Bind to socket is broken in 1.5.4 - fixed in next release >>>>> >>>>> 2. Add --report-bindings to cmd line and see where it thinks the procs >>>>> are bound >>>>> >>>>> 3. Sounds lime memory may not be local - might be worth checking mem >>>>> binding. >>>>> >>>>> Sent from my iPad >>>>> >>>>> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz <matthias.jurenz@tu- >>> >>> dresden.de> wrote: >>>>>> Hi Sylvain, >>>>>> >>>>>> thanks for the quick response! >>>>>> >>>>>> Here some results with enabled process binding. I hope I used the >>>>>> parameters correctly... >>>>>> >>>>>> bind two ranks to one socket: >>>>>> $ mpirun -np 2 --bind-to-core ./all2all >>>>>> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all >>>>>> >>>>>> bind two ranks to two different sockets: >>>>>> $ mpirun -np 2 --bind-to-socket ./all2all >>>>>> >>>>>> All three runs resulted in similar bad latencies (~1.4us). >>>>>> >>>>>> :-( >>>>>> >>>>>> Matthias >>>>>> >>>>>> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote: >>>>>>> Hi Matthias, >>>>>>> >>>>>>> You might want to play with process binding to see if your problem is >>>>>>> related to bad memory affinity. >>>>>>> >>>>>>> Try to launch pingpong on two CPUs of the same socket, then on >>>>>>> different sockets (i.e. bind each process to a core, and try >>>>>>> different configurations). >>>>>>> >>>>>>> Sylvain >>>>>>> >>>>>>> >>>>>>> >>>>>>> De : Matthias Jurenz <matthias.jur...@tu-dresden.de> >>>>>>> A : Open MPI Developers <de...@open-mpi.org> >>>>>>> Date : 13/02/2012 12:12 >>>>>>> Objet : [OMPI devel] poor btl sm latency >>>>>>> Envoyé par : devel-boun...@open-mpi.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad >>>>>>> latencies >>>>>>> (~1.5us) when performing 0-byte p2p communication on one single node >>>>>>> using the >>>>>>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies >>>>>>> which is pretty good. The bandwidth results are similar for both MPI >>>>>>> implementations >>>>>>> (~3,3GB/s) - this is okay. >>>>>>> >>>>>>> One node has 64 cores and 64Gb RAM where it doesn't matter how many >>>>>>> ranks allocated by the application. We get similar results with >>>>>>> different number of >>>>>>> ranks. >>>>>>> >>>>>>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any >>>>>>> special >>>>>>> configure options except the installation prefix and the location of >>>>>>> the LSF >>>>>>> stuff. >>>>>>> >>>>>>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to >>>>>>> use /dev/shm instead of /tmp for the session directory, but it had no >>>>>>> effect. Furthermore, we tried the current release candidate 1.5.5rc1 >>>>>>> of Open MPI which >>>>>>> provides an option to use the SysV shared memory (-mca shmem sysv) - >>>>>>> also this >>>>>>> results in similar poor latencies. >>>>>>> >>>>>>> Do you have any idea? Please help! >>>>>>> >>>>>>> Thanks, >>>>>>> Matthias >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/