If the processes are bound for L2 sharing (i.e. using neighboring cores pu:0 and pu:1) I get the *worst* latency results:
$ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1 hwloc-bind pu:1 ./NPmpi -S -u 4 -n 100000 Using synchronous sends Using synchronous sends 0: n023 1: n023 Now starting the main loop 0: 1 bytes 100000 times --> 3.54 Mbps in 2.16 usec 1: 2 bytes 100000 times --> 7.10 Mbps in 2.15 usec 2: 3 bytes 100000 times --> 10.68 Mbps in 2.14 usec 3: 4 bytes 100000 times --> 14.23 Mbps in 2.15 usec As it should, I get the same result when using '-bind-to-core' *without* '-- cpus-per-proc 2'. When using two separate L2's (pu:0,pu:2 or '--cpus-per-proc 2') I get better results: $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1 hwloc-bind pu:2 ./NPmpi -S -u 4 -n 100000 Using synchronous sends 0: n023 Using synchronous sends 1: n023 Now starting the main loop 0: 1 bytes 100000 times --> 5.15 Mbps in 1.48 usec 1: 2 bytes 100000 times --> 10.15 Mbps in 1.50 usec 2: 3 bytes 100000 times --> 15.26 Mbps in 1.50 usec 3: 4 bytes 100000 times --> 20.23 Mbps in 1.51 usec So it seems that the process binding within Open MPI works and retires as reason for the bad latency :-( Matthias On Thursday 16 February 2012 17:51:53 Brice Goglin wrote: > Le 16/02/2012 17:12, Matthias Jurenz a écrit : > > Thanks for the hint, Brice. > > I'll forward this bug report to our cluster vendor. > > > > Could this be the reason for the bad latencies with Open MPI or does it > > only affect hwloc/lstopo? > > It affects binding. So it may affect the performance you observed when > using "high-level" binding policies that end up binding on wrong cores > because of hwloc/kernel problems. If you specify binding manually, it > shouldn't hurt. > > If the best latency case is supposed to be when L2 is shared, then try: > mpiexec -np 1 hwloc-bind pu:0 ./all2all : -np 1 hwloc-bind pu:1 > ./all2all > Then, we'll see if you can get the same result with one of OMPI binding > options. > > Brice > > > Matthias > > > > On Thursday 16 February 2012 15:46:46 Brice Goglin wrote: > >> Le 16/02/2012 15:39, Matthias Jurenz a écrit : > >>> Here the output of lstopo from a single compute node. I'm wondering > >>> that the fact of L1/L2 sharing isn't visible - also not in the > >>> graphical output... > >> > >> That's a kernel bug. We're waiting for AMD to tell the kernel that L1i > >> and L2 are shared across dual-core modules. If you have some contact at > >> AMD, please tell them to look at > >> https://bugzilla.kernel.org/show_bug.cgi?id=42607 > >> > >> Brice > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel