Some supplements: I tried several compilers for building Open MPI with enabled optimizations for the AMD Bulldozer architecture:
* gcc 4.6.2 (-Ofast -mtune=bdver1 -march=bdver1) * Open64 5.0 (-O3 -march=bgver1 -mtune=bdver1 -mso) * Intel 12.1 (-O3 -msse4.2) They all result in similar latencies (~1.4us). As I mentioned in my previous mail, I get the best results if the processes are bound for disabling L2 sharing (i.e. --bind-to-core --cpus-per-proc 2). Just see what happens when doing this for Platform MPI: Without process binding: $ mpirun -np 2 ./NPmpi_pcmpi -u 4 -n 100000 0: n091 1: n091 Now starting the main loop 0: 1 bytes 10000 times --> 16.89 Mbps in 0.45 usec 1: 2 bytes 10000 times --> 34.11 Mbps in 0.45 usec 2: 3 bytes 10000 times --> 51.01 Mbps in 0.45 usec 3: 4 bytes 10000 times --> 68.13 Mbps in 0.45 usec With process binding using 'taskset': $ mpirun -np 2 taskset -c 0,2 ./NPmpi_pcmpi -u 4 -n 10000 0: n051 1: n051 Now starting the main loop 0: 1 bytes 10000 times --> 29.33 Mbps in 0.26 usec 1: 2 bytes 10000 times --> 58.64 Mbps in 0.26 usec 2: 3 bytes 10000 times --> 88.05 Mbps in 0.26 usec 3: 4 bytes 10000 times --> 117.33 Mbps in 0.26 usec I tried to change some parameters of the SM BTL described here: http://www.open-mpi.org/faq/?category=sm#sm-params - but also without success. Do you have any further ideas? Matthias On Monday 20 February 2012 13:46:54 Matthias Jurenz wrote: > If the processes are bound for L2 sharing (i.e. using neighboring cores > pu:0 and pu:1) I get the *worst* latency results: > > $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1 > hwloc-bind pu:1 ./NPmpi -S -u 4 -n 100000 > Using synchronous sends > Using synchronous sends > 0: n023 > 1: n023 > Now starting the main loop > 0: 1 bytes 100000 times --> 3.54 Mbps in 2.16 usec > 1: 2 bytes 100000 times --> 7.10 Mbps in 2.15 usec > 2: 3 bytes 100000 times --> 10.68 Mbps in 2.14 usec > 3: 4 bytes 100000 times --> 14.23 Mbps in 2.15 usec > > As it should, I get the same result when using '-bind-to-core' *without* > '-- cpus-per-proc 2'. > > When using two separate L2's (pu:0,pu:2 or '--cpus-per-proc 2') I get > better results: > > $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1 > hwloc-bind pu:2 ./NPmpi -S -u 4 -n 100000 > Using synchronous sends > 0: n023 > Using synchronous sends > 1: n023 > Now starting the main loop > 0: 1 bytes 100000 times --> 5.15 Mbps in 1.48 usec > 1: 2 bytes 100000 times --> 10.15 Mbps in 1.50 usec > 2: 3 bytes 100000 times --> 15.26 Mbps in 1.50 usec > 3: 4 bytes 100000 times --> 20.23 Mbps in 1.51 usec > > So it seems that the process binding within Open MPI works and retires as > reason for the bad latency :-( > > Matthias > > On Thursday 16 February 2012 17:51:53 Brice Goglin wrote: > > Le 16/02/2012 17:12, Matthias Jurenz a écrit : > > > Thanks for the hint, Brice. > > > I'll forward this bug report to our cluster vendor. > > > > > > Could this be the reason for the bad latencies with Open MPI or does it > > > only affect hwloc/lstopo? > > > > It affects binding. So it may affect the performance you observed when > > using "high-level" binding policies that end up binding on wrong cores > > because of hwloc/kernel problems. If you specify binding manually, it > > shouldn't hurt. > > > > If the best latency case is supposed to be when L2 is shared, then try: > > mpiexec -np 1 hwloc-bind pu:0 ./all2all : -np 1 hwloc-bind pu:1 > > > > ./all2all > > Then, we'll see if you can get the same result with one of OMPI binding > > options. > > > > Brice > > > > > Matthias > > > > > > On Thursday 16 February 2012 15:46:46 Brice Goglin wrote: > > >> Le 16/02/2012 15:39, Matthias Jurenz a écrit : > > >>> Here the output of lstopo from a single compute node. I'm wondering > > >>> that the fact of L1/L2 sharing isn't visible - also not in the > > >>> graphical output... > > >> > > >> That's a kernel bug. We're waiting for AMD to tell the kernel that L1i > > >> and L2 are shared across dual-core modules. If you have some contact > > >> at AMD, please tell them to look at > > >> https://bugzilla.kernel.org/show_bug.cgi?id=42607 > > >> > > >> Brice > > >> > > >> _______________________________________________ > > >> devel mailing list > > >> de...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel