Some supplements:

I tried several compilers for building Open MPI with enabled optimizations for 
the AMD Bulldozer architecture:

* gcc 4.6.2 (-Ofast -mtune=bdver1 -march=bdver1)
* Open64 5.0 (-O3 -march=bgver1 -mtune=bdver1 -mso)
* Intel 12.1 (-O3 -msse4.2)

They all result in similar latencies (~1.4us).

As I mentioned in my previous mail, I get the best results if the processes 
are bound for disabling L2 sharing (i.e. --bind-to-core --cpus-per-proc 2).
Just see what happens when doing this for Platform MPI:

Without process binding:

$ mpirun -np 2 ./NPmpi_pcmpi -u 4 -n 100000
0: n091
1: n091
Now starting the main loop
  0:       1 bytes  10000 times -->     16.89 Mbps in       0.45 usec
  1:       2 bytes  10000 times -->     34.11 Mbps in       0.45 usec
  2:       3 bytes  10000 times -->     51.01 Mbps in       0.45 usec
  3:       4 bytes  10000 times -->     68.13 Mbps in       0.45 usec

With process binding using 'taskset':

$ mpirun -np 2 taskset -c 0,2 ./NPmpi_pcmpi -u 4 -n 10000
0: n051
1: n051
Now starting the main loop
  0:       1 bytes  10000 times -->     29.33 Mbps in       0.26 usec
  1:       2 bytes  10000 times -->     58.64 Mbps in       0.26 usec
  2:       3 bytes  10000 times -->     88.05 Mbps in       0.26 usec
  3:       4 bytes  10000 times -->    117.33 Mbps in       0.26 usec

I tried to change some parameters of the SM BTL described here: 
http://www.open-mpi.org/faq/?category=sm#sm-params - but also without success.

Do you have any further ideas?

Matthias

On Monday 20 February 2012 13:46:54 Matthias Jurenz wrote:
> If the processes are bound for L2 sharing (i.e. using neighboring cores
> pu:0 and pu:1) I get the *worst* latency results:
> 
> $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1
> hwloc-bind pu:1 ./NPmpi -S -u 4 -n 100000
> Using synchronous sends
> Using synchronous sends
> 0: n023
> 1: n023
> Now starting the main loop
>   0:       1 bytes 100000 times -->      3.54 Mbps in       2.16 usec
>   1:       2 bytes 100000 times -->      7.10 Mbps in       2.15 usec
>   2:       3 bytes 100000 times -->     10.68 Mbps in       2.14 usec
>   3:       4 bytes 100000 times -->     14.23 Mbps in       2.15 usec
> 
> As it should, I get the same result when using '-bind-to-core' *without*
> '-- cpus-per-proc 2'.
> 
> When using two separate L2's (pu:0,pu:2 or '--cpus-per-proc 2') I get
> better results:
> 
> $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1
> hwloc-bind pu:2 ./NPmpi -S -u 4 -n 100000
> Using synchronous sends
> 0: n023
> Using synchronous sends
> 1: n023
> Now starting the main loop
>   0:       1 bytes 100000 times -->      5.15 Mbps in       1.48 usec
>   1:       2 bytes 100000 times -->     10.15 Mbps in       1.50 usec
>   2:       3 bytes 100000 times -->     15.26 Mbps in       1.50 usec
>   3:       4 bytes 100000 times -->     20.23 Mbps in       1.51 usec
> 
> So it seems that the process binding within Open MPI works and retires as
> reason for the bad latency :-(
> 
> Matthias
> 
> On Thursday 16 February 2012 17:51:53 Brice Goglin wrote:
> > Le 16/02/2012 17:12, Matthias Jurenz a écrit :
> > > Thanks for the hint, Brice.
> > > I'll forward this bug report to our cluster vendor.
> > > 
> > > Could this be the reason for the bad latencies with Open MPI or does it
> > > only affect hwloc/lstopo?
> > 
> > It affects binding. So it may affect the performance you observed when
> > using "high-level" binding policies that end up binding on wrong cores
> > because of hwloc/kernel problems. If you specify binding manually, it
> > shouldn't hurt.
> > 
> > If the best latency case is supposed to be when L2 is shared, then try:
> >     mpiexec -np 1 hwloc-bind pu:0 ./all2all : -np 1 hwloc-bind pu:1
> > 
> > ./all2all
> > Then, we'll see if you can get the same result with one of OMPI binding
> > options.
> > 
> > Brice
> > 
> > > Matthias
> > > 
> > > On Thursday 16 February 2012 15:46:46 Brice Goglin wrote:
> > >> Le 16/02/2012 15:39, Matthias Jurenz a écrit :
> > >>> Here the output of lstopo from a single compute node. I'm wondering
> > >>> that the fact of L1/L2 sharing isn't visible - also not in the
> > >>> graphical output...
> > >> 
> > >> That's a kernel bug. We're waiting for AMD to tell the kernel that L1i
> > >> and L2 are shared across dual-core modules. If you have some contact
> > >> at AMD, please tell them to look at
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=42607
> > >> 
> > >> Brice
> > >> 
> > >> _______________________________________________
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to