Re: [OMPI devel] poor btl sm latency

Matthias Jurenz Tue, 28 Feb 2012 05:16:57 -0500

Minor update:

I see some improvement when I set the MCA parameter mpi_yield_when_idle to 0 
to enforce the "Agressive" performance mode:


$ mpirun -np 2 -mca mpi_yield_when_idle 0 -mca btl self,sm -bind-to-core -
cpus-per-proc 2 ./NPmpi_ompi1.5.5 -u 4 -n 100000
0: n090
1: n090
Now starting the main loop
  0:       1 bytes 100000 times -->      6.96 Mbps in       1.10 usec
  1:       2 bytes 100000 times -->     14.00 Mbps in       1.09 usec
  2:       3 bytes 100000 times -->     20.88 Mbps in       1.10 usec
  3:       4 bytes 100000 times -->     27.65 Mbps in       1.10 usec

When using the default behavior (mpi_yield_when_idle=-1, Open MPI decides 
which performance mode to use), it seems that Open MPI thinks that it runs in 
an oversubscribed mode (more ranks than there are cores available) so it turns 
on the degraded performance mode (mpi_yield_when_idle=1).

When using a hostfile to enforce the number of available cores
$ cat hostfile
localhost
localhost
I get similar results as when using mpi_yield_when_idle=0.

Perhaps this misbehavior is due to the kernel bug (mentioned by Brice) which 
might cause that Open MPI (hwloc) sees less cores than actually available?

Matthias


On Tuesday 21 February 2012 17:17:49 Matthias Jurenz wrote:
> Some supplements:
> 
> I tried several compilers for building Open MPI with enabled optimizations
> for the AMD Bulldozer architecture:
> 
> * gcc 4.6.2 (-Ofast -mtune=bdver1 -march=bdver1)
> * Open64 5.0 (-O3 -march=bgver1 -mtune=bdver1 -mso)
> * Intel 12.1 (-O3 -msse4.2)
> 
> They all result in similar latencies (~1.4us).
> 
> As I mentioned in my previous mail, I get the best results if the processes
> are bound for disabling L2 sharing (i.e. --bind-to-core --cpus-per-proc 2).
> Just see what happens when doing this for Platform MPI:
> 
> Without process binding:
> 
> $ mpirun -np 2 ./NPmpi_pcmpi -u 4 -n 100000
> 0: n091
> 1: n091
> Now starting the main loop
>   0:       1 bytes  10000 times -->     16.89 Mbps in       0.45 usec
>   1:       2 bytes  10000 times -->     34.11 Mbps in       0.45 usec
>   2:       3 bytes  10000 times -->     51.01 Mbps in       0.45 usec
>   3:       4 bytes  10000 times -->     68.13 Mbps in       0.45 usec
> 
> With process binding using 'taskset':
> 
> $ mpirun -np 2 taskset -c 0,2 ./NPmpi_pcmpi -u 4 -n 10000
> 0: n051
> 1: n051
> Now starting the main loop
>   0:       1 bytes  10000 times -->     29.33 Mbps in       0.26 usec
>   1:       2 bytes  10000 times -->     58.64 Mbps in       0.26 usec
>   2:       3 bytes  10000 times -->     88.05 Mbps in       0.26 usec
>   3:       4 bytes  10000 times -->    117.33 Mbps in       0.26 usec
> 
> I tried to change some parameters of the SM BTL described here:
> http://www.open-mpi.org/faq/?category=sm#sm-params - but also without
> success.
> 
> Do you have any further ideas?
> 
> Matthias
> 
> On Monday 20 February 2012 13:46:54 Matthias Jurenz wrote:
> > If the processes are bound for L2 sharing (i.e. using neighboring cores
> > pu:0 and pu:1) I get the *worst* latency results:
> > 
> > $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1
> > hwloc-bind pu:1 ./NPmpi -S -u 4 -n 100000
> > Using synchronous sends
> > Using synchronous sends
> > 0: n023
> > 1: n023
> > Now starting the main loop
> > 
> >   0:       1 bytes 100000 times -->      3.54 Mbps in       2.16 usec
> >   1:       2 bytes 100000 times -->      7.10 Mbps in       2.15 usec
> >   2:       3 bytes 100000 times -->     10.68 Mbps in       2.14 usec
> >   3:       4 bytes 100000 times -->     14.23 Mbps in       2.15 usec
> > 
> > As it should, I get the same result when using '-bind-to-core' *without*
> > '-- cpus-per-proc 2'.
> > 
> > When using two separate L2's (pu:0,pu:2 or '--cpus-per-proc 2') I get
> > better results:
> > 
> > $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1
> > hwloc-bind pu:2 ./NPmpi -S -u 4 -n 100000
> > Using synchronous sends
> > 0: n023
> > Using synchronous sends
> > 1: n023
> > Now starting the main loop
> > 
> >   0:       1 bytes 100000 times -->      5.15 Mbps in       1.48 usec
> >   1:       2 bytes 100000 times -->     10.15 Mbps in       1.50 usec
> >   2:       3 bytes 100000 times -->     15.26 Mbps in       1.50 usec
> >   3:       4 bytes 100000 times -->     20.23 Mbps in       1.51 usec
> > 
> > So it seems that the process binding within Open MPI works and retires as
> > reason for the bad latency :-(
> > 
> > Matthias
> > 
> > On Thursday 16 February 2012 17:51:53 Brice Goglin wrote:
> > > Le 16/02/2012 17:12, Matthias Jurenz a écrit :
> > > > Thanks for the hint, Brice.
> > > > I'll forward this bug report to our cluster vendor.
> > > > 
> > > > Could this be the reason for the bad latencies with Open MPI or does
> > > > it only affect hwloc/lstopo?
> > > 
> > > It affects binding. So it may affect the performance you observed when
> > > using "high-level" binding policies that end up binding on wrong cores
> > > because of hwloc/kernel problems. If you specify binding manually, it
> > > shouldn't hurt.
> > > 
> > > If the best latency case is supposed to be when L2 is shared, then try:
> > >     mpiexec -np 1 hwloc-bind pu:0 ./all2all : -np 1 hwloc-bind pu:1
> > > 
> > > ./all2all
> > > Then, we'll see if you can get the same result with one of OMPI binding
> > > options.
> > > 
> > > Brice
> > > 
> > > > Matthias
> > > > 
> > > > On Thursday 16 February 2012 15:46:46 Brice Goglin wrote:
> > > >> Le 16/02/2012 15:39, Matthias Jurenz a écrit :
> > > >>> Here the output of lstopo from a single compute node. I'm wondering
> > > >>> that the fact of L1/L2 sharing isn't visible - also not in the
> > > >>> graphical output...
> > > >> 
> > > >> That's a kernel bug. We're waiting for AMD to tell the kernel that
> > > >> L1i and L2 are shared across dual-core modules. If you have some
> > > >> contact at AMD, please tell them to look at
> > > >> https://bugzilla.kernel.org/show_bug.cgi?id=42607
> > > >> 
> > > >> Brice
> > > >> 
> > > >> _______________________________________________
> > > >> devel mailing list
> > > >> de...@open-mpi.org
> > > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > 
> > > > _______________________________________________
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] poor btl sm latency

Reply via email to