If the processes are bound for L2 sharing (i.e. using neighboring cores pu:0 
and pu:1) I get the *worst* latency results:

$ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1 hwloc-bind 
pu:1 ./NPmpi -S -u 4 -n 100000
Using synchronous sends
Using synchronous sends
0: n023
1: n023
Now starting the main loop
  0:       1 bytes 100000 times -->      3.54 Mbps in       2.16 usec
  1:       2 bytes 100000 times -->      7.10 Mbps in       2.15 usec
  2:       3 bytes 100000 times -->     10.68 Mbps in       2.14 usec
  3:       4 bytes 100000 times -->     14.23 Mbps in       2.15 usec

As it should, I get the same result when using '-bind-to-core' *without* '--
cpus-per-proc 2'.

When using two separate L2's (pu:0,pu:2 or '--cpus-per-proc 2') I get better 
results:

$ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 100000 : -np 1 hwloc-bind 
pu:2 ./NPmpi -S -u 4 -n 100000
Using synchronous sends
0: n023
Using synchronous sends
1: n023
Now starting the main loop
  0:       1 bytes 100000 times -->      5.15 Mbps in       1.48 usec
  1:       2 bytes 100000 times -->     10.15 Mbps in       1.50 usec
  2:       3 bytes 100000 times -->     15.26 Mbps in       1.50 usec
  3:       4 bytes 100000 times -->     20.23 Mbps in       1.51 usec

So it seems that the process binding within Open MPI works and retires as 
reason for the bad latency :-(

Matthias

On Thursday 16 February 2012 17:51:53 Brice Goglin wrote:
> Le 16/02/2012 17:12, Matthias Jurenz a écrit :
> > Thanks for the hint, Brice.
> > I'll forward this bug report to our cluster vendor.
> > 
> > Could this be the reason for the bad latencies with Open MPI or does it
> > only affect hwloc/lstopo?
> 
> It affects binding. So it may affect the performance you observed when
> using "high-level" binding policies that end up binding on wrong cores
> because of hwloc/kernel problems. If you specify binding manually, it
> shouldn't hurt.
> 
> If the best latency case is supposed to be when L2 is shared, then try:
>     mpiexec -np 1 hwloc-bind pu:0 ./all2all : -np 1 hwloc-bind pu:1
> ./all2all
> Then, we'll see if you can get the same result with one of OMPI binding
> options.
> 
> Brice
> 
> > Matthias
> > 
> > On Thursday 16 February 2012 15:46:46 Brice Goglin wrote:
> >> Le 16/02/2012 15:39, Matthias Jurenz a écrit :
> >>> Here the output of lstopo from a single compute node. I'm wondering
> >>> that the fact of L1/L2 sharing isn't visible - also not in the
> >>> graphical output...
> >> 
> >> That's a kernel bug. We're waiting for AMD to tell the kernel that L1i
> >> and L2 are shared across dual-core modules. If you have some contact at
> >> AMD, please tell them to look at
> >> https://bugzilla.kernel.org/show_bug.cgi?id=42607
> >> 
> >> Brice
> >> 
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to