"Unfortunately" also Platform MPI benefits from disabled ASLR:
shared L2/L1I caches (core 0 and 1) and
enabled ASLR:
$ mpirun -np 2 taskset -c 0,1 ./NPmpi_pcmpi -u 1 -n 100
Now starting the main loop
0: 1 bytes 100 times --> 17.07 Mbps in 0.45 usec
disabled ASLR:
$ mpiru
On Mar 15, 2012, at 8:06 AM, Matthias Jurenz wrote:
> We made a big step forward today!
>
> The used Kernel has a bug regarding to the shared L1 instruction cache in AMD
> Bulldozer processors:
> See
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03
gt; - Original Message -
> > From: Matthias Jurenz [mailto:matthias.jur...@tu-dresden.de]
> > Sent: Friday, March 09, 2012 08:50 AM
> > To: Open MPI Developers
> > Cc: Mora, Joshua
> > Subject: Re: [OMPI devel] poor btl sm latency
> >
> > I just made an intere
.@tu-dresden.de]
> Sent: Friday, March 09, 2012 08:50 AM
> To: Open MPI Developers
> Cc: Mora, Joshua
> Subject: Re: [OMPI devel] poor btl sm latency
>
> I just made an interesting observation:
>
> When binding the processes to two neighboring cores (L2 sharing) NetPIPE
I just made an interesting observation:
When binding the processes to two neighboring cores (L2 sharing) NetPIPE shows
*sometimes* pretty good results: ~0.5us
$ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n
10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.
Here the SM BTL parameters:
$ ompi_info --param btl sm
MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
default value) Verbosity level of the BTL framework
MCA btl: parameter "btl" (current value: , data source: file
[/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi
Please do a "ompi_info --param btl sm" on your environment. The lazy_free
direct the internals of the SM BTL not to release the memory fragments used to
communicate until the lazy limit is reached. The default value was deemed as
reasonable a while back when the number of default fragments was l
In thanks to the OTPO tool, I figured out that setting the MCA parameter
btl_sm_fifo_lazy_free to 1 (default is 120) improves the latency significantly:
0,88µs
But somehow I get the feeling that this doesn't eliminate the actual
problem...
Matthias
On Friday 02 March 2012 15:37:03 Matthias Ju
On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote:
> Ok. Good that there's no oversubscription bug, at least. :-)
>
> Did you see my off-list mail to you yesterday about building with an
> external copy of hwloc 1.4 to see if that helps?
Yes, I did - I answered as well. Our mail server seem
Hah! I just saw your ticket about how --with-hwloc=/path/to/install is broken
in 1.5.5. So -- let me go look in to that...
On Mar 2, 2012, at 8:58 AM, Jeffrey Squyres wrote:
> Ok. Good that there's no oversubscription bug, at least. :-)
>
> Did you see my off-list mail to you yesterday abo
Ok. Good that there's no oversubscription bug, at least. :-)
Did you see my off-list mail to you yesterday about building with an external
copy of hwloc 1.4 to see if that helps?
On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
> To exclude a possible bug within the LSF component, I rebuil
To exclude a possible bug within the LSF component, I rebuilt Open MPI without
support for LSF (--without-lsf).
-> It makes no difference - the latency is still bad: ~1.1us.
Matthias
On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> SORRY, it was obviously a big mistake by me. :-(
>
>
SORRY, it was obviously a big mistake by me. :-(
Open MPI 1.5.5 was built with LSF support, so when starting an LSF job it's
necessary to request at least the number of tasks/cores as used for the
subsequent mpirun command. That was not the case - I forgot the bsub's '-n'
option to specify the
When using Open MPI v1.4.5 I get ~1.1us. That's the same result as I get with
Open MPI v1.5.x using mpi_yield_when_idle=0.
So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2) regarding to the
automatic performance mode selection.
When enabling the degraded performance mode for Open MPI
Minor update:
I see some improvement when I set the MCA parameter mpi_yield_when_idle to 0
to enforce the "Agressive" performance mode:
$ mpirun -np 2 -mca mpi_yield_when_idle 0 -mca btl self,sm -bind-to-core -
cpus-per-proc 2 ./NPmpi_ompi1.5.5 -u 4 -n 10
0: n090
1: n090
Now starting the mai
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On 13/02/12 22:11, Matthias Jurenz wrote:
> Do you have any idea? Please help!
Do you see the same bad latency in the old branch (1.4.5) ?
cheers,
Chris
- --
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Com
Some supplements:
I tried several compilers for building Open MPI with enabled optimizations for
the AMD Bulldozer architecture:
* gcc 4.6.2 (-Ofast -mtune=bdver1 -march=bdver1)
* Open64 5.0 (-O3 -march=bgver1 -mtune=bdver1 -mso)
* Intel 12.1 (-O3 -msse4.2)
They all result in similar latencies
If the processes are bound for L2 sharing (i.e. using neighboring cores pu:0
and pu:1) I get the *worst* latency results:
$ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 10 : -np 1 hwloc-bind
pu:1 ./NPmpi -S -u 4 -n 10
Using synchronous sends
Using synchronous sends
0: n023
1: n023
No
Le 16/02/2012 17:12, Matthias Jurenz a écrit :
> Thanks for the hint, Brice.
> I'll forward this bug report to our cluster vendor.
>
> Could this be the reason for the bad latencies with Open MPI or does it only
> affect hwloc/lstopo?
It affects binding. So it may affect the performance you obser
Thanks for the hint, Brice.
I'll forward this bug report to our cluster vendor.
Could this be the reason for the bad latencies with Open MPI or does it only
affect hwloc/lstopo?
Matthias
On Thursday 16 February 2012 15:46:46 Brice Goglin wrote:
> Le 16/02/2012 15:39, Matthias Jurenz a écrit :
>
On Thursday 16 February 2012 16:50:55 Jeff Squyres wrote:
> On Feb 16, 2012, at 10:30 AM, Matthias Jurenz wrote:
> > $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get
> > 0x0003
> > 0x000c
>
> That seems right. From your prior email, 3 maps to 11 binary, which maps
> to:
just a stupid question: in another sm-related thread the value of the
$TMPDIR variable was the problem, could this be the problem here as well?
Edgar
On 2/16/2012 9:30 AM, Matthias Jurenz wrote:
> On Thursday 16 February 2012 16:21:10 Jeff Squyres wrote:
>> On Feb 16, 2012, at 9:39 AM, Matthias J
On Feb 16, 2012, at 10:30 AM, Matthias Jurenz wrote:
> $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get
> 0x0003
> 0x000c
That seems right. From your prior email, 3 maps to 11 binary, which maps to:
Socket L#0 (16GB)
NUMANode L#0 (P#0 8190MB) + L3 L#0 (6144KB)
L
On Thursday 16 February 2012 16:21:10 Jeff Squyres wrote:
> On Feb 16, 2012, at 9:39 AM, Matthias Jurenz wrote:
> > However, the latencies are constant now but still too high:
> >
> > $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 ./NPmpi_ompi1.5.5 -S -u
> > 12 -n 10
>
> Can you run:
>
> mp
On Feb 16, 2012, at 9:39 AM, Matthias Jurenz wrote:
> However, the latencies are constant now but still too high:
>
> $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 ./NPmpi_ompi1.5.5 -S -u 12 -n
> 10
Can you run:
mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get
I want to ve
Le 16/02/2012 15:39, Matthias Jurenz a écrit :
> Here the output of lstopo from a single compute node. I'm wondering that the
> fact of L1/L2 sharing isn't visible - also not in the graphical output...
That's a kernel bug. We're waiting for AMD to tell the kernel that L1i
and L2 are shared across
The inconsistent results disappears when using the option '--cpus-per-proc 2'.
I assume it has to do with the fact that each core shares the L1-instruction
and L2-data cache with its neighboring core (see
http://upload.wikimedia.org/wikipedia/commons/e/ec/AMD_Bulldozer_block_diagram_%288_core_CP
Yowza. With inconsistent results like that, it does sound like something is
going on in the hardware. Unfortunately, I don't know much/anything about AMDs
(Cisco is an Intel shop). :-\
Do you have (AMD's equivalent of) hyperthreading enabled, perchance?
In the latest 1.5.5 nightly tarball, I
Jeff,
sorry for the confusion - the all2all is a classic pingpong which uses
MPI_Send/Recv with 0 byte messages.
One thing I just noticed when using NetPIPE/MPI. Platform MPI results in
almost constant latencies for small messages (~0.89us), where I don't know
about process-binding in Platform
Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte HRT
ping pong. What is this all2all benchmark, btw? Is it measuring an
MPI_ALLTOALL, or a pingpong?
FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about .27us
latencies for short messages over sm and
I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3.
Unfortunately, also without any effect.
Here some results with enabled binding reports:
$ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
[n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1]
Also, double check that you have an optimized build, not a debugging build.
SVN and HG checkouts default to debugging builds, which add in lots of latency.
On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
> Few thoughts
>
> 1. Bind to socket is broken in 1.5.4 - fixed in next release
>
> 2.
Few thoughts
1. Bind to socket is broken in 1.5.4 - fixed in next release
2. Add --report-bindings to cmd line and see where it thinks the procs are bound
3. Sounds lime memory may not be local - might be worth checking mem binding.
Sent from my iPad
On Feb 13, 2012, at 7:07 AM, Matthias Juren
Hi Sylvain,
thanks for the quick response!
Here some results with enabled process binding. I hope I used the parameters
correctly...
bind two ranks to one socket:
$ mpirun -np 2 --bind-to-core ./all2all
$ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
bind two ranks to two different sockets
Hi Matthias,
You might want to play with process binding to see if your problem is
related to bad memory affinity.
Try to launch pingpong on two CPUs of the same socket, then on different
sockets (i.e. bind each process to a core, and try different
configurations).
Sylvain
De :Matthias
35 matches
Mail list logo