Re: [OMPI devel] poor btl sm latency

2012-03-16 Thread Matthias Jurenz
"Unfortunately" also Platform MPI benefits from disabled ASLR:

shared L2/L1I caches (core 0 and 1) and
enabled ASLR:
$ mpirun -np 2 taskset -c 0,1 ./NPmpi_pcmpi -u 1 -n 100
Now starting the main loop
  0:   1 bytes 100 times --> 17.07 Mbps in   0.45 usec

disabled ASLR:
$ mpirun -np 2 taskset -c 0,1 ./NPmpi_pcmpi -u 1 -n 100
Now starting the main loop
  0:   1 bytes 100 times --> 55.03 Mbps in   0.14 usec

exclusive L2/L1I caches (core 0 and 2) and
enabled ASLR:
$ mpirun -np 2 taskset -c 0,2 ./NPmpi_pcmpi -u 1 -n 100
Now starting the main loop
  0:   1 bytes 100 times --> 30.45 Mbps in   0.27 usec

disabled ASLR:
$ mpirun -np 2 taskset -c 0,2 ./NPmpi_pcmpi -u 1 -n 100
Now starting the main loop
  0:   1 bytes 100 times --> 31.25 Mbps in   0.28 usec

As expected, disabling the ASLR makes no difference when binding for exclusive 
caches.

Matthias

On Thursday 15 March 2012 17:10:33 Jeffrey Squyres wrote:
> On Mar 15, 2012, at 8:06 AM, Matthias Jurenz wrote:
> > We made a big step forward today!
> > 
> > The used Kernel has a bug regarding to the shared L1 instruction cache in
> > AMD Bulldozer processors:
> > See
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit
> > diff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726 and
> > http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf
> > 
> > Until the Kernel is patched we disable the address-space layout
> > randomization
> > 
> > (ASLR) as described in the above PDF:
> >   $ sudo /sbin/sysctl -w kernel.randomize_va_space=0
> > 
> > Therewith, NetPIPE results in ~0.5us latency when binding the processes
> > for L2/L1I cache sharing (i.e. -bind-to-core).
> 
> This is good!  I love it when the bug is not our fault.  :-)
> 
> > However, when binding the processes for exclusive L2/L1I caches (i.e.
> > -cpus- per-proc 2) we still get ~1.1us latency. I don't think that the
> > upcoming kernel patch will help for this kind of process binding...
> 
> Does this kind of thing happen with Platform MPI, too?  I.e., is this
> another kernel issue, or an OMPI-specific issue?


Re: [OMPI devel] poor btl sm latency

2012-03-15 Thread Jeffrey Squyres
On Mar 15, 2012, at 8:06 AM, Matthias Jurenz wrote:

> We made a big step forward today!
> 
> The used Kernel has a bug regarding to the shared L1 instruction cache in AMD 
> Bulldozer processors:
> See 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726
>  
> and
> http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf
> 
> Until the Kernel is patched we disable the address-space layout randomization 
> (ASLR) as described in the above PDF:
> 
>   $ sudo /sbin/sysctl -w kernel.randomize_va_space=0
> 
> Therewith, NetPIPE results in ~0.5us latency when binding the processes for 
> L2/L1I cache sharing (i.e. -bind-to-core).

This is good!  I love it when the bug is not our fault.  :-)

> However, when binding the processes for exclusive L2/L1I caches (i.e. -cpus-
> per-proc 2) we still get ~1.1us latency. I don't think that the upcoming 
> kernel patch will help for this kind of process binding...

Does this kind of thing happen with Platform MPI, too?  I.e., is this another 
kernel issue, or an OMPI-specific issue?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] poor btl sm latency

2012-03-15 Thread Matthias Jurenz
We made a big step forward today!

The used Kernel has a bug regarding to the shared L1 instruction cache in AMD 
Bulldozer processors:
See 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726
 
and
http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf

Until the Kernel is patched we disable the address-space layout randomization 
(ASLR) as described in the above PDF:

   $ sudo /sbin/sysctl -w kernel.randomize_va_space=0

Therewith, NetPIPE results in ~0.5us latency when binding the processes for 
L2/L1I cache sharing (i.e. -bind-to-core).

However, when binding the processes for exclusive L2/L1I caches (i.e. -cpus-
per-proc 2) we still get ~1.1us latency. I don't think that the upcoming 
kernel patch will help for this kind of process binding...

Matthias

On Monday 12 March 2012 11:09:01 Matthias Jurenz wrote:
> It's a SUSE Linux Enterprise Server 11 Service Pack 1 with kernel version
> 2.6.32.49-0.3-default.
> 
> Matthias
> 
> On Friday 09 March 2012 16:36:41 you wrote:
> > What OS are you using ?
> > 
> > Joshua
> > 
> > - Original Message -
> > From: Matthias Jurenz [mailto:matthias.jur...@tu-dresden.de]
> > Sent: Friday, March 09, 2012 08:50 AM
> > To: Open MPI Developers 
> > Cc: Mora, Joshua
> > Subject: Re: [OMPI devel] poor btl sm latency
> > 
> > I just made an interesting observation:
> > 
> > When binding the processes to two neighboring cores (L2 sharing) NetPIPE
> > shows *sometimes* pretty good results: ~0.5us
> > 
> > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u
> > 4 -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> > 10 -p 0 using object #0 depth 6 below cpuset 0x,0x
> > using object #1 depth 6 below cpuset 0x,0x
> > adding 0x0001 to 0x0
> > adding 0x0001 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x0001
> > adding 0x0002 to 0x0
> > adding 0x0002 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x0002
> > Using no perturbations
> > 
> > 0: n035
> > Using no perturbations
> > 
> > 1: n035
> > Now starting the main loop
> > 
> >   0:   1 bytes 10 times -->  6.01 Mbps in   1.27 usec
> >   1:   2 bytes 10 times --> 12.04 Mbps in   1.27 usec
> >   2:   3 bytes 10 times --> 18.07 Mbps in   1.27 usec
> >   3:   4 bytes 10 times --> 24.13 Mbps in   1.26 usec
> > 
> > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u
> > 4 -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> > 10 -p 0 using object #0 depth 6 below cpuset 0x,0x
> > adding 0x0001 to 0x0
> > adding 0x0001 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x0001
> > using object #1 depth 6 below cpuset 0x,0x
> > adding 0x0002 to 0x0
> > adding 0x0002 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x0002
> > Using no perturbations
> > 
> > 0: n035
> > Using no perturbations
> > 
> > 1: n035
> > Now starting the main loop
> > 
> >   0:   1 bytes 10 times --> 12.96 Mbps in   0.59 usec
> >   1:   2 bytes 10 times --> 25.78 Mbps in   0.59 usec
> >   2:   3 bytes 10 times --> 38.62 Mbps in   0.59 usec
> >   3:   4 bytes 10 times --> 52.88 Mbps in   0.58 usec
> > 
> > I can reproduce that approximately every tenth run.
> > 
> > When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I
> > get constant latencies ~1.1us
> > 
> > Matthias
> > 
> > On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> > > Here the SM BTL parameters:
> > > 
> > > $ ompi_info --param btl sm
> > > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> > > default value) Verbosity level of the BTL framework
> > > MCA btl: parameter "btl" (current value: , data source:
> > > file
> > > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.con
> > > f] ) Default selection set of components for the btl framework (
> > > means use all components that can be found)
> > > MCA btl: information "btl_sm_have_knem_support" (value:

Re: [OMPI devel] poor btl sm latency

2012-03-12 Thread Matthias Jurenz
It's a SUSE Linux Enterprise Server 11 Service Pack 1 with kernel version 
2.6.32.49-0.3-default.

Matthias

On Friday 09 March 2012 16:36:41 you wrote:
> What OS are you using ?
> 
> Joshua
> 
> - Original Message -
> From: Matthias Jurenz [mailto:matthias.jur...@tu-dresden.de]
> Sent: Friday, March 09, 2012 08:50 AM
> To: Open MPI Developers 
> Cc: Mora, Joshua
> Subject: Re: [OMPI devel] poor btl sm latency
> 
> I just made an interesting observation:
> 
> When binding the processes to two neighboring cores (L2 sharing) NetPIPE
> shows *sometimes* pretty good results: ~0.5us
> 
> $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4
> -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> 10 -p 0 using object #0 depth 6 below cpuset 0x,0x
> using object #1 depth 6 below cpuset 0x,0x
> adding 0x0001 to 0x0
> adding 0x0001 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x0001
> adding 0x0002 to 0x0
> adding 0x0002 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x0002
> Using no perturbations
> 
> 0: n035
> Using no perturbations
> 
> 1: n035
> Now starting the main loop
>   0:   1 bytes 10 times -->  6.01 Mbps in   1.27 usec
>   1:   2 bytes 10 times --> 12.04 Mbps in   1.27 usec
>   2:   3 bytes 10 times --> 18.07 Mbps in   1.27 usec
>   3:   4 bytes 10 times --> 24.13 Mbps in   1.26 usec
> 
> $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4
> -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> 10 -p 0 using object #0 depth 6 below cpuset 0x,0x
> adding 0x0001 to 0x0
> adding 0x0001 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x0001
> using object #1 depth 6 below cpuset 0x,0x
> adding 0x0002 to 0x0
> adding 0x0002 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x0002
> Using no perturbations
> 
> 0: n035
> Using no perturbations
> 
> 1: n035
> Now starting the main loop
>   0:   1 bytes 10 times --> 12.96 Mbps in   0.59 usec
>   1:   2 bytes 10 times --> 25.78 Mbps in   0.59 usec
>   2:   3 bytes 10 times --> 38.62 Mbps in   0.59 usec
>   3:   4 bytes 10 times --> 52.88 Mbps in   0.58 usec
> 
> I can reproduce that approximately every tenth run.
> 
> When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I
> get constant latencies ~1.1us
> 
> Matthias
> 
> On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> > Here the SM BTL parameters:
> > 
> > $ ompi_info --param btl sm
> > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> > default value) Verbosity level of the BTL framework
> > MCA btl: parameter "btl" (current value: , data source:
> > file
> > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf]
> > ) Default selection set of components for the btl framework ( means
> > use all components that can be found)
> > MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source:
> > default value) Whether this component supports the knem Linux kernel
> > module or not
> > MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source:
> > default value) Whether knem support is desired or not (negative = try to
> > enable knem support, but continue even if it is not available, 0 = do not
> > enable knem support, positive = try to enable knem support and fail if it
> > is not available)
> > MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data
> > source: default value) Minimum message size (in bytes) to use the knem
> > DMA mode; ignored if knem does not support DMA mode (0 = do not use the
> > knem DMA mode) MCA btl: parameter "btl_sm_knem_max_simultaneous"
> > (current value: <0>, data source: default value) Max number of
> > simultaneous ongoing knem operations to support (0 = do everything
> > synchronously, which probably gives the best large message latency; >0
> > means to do all operations asynchronously, which supports better overlap
> > for simultaneous large message sends)
> > MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data
> > source: default value)
> > MCA btl: parameter "btl_sm

Re: [OMPI devel] poor btl sm latency

2012-03-09 Thread Matthias Jurenz
I just made an interesting observation:

When binding the processes to two neighboring cores (L2 sharing) NetPIPE shows 
*sometimes* pretty good results: ~0.5us

$ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 
10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 10 -p 0
using object #0 depth 6 below cpuset 0x,0x
using object #1 depth 6 below cpuset 0x,0x
adding 0x0001 to 0x0
adding 0x0001 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x0001
adding 0x0002 to 0x0
adding 0x0002 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x0002
Using no perturbations

0: n035
Using no perturbations

1: n035
Now starting the main loop
  0:   1 bytes 10 times -->  6.01 Mbps in   1.27 usec
  1:   2 bytes 10 times --> 12.04 Mbps in   1.27 usec
  2:   3 bytes 10 times --> 18.07 Mbps in   1.27 usec
  3:   4 bytes 10 times --> 24.13 Mbps in   1.26 usec

$ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 
10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 10 -p 0
using object #0 depth 6 below cpuset 0x,0x
adding 0x0001 to 0x0
adding 0x0001 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x0001
using object #1 depth 6 below cpuset 0x,0x
adding 0x0002 to 0x0
adding 0x0002 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x0002
Using no perturbations

0: n035
Using no perturbations

1: n035
Now starting the main loop
  0:   1 bytes 10 times --> 12.96 Mbps in   0.59 usec
  1:   2 bytes 10 times --> 25.78 Mbps in   0.59 usec
  2:   3 bytes 10 times --> 38.62 Mbps in   0.59 usec
  3:   4 bytes 10 times --> 52.88 Mbps in   0.58 usec

I can reproduce that approximately every tenth run.

When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I get 
constant latencies ~1.1us

Matthias

On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> Here the SM BTL parameters:
> 
> $ ompi_info --param btl sm
> MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> default value) Verbosity level of the BTL framework
> MCA btl: parameter "btl" (current value: , data source:
> file
> [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf])
> Default selection set of components for the btl framework ( means
> use all components that can be found)
> MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source:
> default value) Whether this component supports the knem Linux kernel module
> or not
> MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source:
> default value) Whether knem support is desired or not (negative = try to
> enable knem support, but continue even if it is not available, 0 = do not
> enable knem support, positive = try to enable knem support and fail if it
> is not available)
> MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data source:
> default value) Minimum message size (in bytes) to use the knem DMA mode;
> ignored if knem does not support DMA mode (0 = do not use the knem DMA
> mode) MCA btl: parameter "btl_sm_knem_max_simultaneous" (current value:
> <0>, data source: default value) Max number of simultaneous ongoing knem
> operations to support (0 = do everything synchronously, which probably
> gives the best large message latency; >0 means to do all operations
> asynchronously, which supports better overlap for simultaneous large
> message sends)
> MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data source:
> default value)
> MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data
> source: default value)
> MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data
> source: default value)
> MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data source:
> default value)
> MCA btl: parameter "btl_sm_mpool" (current value: , data source:
> default value)
> MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data source:
> default value)
> MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source:
> default value)
> MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data
> source: default value)
> MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data
> source: default value)
> MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data
> source: default value) BTL exclusivity (must be >= 0)
> MCA btl: parameter "btl_sm_flags" (current value: <5>, data source: default
> value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4, SEND_INPLACE=8,
> RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used by the "dr" PML
> (ignored by others): ACK=16, CHECKSUM=32, RDMA_COMPLETION=128; flags

Re: [OMPI devel] poor btl sm latency

2012-03-05 Thread Matthias Jurenz
Here the SM BTL parameters:

$ ompi_info --param btl sm
MCA btl: parameter "btl_base_verbose" (current value: <0>, data source: 
default value) Verbosity level of the BTL framework
MCA btl: parameter "btl" (current value: , data source: file 
[/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf]) 
Default selection set of components for the btl framework ( means use 
all components that can be found)
MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source: 
default value) Whether this component supports the knem Linux kernel module or 
not
MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source: 
default value) Whether knem support is desired or not (negative = try to 
enable knem support, but continue even if it is not available, 0 = do not 
enable knem support, positive = try to enable knem support and fail if it is 
not available)
MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data source: 
default value) Minimum message size (in bytes) to use the knem DMA mode; 
ignored if knem does not support DMA mode (0 = do not use the knem DMA mode)
MCA btl: parameter "btl_sm_knem_max_simultaneous" (current value: <0>, data 
source: default value) Max number of simultaneous ongoing knem operations to 
support (0 = do everything synchronously, which probably gives the best large 
message latency; >0 means to do all operations asynchronously, which supports 
better overlap for simultaneous large message sends)
MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data source: 
default value)
MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data source: 
default value)
MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data source: 
default value)
MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data source: 
default value)
MCA btl: parameter "btl_sm_mpool" (current value: , data source: default 
value)
MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data source: 
default value)
MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source: default 
value)
MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data source: 
default value)
MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data source: 
default value)
MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data source: 
default value) BTL exclusivity (must be >= 0)
MCA btl: parameter "btl_sm_flags" (current value: <5>, data source: default 
value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4, SEND_INPLACE=8, 
RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used by the "dr" PML 
(ignored by others): ACK=16, CHECKSUM=32, RDMA_COMPLETION=128; flags only used 
by the "bfo" PML (ignored by others): FAILOVER_SUPPORT=512)
MCA btl: parameter "btl_sm_rndv_eager_limit" (current value: <4096>, data 
source: default value) Size (in bytes) of "phase 1" fragment sent for all 
large messages (must be >= 0 and <= eager_limit)
MCA btl: parameter "btl_sm_eager_limit" (current value: <4096>, data source: 
default value) Maximum size (in bytes) of "short" messages (must be >= 1).
MCA btl: parameter "btl_sm_max_send_size" (current value: <32768>, data 
source: default value) Maximum size (in bytes) of a single "phase 2" fragment 
of a long message when using the pipeline protocol (must be >= 1)
MCA btl: parameter "btl_sm_bandwidth" (current value: <9000>, data source: 
default value) Approximate maximum bandwidth of interconnect(0 = auto-detect 
value at run-time [not supported in all BTL modules], >= 1 = bandwidth in 
Mbps)
MCA btl: parameter "btl_sm_latency" (current value: <1>, data source: default 
value) Approximate latency of interconnect (must be >= 0)
MCA btl: parameter "btl_sm_priority" (current value: <0>, data source: default 
value)
MCA btl: parameter "btl_base_warn_component_unused" (current value: <1>, data 
source: default value) This parameter is used to turn on warning messages when 
certain NICs are not used

Matthias

On Friday 02 March 2012 16:23:32 George Bosilca wrote:
> Please do a "ompi_info --param btl sm" on your environment. The lazy_free
> direct the internals of the SM BTL not to release the memory fragments
> used to communicate until the lazy limit is reached. The default value was
> deemed as reasonable a while back when the number of default fragments was
> large. Lately there were some patches to reduce the memory footprint of
> the SM BTL and these might have lowered the available fragments to a limit
> where the default value for the lazy_free is now too large.
> 
>   george.
> 
> On Mar 2, 2012, at 10:08 , Matthias Jurenz wrote:
> > In thanks to the OTPO tool, I figured out that setting the MCA parameter
> > btl_sm_fifo_lazy_free to 1 (default is 120) improves the latency
> > significantly: 0,88µs
> > 
> > But somehow I get the feeling that this doesn't eliminate the actual
> > problem...
> > 
> > Matthias
> > 
> > On Friday 02 March 2012 15:37:03 Matthias J

Re: [OMPI devel] poor btl sm latency

2012-03-02 Thread George Bosilca
Please do a "ompi_info --param btl sm" on your environment. The lazy_free 
direct the internals of the SM BTL not to release the memory fragments used to 
communicate until the lazy limit is reached. The default value was deemed as 
reasonable a while back when the number of default fragments was large. Lately 
there were some patches to reduce the memory footprint of the SM BTL and these 
might have lowered the available fragments to a limit where the default value 
for the lazy_free is now too large.

  george.

On Mar 2, 2012, at 10:08 , Matthias Jurenz wrote:

> In thanks to the OTPO tool, I figured out that setting the MCA parameter 
> btl_sm_fifo_lazy_free to 1 (default is 120) improves the latency 
> significantly: 
> 0,88µs
> 
> But somehow I get the feeling that this doesn't eliminate the actual 
> problem...
> 
> Matthias
> 
> On Friday 02 March 2012 15:37:03 Matthias Jurenz wrote:
>> On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote:
>>> Ok.  Good that there's no oversubscription bug, at least.  :-)
>>> 
>>> Did you see my off-list mail to you yesterday about building with an
>>> external copy of hwloc 1.4 to see if that helps?
>> 
>> Yes, I did - I answered as well. Our mail server seems to be something busy
>> today...
>> 
>> Just for the record: Using hwloc-1.4 makes no difference.
>> 
>> Matthias
>> 
>>> On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
 To exclude a possible bug within the LSF component, I rebuilt Open MPI
 without support for LSF (--without-lsf).
 
 -> It makes no difference - the latency is still bad: ~1.1us.
 
 Matthias
 
 On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> SORRY, it was obviously a big mistake by me. :-(
> 
> Open MPI 1.5.5 was built with LSF support, so when starting an LSF job
> it's necessary to request at least the number of tasks/cores as used
> for the subsequent mpirun command. That was not the case - I forgot
> the bsub's '-n' option to specify the number of task, so only *one*
> task/core was requested.
> 
> Open MPI 1.4.5 was built *without* LSF support, so the supposed
> misbehavior could not happen with it.
> 
> In short, there is no bug in Open MPI 1.5.x regarding to the detection
> of oversubscription. Sorry for any confusion!
> 
> Matthias
> 
> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
>> When using Open MPI v1.4.5 I get ~1.1us. That's the same result as I
>> get with Open MPI v1.5.x using mpi_yield_when_idle=0.
>> So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2)
>> regarding to the automatic performance mode selection.
>> 
>> When enabling the degraded performance mode for Open MPI 1.4.5
>> (mpi_yield_when_idle=1) I get ~1.8us latencies.
>> 
>> Matthias
>> 
>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
>>> On 13/02/12 22:11, Matthias Jurenz wrote:
 Do you have any idea? Please help!
>>> 
>>> Do you see the same bad latency in the old branch (1.4.5) ?
>>> 
>>> cheers,
>>> Chris
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] poor btl sm latency

2012-03-02 Thread Matthias Jurenz
In thanks to the OTPO tool, I figured out that setting the MCA parameter 
btl_sm_fifo_lazy_free to 1 (default is 120) improves the latency significantly: 
0,88µs

But somehow I get the feeling that this doesn't eliminate the actual 
problem...

Matthias

On Friday 02 March 2012 15:37:03 Matthias Jurenz wrote:
> On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote:
> > Ok.  Good that there's no oversubscription bug, at least.  :-)
> > 
> > Did you see my off-list mail to you yesterday about building with an
> > external copy of hwloc 1.4 to see if that helps?
> 
> Yes, I did - I answered as well. Our mail server seems to be something busy
> today...
> 
> Just for the record: Using hwloc-1.4 makes no difference.
> 
> Matthias
> 
> > On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
> > > To exclude a possible bug within the LSF component, I rebuilt Open MPI
> > > without support for LSF (--without-lsf).
> > > 
> > > -> It makes no difference - the latency is still bad: ~1.1us.
> > > 
> > > Matthias
> > > 
> > > On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> > >> SORRY, it was obviously a big mistake by me. :-(
> > >> 
> > >> Open MPI 1.5.5 was built with LSF support, so when starting an LSF job
> > >> it's necessary to request at least the number of tasks/cores as used
> > >> for the subsequent mpirun command. That was not the case - I forgot
> > >> the bsub's '-n' option to specify the number of task, so only *one*
> > >> task/core was requested.
> > >> 
> > >> Open MPI 1.4.5 was built *without* LSF support, so the supposed
> > >> misbehavior could not happen with it.
> > >> 
> > >> In short, there is no bug in Open MPI 1.5.x regarding to the detection
> > >> of oversubscription. Sorry for any confusion!
> > >> 
> > >> Matthias
> > >> 
> > >> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
> > >>> When using Open MPI v1.4.5 I get ~1.1us. That's the same result as I
> > >>> get with Open MPI v1.5.x using mpi_yield_when_idle=0.
> > >>> So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2)
> > >>> regarding to the automatic performance mode selection.
> > >>> 
> > >>> When enabling the degraded performance mode for Open MPI 1.4.5
> > >>> (mpi_yield_when_idle=1) I get ~1.8us latencies.
> > >>> 
> > >>> Matthias
> > >>> 
> > >>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> >  On 13/02/12 22:11, Matthias Jurenz wrote:
> > > Do you have any idea? Please help!
> >  
> >  Do you see the same bad latency in the old branch (1.4.5) ?
> >  
> >  cheers,
> >  Chris
> > >>> 
> > >>> ___
> > >>> devel mailing list
> > >>> de...@open-mpi.org
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >> 
> > >> ___
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] poor btl sm latency

2012-03-02 Thread Matthias Jurenz
On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote:
> Ok.  Good that there's no oversubscription bug, at least.  :-)
> 
> Did you see my off-list mail to you yesterday about building with an
> external copy of hwloc 1.4 to see if that helps?
Yes, I did - I answered as well. Our mail server seems to be something busy 
today...

Just for the record: Using hwloc-1.4 makes no difference.

Matthias

> 
> On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
> > To exclude a possible bug within the LSF component, I rebuilt Open MPI
> > without support for LSF (--without-lsf).
> > 
> > -> It makes no difference - the latency is still bad: ~1.1us.
> > 
> > Matthias
> > 
> > On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> >> SORRY, it was obviously a big mistake by me. :-(
> >> 
> >> Open MPI 1.5.5 was built with LSF support, so when starting an LSF job
> >> it's necessary to request at least the number of tasks/cores as used
> >> for the subsequent mpirun command. That was not the case - I forgot the
> >> bsub's '-n' option to specify the number of task, so only *one*
> >> task/core was requested.
> >> 
> >> Open MPI 1.4.5 was built *without* LSF support, so the supposed
> >> misbehavior could not happen with it.
> >> 
> >> In short, there is no bug in Open MPI 1.5.x regarding to the detection
> >> of oversubscription. Sorry for any confusion!
> >> 
> >> Matthias
> >> 
> >> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
> >>> When using Open MPI v1.4.5 I get ~1.1us. That's the same result as I
> >>> get with Open MPI v1.5.x using mpi_yield_when_idle=0.
> >>> So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2) regarding
> >>> to the automatic performance mode selection.
> >>> 
> >>> When enabling the degraded performance mode for Open MPI 1.4.5
> >>> (mpi_yield_when_idle=1) I get ~1.8us latencies.
> >>> 
> >>> Matthias
> >>> 
> >>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
>  On 13/02/12 22:11, Matthias Jurenz wrote:
> > Do you have any idea? Please help!
>  
>  Do you see the same bad latency in the old branch (1.4.5) ?
>  
>  cheers,
>  Chris
> >>> 
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] poor btl sm latency

2012-03-02 Thread Jeffrey Squyres
Hah!  I just saw your ticket about how --with-hwloc=/path/to/install is broken 
in 1.5.5.  So -- let me go look in to that...


On Mar 2, 2012, at 8:58 AM, Jeffrey Squyres wrote:

> Ok.  Good that there's no oversubscription bug, at least.  :-)
> 
> Did you see my off-list mail to you yesterday about building with an external 
> copy of hwloc 1.4 to see if that helps?
> 
> 
> On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
> 
>> To exclude a possible bug within the LSF component, I rebuilt Open MPI 
>> without 
>> support for LSF (--without-lsf).
>> 
>> -> It makes no difference - the latency is still bad: ~1.1us.
>> 
>> Matthias
>> 
>> On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
>>> SORRY, it was obviously a big mistake by me. :-(
>>> 
>>> Open MPI 1.5.5 was built with LSF support, so when starting an LSF job it's
>>> necessary to request at least the number of tasks/cores as used for the
>>> subsequent mpirun command. That was not the case - I forgot the bsub's '-n'
>>> option to specify the number of task, so only *one* task/core was
>>> requested.
>>> 
>>> Open MPI 1.4.5 was built *without* LSF support, so the supposed misbehavior
>>> could not happen with it.
>>> 
>>> In short, there is no bug in Open MPI 1.5.x regarding to the detection of
>>> oversubscription. Sorry for any confusion!
>>> 
>>> Matthias
>>> 
>>> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
 When using Open MPI v1.4.5 I get ~1.1us. That's the same result as I get
 with Open MPI v1.5.x using mpi_yield_when_idle=0.
 So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2) regarding to
 the automatic performance mode selection.
 
 When enabling the degraded performance mode for Open MPI 1.4.5
 (mpi_yield_when_idle=1) I get ~1.8us latencies.
 
 Matthias
 
 On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> On 13/02/12 22:11, Matthias Jurenz wrote:
>> Do you have any idea? Please help!
> 
> Do you see the same bad latency in the old branch (1.4.5) ?
> 
> cheers,
> Chris
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] poor btl sm latency

2012-03-02 Thread Jeffrey Squyres
Ok.  Good that there's no oversubscription bug, at least.  :-)

Did you see my off-list mail to you yesterday about building with an external 
copy of hwloc 1.4 to see if that helps?


On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:

> To exclude a possible bug within the LSF component, I rebuilt Open MPI 
> without 
> support for LSF (--without-lsf).
> 
> -> It makes no difference - the latency is still bad: ~1.1us.
> 
> Matthias
> 
> On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
>> SORRY, it was obviously a big mistake by me. :-(
>> 
>> Open MPI 1.5.5 was built with LSF support, so when starting an LSF job it's
>> necessary to request at least the number of tasks/cores as used for the
>> subsequent mpirun command. That was not the case - I forgot the bsub's '-n'
>> option to specify the number of task, so only *one* task/core was
>> requested.
>> 
>> Open MPI 1.4.5 was built *without* LSF support, so the supposed misbehavior
>> could not happen with it.
>> 
>> In short, there is no bug in Open MPI 1.5.x regarding to the detection of
>> oversubscription. Sorry for any confusion!
>> 
>> Matthias
>> 
>> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
>>> When using Open MPI v1.4.5 I get ~1.1us. That's the same result as I get
>>> with Open MPI v1.5.x using mpi_yield_when_idle=0.
>>> So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2) regarding to
>>> the automatic performance mode selection.
>>> 
>>> When enabling the degraded performance mode for Open MPI 1.4.5
>>> (mpi_yield_when_idle=1) I get ~1.8us latencies.
>>> 
>>> Matthias
>>> 
>>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
 On 13/02/12 22:11, Matthias Jurenz wrote:
> Do you have any idea? Please help!
 
 Do you see the same bad latency in the old branch (1.4.5) ?
 
 cheers,
 Chris
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] poor btl sm latency

2012-03-02 Thread Matthias Jurenz
To exclude a possible bug within the LSF component, I rebuilt Open MPI without 
support for LSF (--without-lsf).

-> It makes no difference - the latency is still bad: ~1.1us.

Matthias

On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> SORRY, it was obviously a big mistake by me. :-(
> 
> Open MPI 1.5.5 was built with LSF support, so when starting an LSF job it's
> necessary to request at least the number of tasks/cores as used for the
> subsequent mpirun command. That was not the case - I forgot the bsub's '-n'
> option to specify the number of task, so only *one* task/core was
> requested.
> 
> Open MPI 1.4.5 was built *without* LSF support, so the supposed misbehavior
> could not happen with it.
> 
> In short, there is no bug in Open MPI 1.5.x regarding to the detection of
> oversubscription. Sorry for any confusion!
> 
> Matthias
> 
> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
> > When using Open MPI v1.4.5 I get ~1.1us. That's the same result as I get
> > with Open MPI v1.5.x using mpi_yield_when_idle=0.
> > So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2) regarding to
> > the automatic performance mode selection.
> > 
> > When enabling the degraded performance mode for Open MPI 1.4.5
> > (mpi_yield_when_idle=1) I get ~1.8us latencies.
> > 
> > Matthias
> > 
> > On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> > > On 13/02/12 22:11, Matthias Jurenz wrote:
> > > > Do you have any idea? Please help!
> > > 
> > > Do you see the same bad latency in the old branch (1.4.5) ?
> > > 
> > > cheers,
> > > Chris
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] poor btl sm latency

2012-03-02 Thread Matthias Jurenz
SORRY, it was obviously a big mistake by me. :-(

Open MPI 1.5.5 was built with LSF support, so when starting an LSF job it's 
necessary to request at least the number of tasks/cores as used for the 
subsequent mpirun command. That was not the case - I forgot the bsub's '-n' 
option to specify the number of task, so only *one* task/core was requested.

Open MPI 1.4.5 was built *without* LSF support, so the supposed misbehavior 
could not happen with it.

In short, there is no bug in Open MPI 1.5.x regarding to the detection of 
oversubscription. Sorry for any confusion!

Matthias

On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
> When using Open MPI v1.4.5 I get ~1.1us. That's the same result as I get
> with Open MPI v1.5.x using mpi_yield_when_idle=0.
> So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2) regarding to
> the automatic performance mode selection.
> 
> When enabling the degraded performance mode for Open MPI 1.4.5
> (mpi_yield_when_idle=1) I get ~1.8us latencies.
> 
> Matthias
> 
> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> > On 13/02/12 22:11, Matthias Jurenz wrote:
> > > Do you have any idea? Please help!
> > 
> > Do you see the same bad latency in the old branch (1.4.5) ?
> > 
> > cheers,
> > Chris
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] poor btl sm latency

2012-02-28 Thread Matthias Jurenz
When using Open MPI v1.4.5 I get ~1.1us. That's the same result as I get with 
Open MPI v1.5.x using mpi_yield_when_idle=0.
So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2) regarding to the 
automatic performance mode selection.

When enabling the degraded performance mode for Open MPI 1.4.5 
(mpi_yield_when_idle=1) I get ~1.8us latencies.

Matthias

On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> On 13/02/12 22:11, Matthias Jurenz wrote:
> > Do you have any idea? Please help!
> 
> Do you see the same bad latency in the old branch (1.4.5) ?
> 
> cheers,
> Chris


Re: [OMPI devel] poor btl sm latency

2012-02-28 Thread Matthias Jurenz
Minor update:

I see some improvement when I set the MCA parameter mpi_yield_when_idle to 0 
to enforce the "Agressive" performance mode:

$ mpirun -np 2 -mca mpi_yield_when_idle 0 -mca btl self,sm -bind-to-core -
cpus-per-proc 2 ./NPmpi_ompi1.5.5 -u 4 -n 10
0: n090
1: n090
Now starting the main loop
  0:   1 bytes 10 times -->  6.96 Mbps in   1.10 usec
  1:   2 bytes 10 times --> 14.00 Mbps in   1.09 usec
  2:   3 bytes 10 times --> 20.88 Mbps in   1.10 usec
  3:   4 bytes 10 times --> 27.65 Mbps in   1.10 usec

When using the default behavior (mpi_yield_when_idle=-1, Open MPI decides 
which performance mode to use), it seems that Open MPI thinks that it runs in 
an oversubscribed mode (more ranks than there are cores available) so it turns 
on the degraded performance mode (mpi_yield_when_idle=1).

When using a hostfile to enforce the number of available cores
$ cat hostfile
localhost
localhost
I get similar results as when using mpi_yield_when_idle=0.

Perhaps this misbehavior is due to the kernel bug (mentioned by Brice) which 
might cause that Open MPI (hwloc) sees less cores than actually available?

Matthias


On Tuesday 21 February 2012 17:17:49 Matthias Jurenz wrote:
> Some supplements:
> 
> I tried several compilers for building Open MPI with enabled optimizations
> for the AMD Bulldozer architecture:
> 
> * gcc 4.6.2 (-Ofast -mtune=bdver1 -march=bdver1)
> * Open64 5.0 (-O3 -march=bgver1 -mtune=bdver1 -mso)
> * Intel 12.1 (-O3 -msse4.2)
> 
> They all result in similar latencies (~1.4us).
> 
> As I mentioned in my previous mail, I get the best results if the processes
> are bound for disabling L2 sharing (i.e. --bind-to-core --cpus-per-proc 2).
> Just see what happens when doing this for Platform MPI:
> 
> Without process binding:
> 
> $ mpirun -np 2 ./NPmpi_pcmpi -u 4 -n 10
> 0: n091
> 1: n091
> Now starting the main loop
>   0:   1 bytes  1 times --> 16.89 Mbps in   0.45 usec
>   1:   2 bytes  1 times --> 34.11 Mbps in   0.45 usec
>   2:   3 bytes  1 times --> 51.01 Mbps in   0.45 usec
>   3:   4 bytes  1 times --> 68.13 Mbps in   0.45 usec
> 
> With process binding using 'taskset':
> 
> $ mpirun -np 2 taskset -c 0,2 ./NPmpi_pcmpi -u 4 -n 1
> 0: n051
> 1: n051
> Now starting the main loop
>   0:   1 bytes  1 times --> 29.33 Mbps in   0.26 usec
>   1:   2 bytes  1 times --> 58.64 Mbps in   0.26 usec
>   2:   3 bytes  1 times --> 88.05 Mbps in   0.26 usec
>   3:   4 bytes  1 times -->117.33 Mbps in   0.26 usec
> 
> I tried to change some parameters of the SM BTL described here:
> http://www.open-mpi.org/faq/?category=sm#sm-params - but also without
> success.
> 
> Do you have any further ideas?
> 
> Matthias
> 
> On Monday 20 February 2012 13:46:54 Matthias Jurenz wrote:
> > If the processes are bound for L2 sharing (i.e. using neighboring cores
> > pu:0 and pu:1) I get the *worst* latency results:
> > 
> > $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 10 : -np 1
> > hwloc-bind pu:1 ./NPmpi -S -u 4 -n 10
> > Using synchronous sends
> > Using synchronous sends
> > 0: n023
> > 1: n023
> > Now starting the main loop
> > 
> >   0:   1 bytes 10 times -->  3.54 Mbps in   2.16 usec
> >   1:   2 bytes 10 times -->  7.10 Mbps in   2.15 usec
> >   2:   3 bytes 10 times --> 10.68 Mbps in   2.14 usec
> >   3:   4 bytes 10 times --> 14.23 Mbps in   2.15 usec
> > 
> > As it should, I get the same result when using '-bind-to-core' *without*
> > '-- cpus-per-proc 2'.
> > 
> > When using two separate L2's (pu:0,pu:2 or '--cpus-per-proc 2') I get
> > better results:
> > 
> > $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 10 : -np 1
> > hwloc-bind pu:2 ./NPmpi -S -u 4 -n 10
> > Using synchronous sends
> > 0: n023
> > Using synchronous sends
> > 1: n023
> > Now starting the main loop
> > 
> >   0:   1 bytes 10 times -->  5.15 Mbps in   1.48 usec
> >   1:   2 bytes 10 times --> 10.15 Mbps in   1.50 usec
> >   2:   3 bytes 10 times --> 15.26 Mbps in   1.50 usec
> >   3:   4 bytes 10 times --> 20.23 Mbps in   1.51 usec
> > 
> > So it seems that the process binding within Open MPI works and retires as
> > reason for the bad latency :-(
> > 
> > Matthias
> > 
> > On Thursday 16 February 2012 17:51:53 Brice Goglin wrote:
> > > Le 16/02/2012 17:12, Matthias Jurenz a écrit :
> > > > Thanks for the hint, Brice.
> > > > I'll forward this bug report to our cluster vendor.
> > > > 
> > > > Could this be the reason for the bad latencies with Open MPI or does
> > > > it only affect hwloc/lstopo?
> > > 
> > > It affects binding. So it may affect the performance you observed when
> > > using "high-level" binding policies that end up binding on wrong cores
> > > because of hwloc/kern

Re: [OMPI devel] poor btl sm latency

2012-02-28 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 13/02/12 22:11, Matthias Jurenz wrote:

> Do you have any idea? Please help!

Do you see the same bad latency in the old branch (1.4.5) ?

cheers,
Chris
- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9MZBwACgkQO2KABBYQAh99aQCggjCQB/+aaQ3XCrdq4QyMlsD0
m2IAoI+TcrStWFkTZhEV50ax23ulmJvZ
=Soi0
-END PGP SIGNATURE-


Re: [OMPI devel] poor btl sm latency

2012-02-21 Thread Matthias Jurenz
Some supplements:

I tried several compilers for building Open MPI with enabled optimizations for 
the AMD Bulldozer architecture:

* gcc 4.6.2 (-Ofast -mtune=bdver1 -march=bdver1)
* Open64 5.0 (-O3 -march=bgver1 -mtune=bdver1 -mso)
* Intel 12.1 (-O3 -msse4.2)

They all result in similar latencies (~1.4us).

As I mentioned in my previous mail, I get the best results if the processes 
are bound for disabling L2 sharing (i.e. --bind-to-core --cpus-per-proc 2).
Just see what happens when doing this for Platform MPI:

Without process binding:

$ mpirun -np 2 ./NPmpi_pcmpi -u 4 -n 10
0: n091
1: n091
Now starting the main loop
  0:   1 bytes  1 times --> 16.89 Mbps in   0.45 usec
  1:   2 bytes  1 times --> 34.11 Mbps in   0.45 usec
  2:   3 bytes  1 times --> 51.01 Mbps in   0.45 usec
  3:   4 bytes  1 times --> 68.13 Mbps in   0.45 usec

With process binding using 'taskset':

$ mpirun -np 2 taskset -c 0,2 ./NPmpi_pcmpi -u 4 -n 1
0: n051
1: n051
Now starting the main loop
  0:   1 bytes  1 times --> 29.33 Mbps in   0.26 usec
  1:   2 bytes  1 times --> 58.64 Mbps in   0.26 usec
  2:   3 bytes  1 times --> 88.05 Mbps in   0.26 usec
  3:   4 bytes  1 times -->117.33 Mbps in   0.26 usec

I tried to change some parameters of the SM BTL described here: 
http://www.open-mpi.org/faq/?category=sm#sm-params - but also without success.

Do you have any further ideas?

Matthias

On Monday 20 February 2012 13:46:54 Matthias Jurenz wrote:
> If the processes are bound for L2 sharing (i.e. using neighboring cores
> pu:0 and pu:1) I get the *worst* latency results:
> 
> $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 10 : -np 1
> hwloc-bind pu:1 ./NPmpi -S -u 4 -n 10
> Using synchronous sends
> Using synchronous sends
> 0: n023
> 1: n023
> Now starting the main loop
>   0:   1 bytes 10 times -->  3.54 Mbps in   2.16 usec
>   1:   2 bytes 10 times -->  7.10 Mbps in   2.15 usec
>   2:   3 bytes 10 times --> 10.68 Mbps in   2.14 usec
>   3:   4 bytes 10 times --> 14.23 Mbps in   2.15 usec
> 
> As it should, I get the same result when using '-bind-to-core' *without*
> '-- cpus-per-proc 2'.
> 
> When using two separate L2's (pu:0,pu:2 or '--cpus-per-proc 2') I get
> better results:
> 
> $ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 10 : -np 1
> hwloc-bind pu:2 ./NPmpi -S -u 4 -n 10
> Using synchronous sends
> 0: n023
> Using synchronous sends
> 1: n023
> Now starting the main loop
>   0:   1 bytes 10 times -->  5.15 Mbps in   1.48 usec
>   1:   2 bytes 10 times --> 10.15 Mbps in   1.50 usec
>   2:   3 bytes 10 times --> 15.26 Mbps in   1.50 usec
>   3:   4 bytes 10 times --> 20.23 Mbps in   1.51 usec
> 
> So it seems that the process binding within Open MPI works and retires as
> reason for the bad latency :-(
> 
> Matthias
> 
> On Thursday 16 February 2012 17:51:53 Brice Goglin wrote:
> > Le 16/02/2012 17:12, Matthias Jurenz a écrit :
> > > Thanks for the hint, Brice.
> > > I'll forward this bug report to our cluster vendor.
> > > 
> > > Could this be the reason for the bad latencies with Open MPI or does it
> > > only affect hwloc/lstopo?
> > 
> > It affects binding. So it may affect the performance you observed when
> > using "high-level" binding policies that end up binding on wrong cores
> > because of hwloc/kernel problems. If you specify binding manually, it
> > shouldn't hurt.
> > 
> > If the best latency case is supposed to be when L2 is shared, then try:
> > mpiexec -np 1 hwloc-bind pu:0 ./all2all : -np 1 hwloc-bind pu:1
> > 
> > ./all2all
> > Then, we'll see if you can get the same result with one of OMPI binding
> > options.
> > 
> > Brice
> > 
> > > Matthias
> > > 
> > > On Thursday 16 February 2012 15:46:46 Brice Goglin wrote:
> > >> Le 16/02/2012 15:39, Matthias Jurenz a écrit :
> > >>> Here the output of lstopo from a single compute node. I'm wondering
> > >>> that the fact of L1/L2 sharing isn't visible - also not in the
> > >>> graphical output...
> > >> 
> > >> That's a kernel bug. We're waiting for AMD to tell the kernel that L1i
> > >> and L2 are shared across dual-core modules. If you have some contact
> > >> at AMD, please tell them to look at
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=42607
> > >> 
> > >> Brice
> > >> 
> > >> ___
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 

Re: [OMPI devel] poor btl sm latency

2012-02-20 Thread Matthias Jurenz
If the processes are bound for L2 sharing (i.e. using neighboring cores pu:0 
and pu:1) I get the *worst* latency results:

$ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 10 : -np 1 hwloc-bind 
pu:1 ./NPmpi -S -u 4 -n 10
Using synchronous sends
Using synchronous sends
0: n023
1: n023
Now starting the main loop
  0:   1 bytes 10 times -->  3.54 Mbps in   2.16 usec
  1:   2 bytes 10 times -->  7.10 Mbps in   2.15 usec
  2:   3 bytes 10 times --> 10.68 Mbps in   2.14 usec
  3:   4 bytes 10 times --> 14.23 Mbps in   2.15 usec

As it should, I get the same result when using '-bind-to-core' *without* '--
cpus-per-proc 2'.

When using two separate L2's (pu:0,pu:2 or '--cpus-per-proc 2') I get better 
results:

$ mpiexec -np 1 hwloc-bind pu:0 ./NPmpi -S -u 4 -n 10 : -np 1 hwloc-bind 
pu:2 ./NPmpi -S -u 4 -n 10
Using synchronous sends
0: n023
Using synchronous sends
1: n023
Now starting the main loop
  0:   1 bytes 10 times -->  5.15 Mbps in   1.48 usec
  1:   2 bytes 10 times --> 10.15 Mbps in   1.50 usec
  2:   3 bytes 10 times --> 15.26 Mbps in   1.50 usec
  3:   4 bytes 10 times --> 20.23 Mbps in   1.51 usec

So it seems that the process binding within Open MPI works and retires as 
reason for the bad latency :-(

Matthias

On Thursday 16 February 2012 17:51:53 Brice Goglin wrote:
> Le 16/02/2012 17:12, Matthias Jurenz a écrit :
> > Thanks for the hint, Brice.
> > I'll forward this bug report to our cluster vendor.
> > 
> > Could this be the reason for the bad latencies with Open MPI or does it
> > only affect hwloc/lstopo?
> 
> It affects binding. So it may affect the performance you observed when
> using "high-level" binding policies that end up binding on wrong cores
> because of hwloc/kernel problems. If you specify binding manually, it
> shouldn't hurt.
> 
> If the best latency case is supposed to be when L2 is shared, then try:
> mpiexec -np 1 hwloc-bind pu:0 ./all2all : -np 1 hwloc-bind pu:1
> ./all2all
> Then, we'll see if you can get the same result with one of OMPI binding
> options.
> 
> Brice
> 
> > Matthias
> > 
> > On Thursday 16 February 2012 15:46:46 Brice Goglin wrote:
> >> Le 16/02/2012 15:39, Matthias Jurenz a écrit :
> >>> Here the output of lstopo from a single compute node. I'm wondering
> >>> that the fact of L1/L2 sharing isn't visible - also not in the
> >>> graphical output...
> >> 
> >> That's a kernel bug. We're waiting for AMD to tell the kernel that L1i
> >> and L2 are shared across dual-core modules. If you have some contact at
> >> AMD, please tell them to look at
> >> https://bugzilla.kernel.org/show_bug.cgi?id=42607
> >> 
> >> Brice
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Brice Goglin
Le 16/02/2012 17:12, Matthias Jurenz a écrit :
> Thanks for the hint, Brice.
> I'll forward this bug report to our cluster vendor.
>
> Could this be the reason for the bad latencies with Open MPI or does it only 
> affect hwloc/lstopo?

It affects binding. So it may affect the performance you observed when
using "high-level" binding policies that end up binding on wrong cores
because of hwloc/kernel problems. If you specify binding manually, it
shouldn't hurt.

If the best latency case is supposed to be when L2 is shared, then try:
mpiexec -np 1 hwloc-bind pu:0 ./all2all : -np 1 hwloc-bind pu:1
./all2all
Then, we'll see if you can get the same result with one of OMPI binding
options.

Brice





> Matthias
>
> On Thursday 16 February 2012 15:46:46 Brice Goglin wrote:
>> Le 16/02/2012 15:39, Matthias Jurenz a écrit :
>>> Here the output of lstopo from a single compute node. I'm wondering that
>>> the fact of L1/L2 sharing isn't visible - also not in the graphical
>>> output...
>> That's a kernel bug. We're waiting for AMD to tell the kernel that L1i
>> and L2 are shared across dual-core modules. If you have some contact at
>> AMD, please tell them to look at
>> https://bugzilla.kernel.org/show_bug.cgi?id=42607
>>
>> Brice
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Matthias Jurenz
Thanks for the hint, Brice.
I'll forward this bug report to our cluster vendor.

Could this be the reason for the bad latencies with Open MPI or does it only 
affect hwloc/lstopo?

Matthias

On Thursday 16 February 2012 15:46:46 Brice Goglin wrote:
> Le 16/02/2012 15:39, Matthias Jurenz a écrit :
> > Here the output of lstopo from a single compute node. I'm wondering that
> > the fact of L1/L2 sharing isn't visible - also not in the graphical
> > output...
> 
> That's a kernel bug. We're waiting for AMD to tell the kernel that L1i
> and L2 are shared across dual-core modules. If you have some contact at
> AMD, please tell them to look at
> https://bugzilla.kernel.org/show_bug.cgi?id=42607
> 
> Brice
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Matthias Jurenz


On Thursday 16 February 2012 16:50:55 Jeff Squyres wrote:
> On Feb 16, 2012, at 10:30 AM, Matthias Jurenz wrote:
> > $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get
> > 0x0003
> > 0x000c
> 
> That seems right.  From your prior email, 3 maps to 11 binary, which maps
> to:
> 
>  Socket L#0 (16GB)
>NUMANode L#0 (P#0 8190MB) + L3 L#0 (6144KB)
>  L2 L#0 (2048KB) + L1 L#0 (16KB) + Core L#0 + PU L#0 (P#0)
>  L2 L#1 (2048KB) + L1 L#1 (16KB) + Core L#1 + PU L#1 (P#1)
> 
> And c maps to 1100 binary, which maps to PU's P#2 and P#3 on the same
> socket:
> 
>  Socket L#0 (16GB)
>NUMANode L#0 (P#0 8190MB) + L3 L#0 (6144KB)
>  L2 L#2 (2048KB) + L1 L#2 (16KB) + Core L#2 + PU L#2 (P#2)
>  L2 L#3 (2048KB) + L1 L#3 (16KB) + Core L#3 + PU L#3 (P#3)
> 
> Let me ask two more dumb questions:
> 
> 1. Run "ompi_info | grep debug".  All the debugging is set to "no", right?
>
Yes: 
$ ompi_info | grep debug
Internal debug support: no
Memory debugging support: no

> 2. Your /tmp is not a network filesystem, is it?  (i.e., is OMPI putting
> the shared memory backing files on NFS?)  I *think* you said in a prior
> mail that you tried all the shared memory methods and got the same results
> (i.e., not just mmap), right?

Yes, in addition to mmap on /tmp and /dev/shm I tried sysv and posix and I got 
the same results.

Matthias


Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Edgar Gabriel
just a stupid question: in another sm-related thread the value of the
$TMPDIR variable was the problem, could this be the problem here as well?

Edgar

On 2/16/2012 9:30 AM, Matthias Jurenz wrote:
> On Thursday 16 February 2012 16:21:10 Jeff Squyres wrote:
>> On Feb 16, 2012, at 9:39 AM, Matthias Jurenz wrote:
>>> However, the latencies are constant now but still too high:
>>>
>>> $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 ./NPmpi_ompi1.5.5 -S -u
>>> 12 -n 10
>>
>> Can you run:
>>
>> mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get
>>
>> I want to verify that the processes are actually getting bound to the
>> correct places.
> 
> Sure:
> $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get
> 0x0003
> 0x000c
> 
> Matthias
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335



signature.asc
Description: OpenPGP digital signature


Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Jeff Squyres
On Feb 16, 2012, at 10:30 AM, Matthias Jurenz wrote:

> $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get
> 0x0003
> 0x000c

That seems right.  From your prior email, 3 maps to 11 binary, which maps to:

 Socket L#0 (16GB)
   NUMANode L#0 (P#0 8190MB) + L3 L#0 (6144KB)
 L2 L#0 (2048KB) + L1 L#0 (16KB) + Core L#0 + PU L#0 (P#0)
 L2 L#1 (2048KB) + L1 L#1 (16KB) + Core L#1 + PU L#1 (P#1)

And c maps to 1100 binary, which maps to PU's P#2 and P#3 on the same socket:

 Socket L#0 (16GB)
   NUMANode L#0 (P#0 8190MB) + L3 L#0 (6144KB)
 L2 L#2 (2048KB) + L1 L#2 (16KB) + Core L#2 + PU L#2 (P#2)
 L2 L#3 (2048KB) + L1 L#3 (16KB) + Core L#3 + PU L#3 (P#3)

Let me ask two more dumb questions:

1. Run "ompi_info | grep debug".  All the debugging is set to "no", right?

2. Your /tmp is not a network filesystem, is it?  (i.e., is OMPI putting the 
shared memory backing files on NFS?)  I *think* you said in a prior mail that 
you tried all the shared memory methods and got the same results (i.e., not 
just mmap), right?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Matthias Jurenz
On Thursday 16 February 2012 16:21:10 Jeff Squyres wrote:
> On Feb 16, 2012, at 9:39 AM, Matthias Jurenz wrote:
> > However, the latencies are constant now but still too high:
> > 
> > $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 ./NPmpi_ompi1.5.5 -S -u
> > 12 -n 10
> 
> Can you run:
> 
> mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get
> 
> I want to verify that the processes are actually getting bound to the
> correct places.

Sure:
$ mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get
0x0003
0x000c

Matthias


Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Jeff Squyres
On Feb 16, 2012, at 9:39 AM, Matthias Jurenz wrote:

> However, the latencies are constant now but still too high:
> 
> $ mpirun -np 2 --bind-to-core --cpus-per-proc 2 ./NPmpi_ompi1.5.5 -S -u 12 -n 
> 10

Can you run:

mpirun -np 2 --bind-to-core --cpus-per-proc 2 hwloc-bind --get

I want to verify that the processes are actually getting bound to the correct 
places.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Brice Goglin
Le 16/02/2012 15:39, Matthias Jurenz a écrit :
> Here the output of lstopo from a single compute node. I'm wondering that the 
> fact of L1/L2 sharing isn't visible - also not in the graphical output...

That's a kernel bug. We're waiting for AMD to tell the kernel that L1i
and L2 are shared across dual-core modules. If you have some contact at
AMD, please tell them to look at
https://bugzilla.kernel.org/show_bug.cgi?id=42607

Brice



Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Matthias Jurenz
The inconsistent results disappears when using the option '--cpus-per-proc 2'. 
I assume it has to do with the fact that each core shares the L1-instruction 
and L2-data cache with its neighboring core (see 
http://upload.wikimedia.org/wikipedia/commons/e/ec/AMD_Bulldozer_block_diagram_%288_core_CPU%29.PNG).

However, the latencies are constant now but still too high:

$ mpirun -np 2 --bind-to-core --cpus-per-proc 2 ./NPmpi_ompi1.5.5 -S -u 12 -n 
10
Using synchronous sends
0: n029
1: n029
Using synchronous sends
Now starting the main loop
  0:   1 bytes 10 times -->  4.83 Mbps in   1.58 usec
  1:   2 bytes 10 times -->  9.64 Mbps in   1.58 usec
  2:   3 bytes 10 times --> 14.59 Mbps in   1.57 usec
  3:   4 bytes 10 times --> 19.44 Mbps in   1.57 usec
  4:   6 bytes 10 times --> 29.34 Mbps in   1.56 usec
  5:   8 bytes 10 times --> 38.95 Mbps in   1.57 usec
  6:  12 bytes 10 times --> 58.49 Mbps in   1.57 usec

I updated the Open MPI installation to version 1.5.5rc2r25939 to have hwloc 
1.3.2.

$ ompi_info | grep hwloc
MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.5)
MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.5)
MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.5.5)


Here the output of lstopo from a single compute node. I'm wondering that the 
fact of L1/L2 sharing isn't visible - also not in the graphical output...

$ lstopo --output-format console
Machine (64GB)
  Socket L#0 (16GB)
NUMANode L#0 (P#0 8190MB) + L3 L#0 (6144KB)
  L2 L#0 (2048KB) + L1 L#0 (16KB) + Core L#0 + PU L#0 (P#0)
  L2 L#1 (2048KB) + L1 L#1 (16KB) + Core L#1 + PU L#1 (P#1)
  L2 L#2 (2048KB) + L1 L#2 (16KB) + Core L#2 + PU L#2 (P#2)
  L2 L#3 (2048KB) + L1 L#3 (16KB) + Core L#3 + PU L#3 (P#3)
  L2 L#4 (2048KB) + L1 L#4 (16KB) + Core L#4 + PU L#4 (P#4)
  L2 L#5 (2048KB) + L1 L#5 (16KB) + Core L#5 + PU L#5 (P#5)
  L2 L#6 (2048KB) + L1 L#6 (16KB) + Core L#6 + PU L#6 (P#6)
  L2 L#7 (2048KB) + L1 L#7 (16KB) + Core L#7 + PU L#7 (P#7)
NUMANode L#1 (P#1 8192MB) + L3 L#1 (6144KB)
  L2 L#8 (2048KB) + L1 L#8 (16KB) + Core L#8 + PU L#8 (P#8)
  L2 L#9 (2048KB) + L1 L#9 (16KB) + Core L#9 + PU L#9 (P#9)
  L2 L#10 (2048KB) + L1 L#10 (16KB) + Core L#10 + PU L#10 (P#10)
  L2 L#11 (2048KB) + L1 L#11 (16KB) + Core L#11 + PU L#11 (P#11)
  L2 L#12 (2048KB) + L1 L#12 (16KB) + Core L#12 + PU L#12 (P#12)
  L2 L#13 (2048KB) + L1 L#13 (16KB) + Core L#13 + PU L#13 (P#13)
  L2 L#14 (2048KB) + L1 L#14 (16KB) + Core L#14 + PU L#14 (P#14)
  L2 L#15 (2048KB) + L1 L#15 (16KB) + Core L#15 + PU L#15 (P#15)
  Socket L#1 (16GB)
NUMANode L#2 (P#2 8192MB) + L3 L#2 (6144KB)
  L2 L#16 (2048KB) + L1 L#16 (16KB) + Core L#16 + PU L#16 (P#16)
  L2 L#17 (2048KB) + L1 L#17 (16KB) + Core L#17 + PU L#17 (P#17)
  L2 L#18 (2048KB) + L1 L#18 (16KB) + Core L#18 + PU L#18 (P#18)
  L2 L#19 (2048KB) + L1 L#19 (16KB) + Core L#19 + PU L#19 (P#19)
  L2 L#20 (2048KB) + L1 L#20 (16KB) + Core L#20 + PU L#20 (P#20)
  L2 L#21 (2048KB) + L1 L#21 (16KB) + Core L#21 + PU L#21 (P#21)
  L2 L#22 (2048KB) + L1 L#22 (16KB) + Core L#22 + PU L#22 (P#22)
  L2 L#23 (2048KB) + L1 L#23 (16KB) + Core L#23 + PU L#23 (P#23)
NUMANode L#3 (P#3 8192MB) + L3 L#3 (6144KB)
  L2 L#24 (2048KB) + L1 L#24 (16KB) + Core L#24 + PU L#24 (P#24)
  L2 L#25 (2048KB) + L1 L#25 (16KB) + Core L#25 + PU L#25 (P#25)
  L2 L#26 (2048KB) + L1 L#26 (16KB) + Core L#26 + PU L#26 (P#26)
  L2 L#27 (2048KB) + L1 L#27 (16KB) + Core L#27 + PU L#27 (P#27)
  L2 L#28 (2048KB) + L1 L#28 (16KB) + Core L#28 + PU L#28 (P#28)
  L2 L#29 (2048KB) + L1 L#29 (16KB) + Core L#29 + PU L#29 (P#29)
  L2 L#30 (2048KB) + L1 L#30 (16KB) + Core L#30 + PU L#30 (P#30)
  L2 L#31 (2048KB) + L1 L#31 (16KB) + Core L#31 + PU L#31 (P#31)
  Socket L#2 (16GB)
NUMANode L#4 (P#4 8192MB) + L3 L#4 (6144KB)
  L2 L#32 (2048KB) + L1 L#32 (16KB) + Core L#32 + PU L#32 (P#32)
  L2 L#33 (2048KB) + L1 L#33 (16KB) + Core L#33 + PU L#33 (P#33)
  L2 L#34 (2048KB) + L1 L#34 (16KB) + Core L#34 + PU L#34 (P#34)
  L2 L#35 (2048KB) + L1 L#35 (16KB) + Core L#35 + PU L#35 (P#35)
  L2 L#36 (2048KB) + L1 L#36 (16KB) + Core L#36 + PU L#36 (P#36)
  L2 L#37 (2048KB) + L1 L#37 (16KB) + Core L#37 + PU L#37 (P#37)
  L2 L#38 (2048KB) + L1 L#38 (16KB) + Core L#38 + PU L#38 (P#38)
  L2 L#39 (2048KB) + L1 L#39 (16KB) + Core L#39 + PU L#39 (P#39)
NUMANode L#5 (P#5 8192MB) + L3 L#5 (6144KB)
  L2 L#40 (2048KB) + L1 L#40 (16KB) + Core L#40 + PU L#40 (P#40)
  L2 L#41 (2048KB) + L1 L#41 (16KB) + Core L#41 + PU L#41 (P#41)
  L2 L#42 (2048KB) + L1 L#42 (16KB) + Core L#42 + PU L#42 (P#42)
  L2 L#43 (2048KB) + L1 L#43 (16KB) + Core L#43 + PU L#43 (P#43)
  L2 L#44 (2048KB) + L1 L#44 (16KB) + Core L#44 + PU L#44 (P#44)
  L2 L#45 (2048KB) + L1 L#45 (16KB) + Core L#45 + PU L#45 (P

Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Jeff Squyres
Yowza.  With inconsistent results like that, it does sound like something is 
going on in the hardware.  Unfortunately, I don't know much/anything about AMDs 
(Cisco is an Intel shop).  :-\

Do you have (AMD's equivalent of) hyperthreading enabled, perchance?

In the latest 1.5.5 nightly tarball, I have just upgraded the included version 
of hwloc to be 1.3.2.  Maybe a good step would be to download hwloc 1.3.2 and 
verify that lstopo is faithfully reporting the actual topology of your system.  
Can you do that?


On Feb 16, 2012, at 7:06 AM, Matthias Jurenz wrote:

> Jeff,
> 
> sorry for the confusion - the all2all is a classic pingpong which uses 
> MPI_Send/Recv with 0 byte messages.
> 
> One thing I just noticed when using NetPIPE/MPI. Platform MPI results in 
> almost constant latencies for small messages (~0.89us), where I don't know 
> about process-binding in Platform MPI - I just used the defaults.
> When using Open MPI (regardless of core/socket-binding) the results differ 
> from 
> run to run:
> 
> === FIRST RUN ===
> $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 10
> Using synchronous sends
> 1: n029
> Using synchronous sends
> 0: n029
> Now starting the main loop
>  0:   1 bytes 10 times -->  4.66 Mbps in   1.64 usec
>  1:   2 bytes 10 times -->  8.94 Mbps in   1.71 usec
>  2:   3 bytes 10 times --> 13.65 Mbps in   1.68 usec
>  3:   4 bytes 10 times --> 17.91 Mbps in   1.70 usec
>  4:   6 bytes 10 times --> 29.04 Mbps in   1.58 usec
>  5:   8 bytes 10 times --> 39.06 Mbps in   1.56 usec
>  6:  12 bytes 10 times --> 57.58 Mbps in   1.59 usec
> 
> === SECOND RUN (~3s after the previous run) ===
> $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 10
> Using synchronous sends
> 1: n029
> Using synchronous sends
> 0: n029
> Now starting the main loop
>  0:   1 bytes 10 times -->  5.73 Mbps in   1.33 usec
>  1:   2 bytes 10 times --> 11.45 Mbps in   1.33 usec
>  2:   3 bytes 10 times --> 17.13 Mbps in   1.34 usec
>  3:   4 bytes 10 times --> 22.94 Mbps in   1.33 usec
>  4:   6 bytes 10 times --> 34.39 Mbps in   1.33 usec
>  5:   8 bytes 10 times --> 46.40 Mbps in   1.32 usec
>  6:  12 bytes 10 times --> 68.92 Mbps in   1.33 usec
> 
> === THIRD RUN ===
> $ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 10
> Using synchronous sends
> 0: n029
> Using synchronous sends
> 1: n029
> Now starting the main loop
>  0:   1 bytes 10 times -->  3.50 Mbps in   2.18 usec
>  1:   2 bytes 10 times -->  6.99 Mbps in   2.18 usec
>  2:   3 bytes 10 times --> 10.48 Mbps in   2.18 usec
>  3:   4 bytes 10 times --> 14.00 Mbps in   2.18 usec
>  4:   6 bytes 10 times --> 20.98 Mbps in   2.18 usec
>  5:   8 bytes 10 times --> 27.84 Mbps in   2.19 usec
>  6:  12 bytes 10 times --> 41.99 Mbps in   2.18 usec
> 
> At first appearance, I assumed that some CPU power saving feature is enabled. 
> But the CPU frequency scaling is set to "performance" and there is only one 
> available frequency (2.2GHz).
> 
> Any idea how this can happen?
> 
> 
> Matthias
> 
> On Wednesday 15 February 2012 19:29:38 Jeff Squyres wrote:
>> Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte
>> HRT ping pong.  What is this all2all benchmark, btw?  Is it measuring an
>> MPI_ALLTOALL, or a pingpong?
>> 
>> FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about
>> .27us latencies for short messages over sm and binding to socket.
>> 
>> On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote:
>>> I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3.
>>> Unfortunately, also without any effect.
>>> 
>>> Here some results with enabled binding reports:
>>> 
>>> $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
>>> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1]
>>> to cpus 0002
>>> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0]
>>> to cpus 0001
>>> latency: 1.415us
>>> 
>>> $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2
>>> ./all2all_ompi1.5.5
>>> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1]
>>> to cpus 0002
>>> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0]
>>> to cpus 0001
>>> latency: 1.4us
>>> 
>>> $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np
>>> 2 ./all2all_ompi1.5.5
>>> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1]
>>> to cpus 0002
>>> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0]
>>> to cpus 0001
>>> latency: 1.4us
>>> 
>>> 
>>> $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.

Re: [OMPI devel] poor btl sm latency

2012-02-16 Thread Matthias Jurenz
Jeff,

sorry for the confusion - the all2all is a classic pingpong which uses 
MPI_Send/Recv with 0 byte messages.

One thing I just noticed when using NetPIPE/MPI. Platform MPI results in 
almost constant latencies for small messages (~0.89us), where I don't know 
about process-binding in Platform MPI - I just used the defaults.
When using Open MPI (regardless of core/socket-binding) the results differ from 
run to run:

=== FIRST RUN ===
$ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 10
Using synchronous sends
1: n029
Using synchronous sends
0: n029
Now starting the main loop
  0:   1 bytes 10 times -->  4.66 Mbps in   1.64 usec
  1:   2 bytes 10 times -->  8.94 Mbps in   1.71 usec
  2:   3 bytes 10 times --> 13.65 Mbps in   1.68 usec
  3:   4 bytes 10 times --> 17.91 Mbps in   1.70 usec
  4:   6 bytes 10 times --> 29.04 Mbps in   1.58 usec
  5:   8 bytes 10 times --> 39.06 Mbps in   1.56 usec
  6:  12 bytes 10 times --> 57.58 Mbps in   1.59 usec

=== SECOND RUN (~3s after the previous run) ===
$ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 10
Using synchronous sends
1: n029
Using synchronous sends
0: n029
Now starting the main loop
  0:   1 bytes 10 times -->  5.73 Mbps in   1.33 usec
  1:   2 bytes 10 times --> 11.45 Mbps in   1.33 usec
  2:   3 bytes 10 times --> 17.13 Mbps in   1.34 usec
  3:   4 bytes 10 times --> 22.94 Mbps in   1.33 usec
  4:   6 bytes 10 times --> 34.39 Mbps in   1.33 usec
  5:   8 bytes 10 times --> 46.40 Mbps in   1.32 usec
  6:  12 bytes 10 times --> 68.92 Mbps in   1.33 usec

=== THIRD RUN ===
$ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 10
Using synchronous sends
0: n029
Using synchronous sends
1: n029
Now starting the main loop
  0:   1 bytes 10 times -->  3.50 Mbps in   2.18 usec
  1:   2 bytes 10 times -->  6.99 Mbps in   2.18 usec
  2:   3 bytes 10 times --> 10.48 Mbps in   2.18 usec
  3:   4 bytes 10 times --> 14.00 Mbps in   2.18 usec
  4:   6 bytes 10 times --> 20.98 Mbps in   2.18 usec
  5:   8 bytes 10 times --> 27.84 Mbps in   2.19 usec
  6:  12 bytes 10 times --> 41.99 Mbps in   2.18 usec

At first appearance, I assumed that some CPU power saving feature is enabled. 
But the CPU frequency scaling is set to "performance" and there is only one 
available frequency (2.2GHz).

Any idea how this can happen?


Matthias

On Wednesday 15 February 2012 19:29:38 Jeff Squyres wrote:
> Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte
> HRT ping pong.  What is this all2all benchmark, btw?  Is it measuring an
> MPI_ALLTOALL, or a pingpong?
> 
> FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about
> .27us latencies for short messages over sm and binding to socket.
> 
> On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote:
> > I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3.
> > Unfortunately, also without any effect.
> > 
> > Here some results with enabled binding reports:
> > 
> > $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
> > [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1]
> > to cpus 0002
> > [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0]
> > to cpus 0001
> > latency: 1.415us
> > 
> > $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2
> > ./all2all_ompi1.5.5
> > [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1]
> > to cpus 0002
> > [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0]
> > to cpus 0001
> > latency: 1.4us
> > 
> > $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np
> > 2 ./all2all_ompi1.5.5
> > [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1]
> > to cpus 0002
> > [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0]
> > to cpus 0001
> > latency: 1.4us
> > 
> > 
> > $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
> > [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1]
> > to socket 0 cpus 0001
> > [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0]
> > to socket 0 cpus 0001
> > latency: 4.0us
> > 
> > $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2
> > ./all2all_ompi1.5.5
> > [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1]
> > to socket 0 cpus 0001
> > [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0]
> > to socket 0 cpus 0001
> > latency: 4.0us
> > 
> > $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings
> > -np 2 ./all2all_ompi1.5.5
> > [n043:61639] [[49810,

Re: [OMPI devel] poor btl sm latency

2012-02-15 Thread Jeff Squyres
Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte HRT 
ping pong.  What is this all2all benchmark, btw?  Is it measuring an 
MPI_ALLTOALL, or a pingpong?

FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about .27us 
latencies for short messages over sm and binding to socket.


On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote:

> I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3. 
> Unfortunately, also without any effect.
> 
> Here some results with enabled binding reports:
> 
> $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1] to 
> cpus 0002
> [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0] to 
> cpus 0001
> latency: 1.415us
> 
> $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2 
> ./all2all_ompi1.5.5
> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1] to 
> cpus 0002
> [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0] to 
> cpus 0001
> latency: 1.4us
> 
> $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np 2 
> ./all2all_ompi1.5.5
> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1] to 
> cpus 0002
> [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0] to 
> cpus 0001
> latency: 1.4us
> 
> 
> $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1] to 
> socket 0 cpus 0001
> [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0] to 
> socket 0 cpus 0001
> latency: 4.0us
> 
> $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2 
> ./all2all_ompi1.5.5 
> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1] to 
> socket 0 cpus 0001
> [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0] to 
> socket 0 cpus 0001
> latency: 4.0us
> 
> $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings -np 2 
> ./all2all_ompi1.5.5 
> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1] to 
> socket 0 cpus 0001
> [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0] to 
> socket 0 cpus 0001
> latency: 4.0us
> 
> 
> If socket-binding is enabled it seems that all ranks are bind to the very 
> first 
> core of one and the same socket. Is it intended? I expected that each rank 
> gets its own socket (i.e. 2 ranks -> 2 sockets)...
> 
> Matthias
> 
> On Monday 13 February 2012 22:36:50 Jeff Squyres wrote:
>> Also, double check that you have an optimized build, not a debugging build.
>> 
>> SVN and HG checkouts default to debugging builds, which add in lots of
>> latency.
>> 
>> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
>>> Few thoughts
>>> 
>>> 1. Bind to socket is broken in 1.5.4 - fixed in next release
>>> 
>>> 2. Add --report-bindings to cmd line and see where it thinks the procs
>>> are bound
>>> 
>>> 3. Sounds lime memory may not be local - might be worth checking mem
>>> binding.
>>> 
>>> Sent from my iPad
>>> 
>>> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz  dresden.de> wrote:
 Hi Sylvain,
 
 thanks for the quick response!
 
 Here some results with enabled process binding. I hope I used the
 parameters correctly...
 
 bind two ranks to one socket:
 $ mpirun -np 2 --bind-to-core ./all2all
 $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
 
 bind two ranks to two different sockets:
 $ mpirun -np 2 --bind-to-socket ./all2all
 
 All three runs resulted in similar bad latencies (~1.4us).
 
 :-(
 
 Matthias
 
 On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
> Hi Matthias,
> 
> You might want to play with process binding to see if your problem is
> related to bad memory affinity.
> 
> Try to launch pingpong on two CPUs of the same socket, then on
> different sockets (i.e. bind each process to a core, and try different
> configurations).
> 
> Sylvain
> 
> 
> 
> De :Matthias Jurenz 
> A : Open MPI Developers 
> Date :  13/02/2012 12:12
> Objet : [OMPI devel] poor btl sm latency
> Envoyé par :devel-boun...@open-mpi.org
> 
> 
> 
> Hello all,
> 
> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
> latencies
> (~1.5us) when performing 0-byte p2p communication on one single node
> using the
> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which
> is pretty good. The bandwidth results are similar for both MPI
> implementations
> (~3,3GB/s) - this is okay.
> 
> One node has 64 cores and 64Gb RAM where it doesn't matter how many
> ranks allocated by the application. We get similar results with
> different number of
> 

Re: [OMPI devel] poor btl sm latency

2012-02-14 Thread Matthias Jurenz
I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3. 
Unfortunately, also without any effect.

Here some results with enabled binding reports:

$ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
[n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1] to 
cpus 0002
[n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0] to 
cpus 0001
latency: 1.415us

$ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2 
./all2all_ompi1.5.5
[n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1] to 
cpus 0002
[n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0] to 
cpus 0001
latency: 1.4us

$ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np 2 
./all2all_ompi1.5.5
[n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1] to 
cpus 0002
[n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0] to 
cpus 0001
latency: 1.4us


$ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
[n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1] to 
socket 0 cpus 0001
[n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0] to 
socket 0 cpus 0001
latency: 4.0us

$ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2 
./all2all_ompi1.5.5 
[n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1] to 
socket 0 cpus 0001
[n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0] to 
socket 0 cpus 0001
latency: 4.0us

$ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings -np 2 
./all2all_ompi1.5.5 
[n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1] to 
socket 0 cpus 0001
[n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0] to 
socket 0 cpus 0001
latency: 4.0us


If socket-binding is enabled it seems that all ranks are bind to the very first 
core of one and the same socket. Is it intended? I expected that each rank 
gets its own socket (i.e. 2 ranks -> 2 sockets)...

Matthias

On Monday 13 February 2012 22:36:50 Jeff Squyres wrote:
> Also, double check that you have an optimized build, not a debugging build.
> 
> SVN and HG checkouts default to debugging builds, which add in lots of
> latency.
> 
> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
> > Few thoughts
> > 
> > 1. Bind to socket is broken in 1.5.4 - fixed in next release
> > 
> > 2. Add --report-bindings to cmd line and see where it thinks the procs
> > are bound
> > 
> > 3. Sounds lime memory may not be local - might be worth checking mem
> > binding.
> > 
> > Sent from my iPad
> > 
> > On Feb 13, 2012, at 7:07 AM, Matthias Jurenz  wrote:
> >> Hi Sylvain,
> >> 
> >> thanks for the quick response!
> >> 
> >> Here some results with enabled process binding. I hope I used the
> >> parameters correctly...
> >> 
> >> bind two ranks to one socket:
> >> $ mpirun -np 2 --bind-to-core ./all2all
> >> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
> >> 
> >> bind two ranks to two different sockets:
> >> $ mpirun -np 2 --bind-to-socket ./all2all
> >> 
> >> All three runs resulted in similar bad latencies (~1.4us).
> >> 
> >> :-(
> >> 
> >> Matthias
> >> 
> >> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
> >>> Hi Matthias,
> >>> 
> >>> You might want to play with process binding to see if your problem is
> >>> related to bad memory affinity.
> >>> 
> >>> Try to launch pingpong on two CPUs of the same socket, then on
> >>> different sockets (i.e. bind each process to a core, and try different
> >>> configurations).
> >>> 
> >>> Sylvain
> >>> 
> >>> 
> >>> 
> >>> De :Matthias Jurenz 
> >>> A : Open MPI Developers 
> >>> Date :  13/02/2012 12:12
> >>> Objet : [OMPI devel] poor btl sm latency
> >>> Envoyé par :devel-boun...@open-mpi.org
> >>> 
> >>> 
> >>> 
> >>> Hello all,
> >>> 
> >>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
> >>> latencies
> >>> (~1.5us) when performing 0-byte p2p communication on one single node
> >>> using the
> >>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which
> >>> is pretty good. The bandwidth results are similar for both MPI
> >>> implementations
> >>> (~3,3GB/s) - this is okay.
> >>> 
> >>> One node has 64 cores and 64Gb RAM where it doesn't matter how many
> >>> ranks allocated by the application. We get similar results with
> >>> different number of
> >>> ranks.
> >>> 
> >>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
> >>> special
> >>> configure options except the installation prefix and the location of
> >>> the LSF
> >>> stuff.
> >>> 
> >>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to
> >>> use /dev/shm instead of /tmp for the session directory, but it had no
> >>> effect. Furthermore, we tried the current release candidate 1.5.5rc1
> >>> of Open MPI which
> >>> provides an option to use the SysV shared memor

Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread Jeff Squyres
Also, double check that you have an optimized build, not a debugging build.

SVN and HG checkouts default to debugging builds, which add in lots of latency.


On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:

> Few thoughts
> 
> 1. Bind to socket is broken in 1.5.4 - fixed in next release
> 
> 2. Add --report-bindings to cmd line and see where it thinks the procs are 
> bound
> 
> 3. Sounds lime memory may not be local - might be worth checking mem binding.
> 
> Sent from my iPad
> 
> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz  
> wrote:
> 
>> Hi Sylvain,
>> 
>> thanks for the quick response!
>> 
>> Here some results with enabled process binding. I hope I used the parameters 
>> correctly...
>> 
>> bind two ranks to one socket:
>> $ mpirun -np 2 --bind-to-core ./all2all
>> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
>> 
>> bind two ranks to two different sockets:
>> $ mpirun -np 2 --bind-to-socket ./all2all
>> 
>> All three runs resulted in similar bad latencies (~1.4us).
>> :-(
>> 
>> 
>> Matthias
>> 
>> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
>>> Hi Matthias,
>>> 
>>> You might want to play with process binding to see if your problem is
>>> related to bad memory affinity.
>>> 
>>> Try to launch pingpong on two CPUs of the same socket, then on different
>>> sockets (i.e. bind each process to a core, and try different
>>> configurations).
>>> 
>>> Sylvain
>>> 
>>> 
>>> 
>>> De :Matthias Jurenz 
>>> A : Open MPI Developers 
>>> Date :  13/02/2012 12:12
>>> Objet : [OMPI devel] poor btl sm latency
>>> Envoyé par :devel-boun...@open-mpi.org
>>> 
>>> 
>>> 
>>> Hello all,
>>> 
>>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
>>> latencies
>>> (~1.5us) when performing 0-byte p2p communication on one single node using
>>> the
>>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is
>>> pretty good. The bandwidth results are similar for both MPI
>>> implementations
>>> (~3,3GB/s) - this is okay.
>>> 
>>> One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks
>>> allocated by the application. We get similar results with different number
>>> of
>>> ranks.
>>> 
>>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
>>> special
>>> configure options except the installation prefix and the location of the
>>> LSF
>>> stuff.
>>> 
>>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use
>>> /dev/shm instead of /tmp for the session directory, but it had no effect.
>>> Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI
>>> which
>>> provides an option to use the SysV shared memory (-mca shmem sysv) - also
>>> this
>>> results in similar poor latencies.
>>> 
>>> Do you have any idea? Please help!
>>> 
>>> Thanks,
>>> Matthias
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread Ralph Castain
Few thoughts

1. Bind to socket is broken in 1.5.4 - fixed in next release

2. Add --report-bindings to cmd line and see where it thinks the procs are bound

3. Sounds lime memory may not be local - might be worth checking mem binding.

Sent from my iPad

On Feb 13, 2012, at 7:07 AM, Matthias Jurenz  
wrote:

> Hi Sylvain,
> 
> thanks for the quick response!
> 
> Here some results with enabled process binding. I hope I used the parameters 
> correctly...
> 
> bind two ranks to one socket:
> $ mpirun -np 2 --bind-to-core ./all2all
> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
> 
> bind two ranks to two different sockets:
> $ mpirun -np 2 --bind-to-socket ./all2all
> 
> All three runs resulted in similar bad latencies (~1.4us).
> :-(
> 
> 
> Matthias
> 
> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
>> Hi Matthias,
>> 
>> You might want to play with process binding to see if your problem is
>> related to bad memory affinity.
>> 
>> Try to launch pingpong on two CPUs of the same socket, then on different
>> sockets (i.e. bind each process to a core, and try different
>> configurations).
>> 
>> Sylvain
>> 
>> 
>> 
>> De :Matthias Jurenz 
>> A : Open MPI Developers 
>> Date :  13/02/2012 12:12
>> Objet : [OMPI devel] poor btl sm latency
>> Envoyé par :devel-boun...@open-mpi.org
>> 
>> 
>> 
>> Hello all,
>> 
>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
>> latencies
>> (~1.5us) when performing 0-byte p2p communication on one single node using
>> the
>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is
>> pretty good. The bandwidth results are similar for both MPI
>> implementations
>> (~3,3GB/s) - this is okay.
>> 
>> One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks
>> allocated by the application. We get similar results with different number
>> of
>> ranks.
>> 
>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
>> special
>> configure options except the installation prefix and the location of the
>> LSF
>> stuff.
>> 
>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use
>> /dev/shm instead of /tmp for the session directory, but it had no effect.
>> Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI
>> which
>> provides an option to use the SysV shared memory (-mca shmem sysv) - also
>> this
>> results in similar poor latencies.
>> 
>> Do you have any idea? Please help!
>> 
>> Thanks,
>> Matthias
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread Matthias Jurenz
Hi Sylvain,

thanks for the quick response!

Here some results with enabled process binding. I hope I used the parameters 
correctly...

bind two ranks to one socket:
$ mpirun -np 2 --bind-to-core ./all2all
$ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all

bind two ranks to two different sockets:
$ mpirun -np 2 --bind-to-socket ./all2all

All three runs resulted in similar bad latencies (~1.4us).
:-(


Matthias

On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
> Hi Matthias,
> 
> You might want to play with process binding to see if your problem is
> related to bad memory affinity.
> 
> Try to launch pingpong on two CPUs of the same socket, then on different
> sockets (i.e. bind each process to a core, and try different
> configurations).
> 
> Sylvain
> 
> 
> 
> De :Matthias Jurenz 
> A : Open MPI Developers 
> Date :  13/02/2012 12:12
> Objet : [OMPI devel] poor btl sm latency
> Envoyé par :devel-boun...@open-mpi.org
> 
> 
> 
> Hello all,
> 
> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
> latencies
> (~1.5us) when performing 0-byte p2p communication on one single node using
> the
> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is
> pretty good. The bandwidth results are similar for both MPI
> implementations
> (~3,3GB/s) - this is okay.
> 
> One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks
> allocated by the application. We get similar results with different number
> of
> ranks.
> 
> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
> special
> configure options except the installation prefix and the location of the
> LSF
> stuff.
> 
> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use
> /dev/shm instead of /tmp for the session directory, but it had no effect.
> Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI
> which
> provides an option to use the SysV shared memory (-mca shmem sysv) - also
> this
> results in similar poor latencies.
> 
> Do you have any idea? Please help!
> 
> Thanks,
> Matthias
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread sylvain . jeaugey
Hi Matthias,

You might want to play with process binding to see if your problem is 
related to bad memory affinity.

Try to launch pingpong on two CPUs of the same socket, then on different 
sockets (i.e. bind each process to a core, and try different 
configurations).

Sylvain



De :Matthias Jurenz 
A : Open MPI Developers 
Date :  13/02/2012 12:12
Objet : [OMPI devel] poor btl sm latency
Envoyé par :devel-boun...@open-mpi.org



Hello all,

on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad 
latencies 
(~1.5us) when performing 0-byte p2p communication on one single node using 
the 
Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is 
pretty good. The bandwidth results are similar for both MPI 
implementations 
(~3,3GB/s) - this is okay.

One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks 
allocated by the application. We get similar results with different number 
of 
ranks.

We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any 
special 
configure options except the installation prefix and the location of the 
LSF 
stuff.

As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use 
/dev/shm instead of /tmp for the session directory, but it had no effect. 
Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI 
which 
provides an option to use the SysV shared memory (-mca shmem sysv) - also 
this 
results in similar poor latencies.

Do you have any idea? Please help!

Thanks,
Matthias
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel