Re: [OMPI devel] poor btl sm latency

Matthias Jurenz Thu, 15 Mar 2012 11:06:38 -0400

We made a big step forward today!

The used Kernel has a bug regarding to the shared L1 instruction cache in AMD 
Bulldozer processors:
See 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726
 
and
http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf


Until the Kernel is patched we disable the address-space layout randomization 
(ASLR) as described in the above PDF:

   $ sudo /sbin/sysctl -w kernel.randomize_va_space=0

Therewith, NetPIPE results in ~0.5us latency when binding the processes for 
L2/L1I cache sharing (i.e. -bind-to-core).

However, when binding the processes for exclusive L2/L1I caches (i.e. -cpus-
per-proc 2) we still get ~1.1us latency. I don't think that the upcoming 
kernel patch will help for this kind of process binding...

Matthias

On Monday 12 March 2012 11:09:01 Matthias Jurenz wrote:
> It's a SUSE Linux Enterprise Server 11 Service Pack 1 with kernel version
> 2.6.32.49-0.3-default.
> 
> Matthias
> 
> On Friday 09 March 2012 16:36:41 you wrote:
> > What OS are you using ?
> > 
> > Joshua
> > 
> > ----- Original Message -----
> > From: Matthias Jurenz [mailto:matthias.jur...@tu-dresden.de]
> > Sent: Friday, March 09, 2012 08:50 AM
> > To: Open MPI Developers <de...@open-mpi.org>
> > Cc: Mora, Joshua
> > Subject: Re: [OMPI devel] poor btl sm latency
> > 
> > I just made an interesting observation:
> > 
> > When binding the processes to two neighboring cores (L2 sharing) NetPIPE
> > shows *sometimes* pretty good results: ~0.5us
> > 
> > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u
> > 4 -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> > 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
> > using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
> > adding 0x00000001 to 0x0
> > adding 0x00000001 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x00000001
> > adding 0x00000002 to 0x0
> > adding 0x00000002 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x00000002
> > Using no perturbations
> > 
> > 0: n035
> > Using no perturbations
> > 
> > 1: n035
> > Now starting the main loop
> > 
> >   0:       1 bytes 100000 times -->      6.01 Mbps in       1.27 usec
> >   1:       2 bytes 100000 times -->     12.04 Mbps in       1.27 usec
> >   2:       3 bytes 100000 times -->     18.07 Mbps in       1.27 usec
> >   3:       4 bytes 100000 times -->     24.13 Mbps in       1.26 usec
> > 
> > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u
> > 4 -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> > 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
> > adding 0x00000001 to 0x0
> > adding 0x00000001 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x00000001
> > using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
> > adding 0x00000002 to 0x0
> > adding 0x00000002 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x00000002
> > Using no perturbations
> > 
> > 0: n035
> > Using no perturbations
> > 
> > 1: n035
> > Now starting the main loop
> > 
> >   0:       1 bytes 100000 times -->     12.96 Mbps in       0.59 usec
> >   1:       2 bytes 100000 times -->     25.78 Mbps in       0.59 usec
> >   2:       3 bytes 100000 times -->     38.62 Mbps in       0.59 usec
> >   3:       4 bytes 100000 times -->     52.88 Mbps in       0.58 usec
> > 
> > I can reproduce that approximately every tenth run.
> > 
> > When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I
> > get constant latencies ~1.1us
> > 
> > Matthias
> > 
> > On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> > > Here the SM BTL parameters:
> > > 
> > > $ ompi_info --param btl sm
> > > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> > > default value) Verbosity level of the BTL framework
> > > MCA btl: parameter "btl" (current value: <self,sm,openib>, data source:
> > > file
> > > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.con
> > > f] ) Default selection set of components for the btl framework (<none>
> > > means use all components that can be found)
> > > MCA btl: information "btl_sm_have_knem_support" (value: <1>, data
> > > source: default value) Whether this component supports the knem Linux
> > > kernel module or not
> > > MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source:
> > > default value) Whether knem support is desired or not (negative = try
> > > to enable knem support, but continue even if it is not available, 0 =
> > > do not enable knem support, positive = try to enable knem support and
> > > fail if it is not available)
> > > MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data
> > > source: default value) Minimum message size (in bytes) to use the knem
> > > DMA mode; ignored if knem does not support DMA mode (0 = do not use the
> > > knem DMA mode) MCA btl: parameter "btl_sm_knem_max_simultaneous"
> > > (current value: <0>, data source: default value) Max number of
> > > simultaneous ongoing knem operations to support (0 = do everything
> > > synchronously, which probably gives the best large message latency; >0
> > > means to do all operations asynchronously, which supports better
> > > overlap for simultaneous large message sends)
> > > MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_mpool" (current value: <sm>, data source:
> > > default value)
> > > MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source:
> > > default value)
> > > MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data
> > > source: default value)
> > > MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data
> > > source: default value) BTL exclusivity (must be >= 0)
> > > MCA btl: parameter "btl_sm_flags" (current value: <5>, data source:
> > > default value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
> > > SEND_INPLACE=8, RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only
> > > used by the "dr" PML (ignored by others): ACK=16, CHECKSUM=32,
> > > RDMA_COMPLETION=128; flags only used by the "bfo" PML (ignored by
> > > others): FAILOVER_SUPPORT=512) MCA btl: parameter
> > > "btl_sm_rndv_eager_limit" (current value: <4096>, data source: default
> > > value) Size (in bytes) of "phase 1" fragment sent for all large
> > > messages (must be >= 0 and <= eager_limit)
> > > MCA btl: parameter "btl_sm_eager_limit" (current value: <4096>, data
> > > source: default value) Maximum size (in bytes) of "short" messages
> > > (must be >= 1). MCA btl: parameter "btl_sm_max_send_size" (current
> > > value: <32768>, data source: default value) Maximum size (in bytes) of
> > > a single "phase 2" fragment of a long message when using the pipeline
> > > protocol (must be >= 1)
> > > MCA btl: parameter "btl_sm_bandwidth" (current value: <9000>, data
> > > source: default value) Approximate maximum bandwidth of interconnect(0
> > > = auto-detect value at run-time [not supported in all BTL modules], >=
> > > 1 = bandwidth in Mbps)
> > > MCA btl: parameter "btl_sm_latency" (current value: <1>, data source:
> > > default value) Approximate latency of interconnect (must be >= 0)
> > > MCA btl: parameter "btl_sm_priority" (current value: <0>, data source:
> > > default value)
> > > MCA btl: parameter "btl_base_warn_component_unused" (current value:
> > > <1>, data source: default value) This parameter is used to turn on
> > > warning messages when certain NICs are not used
> > > 
> > > Matthias
> > > 
> > > On Friday 02 March 2012 16:23:32 George Bosilca wrote:
> > > > Please do a "ompi_info --param btl sm" on your environment. The
> > > > lazy_free direct the internals of the SM BTL not to release the
> > > > memory fragments used to communicate until the lazy limit is
> > > > reached. The default value was deemed as reasonable a while back
> > > > when the number of default fragments was large. Lately there were
> > > > some patches to reduce the memory footprint of the SM BTL and these
> > > > might have lowered the available fragments to a limit where the
> > > > default value for the lazy_free is now too large.
> > > > 
> > > >   george.
> > > > 
> > > > On Mar 2, 2012, at 10:08 , Matthias Jurenz wrote:
> > > > > In thanks to the OTPO tool, I figured out that setting the MCA
> > > > > parameter btl_sm_fifo_lazy_free to 1 (default is 120) improves the
> > > > > latency significantly: 0,88µs
> > > > > 
> > > > > But somehow I get the feeling that this doesn't eliminate the
> > > > > actual problem...
> > > > > 
> > > > > Matthias
> > > > > 
> > > > > On Friday 02 March 2012 15:37:03 Matthias Jurenz wrote:
> > > > >> On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote:
> > > > >>> Ok.  Good that there's no oversubscription bug, at least.  :-)
> > > > >>> 
> > > > >>> Did you see my off-list mail to you yesterday about building with
> > > > >>> an external copy of hwloc 1.4 to see if that helps?
> > > > >> 
> > > > >> Yes, I did - I answered as well. Our mail server seems to be
> > > > >> something busy today...
> > > > >> 
> > > > >> Just for the record: Using hwloc-1.4 makes no difference.
> > > > >> 
> > > > >> Matthias
> > > > >> 
> > > > >>> On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
> > > > >>>> To exclude a possible bug within the LSF component, I rebuilt
> > > > >>>> Open MPI without support for LSF (--without-lsf).
> > > > >>>> 
> > > > >>>> -> It makes no difference - the latency is still bad: ~1.1us.
> > > > >>>> 
> > > > >>>> Matthias
> > > > >>>> 
> > > > >>>> On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> > > > >>>>> SORRY, it was obviously a big mistake by me. :-(
> > > > >>>>> 
> > > > >>>>> Open MPI 1.5.5 was built with LSF support, so when starting an
> > > > >>>>> LSF job it's necessary to request at least the number of
> > > > >>>>> tasks/cores as used for the subsequent mpirun command. That was
> > > > >>>>> not the case - I forgot the bsub's '-n' option to specify the
> > > > >>>>> number of task, so only *one* task/core was requested.
> > > > >>>>> 
> > > > >>>>> Open MPI 1.4.5 was built *without* LSF support, so the supposed
> > > > >>>>> misbehavior could not happen with it.
> > > > >>>>> 
> > > > >>>>> In short, there is no bug in Open MPI 1.5.x regarding to the
> > > > >>>>> detection of oversubscription. Sorry for any confusion!
> > > > >>>>> 
> > > > >>>>> Matthias
> > > > >>>>> 
> > > > >>>>> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
> > > > >>>>>> When using Open MPI v1.4.5 I get ~1.1us. That's the same
> > > > >>>>>> result as I get with Open MPI v1.5.x using
> > > > >>>>>> mpi_yield_when_idle=0. So I think there is a bug in Open MPI
> > > > >>>>>> (v1.5.4 and v1.5.5rc2) regarding to the automatic performance
> > > > >>>>>> mode selection.
> > > > >>>>>> 
> > > > >>>>>> When enabling the degraded performance mode for Open MPI 1.4.5
> > > > >>>>>> (mpi_yield_when_idle=1) I get ~1.8us latencies.
> > > > >>>>>> 
> > > > >>>>>> Matthias
> > > > >>>>>> 
> > > > >>>>>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> > > > >>>>>>> On 13/02/12 22:11, Matthias Jurenz wrote:
> > > > >>>>>>>> Do you have any idea? Please help!
> > > > >>>>>>> 
> > > > >>>>>>> Do you see the same bad latency in the old branch (1.4.5) ?
> > > > >>>>>>> 
> > > > >>>>>>> cheers,
> > > > >>>>>>> Chris
> > > > >>>>>> 
> > > > >>>>>> _______________________________________________
> > > > >>>>>> devel mailing list
> > > > >>>>>> de...@open-mpi.org
> > > > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >>>>> 
> > > > >>>>> _______________________________________________
> > > > >>>>> devel mailing list
> > > > >>>>> de...@open-mpi.org
> > > > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >>>> 
> > > > >>>> _______________________________________________
> > > > >>>> devel mailing list
> > > > >>>> de...@open-mpi.org
> > > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >> 
> > > > >> _______________________________________________
> > > > >> devel mailing list
> > > > >> de...@open-mpi.org
> > > > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > > 
> > > > > _______________________________________________
> > > > > devel mailing list
> > > > > de...@open-mpi.org
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > 
> > > > _______________________________________________
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] poor btl sm latency

Reply via email to