date:20120315

[OMPI devel] v1.5 r26132 broken on multiple nodes?

2012-03-15 Thread Eugene Loh

I'm quitting for the day, but happened to notice that all our v1.5 MTT 
runs are failing with r26133, though tests ran fine as of r26129.  
Things run fine on-node, but if you run even just "hostname" on a remote 
node, the job fails with


orted: Command not found

I get this problem whether I include "--prefix $OPAL_PREFIX" or not.

Looking at recent putbacks, I see r26132 pulls in r26081 to fix #3047.  
According to both the trac ticket and 
http://www.open-mpi.org/community/lists/devel/2012/03/10672.php , r26081 
alone isn't enough, but... whatever, I'm going to bed.  It does seem 
like r26132 isn't quite right.

Re: [OMPI devel] poor btl sm latency

2012-03-15 Thread Matthias Jurenz

We made a big step forward today!

The used Kernel has a bug regarding to the shared L1 instruction cache in AMD 
Bulldozer processors:
See 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726
 
and
http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf

Until the Kernel is patched we disable the address-space layout randomization 
(ASLR) as described in the above PDF:

   $ sudo /sbin/sysctl -w kernel.randomize_va_space=0

Therewith, NetPIPE results in ~0.5us latency when binding the processes for 
L2/L1I cache sharing (i.e. -bind-to-core).

However, when binding the processes for exclusive L2/L1I caches (i.e. -cpus-
per-proc 2) we still get ~1.1us latency. I don't think that the upcoming 
kernel patch will help for this kind of process binding...

Matthias

On Monday 12 March 2012 11:09:01 Matthias Jurenz wrote:
> It's a SUSE Linux Enterprise Server 11 Service Pack 1 with kernel version
> 2.6.32.49-0.3-default.
> 
> Matthias
> 
> On Friday 09 March 2012 16:36:41 you wrote:
> > What OS are you using ?
> > 
> > Joshua
> > 
> > - Original Message -
> > From: Matthias Jurenz [mailto:matthias.jur...@tu-dresden.de]
> > Sent: Friday, March 09, 2012 08:50 AM
> > To: Open MPI Developers 
> > Cc: Mora, Joshua
> > Subject: Re: [OMPI devel] poor btl sm latency
> > 
> > I just made an interesting observation:
> > 
> > When binding the processes to two neighboring cores (L2 sharing) NetPIPE
> > shows *sometimes* pretty good results: ~0.5us
> > 
> > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u
> > 4 -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> > 10 -p 0 using object #0 depth 6 below cpuset 0x,0x
> > using object #1 depth 6 below cpuset 0x,0x
> > adding 0x0001 to 0x0
> > adding 0x0001 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x0001
> > adding 0x0002 to 0x0
> > adding 0x0002 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x0002
> > Using no perturbations
> > 
> > 0: n035
> > Using no perturbations
> > 
> > 1: n035
> > Now starting the main loop
> > 
> >   0:   1 bytes 10 times -->  6.01 Mbps in   1.27 usec
> >   1:   2 bytes 10 times --> 12.04 Mbps in   1.27 usec
> >   2:   3 bytes 10 times --> 18.07 Mbps in   1.27 usec
> >   3:   4 bytes 10 times --> 24.13 Mbps in   1.26 usec
> > 
> > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u
> > 4 -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> > 10 -p 0 using object #0 depth 6 below cpuset 0x,0x
> > adding 0x0001 to 0x0
> > adding 0x0001 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x0001
> > using object #1 depth 6 below cpuset 0x,0x
> > adding 0x0002 to 0x0
> > adding 0x0002 to 0x0
> > assuming the command starts at ./NPmpi_ompi1.5.5
> > binding on cpu set 0x0002
> > Using no perturbations
> > 
> > 0: n035
> > Using no perturbations
> > 
> > 1: n035
> > Now starting the main loop
> > 
> >   0:   1 bytes 10 times --> 12.96 Mbps in   0.59 usec
> >   1:   2 bytes 10 times --> 25.78 Mbps in   0.59 usec
> >   2:   3 bytes 10 times --> 38.62 Mbps in   0.59 usec
> >   3:   4 bytes 10 times --> 52.88 Mbps in   0.58 usec
> > 
> > I can reproduce that approximately every tenth run.
> > 
> > When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I
> > get constant latencies ~1.1us
> > 
> > Matthias
> > 
> > On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> > > Here the SM BTL parameters:
> > > 
> > > $ ompi_info --param btl sm
> > > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> > > default value) Verbosity level of the BTL framework
> > > MCA btl: parameter "btl" (current value: , data source:
> > > file
> > > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.con
> > > f] ) Default selection set of components for the btl framework (
> > > means use all components that can be found)
> > > MCA btl: information "btl_sm_have_knem_support" (value: <1>, data
> > > source: default value) Whether this component supports the knem Linux
> > > kernel module or not
> > > MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source:
> > > default value) Whether knem support is desired or not (negative = try
> > > to enable knem support, but continue even if it is not available, 0 =
> > > do not enable knem support, positive = try to enable knem support and
> > > fail if it is not available)
> > > MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data
> > > source: default value) Minimum message size (in bytes) to use the knem
> > > DMA mode; ignored if knem

Re: [OMPI devel] poor btl sm latency

2012-03-15 Thread Jeffrey Squyres

On Mar 15, 2012, at 8:06 AM, Matthias Jurenz wrote:

> We made a big step forward today!
> 
> The used Kernel has a bug regarding to the shared L1 instruction cache in AMD 
> Bulldozer processors:
> See 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726
>  
> and
> http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf
> 
> Until the Kernel is patched we disable the address-space layout randomization 
> (ASLR) as described in the above PDF:
> 
>   $ sudo /sbin/sysctl -w kernel.randomize_va_space=0
> 
> Therewith, NetPIPE results in ~0.5us latency when binding the processes for 
> L2/L1I cache sharing (i.e. -bind-to-core).

This is good!  I love it when the bug is not our fault.  :-)

> However, when binding the processes for exclusive L2/L1I caches (i.e. -cpus-
> per-proc 2) we still get ~1.1us latency. I don't think that the upcoming 
> kernel patch will help for this kind of process binding...

Does this kind of thing happen with Platform MPI, too?  I.e., is this another 
kernel issue, or an OMPI-specific issue?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

[OMPI devel] RFC: ob1: fallback on put/send on rget failure

2012-03-15 Thread Nathan Hjelm


What: Update ob1 to do the following:
   - fallback on send after rdma_put_retries_limit failures of prepare_dst
   - fallback on put (single non-pipelined) if the btl returns 
OMPI_ERR_NOT_AVAILABLE on a get transaction.

When: Timeout in about one week (Mar 22)

Why: Two reasons:
   - Some btls (ugni) need to switch to put for certain transactions. It 
makes sense to make this switch at the pml level.
   - If prepare_dst repeatedly fails for a get transaction we currently 
deadlock. We can avoid the deadlock (in most cases) by switching to send for 
the transaction.

Please take a look at the attached patch. Feedback and constructive criticism 
is needed!

-Nathan Hjelm
HPC-3, LANL

ompi_trunk_ob1_get_fallback.patch.gz
Description: GNU Zip compressed data

Re: [OMPI devel] RFC: ob1: fallback on put/send on rget failure

2012-03-15 Thread Shamis, Pavel

Nathan,

I did not get any patch.

Regards,

Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Mar 15, 2012, at 5:07 PM, Nathan Hjelm wrote:

> 
> 
> What: Update ob1 to do the following:
>- fallback on send after rdma_put_retries_limit failures of prepare_dst
>- fallback on put (single non-pipelined) if the btl returns 
> OMPI_ERR_NOT_AVAILABLE on a get transaction.
> 
> When: Timeout in about one week (Mar 22)
> 
> Why: Two reasons:
>- Some btls (ugni) need to switch to put for certain transactions. It 
> makes sense to make this switch at the pml level.
>- If prepare_dst repeatedly fails for a get transaction we currently 
> deadlock. We can avoid the deadlock (in most cases) by switching to send for 
> the transaction.
> 
> Please take a look at the attached patch. Feedback and constructive criticism 
> is needed!
> 
> -Nathan Hjelm
> HPC-3, LANL

Re: [OMPI devel] v1.5 r26132 broken on multiple nodes?

2012-03-15 Thread Ralph Castain

Let me know what you find - I took a look at the code and it looks correct. All 
required changes were included in the patch that was applied to the branch.


On Mar 14, 2012, at 11:27 PM, Eugene Loh wrote:

> I'm quitting for the day, but happened to notice that all our v1.5 MTT runs 
> are failing with r26133, though tests ran fine as of r26129.  Things run fine 
> on-node, but if you run even just "hostname" on a remote node, the job fails 
> with
> 
> orted: Command not found
> 
> I get this problem whether I include "--prefix $OPAL_PREFIX" or not.
> 
> Looking at recent putbacks, I see r26132 pulls in r26081 to fix #3047.  
> According to both the trac ticket and 
> http://www.open-mpi.org/community/lists/devel/2012/03/10672.php , r26081 
> alone isn't enough, but... whatever, I'm going to bed.  It does seem like 
> r26132 isn't quite right.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] v1.5 r26132 broken on multiple nodes?

Re: [OMPI devel] poor btl sm latency

Re: [OMPI devel] poor btl sm latency

[OMPI devel] RFC: ob1: fallback on put/send on rget failure

Re: [OMPI devel] RFC: ob1: fallback on put/send on rget failure

Re: [OMPI devel] v1.5 r26132 broken on multiple nodes?

6 matches

Site Navigation

Mail list logo

Footer information