[OMPI devel] v1.5 r26132 broken on multiple nodes?
I'm quitting for the day, but happened to notice that all our v1.5 MTT runs are failing with r26133, though tests ran fine as of r26129. Things run fine on-node, but if you run even just "hostname" on a remote node, the job fails with orted: Command not found I get this problem whether I include "--prefix $OPAL_PREFIX" or not. Looking at recent putbacks, I see r26132 pulls in r26081 to fix #3047. According to both the trac ticket and http://www.open-mpi.org/community/lists/devel/2012/03/10672.php , r26081 alone isn't enough, but... whatever, I'm going to bed. It does seem like r26132 isn't quite right.
Re: [OMPI devel] poor btl sm latency
We made a big step forward today! The used Kernel has a bug regarding to the shared L1 instruction cache in AMD Bulldozer processors: See http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726 and http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf Until the Kernel is patched we disable the address-space layout randomization (ASLR) as described in the above PDF: $ sudo /sbin/sysctl -w kernel.randomize_va_space=0 Therewith, NetPIPE results in ~0.5us latency when binding the processes for L2/L1I cache sharing (i.e. -bind-to-core). However, when binding the processes for exclusive L2/L1I caches (i.e. -cpus- per-proc 2) we still get ~1.1us latency. I don't think that the upcoming kernel patch will help for this kind of process binding... Matthias On Monday 12 March 2012 11:09:01 Matthias Jurenz wrote: > It's a SUSE Linux Enterprise Server 11 Service Pack 1 with kernel version > 2.6.32.49-0.3-default. > > Matthias > > On Friday 09 March 2012 16:36:41 you wrote: > > What OS are you using ? > > > > Joshua > > > > - Original Message - > > From: Matthias Jurenz [mailto:matthias.jur...@tu-dresden.de] > > Sent: Friday, March 09, 2012 08:50 AM > > To: Open MPI Developers > > Cc: Mora, Joshua > > Subject: Re: [OMPI devel] poor btl sm latency > > > > I just made an interesting observation: > > > > When binding the processes to two neighboring cores (L2 sharing) NetPIPE > > shows *sometimes* pretty good results: ~0.5us > > > > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u > > 4 -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n > > 10 -p 0 using object #0 depth 6 below cpuset 0x,0x > > using object #1 depth 6 below cpuset 0x,0x > > adding 0x0001 to 0x0 > > adding 0x0001 to 0x0 > > assuming the command starts at ./NPmpi_ompi1.5.5 > > binding on cpu set 0x0001 > > adding 0x0002 to 0x0 > > adding 0x0002 to 0x0 > > assuming the command starts at ./NPmpi_ompi1.5.5 > > binding on cpu set 0x0002 > > Using no perturbations > > > > 0: n035 > > Using no perturbations > > > > 1: n035 > > Now starting the main loop > > > > 0: 1 bytes 10 times --> 6.01 Mbps in 1.27 usec > > 1: 2 bytes 10 times --> 12.04 Mbps in 1.27 usec > > 2: 3 bytes 10 times --> 18.07 Mbps in 1.27 usec > > 3: 4 bytes 10 times --> 24.13 Mbps in 1.26 usec > > > > $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u > > 4 -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n > > 10 -p 0 using object #0 depth 6 below cpuset 0x,0x > > adding 0x0001 to 0x0 > > adding 0x0001 to 0x0 > > assuming the command starts at ./NPmpi_ompi1.5.5 > > binding on cpu set 0x0001 > > using object #1 depth 6 below cpuset 0x,0x > > adding 0x0002 to 0x0 > > adding 0x0002 to 0x0 > > assuming the command starts at ./NPmpi_ompi1.5.5 > > binding on cpu set 0x0002 > > Using no perturbations > > > > 0: n035 > > Using no perturbations > > > > 1: n035 > > Now starting the main loop > > > > 0: 1 bytes 10 times --> 12.96 Mbps in 0.59 usec > > 1: 2 bytes 10 times --> 25.78 Mbps in 0.59 usec > > 2: 3 bytes 10 times --> 38.62 Mbps in 0.59 usec > > 3: 4 bytes 10 times --> 52.88 Mbps in 0.58 usec > > > > I can reproduce that approximately every tenth run. > > > > When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I > > get constant latencies ~1.1us > > > > Matthias > > > > On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote: > > > Here the SM BTL parameters: > > > > > > $ ompi_info --param btl sm > > > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source: > > > default value) Verbosity level of the BTL framework > > > MCA btl: parameter "btl" (current value: , data source: > > > file > > > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.con > > > f] ) Default selection set of components for the btl framework ( > > > means use all components that can be found) > > > MCA btl: information "btl_sm_have_knem_support" (value: <1>, data > > > source: default value) Whether this component supports the knem Linux > > > kernel module or not > > > MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source: > > > default value) Whether knem support is desired or not (negative = try > > > to enable knem support, but continue even if it is not available, 0 = > > > do not enable knem support, positive = try to enable knem support and > > > fail if it is not available) > > > MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data > > > source: default value) Minimum message size (in bytes) to use the knem > > > DMA mode; ignored if knem
Re: [OMPI devel] poor btl sm latency
On Mar 15, 2012, at 8:06 AM, Matthias Jurenz wrote: > We made a big step forward today! > > The used Kernel has a bug regarding to the shared L1 instruction cache in AMD > Bulldozer processors: > See > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726 > > and > http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf > > Until the Kernel is patched we disable the address-space layout randomization > (ASLR) as described in the above PDF: > > $ sudo /sbin/sysctl -w kernel.randomize_va_space=0 > > Therewith, NetPIPE results in ~0.5us latency when binding the processes for > L2/L1I cache sharing (i.e. -bind-to-core). This is good! I love it when the bug is not our fault. :-) > However, when binding the processes for exclusive L2/L1I caches (i.e. -cpus- > per-proc 2) we still get ~1.1us latency. I don't think that the upcoming > kernel patch will help for this kind of process binding... Does this kind of thing happen with Platform MPI, too? I.e., is this another kernel issue, or an OMPI-specific issue? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] RFC: ob1: fallback on put/send on rget failure
What: Update ob1 to do the following: - fallback on send after rdma_put_retries_limit failures of prepare_dst - fallback on put (single non-pipelined) if the btl returns OMPI_ERR_NOT_AVAILABLE on a get transaction. When: Timeout in about one week (Mar 22) Why: Two reasons: - Some btls (ugni) need to switch to put for certain transactions. It makes sense to make this switch at the pml level. - If prepare_dst repeatedly fails for a get transaction we currently deadlock. We can avoid the deadlock (in most cases) by switching to send for the transaction. Please take a look at the attached patch. Feedback and constructive criticism is needed! -Nathan Hjelm HPC-3, LANL ompi_trunk_ob1_get_fallback.patch.gz Description: GNU Zip compressed data
Re: [OMPI devel] RFC: ob1: fallback on put/send on rget failure
Nathan, I did not get any patch. Regards, Pavel (Pasha) Shamis --- Application Performance Tools Group Computer Science and Math Division Oak Ridge National Laboratory On Mar 15, 2012, at 5:07 PM, Nathan Hjelm wrote: > > > What: Update ob1 to do the following: >- fallback on send after rdma_put_retries_limit failures of prepare_dst >- fallback on put (single non-pipelined) if the btl returns > OMPI_ERR_NOT_AVAILABLE on a get transaction. > > When: Timeout in about one week (Mar 22) > > Why: Two reasons: >- Some btls (ugni) need to switch to put for certain transactions. It > makes sense to make this switch at the pml level. >- If prepare_dst repeatedly fails for a get transaction we currently > deadlock. We can avoid the deadlock (in most cases) by switching to send for > the transaction. > > Please take a look at the attached patch. Feedback and constructive criticism > is needed! > > -Nathan Hjelm > HPC-3, LANL
Re: [OMPI devel] v1.5 r26132 broken on multiple nodes?
Let me know what you find - I took a look at the code and it looks correct. All required changes were included in the patch that was applied to the branch. On Mar 14, 2012, at 11:27 PM, Eugene Loh wrote: > I'm quitting for the day, but happened to notice that all our v1.5 MTT runs > are failing with r26133, though tests ran fine as of r26129. Things run fine > on-node, but if you run even just "hostname" on a remote node, the job fails > with > > orted: Command not found > > I get this problem whether I include "--prefix $OPAL_PREFIX" or not. > > Looking at recent putbacks, I see r26132 pulls in r26081 to fix #3047. > According to both the trac ticket and > http://www.open-mpi.org/community/lists/devel/2012/03/10672.php , r26081 > alone isn't enough, but... whatever, I'm going to bed. It does seem like > r26132 isn't quite right. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel