Re: [OMPI devel] Trunk VT breakage

2011-11-22 Thread Matthias Jurenz
Hi,

I've just made a clean checkout of the trunk and built it - I didn't see any 
problems. Also the latest MTT test-builds look good.

Could you please re-checkout the trunk to have a clean working copy and try it 
again?

Matthias

On Tuesday 22 November 2011 04:15:25 Ralph Castain wrote:
> Hi VT developers
> 
> The trunk is broken again, at least on a Mac:
> 
> make[7]: *** No rule to make target `vt_filthandler.cc', needed by
> `vtfilter-vt_filthandler.o'.  Stop.
> 
> 
> This is with a vanilla ./configure, no options other than prefix. Can you
> please fix?
> 
> Thanks
> Ralph
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)

2011-11-22 Thread TERRY DONTJE
The error you are seeing is usually indicative of some code operating on 
memory that isn't aligned properly for a SPARC instruction being used.  
The address that is causing the failure is odd aligned which is more 
than likely the culprit.  If you have a core dump and can disassemble 
the code that is being ran at the time it probably will be some sort of 
instruction requiring an alignment.  If the MPI you are using is 
something you built can you try and build OMPI with -g and get the line 
number in the PML that is failing?


I haven't seen this type of error for some time but I do all of my SPARC 
testing on Solaris with Solaris Studio Compilers.  You may want to try 
to compile the benchmark with "-m32" to see if that helps.  Though being 
an odd address I suspect it might not.  If you can use the Studio 
Compilers you could try giving the compilers the -xmemalign=8i option 
when building the benchmark and see if that resolves the issue.  This 
would help to assure the issue is just an alignment of data we are 
slicing and dicing as opposed to wrongly addressing memory.


--td

On 11/21/2011 8:51 PM, Lukas Razik wrote:

Hello everybody!

I've Sun T5120 (SPARC64) Servers with
- Debian: 6.0.3
- linux-2.6.39.4 (from kernel.org)
- OFED-1.5.3.2
- InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB 
DDR / 10GigE] (rev a0)
   with newest FW (2.9.1)
and the following issue:

If I try to mpirun a program like the osu_latency benchmark:
$ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -np 2 --mca btl_base_verbose 50 --mca 
btl_openib_verbose 1 -host cluster1,cluster2 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency

then I get these errors:

# OSU MPI Latency Test v3.1.1
# SizeLatency (us)
[cluster1:64027] *** Process received signal ***
[cluster1:64027] Signal: Bus error (10)
[cluster1:64027] Signal code: Invalid address alignment (1)
[cluster1:64027] Failing at address: 0xaa9053
[cluster1:64027] [ 0] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) 
[0xf8010209e2f0]
[cluster1:64027] [ 1] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) 
[0xf801031ce904]
[cluster1:64027] [ 2] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) 
[0xf801031d7498]
[cluster1:64027] [ 3] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) 
[0xf8010005a97c]
[cluster1:64027] [ 4] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) 
[0x100f34]
[cluster1:64027] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) 
[0xf80100ac1240]
[cluster1:64027] [ 6] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) 
[0x100bac]
[cluster1:64027] *** End of error message ***
[cluster2:02759] *** Process received signal ***
[cluster2:02759] Signal: Bus error (10)
[cluster2:02759] Signal code: Invalid address alignment (1)
[cluster2:02759] Failing at address: 0xaa9053
[cluster2:02759] [ 0] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) 
[0xf8010209e2f0]
[cluster2:02759] [ 1] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) 
[0xf801031ce904]
[cluster2:02759] [ 2] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) 
[0xf801031d7498]
[cluster2:02759] [ 3] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) 
[0xf8010005a97c]
[cluster2:02759] [ 4] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) 
[0x100f34]
[cluster2:02759] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) 
[0xf80100ac1240]
[cluster2:02759] [ 6] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) 
[0x100bac]
[cluster2:02759] *** End of error message ***
---

The whole output can be found here:
http://net.razik.de/linux/T5120/openmpi-1.4.3-verbose.txt

That's my 'ompi_info --param all all' output:
http://net.razik.de/linux/T5120/openmpi-1.4.3_param_all_all.txt

Same error with OFED-1.5.4-rc4 and also the same with openmpi-1.4.4.

If I disable openib the I get the right results:
$ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --mca btl ^openib -np 2 -host 
cluster1,cluster2 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency
# OSU MPI Latency Test v3.1.1
# SizeLatency (us)
0   143.53
1   140.50

---

ibverbs seems to work:
$ ibv_srq_pingpong -n 100 cluster2

819200 bytes in 4.15 seconds = 15806.63 Mbit/sec
100 iters in 4.15 seconds = 4.15 usec/iter
---

These are the installed OFED packets:
kernel-ib
ofed-scripts
libibverbs
libibverbs-devel
libibverbs-utils
libmlx4
libmlx4-devel
libibumad
libibumad-devel
libibmad
libibmad-devel
librdmacm
librdmacm-utils
librdmacm-devel
opensm-libs
ibutils
infiniband-diags
qperf
ofed-docs
mpi-selector
openmpi_gcc
mpitests_openmpi_gcc
---

I don't know which mailing list is the right one and I'm very thankful for any 
help!
If you have questions, please ask!

Best regard

Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)

2011-11-22 Thread TERRY DONTJE



On 11/22/2011 5:49 AM, TERRY DONTJE wrote:
The error you are seeing is usually indicative of some code operating 
on memory that isn't aligned properly for a SPARC instruction being 
used.  The address that is causing the failure is odd aligned which is 
more than likely the culprit.  If you have a core dump and can 
disassemble the code that is being ran at the time it probably will be 
some sort of instruction requiring an alignment.  If the MPI you are 
using is something you built can you try and build OMPI with -g and 
get the line number in the PML that is failing?


I haven't seen this type of error for some time but I do all of my 
SPARC testing on Solaris with Solaris Studio Compilers.  You may want 
to try to compile the benchmark with "-m32" to see if that helps.  
Though being an odd address I suspect it might not.  If you can use 
the Studio Compilers you could try giving the compilers the 
-xmemalign=8i option when building the benchmark and see if that 
resolves the issue.  This would help to assure the issue is just an 
alignment of data we are slicing and dicing as opposed to wrongly 
addressing memory.


After thinking about this you probably won't be able to use the Studio 
Compilers because they only support compiling on Linux with x86 
platforms not Linux with SPARC.  Not sure if gcc has anything like the 
xmemalign options.


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI devel] Open MPI 1.4.4 Configuration Lefover from 1.4.2

2011-11-22 Thread Jeff Squyres
That is very strange.  That sym link does not exist in the 1.4.4 tarball -- it 
is generated during "make".

You didn't accidentally vpath build using the OMPI v1.2 source, did you?  The 
rules in the opal/asm/Makefile.am refer to the top_srcdir, which should 
implicitly point to the top directory of the source code.  There's no explicit 
reference to any specific directory -- they're all relative to the top source 
directory.

See 
https://svn.open-mpi.org/trac/ompi/browser/branches/v1.4/opal/asm/Makefile.am#L21.



On Nov 21, 2011, at 3:38 PM, Bruce Foster wrote:

> I configured the 1.4.4 release with the following:
>  
> #!/bin/sh
>  
> NUVER=1.4.4
>  
> rm -f config.cache
> rm -f NUInstall.configure
>  
> #/bef/OpenMPI/openmpi-${NUVER}/configure \
> ../../openmpi-${NUVER}/configure \
> --prefix=/opt/openmpi/GNU \
> --with-tm=/usr/pbs \
> --enable-static \
> --disable-dlopen \
> --build=x86_64-redhat-linux-gnu \
> --host=x86_64-redhat-linux-gnu \
> --target=x86_64-redhat-linux-gnu \
> 2>&1 | tee NUInstall.configure
>  
> I don’t see any problems in the configuration output. However, when I try to 
> make the results, there is an explicit reference to release 1.4.2, and the 
> make fails:
>  
> [root@seldon GNU]# more NUInstall.make.all
> Making all in config
> make[1]: Entering directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/config'
> make[1]: Nothing to be done for `all'.
> make[1]: Leaving directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/config'
> Making all in contrib
> make[1]: Entering directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/contrib'
> make[1]: Nothing to be done for `all'.
> make[1]: Leaving directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/contrib'
> Making all in opal
> make[1]: Entering directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/opal'
> Making all in include
> make[2]: Entering directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/opal/include'
> make  all-am
> make[3]: Entering directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/opal/include'
> make[3]: Leaving directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/opal/include'
> make[2]: Leaving directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/opal/include'
> Making all in asm
> make[2]: Entering directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/opal/asm'
> make[2]: *** No rule to make target 
> `../../../../openmpi-1.4.2/opal/asm/asm.c', needed
> by `asm.lo'.  Stop.
> make[2]: Leaving directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/opal/asm'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/sscc/opt/Apps/OpenMPI/Configs/GNU/opal'
> make: *** [all-recursive] Error 1
>  
> Looking at the configured opal/asm directory, there’s a bad symlink to the 
> 1.4.2 release:
>  
> [bef@seldon GNU]# cd opal/asm
> [bef@seldon asm]# ls -l
> total 199
> -rw-r--r-- 1 bef bef   264 May  5  2010 asm.lo
> -rw-r--r-- 1 bef bef   935 May  5  2010 asm.o
> -rw-r--r-- 1 bef bef   285 May  5  2010 atomic-asm.lo
> -rw-r--r-- 1 bef bef  1115 May  5  2010 atomic-asm.o
> lrwxrwxrwx 1 bef bef65 May  5  2010 atomic-asm.S -> 
> ../../../../openmpi-1.4.2/opal/asm/generated/atomic-amd64-linux.s
> -rw-r--r-- 1 bef bef   873 May  5  2010 libasm.la
> -rw-r--r-- 1 bef bef 54526 Oct 14 16:19 Makefile
>  
> Bruce
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Trunk VT breakage

2011-11-22 Thread Jeff Squyres
FWIW, me too -- it builds ok for me, too.

I have a dim recollection of some change a while ago that caused this badness 
(i.e., the trunk is fixed, but an old/stale sym link from before the fix makes 
it look like the problem is still there).  

IIRC, if you rm -rf the ompi/contrib/vt tree and then svn up to restore it, you 
should be good.


On Nov 22, 2011, at 3:52 AM, Matthias Jurenz wrote:

> Hi,
> 
> I've just made a clean checkout of the trunk and built it - I didn't see any 
> problems. Also the latest MTT test-builds look good.
> 
> Could you please re-checkout the trunk to have a clean working copy and try 
> it 
> again?
> 
> Matthias
> 
> On Tuesday 22 November 2011 04:15:25 Ralph Castain wrote:
>> Hi VT developers
>> 
>> The trunk is broken again, at least on a Mac:
>> 
>> make[7]: *** No rule to make target `vt_filthandler.cc', needed by
>> `vtfilter-vt_filthandler.o'.  Stop.
>> 
>> 
>> This is with a vanilla ./configure, no options other than prefix. Can you
>> please fix?
>> 
>> Thanks
>> Ralph
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] PGI error invoked when svnversion is unavailable

2011-11-22 Thread Jeff Squyres
Tom / Larry --

Thanks for looking into this.

1. I just replied on https://svn.open-mpi.org/trac/ompi/ticket/2913 about the 
sed issue; let's iterate there to find the Right solution.

2. Larry: I'll look at the other issues in your patch more closely after the 
Thanksgiving break (I'm supposedly on vacation today; I'm going to be in 
trouble if I don't stop working soon!).




On Nov 15, 2011, at 8:17 PM, Thomas Rothrock CTR SMDC SimCtr/GaN Corporation 
wrote:

> Thank you for the reply and attachments.  I searched the archive before
> sending my own email and did find a couple of the previous messages, but
> believe mine is a different situation.  The macro expansion is working (I'm
> using PGI 11.10) but the macro is set to "" instead of "1.7a1? (no
> svnversion) MM-DD-" by an error in the configure script generation.
> Installing subversion is a cheap workaround that makes the svnversion
> command available and the macro gets set to "1.7a1r25476M" or whatever repo
> version you are at.
> 
> I did not mention it before but this happens when I try to build the trunk
> or the 1.5 development branch.  The 1.4 branch's configure code doesn't even
> bother to test for svnversion failing and just ends up setting OMPI_VERSION
> to "1.4.4rc5r"
> 
> The problem is with configure's with_ident_string assignment:
> 
>   with_ident_string="`echo $with_ident_string | sed -e
> 's/%VERSION%/'$OMPI_VERSION/`"
> 
> OMPI_VERSION is set to "1.7a1? (no svnversion) MM-DD-" but the spaces in
> OMPI_VERSION break the expression because sed sees the rest of the version
> string as extra arguments instead of part of the -e script.  It doesn't show
> up in config.log, but the configure output to the terminal at this point is:
> 
>   checking if want ident string... sed: -e expression #1, char 18:
> unterminated `s' command
> 
> The config.status shows:
> 
>   D["OPAL_IDENT_STRING"]=" \"\""
> 
> If I patch config/opal_get_version.m4 to remove the spaces:
> 
> 
> 
> diff -Naurd openmpi-trunk.a/config/opal_get_version.m4
> openmpi-trunk.b/config/opal_get_version.m4
> --- openmpi-trunk.a/config/opal_get_version.m4  2011-09-21
> 23:17:36.0 -0500
> +++ openmpi-trunk.b/config/opal_get_version.m4  2011-11-15
> 17:36:09.0 -0600
> @@ -81,7 +81,7 @@
> if test $? != 0; then
> # The following is too long for Fortran
> # $2_REPO_REV="unknown svn version (svnversion not
> found); $d"
> -$2_REPO_REV="? (no svnversion); $d"
> +$2_REPO_REV="?(nosvnversion);$d"
> fi
> elif test -d "$srcdir/.hg" ; then
> # Check to see if we can find the hg command
> 
> 
> 
> it verifies that the spaces were the problem.  OMPI_VERSION gets set to
> "1.7a1?(nosvnversion);11-15-2011", configure runs without an sed error, and
> config.status shows:
> 
>   D["OPAL_IDENT_STRING"]=" \"1.7a1?(nosvnversion);11-15-2011\""
> 
> and opal/runtime/opal_init.c compiles without triggering the misleading PGI
> error, but doesn't look quite as pretty.  Perhaps a better solution is to
> modify the with_ident_string assignment to work correclty with spaces in
> OMPI_VERSION intact:
> 
> 
> 
> diff -Naurd openmpi-trunk.a/opal/config/opal_configure_options.m4
> openmpi-trunk.b/opal/config/opal_configure_options.m4
> --- openmpi-trunk.a/opal/config/opal_configure_options.m4   2011-11-15
> 17:55:36.0 -0600
> +++ openmpi-trunk.b/opal/config/opal_configure_options.m4   2011-11-15
> 18:29:43.0 -0600
> @@ -489,7 +489,7 @@
> if test "$with_ident_string" = "" -o "$with_ident_string" = "no"; then
> with_ident_string="%VERSION%"
> fi
> -with_ident_string="`echo $with_ident_string | sed -e
> 's/%VERSION%/'$OMPI_VERSION/`"
> +with_ident_string="`echo $with_ident_string | sed -e
> 's/%VERSION%/'"$OMPI_VERSION"'/'`"
> AC_DEFINE_UNQUOTED([OPAL_IDENT_STRING], ["$with_ident_string"],
>  [ident string for Open MPI])
> AC_MSG_RESULT([$with_ident_string])
> 
> 
> In this case the resulting config.status has:
> 
>   D["OPAL_IDENT_STRING"]=" \"1.7a1? (no svnversion); 11-15-2011\""
> 
> and the compile works.  I have attached the second patch for both trunk and
> 1.5 as it is probably the better solution (don't assume OMPI_VERSION has no
> spaces) and I have not found other instances of spaces in the version string
> breaking anything.  As for OpenMPI 1.4's development branch, I'll leave the
> choice to patch or leave as-is for someone else to discuss.
> 
> FYI, PGI has assigned my support request TPR #18274.  I'm curious what they
> will come back with.
> 
>   ---
>Tom Rothrock  
>US Army Space & Missile Defense Command Simulation Center
>256-955-3382 (DSN 645)   FAX 256-955-1231
> Main SimCtr Phone:  256-9

Re: [OMPI devel] Trunk VT breakage

2011-11-22 Thread Ralph Castain
Ah, yes - I recall that as well now. Thx!

On Nov 22, 2011, at 5:56 AM, Jeff Squyres wrote:

> FWIW, me too -- it builds ok for me, too.
> 
> I have a dim recollection of some change a while ago that caused this badness 
> (i.e., the trunk is fixed, but an old/stale sym link from before the fix 
> makes it look like the problem is still there).  
> 
> IIRC, if you rm -rf the ompi/contrib/vt tree and then svn up to restore it, 
> you should be good.
> 
> 
> On Nov 22, 2011, at 3:52 AM, Matthias Jurenz wrote:
> 
>> Hi,
>> 
>> I've just made a clean checkout of the trunk and built it - I didn't see any 
>> problems. Also the latest MTT test-builds look good.
>> 
>> Could you please re-checkout the trunk to have a clean working copy and try 
>> it 
>> again?
>> 
>> Matthias
>> 
>> On Tuesday 22 November 2011 04:15:25 Ralph Castain wrote:
>>> Hi VT developers
>>> 
>>> The trunk is broken again, at least on a Mac:
>>> 
>>> make[7]: *** No rule to make target `vt_filthandler.cc', needed by
>>> `vtfilter-vt_filthandler.o'.  Stop.
>>> 
>>> 
>>> This is with a vanilla ./configure, no options other than prefix. Can you
>>> please fix?
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Rename "vader" BTL to "xpmem"

2011-11-22 Thread Jeff Squyres
Here's what Nathan and I discussed / decided:

1. Nathan shied away from the name "xpmem" in case some other shared memory 
scheme basically did the same thing as XPMEM (i.e., single copy techniques).  
(just FYI: xpmem's setup is a little different from KNEM, though, so they 
didn't merge in KNEM support to vader)  Hence, he wanted a neutral name that 
could apply to xpmem and others.  He and Sam have some possible names that 
could be suitable ("single copy ...something..."; I don't remember offhand).

2. We've long talked about having an MCA component aliasing scheme.  Perhaps 
now is the time to do it.  Such a scheme would do two things:

   - provide alias names for components.  For example, both of the following
 would be equivalent:

 mpirun --mca btl openib,self ...
 mpirun --mca btl ofrc,self ...

   - automatically register alias MCA parameters.  For example, both of the
 following would be equivalent:

 mpirun --mca btl_openib_param 1 ...
 mpirun --mca btl_ofrc_param 1 ...

This would solve two problems:

2a. Finally be able to rename the "openib" module to something more sensical; 
"ofrc", perhaps?  ("ofrc" = OpenFabrics reliable connected transport, as 
opposed to the existing "ofud" = OpenFabrics unreliable datagram transport BTL).

2b. Rename vader to be xpmem, because it only supports xpmem at the moment.  If 
that component is expanded in the future to support other similar single-copy 
schemes, it can be renamed to some neutral name and have "xpmem" as an alias.

Nathan agreed to look into a module aliasing scheme / vader->xpmem rename after 
he works the hide-OB1/BTL-descriptor-lengths issue that was previously 
discussed on the list.  This will likely be in early/mid December.



On Nov 17, 2011, at 8:11 AM, Jeff Squyres wrote:

> After having to explain to someone at SC for the umpteenth time this week 
> that the "vader" BTL uses the XPMEM transport under the covers, I'd like to 
> put forth an appeal to rename the "vader" BTL to be "xpmem."
> 
> Here's my rationale for why:
> 
> 1. Although we have a history of Star Wars-related names, the "ob1" and "r2" 
> components got their names because they're mainly algorithms that have no 
> obvious name that describes what they do.
> 
> 2. All other components that tie into some back-end system are named 
> reflecting the back-end system (e.g., tcp, mx, portals, ...etc.).  "openib" 
> is the weakest example, but we all know that it was named way back when OFED 
> was named "OpenIB", and the name has kinda stuck.
> 
> 3. The BTL name "xpmem" follows the law of least astonishment from the user's 
> perspective.
> 
> 4. Cute names rarely seem so after 6 months.
> 
> I'll even volunteer to do the work to rename it (a bunch of file moves and 
> global search-and-replaces).
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Rename "vader" BTL to "xpmem"

2011-11-22 Thread TERRY DONTJE
So with the aliasing scheme the code for openib would still under 
ompi/mca/btl/openib but you could access it with -mca btl ofrc?  Ok, so 
when an error happens in the openib btl how does it identify itself?  
Does it use openib or ofrc?  This seems like there could be some user 
confusion by adopting the aliasing scheme.


--td

On 11/22/2011 9:22 AM, Jeff Squyres wrote:

Here's what Nathan and I discussed / decided:

1. Nathan shied away from the name "xpmem" in case some other shared memory scheme 
basically did the same thing as XPMEM (i.e., single copy techniques).  (just FYI: xpmem's setup is 
a little different from KNEM, though, so they didn't merge in KNEM support to vader)  Hence, he 
wanted a neutral name that could apply to xpmem and others.  He and Sam have some possible names 
that could be suitable ("single copy ...something..."; I don't remember offhand).

2. We've long talked about having an MCA component aliasing scheme.  Perhaps 
now is the time to do it.  Such a scheme would do two things:

- provide alias names for components.  For example, both of the following
  would be equivalent:

  mpirun --mca btl openib,self ...
  mpirun --mca btl ofrc,self ...

- automatically register alias MCA parameters.  For example, both of the
  following would be equivalent:

  mpirun --mca btl_openib_param 1 ...
  mpirun --mca btl_ofrc_param 1 ...

This would solve two problems:

2a. Finally be able to rename the "openib" module to something more sensical; "ofrc", perhaps?  
("ofrc" = OpenFabrics reliable connected transport, as opposed to the existing "ofud" = OpenFabrics 
unreliable datagram transport BTL).

2b. Rename vader to be xpmem, because it only supports xpmem at the moment.  If that 
component is expanded in the future to support other similar single-copy schemes, it can 
be renamed to some neutral name and have "xpmem" as an alias.

Nathan agreed to look into a module aliasing scheme / vader->xpmem rename after 
he works the hide-OB1/BTL-descriptor-lengths issue that was previously discussed 
on the list.  This will likely be in early/mid December.



On Nov 17, 2011, at 8:11 AM, Jeff Squyres wrote:


After having to explain to someone at SC for the umpteenth time this week that the "vader" BTL uses 
the XPMEM transport under the covers, I'd like to put forth an appeal to rename the "vader" BTL to 
be "xpmem."

Here's my rationale for why:

1. Although we have a history of Star Wars-related names, the "ob1" and "r2" 
components got their names because they're mainly algorithms that have no obvious name that 
describes what they do.

2. All other components that tie into some back-end system are named reflecting the back-end system 
(e.g., tcp, mx, portals, ...etc.).  "openib" is the weakest example, but we all know that 
it was named way back when OFED was named "OpenIB", and the name has kinda stuck.

3. The BTL name "xpmem" follows the law of least astonishment from the user's 
perspective.

4. Cute names rarely seem so after 6 months.

I'll even volunteer to do the work to rename it (a bunch of file moves and 
global search-and-replaces).

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI devel] Rename "vader" BTL to "xpmem"

2011-11-22 Thread Jeff Squyres
1. Personally, I would love to rename the openib BTL to something that makes 
sense and is a current name.  By "rename", I include "rename the directory" -- 
so it would be ompi/mca/btl/ofrc, or something like that.

2. Good point about error reporting.  I would think that the component (e.g., 
ofrc/openib BTL) should report the name that was specified by the user.  But 
this could be a PITA to implement -- you couldn't just hard-code your component 
name in error messages anymore; there would have to be some level of 
indirection, such as a global variable that the MCA base fills in with the name 
that your component was referred to by.  


On Nov 22, 2011, at 9:34 AM, TERRY DONTJE wrote:

> So with the aliasing scheme the code for openib would still under 
> ompi/mca/btl/openib but you could access it with -mca btl ofrc?  Ok, so when 
> an error happens in the openib btl how does it identify itself?  Does it use 
> openib or ofrc?  This seems like there could be some user confusion by 
> adopting the aliasing scheme.
> 
> --td
> 
> On 11/22/2011 9:22 AM, Jeff Squyres wrote:
>> Here's what Nathan and I discussed / decided:
>> 
>> 1. Nathan shied away from the name "xpmem" in case some other shared memory 
>> scheme basically did the same thing as XPMEM (i.e., single copy techniques). 
>>  (just FYI: xpmem's setup is a little different from KNEM, though, so they 
>> didn't merge in KNEM support to vader)  Hence, he wanted a neutral name that 
>> could apply to xpmem and others.  He and Sam have some possible names that 
>> could be suitable ("single copy ...something..."; I don't remember offhand).
>> 
>> 2. We've long talked about having an MCA component aliasing scheme.  Perhaps 
>> now is the time to do it.  Such a scheme would do two things:
>> 
>>- provide alias names for components.  For example, both of the following
>>  would be equivalent:
>> 
>>  mpirun --mca btl openib,self ...
>>  mpirun --mca btl ofrc,self ...
>> 
>>- automatically register alias MCA parameters.  For example, both of the
>>  following would be equivalent:
>> 
>>  mpirun --mca btl_openib_param 1 ...
>>  mpirun --mca btl_ofrc_param 1 ...
>> 
>> This would solve two problems:
>> 
>> 2a. Finally be able to rename the "openib" module to something more 
>> sensical; "ofrc", perhaps?  ("ofrc" = OpenFabrics reliable connected 
>> transport, as opposed to the existing "ofud" = OpenFabrics unreliable 
>> datagram transport BTL).
>> 
>> 2b. Rename vader to be xpmem, because it only supports xpmem at the moment.  
>> If that component is expanded in the future to support other similar 
>> single-copy schemes, it can be renamed to some neutral name and have "xpmem" 
>> as an alias.
>> 
>> Nathan agreed to look into a module aliasing scheme / vader->xpmem rename 
>> after he works the hide-OB1/BTL-descriptor-lengths issue that was previously 
>> discussed on the list.  This will likely be in early/mid December.
>> 
>> 
>> 
>> On Nov 17, 2011, at 8:11 AM, Jeff Squyres wrote:
>> 
>> 
>>> After having to explain to someone at SC for the umpteenth time this week 
>>> that the "vader" BTL uses the XPMEM transport under the covers, I'd like to 
>>> put forth an appeal to rename the "vader" BTL to be "xpmem."
>>> 
>>> Here's my rationale for why:
>>> 
>>> 1. Although we have a history of Star Wars-related names, the "ob1" and 
>>> "r2" components got their names because they're mainly algorithms that have 
>>> no obvious name that describes what they do.
>>> 
>>> 2. All other components that tie into some back-end system are named 
>>> reflecting the back-end system (e.g., tcp, mx, portals, ...etc.).  "openib" 
>>> is the weakest example, but we all know that it was named way back when 
>>> OFED was named "OpenIB", and the name has kinda stuck.
>>> 
>>> 3. The BTL name "xpmem" follows the law of least astonishment from the 
>>> user's perspective.
>>> 
>>> 4. Cute names rarely seem so after 6 months.
>>> 
>>> I'll even volunteer to do the work to rename it (a bunch of file moves and 
>>> global search-and-replaces).
>>> 
>>> -- 
>>> Jeff Squyres
>>> 
>>> jsquy...@cisco.com
>>> 
>>> For corporate legal information go to:
>>> 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> 
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> -- 
> 
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Rename "vader" BTL to "xpmem"

2011-11-22 Thread George Bosilca
-10!

  george.

On Nov 22, 2011, at 08:38 , Jeff Squyres wrote:

> 1. Personally, I would love to rename the openib BTL to something that makes 
> sense and is a current name.  By "rename", I include "rename the directory" 
> -- so it would be ompi/mca/btl/ofrc, or something like that.
> 
> 2. Good point about error reporting.  I would think that the component (e.g., 
> ofrc/openib BTL) should report the name that was specified by the user.  But 
> this could be a PITA to implement -- you couldn't just hard-code your 
> component name in error messages anymore; there would have to be some level 
> of indirection, such as a global variable that the MCA base fills in with the 
> name that your component was referred to by.  
> 
> 
> On Nov 22, 2011, at 9:34 AM, TERRY DONTJE wrote:
> 
>> So with the aliasing scheme the code for openib would still under 
>> ompi/mca/btl/openib but you could access it with -mca btl ofrc?  Ok, so when 
>> an error happens in the openib btl how does it identify itself?  Does it use 
>> openib or ofrc?  This seems like there could be some user confusion by 
>> adopting the aliasing scheme.
>> 
>> --td
>> 
>> On 11/22/2011 9:22 AM, Jeff Squyres wrote:
>>> Here's what Nathan and I discussed / decided:
>>> 
>>> 1. Nathan shied away from the name "xpmem" in case some other shared memory 
>>> scheme basically did the same thing as XPMEM (i.e., single copy 
>>> techniques).  (just FYI: xpmem's setup is a little different from KNEM, 
>>> though, so they didn't merge in KNEM support to vader)  Hence, he wanted a 
>>> neutral name that could apply to xpmem and others.  He and Sam have some 
>>> possible names that could be suitable ("single copy ...something..."; I 
>>> don't remember offhand).
>>> 
>>> 2. We've long talked about having an MCA component aliasing scheme.  
>>> Perhaps now is the time to do it.  Such a scheme would do two things:
>>> 
>>>   - provide alias names for components.  For example, both of the following
>>> would be equivalent:
>>> 
>>> mpirun --mca btl openib,self ...
>>> mpirun --mca btl ofrc,self ...
>>> 
>>>   - automatically register alias MCA parameters.  For example, both of the
>>> following would be equivalent:
>>> 
>>> mpirun --mca btl_openib_param 1 ...
>>> mpirun --mca btl_ofrc_param 1 ...
>>> 
>>> This would solve two problems:
>>> 
>>> 2a. Finally be able to rename the "openib" module to something more 
>>> sensical; "ofrc", perhaps?  ("ofrc" = OpenFabrics reliable connected 
>>> transport, as opposed to the existing "ofud" = OpenFabrics unreliable 
>>> datagram transport BTL).
>>> 
>>> 2b. Rename vader to be xpmem, because it only supports xpmem at the moment. 
>>>  If that component is expanded in the future to support other similar 
>>> single-copy schemes, it can be renamed to some neutral name and have 
>>> "xpmem" as an alias.
>>> 
>>> Nathan agreed to look into a module aliasing scheme / vader->xpmem rename 
>>> after he works the hide-OB1/BTL-descriptor-lengths issue that was 
>>> previously discussed on the list.  This will likely be in early/mid 
>>> December.
>>> 
>>> 
>>> 
>>> On Nov 17, 2011, at 8:11 AM, Jeff Squyres wrote:
>>> 
>>> 
 After having to explain to someone at SC for the umpteenth time this week 
 that the "vader" BTL uses the XPMEM transport under the covers, I'd like 
 to put forth an appeal to rename the "vader" BTL to be "xpmem."
 
 Here's my rationale for why:
 
 1. Although we have a history of Star Wars-related names, the "ob1" and 
 "r2" components got their names because they're mainly algorithms that 
 have no obvious name that describes what they do.
 
 2. All other components that tie into some back-end system are named 
 reflecting the back-end system (e.g., tcp, mx, portals, ...etc.).  
 "openib" is the weakest example, but we all know that it was named way 
 back when OFED was named "OpenIB", and the name has kinda stuck.
 
 3. The BTL name "xpmem" follows the law of least astonishment from the 
 user's perspective.
 
 4. Cute names rarely seem so after 6 months.
 
 I'll even volunteer to do the work to rename it (a bunch of file moves and 
 global search-and-replaces).
 
 -- 
 Jeff Squyres
 
 jsquy...@cisco.com
 
 For corporate legal information go to:
 
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 
 ___
 devel mailing list
 
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> -- 
>> 
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle - Performance Technologies
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.don...@oracle.com
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mai

Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)

2011-11-22 Thread Lukas Razik
Roland Dreier  wrote:
>
> On Mon, Nov 21, 2011 at 5:51 PM, Lukas Razik  wrote:
>> [cluster1:64027] Signal code: Invalid address alignment (1)
>> [cluster1:64027] Failing at address: 0xaa9053
>> [cluster1:64027] [ 0]
> /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0)
> [0xf8010209e2f0]
>
> Seems like openmpi is doing a misaligned access somewhere...
>
> not sure how to turn this into a real location in the code, Open MPI guys??

Hello Roland,

one guy (Terry D. Dontje) already answered in the de...@open-mpi.org mailing 
list:
http://www.open-mpi.org/community/lists/devel/2011/11/10011.php

As I've understood him, he thinks the same. Now I'm trying to do what he wrote 
and answer soon...
Thanks for your estimation!

Best regards,
Lukas



Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)

2011-11-22 Thread Lukas Razik
TERRY DONTJE  wrote:
>On 11/22/2011 5:49 AM, TERRY DONTJE wrote:
>The error you are seeing is usually indicative of some code operating on 
>memory that isn't aligned properly for a SPARC instruction being used.  The 
>address that is causing the failure is odd aligned which is more than likely 
>the culprit.  If you have a core dump and can disassemble the code that is 
>being ran at the time it probably will be some sort of instruction requiring 
>an alignment.  If the MPI you are using is something you built can you try and 
>build OMPI with -g and get the line number in the PML that is failing?
>>
>>I haven't seen this type of error for some time but I do all of my
  SPARC testing on Solaris with Solaris Studio Compilers.  You may
  want to try to compile the benchmark with "-m32" to see if that
  helps.  Though being an odd address I suspect it might not.  If
  you can use the Studio Compilers you could try giving the
  compilers the -xmemalign=8i option when building the benchmark and
  see if that resolves the issue.  This would help to assure the
  issue is just an alignment of data we are slicing and dicing as
  opposed to wrongly addressing memory.
>>
>>
>>After thinking about this you probably won't be able to use the Studio 
>>Compilers because they only support compiling on Linux with x86 platforms not 
>>Linux with SPARC.  Not sure if gcc has anything like the xmemalign options.


Hello Terry,

we have no Solaris on the machines (anymore). The whole effort is to get Linux 
running on them...
With big help of Roland Dreier and patches of David Miller it seems as if the 
Infiniband drivers work on our SPARC64 machines with Debian now. The only big 
thing from the OFED which now lacks is OpenMPI.

--
BTW:
With Debian's gcc (Debian 4.4.5-8) I've build this new environment:
- binutils-2.21.1 (from gnu.org)
- gcc-4.4.6 (from gnu.org)
- libtool-2.2.6b (from gnu.org)

This new environment I used to build:
- linux-2.6.39.4 (from kernel.org)
- OFED-1.5.3.2 with openmpi-1.4.3, the ofa kernel modules etc. (from 
openfabrics.org)
- openmpi-1.4.4 (from open-mpi.org)
---

You asked for debugging information. Here you can see a screen shot of kdbg 
with the stack, the line number etc.
http://net.razik.de/linux/T5120/kdbg-openmpi-1.4.4-osu_latency.png

That's the backtrace of the core file made by gdb:
---
# gdb /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency core

Reading symbols from 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency...(no 
debugging symbols found)...done.
[New LWP 54054]
[New LWP 54055]
[New LWP 54056]
[Thread debugging using libthread_db enabled]
Core was generated by 
`/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency'.
Program terminated with signal 10, Bus error.
#0  0xf8010229ba9c in mca_pml_ob1_send_request_start_copy 
(sendreq=0xb23200, bml_btl=0xb29050, size=0) at pml_ob1_sendreq.c:551
551 hdr->hdr_match.hdr_ctx = 
sendreq->req_send.req_base.req_comm->c_contextid;
(gdb) backtrace
#0  0xf8010229ba9c in mca_pml_ob1_send_request_start_copy 
(sendreq=0xb23200, bml_btl=0xb29050, size=0) at pml_ob1_sendreq.c:551
#1  0xf80102286d28 in mca_pml_ob1_send_request_start_btl (sendreq=0xb23200, 
bml_btl=0xb29050) at pml_ob1_sendreq.h:363
#2  0xf80102287050 in mca_pml_ob1_send_request_start (sendreq=0xb23200) at 
pml_ob1_sendreq.h:429
#3  0xf801022879ec in mca_pml_ob1_isend (buf=0x0, count=0, 
datatype=0xf80100290dc0, dst=1, tag=-16,
    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x201b50, request=0x7feffa130c8) 
at pml_ob1_isend.c:87
#4  0xf8010343d338 in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0, 
scount=0, sdatatype=0xf80100290dc0, dest=1, stag=-16,
    recvbuf=0x0, rcount=0, rdatatype=0xf80100290dc0, source=1, rtag=-16, 
comm=0x201b50, status=0x0) at coll_tuned_util.c:51
#5  0xf8010344fd94 in ompi_coll_tuned_barrier_intra_two_procs 
(comm=0x201b50, module=0xb2b070) at coll_tuned_barrier.c:258
#6  0xf8010343de94 in ompi_coll_tuned_barrier_intra_dec_fixed 
(comm=0x201b50, module=0xb2b070) at coll_tuned_decision_fixed.c:192
#7  0xf801000bfff0 in PMPI_Barrier (comm=0x201b50) at pbarrier.c:59
#8  0x00100f3c in main ()
---


That's the belonging mpirun:
---
# /usr/mpi/gcc/openmpi-1.4.4/bin/mpirun -np 2 -host cluster1,cluster2 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency
# OSU MPI Latency Test v3.1.1
# Size    Latency (us)
[cluster1:54054] *** Process received signal ***
[cluster1:54054] Signal: Bus error (10)
[cluster1:54054] Signal code: Invalid address alignment (1)
[cluster1:54054] Failing at address: 0xad7393
[cluster1:54054] [ 0] 
/usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_pml_ob1.so(+0xed20) 
[0xf80102286d20]
[cluster1:54054] [ 1] 
/usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_pml_ob1.so(+0xf048) 
[0xf80102287048]
[cluster1:54054] [ 2] 
/usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_pml_ob1.so(+0xf9e4) 
[0xf80102

Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)

2011-11-22 Thread Lukas Razik
Roland Dreier  wrote:

> On Tue, Nov 22, 2011 at 3:05 PM, Lukas Razik  wrote:
>>  #0  0xf8010229ba9c in mca_pml_ob1_send_request_start_copy 
> (sendreq=0xb23200, bml_btl=0xb29050, size=0) at pml_ob1_sendreq.c:551
>>  551 hdr->hdr_match.hdr_ctx = 
> sendreq->req_send.req_base.req_comm->c_contextid;
>>  (gdb) backtrace
> 
> If you can get into gdb here, I guess it would be useful to print the
> address of hdr->hdr_match.hdr_ctx and
> sendreq->req_send.req_base.req_comm->c_contextid to see which one is
> misaligned.
> 
> Not sure of the gdb syntax... does it work to just do
> 
> p &hdr->hdr_match.hdr_ctx and sendreq->req_send.req_base.req
> p &sendreq->req_send.req_base.req_comm->c_contextid
> 

Oh, sorry that I didn't do that before...
The values are:
&hdr->hdr_match.hdr_ctx and sendreq->req_send.req_base.req  =  (uint16_t *) 
0xad7393
&sendreq->req_send.req_base.req_comm->c_contextid  =  (uint32_t *) 0x201c20

So hdr_ctx is the bad one...

Regards,
Lukas


PS:
I always don't know the syntax of gdb - hence I use the nice kdbg. *g*
http://net.razik.de/linux/T5120/kdbg-openmpi-1.4.4-osu_latency-02.png