Re: [OMPI users] OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris

2014-10-26 Thread Ralph Castain
Just took a glance thru 249 and have a few suggestions on it - will pass them 
along tomorrow. I think the right solution is to (a) dump opal_identifier_t in 
favor of using opal_process_name_t everywhere in the opal layer, (b) typedef 
orte_process_name_t to opal_process_name_t, and (c) leave ompi_process_name_t 
as typedef’d to the RTE component in the MPI layer. This lets other RTEs decide 
for themselves how they want to handle it.

If you add changes to your branch, I can pass you a patch with my suggested 
alterations.

> On Oct 26, 2014, at 5:55 PM, Gilles Gouaillardet 
>  wrote:
> 
> No :-(
> I need some extra work to stop declaring orte_process_name_t and 
> ompi_process_name_t variables.
> #249 will make things much easier.
> One option is to use opal_process_name_t everywhere or typedef orte and ompi 
> types to the opal one.
> An other (lightweight but error prone imho) is to change variable declaration 
> only.
> Any thought ?
> 
> Ralph Castain  wrote:
>> Will PR#249 solve it? If so, we should just go with it as I suspect that is 
>> the long-term solution.
>> 
>>> On Oct 26, 2014, at 4:25 PM, Gilles Gouaillardet 
>>>  wrote:
>>> 
>>> It looks like we faced a similar issue :
>>> opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 
>>> bits aligned. If you run an alignment sensitive cpu such as sparc and you 
>>> are not lucky (so to speak) you can run into this issue.
>>> i will make a patch for this shortly
>>> 
>>> Ralph Castain  wrote:
 Afraid this must be something about the Sparc - just ran on a Solaris 11 
 x86 box and everything works fine.
 
 
> On Oct 26, 2014, at 8:22 AM, Siegmar Gross 
>  wrote:
> 
> Hi Gilles,
> 
> I wanted to explore which function is called, when I call MPI_Init
> in a C program, because this function should be called from a Java
> program as well. Unfortunately C programs break with a Bus Error
> once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's
> the reason why I get no useful backtrace for my Java program.
> 
> tyr small_prog 117 mpicc -o init_finalize init_finalize.c
> tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
> ...
> (gdb) run -np 1 init_finalize
> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 
> init_finalize
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP2]
> [tyr:19240] *** Process received signal ***
> [tyr:19240] Signal: Bus Error (10)
> [tyr:19240] Signal code: Invalid address alignment (1)
> [tyr:19240] Failing at address: 7bd1c10c
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04
> /lib/sparcv9/libc.so.1:0xd8b98
> /lib/sparcv9/libc.so.1:0xcc70c
> /lib/sparcv9/libc.so.1:0xcc918
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c
>  [ Signal 10 (BUS)]
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8
> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20
> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c
> [tyr:19240] *** End of error message ***
> --
> mpiexec noticed that process rank 0 with PID 0 on node tyr exited on 
> signal 10 (Bus Error).
> --
> [LWP2 exited]
> [New Thread 2]
> [Switching to Thread 1 (LWP 1)]
> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> satisfy query
> (gdb) bt
> #0  0x7f6173d0 in rtld_db_dlactivity () from 
> /usr/lib/sparcv9/ld.so.1
> #1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
> #2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
> #3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
> #4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
> #5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
> #6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
> #7  

Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686on Solaris

2014-10-26 Thread Ralph Castain
Oh yeah - that would indeed be very bad :-(


> On Oct 26, 2014, at 6:06 PM, Kawashima, Takahiro  
> wrote:
> 
> Siegmar, Oscar,
> 
> I suspect that the problem is calling mca_base_var_register
> without initializing OPAL in JNI_OnLoad.
> 
> ompi/mpi/java/c/mpi_MPI.c:
> 
> jint JNI_OnLoad(JavaVM *vm, void *reserved)
> {
>libmpi = dlopen("libmpi." OPAL_DYN_LIB_SUFFIX, RTLD_NOW | RTLD_GLOBAL);
> 
>if(libmpi == NULL)
>{
>fprintf(stderr, "Java bindings failed to load liboshmem.\n");
>exit(1);
>}
> 
>mca_base_var_register("ompi", "mpi", "java", "eager",
>  "Java buffers eager size",
>  MCA_BASE_VAR_TYPE_INT, NULL, 0, 0,
>  OPAL_INFO_LVL_5,
>  MCA_BASE_VAR_SCOPE_READONLY,
>  _mpi_java_eager);
> 
>return JNI_VERSION_1_6;
> }
> 
> 
> I suppose JNI_OnLoad is the first function in the libmpi_java.so
> which is called by JVM. So OPAL is not initialized yet.
> As shown in Siegmar's JRE log, SEGV occurred in asprintf called
> by mca_base_var_cache_files.
> 
> Siegmar's hs_err_pid13080.log:
> 
> siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), 
> si_addr=0x
> 
> Stack: [0x7b40,0x7b50],  sp=0x7b4fc730,  free 
> space=1009k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C  [libc.so.1+0x3c7f0]  strlen+0x50
> C  [libc.so.1+0xaf640]  vsnprintf+0x84
> C  [libc.so.1+0xaadb4]  vasprintf+0x20
> C  [libc.so.1+0xaaf04]  asprintf+0x28
> C  [libopen-pal.so.0.0.0+0xaf3cc]  mca_base_var_cache_files+0x160
> C  [libopen-pal.so.0.0.0+0xaed90]  mca_base_var_init+0x4e8
> C  [libopen-pal.so.0.0.0+0xb260c]  register_variable+0x214
> C  [libopen-pal.so.0.0.0+0xb36a0]  mca_base_var_register+0x104
> C  [libmpi_java.so.0.0.0+0x221e8]  JNI_OnLoad+0x128
> C  [libjava.so+0x10860]  
> Java_java_lang_ClassLoader_00024NativeLibrary_load+0xb8
> j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+-665819
> j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+0
> j  java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+328
> j  
> java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+290
> j  java.lang.Runtime.loadLibrary0(Ljava/lang/Class;Ljava/lang/String;)V+54
> j  java.lang.System.loadLibrary(Ljava/lang/String;)V+7
> j  mpi.MPI.()V+28
> 
> 
> mca_base_var_cache_files passes opal_install_dirs.sysconfdir to
> asprintf.
> 
> opal/mca/base/mca_base_var.c:
> 
>asprintf(_base_var_files, "%s"OPAL_PATH_SEP".openmpi" OPAL_PATH_SEP
> "mca-params.conf%c%s" OPAL_PATH_SEP "openmpi-mca-params.conf",
> home, OPAL_ENV_SEP, opal_install_dirs.sysconfdir);
> 
> 
> In this situation, opal_install_dirs.sysconfdir is still NULL.
> 
> I run a MPI Java program that only calls MPI.Init() and
> MPI.Finalize() with MCA variable mpi_show_mca_params=1 on
> Linux to confirm this. mca_base_param_files contains "(null)".
> 
> mpi_show_mca_params=1:
> 
> [ppc:12232] 
> mca_base_param_files=/home/rivis/.openmpi/mca-params.conf:(null)/openmpi-mca-params.conf
>  (default)
> [ppc:12232] 
> mca_param_files=/home/rivis/.openmpi/mca-params.conf:(null)/openmpi-mca-params.conf
>  (default)
> [ppc:12232] 
> mca_base_override_param_file=(null)/openmpi-mca-params-override.conf (default)
> [ppc:12232] mca_base_suppress_override_warning=false (default)
> [ppc:12232] mca_base_param_file_prefix= (default)
> [ppc:12232] 
> mca_base_param_file_path=(null)/amca-param-sets:/home/rivis/src/mpisample 
> (default)
> [ppc:12232] mca_base_param_file_path_force= (default)
> [ppc:12232] mca_base_env_list= (default)
> [ppc:12232] mca_base_env_list_delimiter=; (default)
> [ppc:12232] mpi_java_eager=65536 (default)
> (snip)
> 
> 
> GNU libc sets "(null)" for asprintf(buf, "%s", NULL) but
> Solaris libc raises SEGV for it. I think this is the difference
> of Siegmar's runs on Linux and Solaris.
> 
> I think this mca_base_var_register should be moved to another
> place or opal_init_util or something should be called before
> this mca_base_var_register.
> 
> Thanks,
> Takahiro
> 
>> Hi Gilles,
>> 
>> thank you very much for the quick tutorial. Unfortunately I still
>> can't get a backtrace.
>> 
>>> You might need to configure with --enable-debug and add -g -O0
>>> to your CFLAGS and LDFLAGS
>>> 
>>> Then once you attach with gdb, you have 

Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686on Solaris

2014-10-26 Thread Kawashima, Takahiro
Siegmar, Oscar,

I suspect that the problem is calling mca_base_var_register
without initializing OPAL in JNI_OnLoad.

ompi/mpi/java/c/mpi_MPI.c:

jint JNI_OnLoad(JavaVM *vm, void *reserved)
{
libmpi = dlopen("libmpi." OPAL_DYN_LIB_SUFFIX, RTLD_NOW | RTLD_GLOBAL);

if(libmpi == NULL)
{
fprintf(stderr, "Java bindings failed to load liboshmem.\n");
exit(1);
}

mca_base_var_register("ompi", "mpi", "java", "eager",
  "Java buffers eager size",
  MCA_BASE_VAR_TYPE_INT, NULL, 0, 0,
  OPAL_INFO_LVL_5,
  MCA_BASE_VAR_SCOPE_READONLY,
  _mpi_java_eager);

return JNI_VERSION_1_6;
}


I suppose JNI_OnLoad is the first function in the libmpi_java.so
which is called by JVM. So OPAL is not initialized yet.
As shown in Siegmar's JRE log, SEGV occurred in asprintf called
by mca_base_var_cache_files.

Siegmar's hs_err_pid13080.log:

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), 
si_addr=0x

Stack: [0x7b40,0x7b50],  sp=0x7b4fc730,  free 
space=1009k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libc.so.1+0x3c7f0]  strlen+0x50
C  [libc.so.1+0xaf640]  vsnprintf+0x84
C  [libc.so.1+0xaadb4]  vasprintf+0x20
C  [libc.so.1+0xaaf04]  asprintf+0x28
C  [libopen-pal.so.0.0.0+0xaf3cc]  mca_base_var_cache_files+0x160
C  [libopen-pal.so.0.0.0+0xaed90]  mca_base_var_init+0x4e8
C  [libopen-pal.so.0.0.0+0xb260c]  register_variable+0x214
C  [libopen-pal.so.0.0.0+0xb36a0]  mca_base_var_register+0x104
C  [libmpi_java.so.0.0.0+0x221e8]  JNI_OnLoad+0x128
C  [libjava.so+0x10860]  Java_java_lang_ClassLoader_00024NativeLibrary_load+0xb8
j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+-665819
j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+0
j  java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+328
j  java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+290
j  java.lang.Runtime.loadLibrary0(Ljava/lang/Class;Ljava/lang/String;)V+54
j  java.lang.System.loadLibrary(Ljava/lang/String;)V+7
j  mpi.MPI.()V+28


mca_base_var_cache_files passes opal_install_dirs.sysconfdir to
asprintf.

opal/mca/base/mca_base_var.c:

asprintf(_base_var_files, "%s"OPAL_PATH_SEP".openmpi" OPAL_PATH_SEP
 "mca-params.conf%c%s" OPAL_PATH_SEP "openmpi-mca-params.conf",
 home, OPAL_ENV_SEP, opal_install_dirs.sysconfdir);


In this situation, opal_install_dirs.sysconfdir is still NULL.

I run a MPI Java program that only calls MPI.Init() and
MPI.Finalize() with MCA variable mpi_show_mca_params=1 on
Linux to confirm this. mca_base_param_files contains "(null)".

mpi_show_mca_params=1:

[ppc:12232] 
mca_base_param_files=/home/rivis/.openmpi/mca-params.conf:(null)/openmpi-mca-params.conf
 (default)
[ppc:12232] 
mca_param_files=/home/rivis/.openmpi/mca-params.conf:(null)/openmpi-mca-params.conf
 (default)
[ppc:12232] 
mca_base_override_param_file=(null)/openmpi-mca-params-override.conf (default)
[ppc:12232] mca_base_suppress_override_warning=false (default)
[ppc:12232] mca_base_param_file_prefix= (default)
[ppc:12232] 
mca_base_param_file_path=(null)/amca-param-sets:/home/rivis/src/mpisample 
(default)
[ppc:12232] mca_base_param_file_path_force= (default)
[ppc:12232] mca_base_env_list= (default)
[ppc:12232] mca_base_env_list_delimiter=; (default)
[ppc:12232] mpi_java_eager=65536 (default)
(snip)


GNU libc sets "(null)" for asprintf(buf, "%s", NULL) but
Solaris libc raises SEGV for it. I think this is the difference
of Siegmar's runs on Linux and Solaris.

I think this mca_base_var_register should be moved to another
place or opal_init_util or something should be called before
this mca_base_var_register.

Thanks,
Takahiro

> Hi Gilles,
> 
> thank you very much for the quick tutorial. Unfortunately I still
> can't get a backtrace.
> 
> > You might need to configure with --enable-debug and add -g -O0
> > to your CFLAGS and LDFLAGS
> > 
> > Then once you attach with gdb, you have to find the thread that is polling :
> > thread 1
> > bt
> > thread 2
> > bt
> > and so on until you find the good thread
> > If _dbg is a local variable, you need to select the right frame
> > before you can change the value :
> > get the frame number from bt (generally 1 under linux)
> > f 
> > set _dbg=0
> > 
> > I hope this helps
> 
> 

Re: [OMPI users] OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris

2014-10-26 Thread Gilles Gouaillardet
No :-(
I need some extra work to stop declaring orte_process_name_t and 
ompi_process_name_t variables.
#249 will make things much easier.
One option is to use opal_process_name_t everywhere or typedef orte and ompi 
types to the opal one.
An other (lightweight but error prone imho) is to change variable declaration 
only.
Any thought ?

Ralph Castain  wrote:
>Will PR#249 solve it? If so, we should just go with it as I suspect that is 
>the long-term solution.
>
>> On Oct 26, 2014, at 4:25 PM, Gilles Gouaillardet 
>>  wrote:
>> 
>> It looks like we faced a similar issue :
>> opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 
>> bits aligned. If you run an alignment sensitive cpu such as sparc and you 
>> are not lucky (so to speak) you can run into this issue.
>> i will make a patch for this shortly
>> 
>> Ralph Castain  wrote:
>>> Afraid this must be something about the Sparc - just ran on a Solaris 11 
>>> x86 box and everything works fine.
>>> 
>>> 
 On Oct 26, 2014, at 8:22 AM, Siegmar Gross 
  wrote:
 
 Hi Gilles,
 
 I wanted to explore which function is called, when I call MPI_Init
 in a C program, because this function should be called from a Java
 program as well. Unfortunately C programs break with a Bus Error
 once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's
 the reason why I get no useful backtrace for my Java program.
 
 tyr small_prog 117 mpicc -o init_finalize init_finalize.c
 tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
 ...
 (gdb) run -np 1 init_finalize
 Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 
 init_finalize
 [Thread debugging using libthread_db enabled]
 [New Thread 1 (LWP 1)]
 [New LWP2]
 [tyr:19240] *** Process received signal ***
 [tyr:19240] Signal: Bus Error (10)
 [tyr:19240] Signal code: Invalid address alignment (1)
 [tyr:19240] Failing at address: 7bd1c10c
 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c
 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04
 /lib/sparcv9/libc.so.1:0xd8b98
 /lib/sparcv9/libc.so.1:0xcc70c
 /lib/sparcv9/libc.so.1:0xcc918
 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c
  [ Signal 10 (BUS)]
 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8
 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc
 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c
 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374
 /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8
 /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20
 /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c
 [tyr:19240] *** End of error message ***
 --
 mpiexec noticed that process rank 0 with PID 0 on node tyr exited on 
 signal 10 (Bus Error).
 --
 [LWP2 exited]
 [New Thread 2]
 [Switching to Thread 1 (LWP 1)]
 sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
 satisfy query
 (gdb) bt
 #0  0x7f6173d0 in rtld_db_dlactivity () from 
 /usr/lib/sparcv9/ld.so.1
 #1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
 #2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
 #3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
 #4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
 #5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
 #6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
 #7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
 #8  0x7ec87f60 in vm_close (loader_data=0x0, 
 module=0x7c901fe0)
   at ../../../openmpi-dev-124-g91e9686/opal/libltdl/loaders/dlopen.c:212
 #9  0x7ec85534 in lt_dlclose (handle=0x100189b50)
   at ../../../openmpi-dev-124-g91e9686/opal/libltdl/ltdl.c:1982
 #10 0x7ecaabd4 in ri_destructor (obj=0x1001893a0)
   at 
 ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:382
 #11 0x7eca9504 in opal_obj_run_destructors (object=0x1001893a0)
   at ../../../../openmpi-dev-124-g91e9686/opal/class/opal_object.h:446
 #12 0x7ecaa474 in 

Re: [OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris

2014-10-26 Thread Ralph Castain
Will PR#249 solve it? If so, we should just go with it as I suspect that is the 
long-term solution.

> On Oct 26, 2014, at 4:25 PM, Gilles Gouaillardet 
>  wrote:
> 
> It looks like we faced a similar issue :
> opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 bits 
> aligned. If you run an alignment sensitive cpu such as sparc and you are not 
> lucky (so to speak) you can run into this issue.
> i will make a patch for this shortly
> 
> Ralph Castain  wrote:
>> Afraid this must be something about the Sparc - just ran on a Solaris 11 x86 
>> box and everything works fine.
>> 
>> 
>>> On Oct 26, 2014, at 8:22 AM, Siegmar Gross 
>>>  wrote:
>>> 
>>> Hi Gilles,
>>> 
>>> I wanted to explore which function is called, when I call MPI_Init
>>> in a C program, because this function should be called from a Java
>>> program as well. Unfortunately C programs break with a Bus Error
>>> once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's
>>> the reason why I get no useful backtrace for my Java program.
>>> 
>>> tyr small_prog 117 mpicc -o init_finalize init_finalize.c
>>> tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
>>> ...
>>> (gdb) run -np 1 init_finalize
>>> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 
>>> init_finalize
>>> [Thread debugging using libthread_db enabled]
>>> [New Thread 1 (LWP 1)]
>>> [New LWP2]
>>> [tyr:19240] *** Process received signal ***
>>> [tyr:19240] Signal: Bus Error (10)
>>> [tyr:19240] Signal code: Invalid address alignment (1)
>>> [tyr:19240] Failing at address: 7bd1c10c
>>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c
>>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04
>>> /lib/sparcv9/libc.so.1:0xd8b98
>>> /lib/sparcv9/libc.so.1:0xcc70c
>>> /lib/sparcv9/libc.so.1:0xcc918
>>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c
>>>  [ Signal 10 (BUS)]
>>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8
>>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc
>>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c
>>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374
>>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8
>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20
>>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c
>>> [tyr:19240] *** End of error message ***
>>> --
>>> mpiexec noticed that process rank 0 with PID 0 on node tyr exited on signal 
>>> 10 (Bus Error).
>>> --
>>> [LWP2 exited]
>>> [New Thread 2]
>>> [Switching to Thread 1 (LWP 1)]
>>> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
>>> satisfy query
>>> (gdb) bt
>>> #0  0x7f6173d0 in rtld_db_dlactivity () from 
>>> /usr/lib/sparcv9/ld.so.1
>>> #1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
>>> #2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
>>> #3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
>>> #4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
>>> #5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
>>> #6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
>>> #7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
>>> #8  0x7ec87f60 in vm_close (loader_data=0x0, 
>>> module=0x7c901fe0)
>>>   at ../../../openmpi-dev-124-g91e9686/opal/libltdl/loaders/dlopen.c:212
>>> #9  0x7ec85534 in lt_dlclose (handle=0x100189b50)
>>>   at ../../../openmpi-dev-124-g91e9686/opal/libltdl/ltdl.c:1982
>>> #10 0x7ecaabd4 in ri_destructor (obj=0x1001893a0)
>>>   at 
>>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:382
>>> #11 0x7eca9504 in opal_obj_run_destructors (object=0x1001893a0)
>>>   at ../../../../openmpi-dev-124-g91e9686/opal/class/opal_object.h:446
>>> #12 0x7ecaa474 in mca_base_component_repository_release (
>>>   component=0x7b1236f0 )
>>>   at 
>>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:240
>>> #13 0x7ecac774 in mca_base_component_unload (
>>>   component=0x7b1236f0 , output_id=-1)
>>>   at 
>>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:47
>>> #14 0x7ecac808 in mca_base_component_close (
>>>   component=0x7b1236f0 , output_id=-1)

Re: [OMPI users] OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris

2014-10-26 Thread Gilles Gouaillardet
It looks like we faced a similar issue :
opal_process_name_t is 64 bits aligned wheteas orte_process_name_t is 32 bits 
aligned. If you run an alignment sensitive cpu such as sparc and you are not 
lucky (so to speak) you can run into this issue.
i will make a patch for this shortly

Ralph Castain  wrote:
>Afraid this must be something about the Sparc - just ran on a Solaris 11 x86 
>box and everything works fine.
>
>
>> On Oct 26, 2014, at 8:22 AM, Siegmar Gross 
>>  wrote:
>> 
>> Hi Gilles,
>> 
>> I wanted to explore which function is called, when I call MPI_Init
>> in a C program, because this function should be called from a Java
>> program as well. Unfortunately C programs break with a Bus Error
>> once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's
>> the reason why I get no useful backtrace for my Java program.
>> 
>> tyr small_prog 117 mpicc -o init_finalize init_finalize.c
>> tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
>> ...
>> (gdb) run -np 1 init_finalize
>> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 
>> init_finalize
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1 (LWP 1)]
>> [New LWP2]
>> [tyr:19240] *** Process received signal ***
>> [tyr:19240] Signal: Bus Error (10)
>> [tyr:19240] Signal code: Invalid address alignment (1)
>> [tyr:19240] Failing at address: 7bd1c10c
>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c
>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04
>> /lib/sparcv9/libc.so.1:0xd8b98
>> /lib/sparcv9/libc.so.1:0xcc70c
>> /lib/sparcv9/libc.so.1:0xcc918
>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c
>>  [ Signal 10 (BUS)]
>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8
>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc
>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c
>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374
>> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8
>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20
>> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c
>> [tyr:19240] *** End of error message ***
>> --
>> mpiexec noticed that process rank 0 with PID 0 on node tyr exited on signal 
>> 10 (Bus Error).
>> --
>> [LWP2 exited]
>> [New Thread 2]
>> [Switching to Thread 1 (LWP 1)]
>> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
>> satisfy query
>> (gdb) bt
>> #0  0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
>> #1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
>> #2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
>> #3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
>> #4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
>> #5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
>> #6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
>> #7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
>> #8  0x7ec87f60 in vm_close (loader_data=0x0, 
>> module=0x7c901fe0)
>>at ../../../openmpi-dev-124-g91e9686/opal/libltdl/loaders/dlopen.c:212
>> #9  0x7ec85534 in lt_dlclose (handle=0x100189b50)
>>at ../../../openmpi-dev-124-g91e9686/opal/libltdl/ltdl.c:1982
>> #10 0x7ecaabd4 in ri_destructor (obj=0x1001893a0)
>>at 
>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:382
>> #11 0x7eca9504 in opal_obj_run_destructors (object=0x1001893a0)
>>at ../../../../openmpi-dev-124-g91e9686/opal/class/opal_object.h:446
>> #12 0x7ecaa474 in mca_base_component_repository_release (
>>component=0x7b1236f0 )
>>at 
>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:240
>> #13 0x7ecac774 in mca_base_component_unload (
>>component=0x7b1236f0 , output_id=-1)
>>at 
>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:47
>> #14 0x7ecac808 in mca_base_component_close (
>>component=0x7b1236f0 , output_id=-1)
>>at 
>> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:60
>> #15 0x7ecac8dc in mca_base_components_close (output_id=-1, 
>>components=0x7f14ba58 , skip=0x0)
>>at 
>> 

Re: [OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris

2014-10-26 Thread Ralph Castain
Afraid this must be something about the Sparc - just ran on a Solaris 11 x86 
box and everything works fine.


> On Oct 26, 2014, at 8:22 AM, Siegmar Gross 
>  wrote:
> 
> Hi Gilles,
> 
> I wanted to explore which function is called, when I call MPI_Init
> in a C program, because this function should be called from a Java
> program as well. Unfortunately C programs break with a Bus Error
> once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's
> the reason why I get no useful backtrace for my Java program.
> 
> tyr small_prog 117 mpicc -o init_finalize init_finalize.c
> tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
> ...
> (gdb) run -np 1 init_finalize
> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 
> init_finalize
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP2]
> [tyr:19240] *** Process received signal ***
> [tyr:19240] Signal: Bus Error (10)
> [tyr:19240] Signal code: Invalid address alignment (1)
> [tyr:19240] Failing at address: 7bd1c10c
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04
> /lib/sparcv9/libc.so.1:0xd8b98
> /lib/sparcv9/libc.so.1:0xcc70c
> /lib/sparcv9/libc.so.1:0xcc918
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c
>  [ Signal 10 (BUS)]
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374
> /export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8
> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20
> /home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c
> [tyr:19240] *** End of error message ***
> --
> mpiexec noticed that process rank 0 with PID 0 on node tyr exited on signal 
> 10 (Bus Error).
> --
> [LWP2 exited]
> [New Thread 2]
> [Switching to Thread 1 (LWP 1)]
> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> satisfy query
> (gdb) bt
> #0  0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
> #1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
> #2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
> #3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
> #4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
> #5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
> #6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
> #7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
> #8  0x7ec87f60 in vm_close (loader_data=0x0, 
> module=0x7c901fe0)
>at ../../../openmpi-dev-124-g91e9686/opal/libltdl/loaders/dlopen.c:212
> #9  0x7ec85534 in lt_dlclose (handle=0x100189b50)
>at ../../../openmpi-dev-124-g91e9686/opal/libltdl/ltdl.c:1982
> #10 0x7ecaabd4 in ri_destructor (obj=0x1001893a0)
>at 
> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:382
> #11 0x7eca9504 in opal_obj_run_destructors (object=0x1001893a0)
>at ../../../../openmpi-dev-124-g91e9686/opal/class/opal_object.h:446
> #12 0x7ecaa474 in mca_base_component_repository_release (
>component=0x7b1236f0 )
>at 
> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:240
> #13 0x7ecac774 in mca_base_component_unload (
>component=0x7b1236f0 , output_id=-1)
>at 
> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:47
> #14 0x7ecac808 in mca_base_component_close (
>component=0x7b1236f0 , output_id=-1)
>at 
> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:60
> #15 0x7ecac8dc in mca_base_components_close (output_id=-1, 
>components=0x7f14ba58 , skip=0x0)
>at 
> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:86
> #16 0x7ecac844 in mca_base_framework_components_close (
>framework=0x7f14ba08 , skip=0x0)
>at 
> ../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:66
> #17 0x7efcaf58 in orte_oob_base_close ()
>at 
> ../../../../openmpi-dev-124-g91e9686/orte/mca/oob/base/oob_base_frame.c:112
> #18 0x7ecc136c 

Re: [OMPI users] OMPI users] which info is needed for SIGSEGV inJava foropenmpi-dev-124-g91e9686on Solaris

2014-10-26 Thread Siegmar Gross
Hi Gilles,

I wanted to explore which function is called, when I call MPI_Init
in a C program, because this function should be called from a Java
program as well. Unfortunately C programs break with a Bus Error
once more for openmpi-dev-124-g91e9686 on Solaris. I assume that's
the reason why I get no useful backtrace for my Java program.

tyr small_prog 117 mpicc -o init_finalize init_finalize.c
tyr small_prog 118 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
...
(gdb) run -np 1 init_finalize
Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 
init_finalize
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP2]
[tyr:19240] *** Process received signal ***
[tyr:19240] Signal: Bus Error (10)
[tyr:19240] Signal code: Invalid address alignment (1)
[tyr:19240] Failing at address: 7bd1c10c
/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c
/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdcc04
/lib/sparcv9/libc.so.1:0xd8b98
/lib/sparcv9/libc.so.1:0xcc70c
/lib/sparcv9/libc.so.1:0xcc918
/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_proc_set_name+0x1c
 [ Signal 10 (BUS)]
/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x103e8
/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc
/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c
/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374
/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8
/home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:main+0x20
/home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/init_finalize:_start+0x7c
[tyr:19240] *** End of error message ***
--
mpiexec noticed that process rank 0 with PID 0 on node tyr exited on signal 10 
(Bus Error).
--
[LWP2 exited]
[New Thread 2]
[Switching to Thread 1 (LWP 1)]
sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to satisfy 
query
(gdb) bt
#0  0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
#1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
#2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
#3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
#4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
#5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
#6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
#7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
#8  0x7ec87f60 in vm_close (loader_data=0x0, module=0x7c901fe0)
at ../../../openmpi-dev-124-g91e9686/opal/libltdl/loaders/dlopen.c:212
#9  0x7ec85534 in lt_dlclose (handle=0x100189b50)
at ../../../openmpi-dev-124-g91e9686/opal/libltdl/ltdl.c:1982
#10 0x7ecaabd4 in ri_destructor (obj=0x1001893a0)
at 
../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:382
#11 0x7eca9504 in opal_obj_run_destructors (object=0x1001893a0)
at ../../../../openmpi-dev-124-g91e9686/opal/class/opal_object.h:446
#12 0x7ecaa474 in mca_base_component_repository_release (
component=0x7b1236f0 )
at 
../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_component_repository.c:240
#13 0x7ecac774 in mca_base_component_unload (
component=0x7b1236f0 , output_id=-1)
at 
../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:47
#14 0x7ecac808 in mca_base_component_close (
component=0x7b1236f0 , output_id=-1)
at 
../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:60
#15 0x7ecac8dc in mca_base_components_close (output_id=-1, 
components=0x7f14ba58 , skip=0x0)
at 
../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:86
#16 0x7ecac844 in mca_base_framework_components_close (
framework=0x7f14ba08 , skip=0x0)
at 
../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_components_close.c:66
#17 0x7efcaf58 in orte_oob_base_close ()
at 
../../../../openmpi-dev-124-g91e9686/orte/mca/oob/base/oob_base_frame.c:112
#18 0x7ecc136c in mca_base_framework_close (
framework=0x7f14ba08 )
at 
../../../../openmpi-dev-124-g91e9686/opal/mca/base/mca_base_framework.c:187
#19 0x7be07858 in rte_finalize ()
at 
../../../../../openmpi-dev-124-g91e9686/orte/mca/ess/hnp/ess_hnp_module.c:857
#20 0x7ef338a4 in orte_finalize ()
at 

Re: [OMPI users] OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686on Solaris

2014-10-26 Thread Siegmar Gross
Hi Gilles,

thank you very much for the quick tutorial. Unfortunately I still
can't get a backtrace.

> You might need to configure with --enable-debug and add -g -O0
> to your CFLAGS and LDFLAGS
> 
> Then once you attach with gdb, you have to find the thread that is polling :
> thread 1
> bt
> thread 2
> bt
> and so on until you find the good thread
> If _dbg is a local variable, you need to select the right frame
> before you can change the value :
> get the frame number from bt (generally 1 under linux)
> f 
> set _dbg=0
> 
> I hope this helps

"--enable-debug" is one of my default options. Now I used the
following command to configure Open MPI. I always start the
build process in an empty directory and I always remove
/usr/local/openmpi-1.9.0_64_gcc, before I install a new version.

tyr openmpi-dev-124-g91e9686-SunOS.sparc.64_gcc 112 head config.log \
  | grep openmpi
$ ../openmpi-dev-124-g91e9686/configure
  --prefix=/usr/local/openmpi-1.9.0_64_gcc
  --libdir=/usr/local/openmpi-1.9.0_64_gcc/lib64
  --with-jdk-bindir=/usr/local/jdk1.8.0/bin
  --with-jdk-headers=/usr/local/jdk1.8.0/include
  JAVA_HOME=/usr/local/jdk1.8.0
  LDFLAGS=-m64 -g -O0 CC=gcc CXX=g++ FC=gfortran
  CFLAGS=-m64 -D_REENTRANT -g -O0
  CXXFLAGS=-m64 FCFLAGS=-m64 CPP=cpp CXXCPP=cpp
  CPPFLAGS=-D_REENTRANT CXXCPPFLAGS=
  --enable-mpi-cxx --enable-cxx-exceptions --enable-mpi-java
  --enable-heterogeneous --enable-mpi-thread-multiple
  --with-threads=posix --with-hwloc=internal --without-verbs
  --with-wrapper-cflags=-std=c11 -m64 --enable-debug
tyr openmpi-dev-124-g91e9686-SunOS.sparc.64_gcc 113 


"gbd" doesn't allow any backtrace for any thread.

tyr java 124 /usr/local/gdb-7.6.1_64_gcc/bin/gdb
GNU gdb (GDB) 7.6.1
...
(gdb) attach 18876
Attaching to process 18876
[New process 18876]
Retry #1:
Retry #2:
Retry #3:
Retry #4:
0x7eadcb04 in ?? ()
(gdb) info threads
[New LWP 12]
[New LWP 11]
[New LWP 10]
[New LWP 9]
[New LWP 8]
[New LWP 7]
[New LWP 6]
[New LWP 5]
[New LWP 4]
[New LWP 3]
[New LWP 2]
  Id   Target Id Frame 
  12   LWP 2 0x7eadc6b0 in ?? ()
  11   LWP 3 0x7eadcbb8 in ?? ()
  10   LWP 4 0x7eadcbb8 in ?? ()
  9LWP 5 0x7eadcbb8 in ?? ()
  8LWP 6 0x7eadcbb8 in ?? ()
  7LWP 7 0x7eadcbb8 in ?? ()
  6LWP 8 0x7ead8b0c in ?? ()
  5LWP 9 0x7eadcbb8 in ?? ()
  4LWP 100x7eadcbb8 in ?? ()
  3LWP 110x7eadcbb8 in ?? ()
  2LWP 120x7eadcbb8 in ?? ()
* 1LWP 1 0x7eadcb04 in ?? ()
(gdb) thread 1
[Switching to thread 1 (LWP 1)]
#0  0x7eadcb04 in ?? ()
(gdb) bt
#0  0x7eadcb04 in ?? ()
#1  0x7eaca12c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 2
[Switching to thread 2 (LWP 12)]
#0  0x7eadcbb8 in ?? ()
(gdb) bt
#0  0x7eadcbb8 in ?? ()
#1  0x7eac2638 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 3
[Switching to thread 3 (LWP 11)]
#0  0x7eadcbb8 in ?? ()
(gdb) bt
#0  0x7eadcbb8 in ?? ()
#1  0x7eac25a8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 4
[Switching to thread 4 (LWP 10)]
#0  0x7eadcbb8 in ?? ()
(gdb) bt
#0  0x7eadcbb8 in ?? ()
#1  0x7eac2638 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 5
[Switching to thread 5 (LWP 9)]
#0  0x7eadcbb8 in ?? ()
(gdb) bt
#0  0x7eadcbb8 in ?? ()
#1  0x7eac2638 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 6
[Switching to thread 6 (LWP 8)]
#0  0x7ead8b0c in ?? ()
(gdb) bt
#0  0x7ead8b0c in ?? ()
#1  0x7eacbcb0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 7
[Switching to thread 7 (LWP 7)]
#0  0x7eadcbb8 in ?? ()
(gdb) bt
#0  0x7eadcbb8 in ?? ()
#1  0x7eac25a8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 8
[Switching to thread 8 (LWP 6)]
#0  0x7eadcbb8 in ?? ()
(gdb) bt
#0  0x7eadcbb8 in ?? ()
#1  0x7eac25a8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 9
[Switching to thread 9 (LWP 5)]
#0  0x7eadcbb8 in ?? ()
(gdb) bt
#0  0x7eadcbb8 in ?? ()
#1  0x7eac2638 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 10
[Switching to thread 10 (LWP 4)]
#0  0x7eadcbb8 in ?? ()
(gdb) bt
#0  0x7eadcbb8 in ?? ()
#1  0x7eac25a8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 11
[Switching to thread 11 (LWP 3)]
#0  0x7eadcbb8 in ?? ()
(gdb) bt
#0  0x7eadcbb8 in ?? ()
#1  0x7eac25a8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 12
[Switching to thread 12 (LWP 2)]
#0  0x7eadc6b0 in ?? ()
(gdb) 



I also tried to set _dbg in all available frames without success.

(gdb) f 1
#1  0x7eacb46c in ?? ()