Hi,

> Your commit r32459 fixed the bus error by correcting
> opal/dss/dss_copy.c. It's OK for trunk because mca_dstore_hash
> calls dss to copy data. But it's insufficient for v1.8 because
> mca_db_hash doesn't call dss and copies data itself.
> 
> The attached patch is the minimum patch to fix it in v1.8.
> My fix doesn't call dss but uses memcpy. I have confirmed it on
> SPARC64/Linux.

Thank you very much for your help. I applied your patch and it
fixes the bus error for my C programs as well. Unfortunately I
get a SIGSEGV for Java programs.

tyr java 126 mpiexec -np 1 java InitFinalizeMain
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xffffffff7ea3c7f0, pid=10506, tid=2
...


gdb shows the following backtrace.

tyr java 127 /usr/local/gdb-7.6.1_64_gcc/bin/gdb 
/usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec 
GNU gdb (GDB) 7.6.1
...
(gdb) run -np 1 java InitFinalizeMain 
Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 java 
InitFinalizeMain
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP    2        ]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xffffffff7ea3c7f0, pid=10524, tid=2
#
# JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode solaris-sparc 
compressed oops)
# Problematic frame:
# C  [libc.so.1+0x3c7f0]  strlen+0x50
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java 
again
#
# An error report file with more information is saved as:
# /home/fd1026/work/skripte/master/parallel/prog/mpi/java/hs_err_pid10524.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 10524 on node tyr exited on signal 
6 (Abort).
--------------------------------------------------------------------------
[LWP    2         exited]
[New Thread 2        ]
[Switching to Thread 1 (LWP 1)]
sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to satisfy 
query
(gdb) bt
#0  0xffffffff7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
#1  0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
#2  0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
#3  0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
#4  0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
#5  0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
#6  0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
#7  0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
#8  0xffffffff7ec7748c in vm_close ()
   from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
#9  0xffffffff7ec74a6c in lt_dlclose ()
   from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
#10 0xffffffff7ec99b90 in ri_destructor (obj=0x1001eae10)
    at 
../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:391
#11 0xffffffff7ec984a8 in opal_obj_run_destructors (object=0x1001eae10)
    at ../../../../openmpi-1.8.2rc4r32485/opal/class/opal_object.h:446
#12 0xffffffff7ec9940c in mca_base_component_repository_release (
    component=0xffffffff7b023df0 <mca_oob_tcp_component>)
    at 
../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:244
#13 0xffffffff7ec9b754 in mca_base_component_unload (
    component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1)
    at 
../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:47
#14 0xffffffff7ec9b7e8 in mca_base_component_close (
    component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1)
    at 
../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:60
#15 0xffffffff7ec9b8bc in mca_base_components_close (output_id=-1, 
    components=0xffffffff7f12b930 <orte_oob_base_framework+80>, skip=0x0)
    at 
../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:86
#16 0xffffffff7ec9b824 in mca_base_framework_components_close (
    framework=0xffffffff7f12b8e0 <orte_oob_base_framework>, skip=0x0)
    at 
../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:66
#17 0xffffffff7efae21c in orte_oob_base_close ()
    at ../../../../openmpi-1.8.2rc4r32485/orte/mca/oob/base/oob_base_frame.c:94
#18 0xffffffff7ecb28cc in mca_base_framework_close (
    framework=0xffffffff7f12b8e0 <orte_oob_base_framework>)
---Type <return> to continue, or q <return> to quit---
    at ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_framework.c:187
#19 0xffffffff7bf078c0 in rte_finalize ()
    at 
../../../../../openmpi-1.8.2rc4r32485/orte/mca/ess/hnp/ess_hnp_module.c:858
#20 0xffffffff7ef30a44 in orte_finalize ()
    at ../../openmpi-1.8.2rc4r32485/orte/runtime/orte_finalize.c:65
#21 0x00000001000070c4 in orterun (argc=5, argv=0xffffffff7fffe0d8)
    at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/orterun.c:1096
#22 0x0000000100003d70 in main (argc=5, argv=0xffffffff7fffe0d8)
    at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/main.c:13
(gdb) 


Kind regards and once more thank you very much

Siegmar



> Sorry to response so late.
> 
> Regards,
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> 
> > Siegmar, Ralph,
> > 
> > I'm sorry to response so late since last week.
> > 
> > Ralph fixed the problem in r32459 and it was merged to v1.8
> > in r32474. But in v1.8 an additional custom patch is needed
> > because the db/dstore source codes are different between trunk
> > and v1.8.
> > 
> > I'm preparing and testing the custom patch just now.
> > Wait wait a minute please.
> > 
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> > 
> > > Hi,
> > > 
> > > thank you very much to everybody who tried to solve my bus
> > > error problem on Solaris 10 Sparc. I thought that you found
> > > and fixed it, so that I installed openmpi-1.8.2rc4r32485 on
> > > my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc1),
> > > openSUSE Linux 12.1 x86_64 (linpc1)) with gcc-4.9.0. A small
> > > program works on my x86_64 architectures, but still breaks
> > > with a bus error on my Sparc system.
> > > 
> > > linpc1 fd1026 106 mpiexec -np 1 init_finalize
> > > Hello!
> > > linpc1 fd1026 106 exit
> > > logout
> > > tyr small_prog 113 ssh sunpc1
> > > sunpc1 fd1026 101 mpiexec -np 1 init_finalize
> > > Hello!
> > > sunpc1 fd1026 102 exit
> > > logout
> > > tyr small_prog 114 mpiexec -np 1 init_finalize
> > > [tyr:21109] *** Process received signal ***
> > > [tyr:21109] Signal: Bus Error (10)
> > > ...
> > > 
> > > 
> > > gdb shows the following backtrace.
> > > 
> > > tyr small_prog 122 /usr/local/gdb-7.6.1_64_gcc/bin/gdb 
> > > /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
> > > GNU gdb (GDB) 7.6.1
> > > ...
> > > (gdb) run -np 1 init_finalize
> > > Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 
> > > init_finalize
> > > [Thread debugging using libthread_db enabled]
> > > [New Thread 1 (LWP 1)]
> > > [New LWP    2        ]
> > > [tyr:21158] *** Process received signal ***
> > > [tyr:21158] Signal: Bus Error (10)
> > > [tyr:21158] Signal code: Invalid address alignment (1)
> > > [tyr:21158] Failing at address: ffffffff7fffd224
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd130
> > > /lib/sparcv9/libc.so.1:0xd8b98
> > > /lib/sparcv9/libc.so.1:0xcc70c
> > > /lib/sparcv9/libc.so.1:0xcc918
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
> > >  [ Signal 10 (BUS)]
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
> > > /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
> > > /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
> > > [tyr:21158] *** End of error message ***
> > > --------------------------------------------------------------------------
> > > mpiexec noticed that process rank 0 with PID 21158 on node tyr exited on 
> > > signal 10 (Bus Error).
> > > --------------------------------------------------------------------------
> > > [LWP    2         exited]
> > > [New Thread 2        ]
> > > [Switching to Thread 1 (LWP 1)]
> > > sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> > > satisfy query
> > > (gdb) bt
> > > #0  0xffffffff7f6173d0 in rtld_db_dlactivity () from 
> > > /usr/lib/sparcv9/ld.so.1
> > > #1  0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
> > > #2  0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
> > > #3  0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
> > > #4  0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
> > > #5  0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
> > > #6  0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
> > > #7  0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
> > > #8  0xffffffff7ec7748c in vm_close () from 
> > > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> > > #9  0xffffffff7ec74a6c in lt_dlclose () from 
> > > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> > > #10 0xffffffff7ec99b90 in ri_destructor (obj=0x1001ead30)
> > >     at 
> > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:391
> > > #11 0xffffffff7ec984a8 in opal_obj_run_destructors (object=0x1001ead30)
> > >     at ../../../../openmpi-1.8.2rc4r32485/opal/class/opal_object.h:446
> > > #12 0xffffffff7ec9940c in mca_base_component_repository_release (
> > >     component=0xffffffff7b023df0 <mca_oob_tcp_component>)
> > >     at 
> > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:244
> > > #13 0xffffffff7ec9b754 in mca_base_component_unload (
> > >     component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1)
> > >     at 
> > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:47
> > > #14 0xffffffff7ec9b7e8 in mca_base_component_close (
> > >     component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1)
> > >     at 
> > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:60
> > > #15 0xffffffff7ec9b8bc in mca_base_components_close (output_id=-1, 
> > >     components=0xffffffff7f12b930 <orte_oob_base_framework+80>, skip=0x0)
> > >     at 
> > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:86
> > > #16 0xffffffff7ec9b824 in mca_base_framework_components_close (
> > >     framework=0xffffffff7f12b8e0 <orte_oob_base_framework>, skip=0x0)
> > >     at 
> > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:66
> > > #17 0xffffffff7efae21c in orte_oob_base_close ()
> > >     at 
> > > ../../../../openmpi-1.8.2rc4r32485/orte/mca/oob/base/oob_base_frame.c:94
> > > #18 0xffffffff7ecb28cc in mca_base_framework_close (
> > >     framework=0xffffffff7f12b8e0 <orte_oob_base_framework>)
> > >     at 
> > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_framework.c:187
> > > #19 0xffffffff7bf078c0 in rte_finalize ()
> > >     at 
> > > ../../../../../openmpi-1.8.2rc4r32485/orte/mca/ess/hnp/ess_hnp_module.c:858
> > > #20 0xffffffff7ef30a44 in orte_finalize ()
> > >     at ../../openmpi-1.8.2rc4r32485/orte/runtime/orte_finalize.c:65
> > > #21 0x00000001000070c4 in orterun (argc=4, argv=0xffffffff7fffe0d8)
> > >     at 
> > > ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/orterun.c:1096
> > > #22 0x0000000100003d70 in main (argc=4, argv=0xffffffff7fffe0d8)
> > >     at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/main.c:13
> > > (gdb) 
> > > 
> > > 
> > > Is this a new problem? I would be grateful if somebody could
> > > fix it. Thank you very much for any help in advance.
> > > 
> > > Kind regards
> > > 
> > > Siegmar

Reply via email to