Sorry for the delay; this past weekend was a holiday in the US. I'm just now
catching up on the backlog.
Have you edited pml_ob1_comm.c? For me, line 56 (on the trunk) is:
OBJ_CONSTRUCT(&comm->matching_lock, opal_mutex_t);
But clearly you seem to be executing the line above that:
OBJ_CONSTRUCT(&comm->wild_receives, opal_list_t);
I can't imagine why that line would segv -- it would imply that the "class
definition" for opal_list_t is hosed in memory somehow.
Are you 100% sure that you're compiling / linking against your development copy
of Open MPI, and not accidentally mixing it with some other OMPI installation
at run time? (e.g., via LD_LIBRARY_PATH or somesuch)
If you're not, you might want to run hello_c through a debugger and put a watch
on the opal_list_t_class variable and see when it changes. It should be
initialed early in opal_init() somewhere and then used many times during
MPI_Init() before the place where it fails. The sentinel value
0xDEAFBEEDDEAFBEED is used in OMPI debug builds to mean that it's an object
that has been destroyed. But this should never happen in the opal_list_t_class
instance itself.
On Jul 4, 2011, at 9:37 AM, Xin He wrote:
> Hi, here is the result:
>
> ehhexxn@oak:~/git/test$ mpirun -n 2 -mca btl tipc,self valgrind ./hello_c >
> 11.out
> ==30850== Memcheck, a memory error detector
> ==30850== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==30850== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for
> copyright info
> ==30850== Command: ./hello_c
> ==30850==
> ==30849== Memcheck, a memory error detector
> ==30849== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==30849== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for
> copyright info
> ==30849== Command: ./hello_c
> ==30849==
> ==30849== Jump to the invalid address stated on the next line
> ==30849==at 0xDEAFBEEDDEAFBEED: ???
> ==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
> ==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
> ==30849==by 0xA8A12FA: opal_obj_new_debug (opal_object.h:252)
> ==30849==by 0xA8A2A5F: mca_pml_ob1_add_comm (pml_ob1.c:182)
> ==30849==by 0x4E95F50: ompi_mpi_init (ompi_mpi_init.c:770)
> ==30849==by 0x4EC6C32: PMPI_Init (pinit.c:84)
> ==30849==by 0x400935: main (in /home/ehhexxn/git/test/hello_c)
> ==30849== Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently)
> free'd
> ==30849==
> [oak:30849] *** Process received signal ***
> [oak:30849] Signal: Segmentation fault (11)
> [oak:30849] Signal code: Invalid permissions (2)
> [oak:30849] Failing at address: 0xdeafbeeddeafbeed
> ==30849== Invalid read of size 1
> ==30849==at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
> ==30849==by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
> ==30849==by 0x60BE69D: backtrace (backtrace.c:91)
> ==30849==by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
> ==30849==by 0x5026DF3: show_stackframe (stacktrace.c:348)
> ==30849==by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
> ==30849==by 0xDEAFBEEDDEAFBEEC: ???
> ==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
> ==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
> ==30849== Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently)
> free'd
> ==30849==
> ==30849==
> ==30849== Process terminating with default action of signal 11 (SIGSEGV):
> dumping core
> ==30849== General Protection Fault
> ==30849==at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
> ==30849==by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
> ==30849==by 0x60BE69D: backtrace (backtrace.c:91)
> ==30849==by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
> ==30849==by 0x5026DF3: show_stackframe (stacktrace.c:348)
> ==30849==by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
> ==30849==by 0xDEAFBEEDDEAFBEEC: ???
> ==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
> ==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
> ==30850== Jump to the invalid address stated on the next line
> ==30850==at 0xDEAFBEEDDEAFBEED: ???
> ==30850==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30850==by 0xA8A49F1: opal_obj_run_constructors (opal_