Hi, here is the result:

ehhexxn@oak:~/git/test$ mpirun -n 2 -mca btl tipc,self valgrind ./hello_c > 11.out
==30850== Memcheck, a memory error detector
==30850== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==30850== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==30850== Command: ./hello_c
==30850==
==30849== Memcheck, a memory error detector
==30849== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==30849== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==30849== Command: ./hello_c
==30849==
==30849== Jump to the invalid address stated on the next line
==30849==    at 0xDEAFBEEDDEAFBEED: ???
==30849==    by 0x50151F1: opal_list_construct (opal_list.c:88)
==30849==    by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30849==    by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30849==    by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30849==    by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30849==    by 0xA8A12FA: opal_obj_new_debug (opal_object.h:252)
==30849==    by 0xA8A2A5F: mca_pml_ob1_add_comm (pml_ob1.c:182)
==30849==    by 0x4E95F50: ompi_mpi_init (ompi_mpi_init.c:770)
==30849==    by 0x4EC6C32: PMPI_Init (pinit.c:84)
==30849==    by 0x400935: main (in /home/ehhexxn/git/test/hello_c)
==30849== Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently) free'd
==30849==
[oak:30849] *** Process received signal ***
[oak:30849] Signal: Segmentation fault (11)
[oak:30849] Signal code: Invalid permissions (2)
[oak:30849] Failing at address: 0xdeafbeeddeafbeed
==30849== Invalid read of size 1
==30849==    at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
==30849==    by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
==30849==    by 0x60BE69D: backtrace (backtrace.c:91)
==30849==    by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
==30849==    by 0x5026DF3: show_stackframe (stacktrace.c:348)
==30849==    by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
==30849==    by 0xDEAFBEEDDEAFBEEC: ???
==30849==    by 0x50151F1: opal_list_construct (opal_list.c:88)
==30849==    by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30849==    by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30849==    by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30849==    by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30849== Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently) free'd
==30849==
==30849==
==30849== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==30849==  General Protection Fault
==30849==    at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
==30849==    by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
==30849==    by 0x60BE69D: backtrace (backtrace.c:91)
==30849==    by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
==30849==    by 0x5026DF3: show_stackframe (stacktrace.c:348)
==30849==    by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
==30849==    by 0xDEAFBEEDDEAFBEEC: ???
==30849==    by 0x50151F1: opal_list_construct (opal_list.c:88)
==30849==    by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30849==    by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30849==    by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30849==    by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30850== Jump to the invalid address stated on the next line
==30850==    at 0xDEAFBEEDDEAFBEED: ???
==30850==    by 0x50151F1: opal_list_construct (opal_list.c:88)
==30850==    by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30850==    by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30850==    by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30850==    by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30850==    by 0xA8A12FA: opal_obj_new_debug (opal_object.h:252)
==30850==    by 0xA8A2A5F: mca_pml_ob1_add_comm (pml_ob1.c:182)
==30850==    by 0x4E95F50: ompi_mpi_init (ompi_mpi_init.c:770)
==30850==    by 0x4EC6C32: PMPI_Init (pinit.c:84)
==30850==    by 0x400935: main (in /home/ehhexxn/git/test/hello_c)
==30850== Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently) free'd
==30850==
[oak:30850] *** Process received signal ***
[oak:30850] Signal: Segmentation fault (11)
[oak:30850] Signal code: Invalid permissions (2)
[oak:30850] Failing at address: 0xdeafbeeddeafbeed
==30849==
==30849== HEAP SUMMARY:
==30849==     in use at exit: 2,338,964 bytes in 3,213 blocks
==30849== total heap usage: 5,205 allocs, 1,992 frees, 12,942,078 bytes allocated
==30849==
==30850== Invalid read of size 1
==30850==    at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
==30850==    by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
==30850==    by 0x60BE69D: backtrace (backtrace.c:91)
==30850==    by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
==30850==    by 0x5026DF3: show_stackframe (stacktrace.c:348)
==30850==    by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
==30850==    by 0xDEAFBEEDDEAFBEEC: ???
==30850==    by 0x50151F1: opal_list_construct (opal_list.c:88)
==30850==    by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30850==    by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30850==    by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30850==    by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30850== Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently) free'd
==30850==
==30850==
==30850== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==30850==  General Protection Fault
==30850==    at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
==30850==    by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
==30850==    by 0x60BE69D: backtrace (backtrace.c:91)
==30850==    by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
==30850==    by 0x5026DF3: show_stackframe (stacktrace.c:348)
==30850==    by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
==30850==    by 0xDEAFBEEDDEAFBEEC: ???
==30850==    by 0x50151F1: opal_list_construct (opal_list.c:88)
==30850==    by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30850==    by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30850==    by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30850==    by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30849== LEAK SUMMARY:
==30849==    definitely lost: 453 bytes in 13 blocks
==30849==    indirectly lost: 7,440 bytes in 12 blocks
==30849==      possibly lost: 0 bytes in 0 blocks
==30849==    still reachable: 2,331,071 bytes in 3,188 blocks
==30849==         suppressed: 0 bytes in 0 blocks
==30849== Rerun with --leak-check=full to see details of leaked memory
==30849==
==30849== For counts of detected and suppressed errors, rerun with: -v
==30849== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 4 from 4)



On 07/04/2011 01:51 PM, Jeff Squyres wrote:
Keep in mind, too, that opal_object is the "base" object -- put in C++ terms, 
it's the abstract class that all other classes are made of.  So it's rare that we could 
create a opal_object by itself.  opal_objects are usually created as part of some other, 
higher-level object.

What's the full call stack of where Valgrind is showing the error?

Make sure you have the most recent valgrind (www.valgrind.org); the versions 
that ship in various distros may be somewhat old.  Newer valgrind versions show 
lots of things that older versions don't.  A new valgrind *might* be able to 
show some prior memory fault that is causing the issue...?


On Jul 4, 2011, at 7:45 AM, Xin He wrote:

Hi,

I ran the program with valgrind, and it showed almost the same error. It 
appeared that the segmentation fault happened during
the initiation of an opal_object.  That's why it puzzled me.

/Xin


On 07/04/2011 01:40 PM, Jeff Squyres wrote:
Ah -- so this is in the template code.  I suspect this code might have bit 
rotted a bit.  :-\

If you run this through valgrind, does anything obvious show up?  I ask because 
this kind of error is typically a symptom of the real error.  I.e., the real 
error was some kind of memory corruption that occurred earlier, and this is the 
memory access that exposes that prior memory corruption.


On Jul 4, 2011, at 5:08 AM, Xin He wrote:

Yes, it is a opal_object.

And this error seems to be caused by these code:

  void mca_btl_template_proc_construct(mca_btl_template_proc_t* template_proc){
     .......
    .........
     /* add to list of all proc instance */
     OPAL_THREAD_LOCK(&mca_btl_template_component.template_lock);
     
opal_list_append(&mca_btl_template_component.template_procs,&template_proc->super);
     OPAL_THREAD_UNLOCK(&mca_btl_template_component.template_lock);
}

/Xin

On 07/02/2011 10:49 PM, Jeff Squyres (jsquyres) wrote:
Do u know which object it is that is being constructed?  When you compile with 
debugging enabled, theres strings in the object struct that identify te file 
and line where the obj was created.

Sent from my phone. No type good.

On Jun 29, 2011, at 8:48 AM, "Xin He"
<xin.i...@ericsson.com>
  wrote:


Hi,

As I advanced in my implementation of TIPC BTL, I added the component and tried 
to run hello_c program to test.

Then I got this segmentation fault. It seemed happening after the call 
"mca_btl_tipc_add_procs".

The error message displayed:

[oak:23192] *** Process received signal ***
[oak:23192] Signal: Segmentation fault (11)
[oak:23192] Signal code:  (128)
[oak:23192] Failing at address: (nil)
[oak:23192] [ 0] /lib/libpthread.so.0(+0xfb40) [0x7fec2a40fb40]
[oak:23192] [ 1] /usr/lib/libmpi.so.0(+0x1e6c10) [0x7fec2b2afc10]
[oak:23192] [ 2] /usr/lib/libmpi.so.0(+0x1e71f2) [0x7fec2b2b01f2]
[oak:23192] [ 3] /usr/lib/openmpi/mca_pml_ob1.so(+0x59f2) [0x7fec264fc9f2]
[oak:23192] [ 4] /usr/lib/openmpi/mca_pml_ob1.so(+0x5e5a) [0x7fec264fce5a]
[oak:23192] [ 5] /usr/lib/openmpi/mca_pml_ob1.so(+0x2386) [0x7fec264f9386]
[oak:23192] [ 6] /usr/lib/openmpi/mca_pml_ob1.so(+0x24a0) [0x7fec264f94a0]
[oak:23192] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(+0x22fb) [0x7fec264f92fb]
[oak:23192] [ 8] /usr/lib/openmpi/mca_pml_ob1.so(+0x3a60) [0x7fec264faa60]
[oak:23192] [ 9] /usr/lib/libmpi.so.0(+0x67f51) [0x7fec2b130f51]
[oak:23192] [10] /usr/lib/libmpi.so.0(MPI_Init+0x173) [0x7fec2b161c33]
[oak:23192] [11] hello_i(main+0x22) [0x400936]
[oak:23192] [12] /lib/libc.so.6(__libc_start_main+0xfe) [0x7fec2a09bd8e]
[oak:23192] [13] hello_i() [0x400859]
[oak:23192] *** End of error message ***

I used gdb to check the stack:
(gdb) bt
#0  0x00007ffff7afac10 in opal_obj_run_constructors (object=0x6ca980)
    at ../opal/class/opal_object.h:427
#1  0x00007ffff7afb1f2 in opal_list_construct (list=0x6ca958) at 
class/opal_list.c:88
#2  0x00007ffff2d479f2 in opal_obj_run_constructors (object=0x6ca958)
    at ../../../../opal/class/opal_object.h:427
#3  0x00007ffff2d47e5a in mca_pml_ob1_comm_construct (comm=0x6ca8c0)
    at pml_ob1_comm.c:55
#4  0x00007ffff2d44386 in opal_obj_run_constructors (object=0x6ca8c0)
    at ../../../../opal/class/opal_object.h:427
#5  0x00007ffff2d444a0 in opal_obj_new (cls=0x7ffff2f6c040)
    at ../../../../opal/class/opal_object.h:477
#6  0x00007ffff2d442fb in opal_obj_new_debug (type=0x7ffff2f6c040,
    file=0x7ffff2d62840 "pml_ob1.c", line=182)
    at ../../../../opal/class/opal_object.h:252
#7  0x00007ffff2d45a60 in mca_pml_ob1_add_comm (comm=0x601060) at pml_ob1.c:182
#8  0x00007ffff797bf51 in ompi_mpi_init (argc=1, argv=0x7fffffffdf58, 
requested=0,
    provided=0x7fffffffde28) at runtime/ompi_mpi_init.c:770
#9  0x00007ffff79acc33 in PMPI_Init (argc=0x7fffffffde5c, argv=0x7fffffffde50)
    at pinit.c:84
#10 0x0000000000400936 in main (argc=1, argv=0x7fffffffdf58) at hello_c.c:17

It seems the error happened when an object is constructed. Any idea why this is 
happening?

Thanks.

Best regards,
Xin


_______________________________________________
devel mailing list

de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list

de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to