Re: [OMPI devel] TIPC BTL Segmentation fault

2011-07-07 Thread Jeff Squyres
Sorry for the delay; this past weekend was a holiday in the US.  I'm just now 
catching up on the backlog.

Have you edited pml_ob1_comm.c?  For me, line 56 (on the trunk) is:

OBJ_CONSTRUCT(&comm->matching_lock, opal_mutex_t);

But clearly you seem to be executing the line above that:

OBJ_CONSTRUCT(&comm->wild_receives, opal_list_t);

I can't imagine why that line would segv -- it would imply that the "class 
definition" for opal_list_t is hosed in memory somehow.

Are you 100% sure that you're compiling / linking against your development copy 
of Open MPI, and not accidentally mixing it with some other OMPI installation 
at run time?  (e.g., via LD_LIBRARY_PATH or somesuch)

If you're not, you might want to run hello_c through a debugger and put a watch 
on the opal_list_t_class variable and see when it changes.  It should be 
initialed early in opal_init() somewhere and then used many times during 
MPI_Init() before the place where it fails.  The sentinel value 
0xDEAFBEEDDEAFBEED is used in OMPI debug builds to mean that it's an object 
that has been destroyed.  But this should never happen in the opal_list_t_class 
instance itself.



On Jul 4, 2011, at 9:37 AM, Xin He wrote:

> Hi, here is the result:
> 
> ehhexxn@oak:~/git/test$ mpirun -n 2 -mca btl tipc,self valgrind ./hello_c > 
> 11.out
> ==30850== Memcheck, a memory error detector
> ==30850== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==30850== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for 
> copyright info
> ==30850== Command: ./hello_c
> ==30850==
> ==30849== Memcheck, a memory error detector
> ==30849== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==30849== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for 
> copyright info
> ==30849== Command: ./hello_c
> ==30849==
> ==30849== Jump to the invalid address stated on the next line
> ==30849==at 0xDEAFBEEDDEAFBEED: ???
> ==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
> ==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
> ==30849==by 0xA8A12FA: opal_obj_new_debug (opal_object.h:252)
> ==30849==by 0xA8A2A5F: mca_pml_ob1_add_comm (pml_ob1.c:182)
> ==30849==by 0x4E95F50: ompi_mpi_init (ompi_mpi_init.c:770)
> ==30849==by 0x4EC6C32: PMPI_Init (pinit.c:84)
> ==30849==by 0x400935: main (in /home/ehhexxn/git/test/hello_c)
> ==30849==  Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently) 
> free'd
> ==30849==
> [oak:30849] *** Process received signal ***
> [oak:30849] Signal: Segmentation fault (11)
> [oak:30849] Signal code: Invalid permissions (2)
> [oak:30849] Failing at address: 0xdeafbeeddeafbeed
> ==30849== Invalid read of size 1
> ==30849==at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
> ==30849==by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
> ==30849==by 0x60BE69D: backtrace (backtrace.c:91)
> ==30849==by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
> ==30849==by 0x5026DF3: show_stackframe (stacktrace.c:348)
> ==30849==by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
> ==30849==by 0xDEAFBEEDDEAFBEEC: ???
> ==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
> ==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
> ==30849==  Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently) 
> free'd
> ==30849==
> ==30849==
> ==30849== Process terminating with default action of signal 11 (SIGSEGV): 
> dumping core
> ==30849==  General Protection Fault
> ==30849==at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
> ==30849==by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
> ==30849==by 0x60BE69D: backtrace (backtrace.c:91)
> ==30849==by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
> ==30849==by 0x5026DF3: show_stackframe (stacktrace.c:348)
> ==30849==by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
> ==30849==by 0xDEAFBEEDDEAFBEEC: ???
> ==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
> ==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
> ==30850== Jump to the invalid address stated on the next line
> ==30850==at 0xDEAFBEEDDEAFBEED: ???
> ==30850==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30850==by 0xA8A49F1: opal_obj_run_constructors (opal_

Re: [OMPI devel] TIPC BTL Segmentation fault

2011-07-04 Thread Xin He

Hi, here is the result:

ehhexxn@oak:~/git/test$ mpirun -n 2 -mca btl tipc,self valgrind 
./hello_c > 11.out

==30850== Memcheck, a memory error detector
==30850== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==30850== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for 
copyright info

==30850== Command: ./hello_c
==30850==
==30849== Memcheck, a memory error detector
==30849== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==30849== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for 
copyright info

==30849== Command: ./hello_c
==30849==
==30849== Jump to the invalid address stated on the next line
==30849==at 0xDEAFBEEDDEAFBEED: ???
==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30849==by 0xA8A12FA: opal_obj_new_debug (opal_object.h:252)
==30849==by 0xA8A2A5F: mca_pml_ob1_add_comm (pml_ob1.c:182)
==30849==by 0x4E95F50: ompi_mpi_init (ompi_mpi_init.c:770)
==30849==by 0x4EC6C32: PMPI_Init (pinit.c:84)
==30849==by 0x400935: main (in /home/ehhexxn/git/test/hello_c)
==30849==  Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or 
(recently) free'd

==30849==
[oak:30849] *** Process received signal ***
[oak:30849] Signal: Segmentation fault (11)
[oak:30849] Signal code: Invalid permissions (2)
[oak:30849] Failing at address: 0xdeafbeeddeafbeed
==30849== Invalid read of size 1
==30849==at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
==30849==by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
==30849==by 0x60BE69D: backtrace (backtrace.c:91)
==30849==by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
==30849==by 0x5026DF3: show_stackframe (stacktrace.c:348)
==30849==by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
==30849==by 0xDEAFBEEDDEAFBEEC: ???
==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30849==  Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or 
(recently) free'd

==30849==
==30849==
==30849== Process terminating with default action of signal 11 
(SIGSEGV): dumping core

==30849==  General Protection Fault
==30849==at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
==30849==by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
==30849==by 0x60BE69D: backtrace (backtrace.c:91)
==30849==by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
==30849==by 0x5026DF3: show_stackframe (stacktrace.c:348)
==30849==by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
==30849==by 0xDEAFBEEDDEAFBEEC: ???
==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30850== Jump to the invalid address stated on the next line
==30850==at 0xDEAFBEEDDEAFBEED: ???
==30850==by 0x50151F1: opal_list_construct (opal_list.c:88)
==30850==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
==30850==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
==30850==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
==30850==by 0xA8A149F: opal_obj_new (opal_object.h:477)
==30850==by 0xA8A12FA: opal_obj_new_debug (opal_object.h:252)
==30850==by 0xA8A2A5F: mca_pml_ob1_add_comm (pml_ob1.c:182)
==30850==by 0x4E95F50: ompi_mpi_init (ompi_mpi_init.c:770)
==30850==by 0x4EC6C32: PMPI_Init (pinit.c:84)
==30850==by 0x400935: main (in /home/ehhexxn/git/test/hello_c)
==30850==  Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or 
(recently) free'd

==30850==
[oak:30850] *** Process received signal ***
[oak:30850] Signal: Segmentation fault (11)
[oak:30850] Signal code: Invalid permissions (2)
[oak:30850] Failing at address: 0xdeafbeeddeafbeed
==30849==
==30849== HEAP SUMMARY:
==30849== in use at exit: 2,338,964 bytes in 3,213 blocks
==30849==   total heap usage: 5,205 allocs, 1,992 frees, 12,942,078 
bytes allocated

==30849==
==30850== Invalid read of size 1
==30850==at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
==30850==by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
==30850==by 0x60BE69D: backtrace (backtrace.c:91)
==30850==by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
==30850==by 0x5026DF3: show_stackframe (stacktra

Re: [OMPI devel] TIPC BTL Segmentation fault

2011-07-04 Thread Jeff Squyres
Keep in mind, too, that opal_object is the "base" object -- put in C++ terms, 
it's the abstract class that all other classes are made of.  So it's rare that 
we could create a opal_object by itself.  opal_objects are usually created as 
part of some other, higher-level object.

What's the full call stack of where Valgrind is showing the error?

Make sure you have the most recent valgrind (www.valgrind.org); the versions 
that ship in various distros may be somewhat old.  Newer valgrind versions show 
lots of things that older versions don't.  A new valgrind *might* be able to 
show some prior memory fault that is causing the issue...?


On Jul 4, 2011, at 7:45 AM, Xin He wrote:

> Hi,
> 
> I ran the program with valgrind, and it showed almost the same error. It 
> appeared that the segmentation fault happened during
> the initiation of an opal_object.  That's why it puzzled me.
> 
> /Xin
> 
> 
> On 07/04/2011 01:40 PM, Jeff Squyres wrote:
>> Ah -- so this is in the template code.  I suspect this code might have bit 
>> rotted a bit.  :-\
>> 
>> If you run this through valgrind, does anything obvious show up?  I ask 
>> because this kind of error is typically a symptom of the real error.  I.e., 
>> the real error was some kind of memory corruption that occurred earlier, and 
>> this is the memory access that exposes that prior memory corruption.
>> 
>> 
>> On Jul 4, 2011, at 5:08 AM, Xin He wrote:
>> 
>>> Yes, it is a opal_object.
>>> 
>>> And this error seems to be caused by these code:
>>> 
>>>  void mca_btl_template_proc_construct(mca_btl_template_proc_t* 
>>> template_proc){
>>> ...
>>>.
>>> /* add to list of all proc instance */
>>> OPAL_THREAD_LOCK(&mca_btl_template_component.template_lock);
>>> 
>>> opal_list_append(&mca_btl_template_component.template_procs,&template_proc->super);
>>> OPAL_THREAD_UNLOCK(&mca_btl_template_component.template_lock);
>>> }
>>> 
>>> /Xin
>>> 
>>> On 07/02/2011 10:49 PM, Jeff Squyres (jsquyres) wrote:
 Do u know which object it is that is being constructed?  When you compile 
 with debugging enabled, theres strings in the object struct that identify 
 te file and line where the obj was created.
 
 Sent from my phone. No type good.
 
 On Jun 29, 2011, at 8:48 AM, "Xin He"
 
  wrote:
 
 
> Hi,
> 
> As I advanced in my implementation of TIPC BTL, I added the component and 
> tried to run hello_c program to test.
> 
> Then I got this segmentation fault. It seemed happening after the call 
> "mca_btl_tipc_add_procs".
> 
> The error message displayed:
> 
> [oak:23192] *** Process received signal ***
> [oak:23192] Signal: Segmentation fault (11)
> [oak:23192] Signal code:  (128)
> [oak:23192] Failing at address: (nil)
> [oak:23192] [ 0] /lib/libpthread.so.0(+0xfb40) [0x7fec2a40fb40]
> [oak:23192] [ 1] /usr/lib/libmpi.so.0(+0x1e6c10) [0x7fec2b2afc10]
> [oak:23192] [ 2] /usr/lib/libmpi.so.0(+0x1e71f2) [0x7fec2b2b01f2]
> [oak:23192] [ 3] /usr/lib/openmpi/mca_pml_ob1.so(+0x59f2) [0x7fec264fc9f2]
> [oak:23192] [ 4] /usr/lib/openmpi/mca_pml_ob1.so(+0x5e5a) [0x7fec264fce5a]
> [oak:23192] [ 5] /usr/lib/openmpi/mca_pml_ob1.so(+0x2386) [0x7fec264f9386]
> [oak:23192] [ 6] /usr/lib/openmpi/mca_pml_ob1.so(+0x24a0) [0x7fec264f94a0]
> [oak:23192] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(+0x22fb) [0x7fec264f92fb]
> [oak:23192] [ 8] /usr/lib/openmpi/mca_pml_ob1.so(+0x3a60) [0x7fec264faa60]
> [oak:23192] [ 9] /usr/lib/libmpi.so.0(+0x67f51) [0x7fec2b130f51]
> [oak:23192] [10] /usr/lib/libmpi.so.0(MPI_Init+0x173) [0x7fec2b161c33]
> [oak:23192] [11] hello_i(main+0x22) [0x400936]
> [oak:23192] [12] /lib/libc.so.6(__libc_start_main+0xfe) [0x7fec2a09bd8e]
> [oak:23192] [13] hello_i() [0x400859]
> [oak:23192] *** End of error message ***
> 
> I used gdb to check the stack:
> (gdb) bt
> #0  0x77afac10 in opal_obj_run_constructors (object=0x6ca980)
>at ../opal/class/opal_object.h:427
> #1  0x77afb1f2 in opal_list_construct (list=0x6ca958) at 
> class/opal_list.c:88
> #2  0x72d479f2 in opal_obj_run_constructors (object=0x6ca958)
>at ../../../../opal/class/opal_object.h:427
> #3  0x72d47e5a in mca_pml_ob1_comm_construct (comm=0x6ca8c0)
>at pml_ob1_comm.c:55
> #4  0x72d44386 in opal_obj_run_constructors (object=0x6ca8c0)
>at ../../../../opal/class/opal_object.h:427
> #5  0x72d444a0 in opal_obj_new (cls=0x72f6c040)
>at ../../../../opal/class/opal_object.h:477
> #6  0x72d442fb in opal_obj_new_debug (type=0x72f6c040,
>file=0x72d62840 "pml_ob1.c", line=182)
>at ../../../../opal/class/opal_object.h:252
> #7  0x72d45a60 in mca_pml_ob1_add_comm (comm=0x601060) at 
> pml_ob1.c:182
> #8  0x7797bf51 in ompi_mpi_init (argc=1, argv=0x7fff

Re: [OMPI devel] TIPC BTL Segmentation fault

2011-07-04 Thread Xin He

Hi,

I ran the program with valgrind, and it showed almost the same error. It 
appeared that the segmentation fault happened during

the initiation of an opal_object.  That's why it puzzled me.

/Xin


On 07/04/2011 01:40 PM, Jeff Squyres wrote:

Ah -- so this is in the template code.  I suspect this code might have bit 
rotted a bit.  :-\

If you run this through valgrind, does anything obvious show up?  I ask because 
this kind of error is typically a symptom of the real error.  I.e., the real 
error was some kind of memory corruption that occurred earlier, and this is the 
memory access that exposes that prior memory corruption.


On Jul 4, 2011, at 5:08 AM, Xin He wrote:


Yes, it is a opal_object.

And this error seems to be caused by these code:

  void mca_btl_template_proc_construct(mca_btl_template_proc_t* template_proc){
 ...
.
 /* add to list of all proc instance */
 OPAL_THREAD_LOCK(&mca_btl_template_component.template_lock);
 
opal_list_append(&mca_btl_template_component.template_procs,&template_proc->super);
 OPAL_THREAD_UNLOCK(&mca_btl_template_component.template_lock);
}

/Xin

On 07/02/2011 10:49 PM, Jeff Squyres (jsquyres) wrote:

Do u know which object it is that is being constructed?  When you compile with 
debugging enabled, theres strings in the object struct that identify te file 
and line where the obj was created.

Sent from my phone. No type good.

On Jun 29, 2011, at 8:48 AM, "Xin He"

  wrote:



Hi,

As I advanced in my implementation of TIPC BTL, I added the component and tried 
to run hello_c program to test.

Then I got this segmentation fault. It seemed happening after the call 
"mca_btl_tipc_add_procs".

The error message displayed:

[oak:23192] *** Process received signal ***
[oak:23192] Signal: Segmentation fault (11)
[oak:23192] Signal code:  (128)
[oak:23192] Failing at address: (nil)
[oak:23192] [ 0] /lib/libpthread.so.0(+0xfb40) [0x7fec2a40fb40]
[oak:23192] [ 1] /usr/lib/libmpi.so.0(+0x1e6c10) [0x7fec2b2afc10]
[oak:23192] [ 2] /usr/lib/libmpi.so.0(+0x1e71f2) [0x7fec2b2b01f2]
[oak:23192] [ 3] /usr/lib/openmpi/mca_pml_ob1.so(+0x59f2) [0x7fec264fc9f2]
[oak:23192] [ 4] /usr/lib/openmpi/mca_pml_ob1.so(+0x5e5a) [0x7fec264fce5a]
[oak:23192] [ 5] /usr/lib/openmpi/mca_pml_ob1.so(+0x2386) [0x7fec264f9386]
[oak:23192] [ 6] /usr/lib/openmpi/mca_pml_ob1.so(+0x24a0) [0x7fec264f94a0]
[oak:23192] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(+0x22fb) [0x7fec264f92fb]
[oak:23192] [ 8] /usr/lib/openmpi/mca_pml_ob1.so(+0x3a60) [0x7fec264faa60]
[oak:23192] [ 9] /usr/lib/libmpi.so.0(+0x67f51) [0x7fec2b130f51]
[oak:23192] [10] /usr/lib/libmpi.so.0(MPI_Init+0x173) [0x7fec2b161c33]
[oak:23192] [11] hello_i(main+0x22) [0x400936]
[oak:23192] [12] /lib/libc.so.6(__libc_start_main+0xfe) [0x7fec2a09bd8e]
[oak:23192] [13] hello_i() [0x400859]
[oak:23192] *** End of error message ***

I used gdb to check the stack:
(gdb) bt
#0  0x77afac10 in opal_obj_run_constructors (object=0x6ca980)
at ../opal/class/opal_object.h:427
#1  0x77afb1f2 in opal_list_construct (list=0x6ca958) at 
class/opal_list.c:88
#2  0x72d479f2 in opal_obj_run_constructors (object=0x6ca958)
at ../../../../opal/class/opal_object.h:427
#3  0x72d47e5a in mca_pml_ob1_comm_construct (comm=0x6ca8c0)
at pml_ob1_comm.c:55
#4  0x72d44386 in opal_obj_run_constructors (object=0x6ca8c0)
at ../../../../opal/class/opal_object.h:427
#5  0x72d444a0 in opal_obj_new (cls=0x72f6c040)
at ../../../../opal/class/opal_object.h:477
#6  0x72d442fb in opal_obj_new_debug (type=0x72f6c040,
file=0x72d62840 "pml_ob1.c", line=182)
at ../../../../opal/class/opal_object.h:252
#7  0x72d45a60 in mca_pml_ob1_add_comm (comm=0x601060) at pml_ob1.c:182
#8  0x7797bf51 in ompi_mpi_init (argc=1, argv=0x7fffdf58, 
requested=0,
provided=0x7fffde28) at runtime/ompi_mpi_init.c:770
#9  0x779acc33 in PMPI_Init (argc=0x7fffde5c, argv=0x7fffde50)
at pinit.c:84
#10 0x00400936 in main (argc=1, argv=0x7fffdf58) at hello_c.c:17

It seems the error happened when an object is constructed. Any idea why this is 
happening?

Thanks.

Best regards,
Xin


___
devel mailing list

de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list

de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] TIPC BTL Segmentation fault

2011-07-04 Thread Jeff Squyres
Ah -- so this is in the template code.  I suspect this code might have bit 
rotted a bit.  :-\

If you run this through valgrind, does anything obvious show up?  I ask because 
this kind of error is typically a symptom of the real error.  I.e., the real 
error was some kind of memory corruption that occurred earlier, and this is the 
memory access that exposes that prior memory corruption.


On Jul 4, 2011, at 5:08 AM, Xin He wrote:

> Yes, it is a opal_object.
> 
> And this error seems to be caused by these code:
> 
>  void mca_btl_template_proc_construct(mca_btl_template_proc_t* template_proc){
> ...
>.
> /* add to list of all proc instance */
> OPAL_THREAD_LOCK(&mca_btl_template_component.template_lock);
> opal_list_append(&mca_btl_template_component.template_procs, 
> &template_proc->super);
> OPAL_THREAD_UNLOCK(&mca_btl_template_component.template_lock);
> }
> 
> /Xin
> 
> On 07/02/2011 10:49 PM, Jeff Squyres (jsquyres) wrote:
>> Do u know which object it is that is being constructed?  When you compile 
>> with debugging enabled, theres strings in the object struct that identify te 
>> file and line where the obj was created. 
>> 
>> Sent from my phone. No type good. 
>> 
>> On Jun 29, 2011, at 8:48 AM, "Xin He" 
>> 
>>  wrote:
>> 
>> 
>>> Hi,
>>> 
>>> As I advanced in my implementation of TIPC BTL, I added the component and 
>>> tried to run hello_c program to test.
>>> 
>>> Then I got this segmentation fault. It seemed happening after the call 
>>> "mca_btl_tipc_add_procs".
>>> 
>>> The error message displayed:
>>> 
>>> [oak:23192] *** Process received signal ***
>>> [oak:23192] Signal: Segmentation fault (11)
>>> [oak:23192] Signal code:  (128)
>>> [oak:23192] Failing at address: (nil)
>>> [oak:23192] [ 0] /lib/libpthread.so.0(+0xfb40) [0x7fec2a40fb40]
>>> [oak:23192] [ 1] /usr/lib/libmpi.so.0(+0x1e6c10) [0x7fec2b2afc10]
>>> [oak:23192] [ 2] /usr/lib/libmpi.so.0(+0x1e71f2) [0x7fec2b2b01f2]
>>> [oak:23192] [ 3] /usr/lib/openmpi/mca_pml_ob1.so(+0x59f2) [0x7fec264fc9f2]
>>> [oak:23192] [ 4] /usr/lib/openmpi/mca_pml_ob1.so(+0x5e5a) [0x7fec264fce5a]
>>> [oak:23192] [ 5] /usr/lib/openmpi/mca_pml_ob1.so(+0x2386) [0x7fec264f9386]
>>> [oak:23192] [ 6] /usr/lib/openmpi/mca_pml_ob1.so(+0x24a0) [0x7fec264f94a0]
>>> [oak:23192] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(+0x22fb) [0x7fec264f92fb]
>>> [oak:23192] [ 8] /usr/lib/openmpi/mca_pml_ob1.so(+0x3a60) [0x7fec264faa60]
>>> [oak:23192] [ 9] /usr/lib/libmpi.so.0(+0x67f51) [0x7fec2b130f51]
>>> [oak:23192] [10] /usr/lib/libmpi.so.0(MPI_Init+0x173) [0x7fec2b161c33]
>>> [oak:23192] [11] hello_i(main+0x22) [0x400936]
>>> [oak:23192] [12] /lib/libc.so.6(__libc_start_main+0xfe) [0x7fec2a09bd8e]
>>> [oak:23192] [13] hello_i() [0x400859]
>>> [oak:23192] *** End of error message ***
>>> 
>>> I used gdb to check the stack:
>>> (gdb) bt
>>> #0  0x77afac10 in opal_obj_run_constructors (object=0x6ca980)
>>>at ../opal/class/opal_object.h:427
>>> #1  0x77afb1f2 in opal_list_construct (list=0x6ca958) at 
>>> class/opal_list.c:88
>>> #2  0x72d479f2 in opal_obj_run_constructors (object=0x6ca958)
>>>at ../../../../opal/class/opal_object.h:427
>>> #3  0x72d47e5a in mca_pml_ob1_comm_construct (comm=0x6ca8c0)
>>>at pml_ob1_comm.c:55
>>> #4  0x72d44386 in opal_obj_run_constructors (object=0x6ca8c0)
>>>at ../../../../opal/class/opal_object.h:427
>>> #5  0x72d444a0 in opal_obj_new (cls=0x72f6c040)
>>>at ../../../../opal/class/opal_object.h:477
>>> #6  0x72d442fb in opal_obj_new_debug (type=0x72f6c040,
>>>file=0x72d62840 "pml_ob1.c", line=182)
>>>at ../../../../opal/class/opal_object.h:252
>>> #7  0x72d45a60 in mca_pml_ob1_add_comm (comm=0x601060) at 
>>> pml_ob1.c:182
>>> #8  0x7797bf51 in ompi_mpi_init (argc=1, argv=0x7fffdf58, 
>>> requested=0,
>>>provided=0x7fffde28) at runtime/ompi_mpi_init.c:770
>>> #9  0x779acc33 in PMPI_Init (argc=0x7fffde5c, 
>>> argv=0x7fffde50)
>>>at pinit.c:84
>>> #10 0x00400936 in main (argc=1, argv=0x7fffdf58) at hello_c.c:17
>>> 
>>> It seems the error happened when an object is constructed. Any idea why 
>>> this is happening?
>>> 
>>> Thanks.
>>> 
>>> Best regards,
>>> Xin
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> 
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> ___
>> devel mailing list
>> 
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] TIPC BTL Segmentation fault

2011-07-04 Thread Xin He

Yes, it is a opal_object.

And this error seems to be caused by these code:

 void mca_btl_template_proc_construct(mca_btl_template_proc_t* 
template_proc){

...
   .
/* add to list of all proc instance */
OPAL_THREAD_LOCK(&mca_btl_template_component.template_lock);
opal_list_append(&mca_btl_template_component.template_procs, 
&template_proc->super);

OPAL_THREAD_UNLOCK(&mca_btl_template_component.template_lock);
}

/Xin

On 07/02/2011 10:49 PM, Jeff Squyres (jsquyres) wrote:

Do u know which object it is that is being constructed?  When you compile with 
debugging enabled, theres strings in the object struct that identify te file 
and line where the obj was created.

Sent from my phone. No type good.

On Jun 29, 2011, at 8:48 AM, "Xin He"  wrote:


Hi,

As I advanced in my implementation of TIPC BTL, I added the component and tried 
to run hello_c program to test.

Then I got this segmentation fault. It seemed happening after the call 
"mca_btl_tipc_add_procs".

The error message displayed:

[oak:23192] *** Process received signal ***
[oak:23192] Signal: Segmentation fault (11)
[oak:23192] Signal code:  (128)
[oak:23192] Failing at address: (nil)
[oak:23192] [ 0] /lib/libpthread.so.0(+0xfb40) [0x7fec2a40fb40]
[oak:23192] [ 1] /usr/lib/libmpi.so.0(+0x1e6c10) [0x7fec2b2afc10]
[oak:23192] [ 2] /usr/lib/libmpi.so.0(+0x1e71f2) [0x7fec2b2b01f2]
[oak:23192] [ 3] /usr/lib/openmpi/mca_pml_ob1.so(+0x59f2) [0x7fec264fc9f2]
[oak:23192] [ 4] /usr/lib/openmpi/mca_pml_ob1.so(+0x5e5a) [0x7fec264fce5a]
[oak:23192] [ 5] /usr/lib/openmpi/mca_pml_ob1.so(+0x2386) [0x7fec264f9386]
[oak:23192] [ 6] /usr/lib/openmpi/mca_pml_ob1.so(+0x24a0) [0x7fec264f94a0]
[oak:23192] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(+0x22fb) [0x7fec264f92fb]
[oak:23192] [ 8] /usr/lib/openmpi/mca_pml_ob1.so(+0x3a60) [0x7fec264faa60]
[oak:23192] [ 9] /usr/lib/libmpi.so.0(+0x67f51) [0x7fec2b130f51]
[oak:23192] [10] /usr/lib/libmpi.so.0(MPI_Init+0x173) [0x7fec2b161c33]
[oak:23192] [11] hello_i(main+0x22) [0x400936]
[oak:23192] [12] /lib/libc.so.6(__libc_start_main+0xfe) [0x7fec2a09bd8e]
[oak:23192] [13] hello_i() [0x400859]
[oak:23192] *** End of error message ***

I used gdb to check the stack:
(gdb) bt
#0  0x77afac10 in opal_obj_run_constructors (object=0x6ca980)
at ../opal/class/opal_object.h:427
#1  0x77afb1f2 in opal_list_construct (list=0x6ca958) at 
class/opal_list.c:88
#2  0x72d479f2 in opal_obj_run_constructors (object=0x6ca958)
at ../../../../opal/class/opal_object.h:427
#3  0x72d47e5a in mca_pml_ob1_comm_construct (comm=0x6ca8c0)
at pml_ob1_comm.c:55
#4  0x72d44386 in opal_obj_run_constructors (object=0x6ca8c0)
at ../../../../opal/class/opal_object.h:427
#5  0x72d444a0 in opal_obj_new (cls=0x72f6c040)
at ../../../../opal/class/opal_object.h:477
#6  0x72d442fb in opal_obj_new_debug (type=0x72f6c040,
file=0x72d62840 "pml_ob1.c", line=182)
at ../../../../opal/class/opal_object.h:252
#7  0x72d45a60 in mca_pml_ob1_add_comm (comm=0x601060) at pml_ob1.c:182
#8  0x7797bf51 in ompi_mpi_init (argc=1, argv=0x7fffdf58, 
requested=0,
provided=0x7fffde28) at runtime/ompi_mpi_init.c:770
#9  0x779acc33 in PMPI_Init (argc=0x7fffde5c, argv=0x7fffde50)
at pinit.c:84
#10 0x00400936 in main (argc=1, argv=0x7fffdf58) at hello_c.c:17

It seems the error happened when an object is constructed. Any idea why this is 
happening?

Thanks.

Best regards,
Xin


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] TIPC BTL Segmentation fault

2011-07-02 Thread Jeff Squyres (jsquyres)
Do u know which object it is that is being constructed?  When you compile with 
debugging enabled, theres strings in the object struct that identify te file 
and line where the obj was created. 

Sent from my phone. No type good. 

On Jun 29, 2011, at 8:48 AM, "Xin He"  wrote:

> Hi,
> 
> As I advanced in my implementation of TIPC BTL, I added the component and 
> tried to run hello_c program to test.
> 
> Then I got this segmentation fault. It seemed happening after the call 
> "mca_btl_tipc_add_procs".
> 
> The error message displayed:
> 
> [oak:23192] *** Process received signal ***
> [oak:23192] Signal: Segmentation fault (11)
> [oak:23192] Signal code:  (128)
> [oak:23192] Failing at address: (nil)
> [oak:23192] [ 0] /lib/libpthread.so.0(+0xfb40) [0x7fec2a40fb40]
> [oak:23192] [ 1] /usr/lib/libmpi.so.0(+0x1e6c10) [0x7fec2b2afc10]
> [oak:23192] [ 2] /usr/lib/libmpi.so.0(+0x1e71f2) [0x7fec2b2b01f2]
> [oak:23192] [ 3] /usr/lib/openmpi/mca_pml_ob1.so(+0x59f2) [0x7fec264fc9f2]
> [oak:23192] [ 4] /usr/lib/openmpi/mca_pml_ob1.so(+0x5e5a) [0x7fec264fce5a]
> [oak:23192] [ 5] /usr/lib/openmpi/mca_pml_ob1.so(+0x2386) [0x7fec264f9386]
> [oak:23192] [ 6] /usr/lib/openmpi/mca_pml_ob1.so(+0x24a0) [0x7fec264f94a0]
> [oak:23192] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(+0x22fb) [0x7fec264f92fb]
> [oak:23192] [ 8] /usr/lib/openmpi/mca_pml_ob1.so(+0x3a60) [0x7fec264faa60]
> [oak:23192] [ 9] /usr/lib/libmpi.so.0(+0x67f51) [0x7fec2b130f51]
> [oak:23192] [10] /usr/lib/libmpi.so.0(MPI_Init+0x173) [0x7fec2b161c33]
> [oak:23192] [11] hello_i(main+0x22) [0x400936]
> [oak:23192] [12] /lib/libc.so.6(__libc_start_main+0xfe) [0x7fec2a09bd8e]
> [oak:23192] [13] hello_i() [0x400859]
> [oak:23192] *** End of error message ***
> 
> I used gdb to check the stack:
> (gdb) bt
> #0  0x77afac10 in opal_obj_run_constructors (object=0x6ca980)
>at ../opal/class/opal_object.h:427
> #1  0x77afb1f2 in opal_list_construct (list=0x6ca958) at 
> class/opal_list.c:88
> #2  0x72d479f2 in opal_obj_run_constructors (object=0x6ca958)
>at ../../../../opal/class/opal_object.h:427
> #3  0x72d47e5a in mca_pml_ob1_comm_construct (comm=0x6ca8c0)
>at pml_ob1_comm.c:55
> #4  0x72d44386 in opal_obj_run_constructors (object=0x6ca8c0)
>at ../../../../opal/class/opal_object.h:427
> #5  0x72d444a0 in opal_obj_new (cls=0x72f6c040)
>at ../../../../opal/class/opal_object.h:477
> #6  0x72d442fb in opal_obj_new_debug (type=0x72f6c040,
>file=0x72d62840 "pml_ob1.c", line=182)
>at ../../../../opal/class/opal_object.h:252
> #7  0x72d45a60 in mca_pml_ob1_add_comm (comm=0x601060) at 
> pml_ob1.c:182
> #8  0x7797bf51 in ompi_mpi_init (argc=1, argv=0x7fffdf58, 
> requested=0,
>provided=0x7fffde28) at runtime/ompi_mpi_init.c:770
> #9  0x779acc33 in PMPI_Init (argc=0x7fffde5c, argv=0x7fffde50)
>at pinit.c:84
> #10 0x00400936 in main (argc=1, argv=0x7fffdf58) at hello_c.c:17
> 
> It seems the error happened when an object is constructed. Any idea why this 
> is happening?
> 
> Thanks.
> 
> Best regards,
> Xin
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] TIPC BTL Segmentation fault

2011-06-29 Thread Xin He

Hi,

As I advanced in my implementation of TIPC BTL, I added the component 
and tried to run hello_c program to test.


Then I got this segmentation fault. It seemed happening after the call 
"mca_btl_tipc_add_procs".


The error message displayed:

[oak:23192] *** Process received signal ***
[oak:23192] Signal: Segmentation fault (11)
[oak:23192] Signal code:  (128)
[oak:23192] Failing at address: (nil)
[oak:23192] [ 0] /lib/libpthread.so.0(+0xfb40) [0x7fec2a40fb40]
[oak:23192] [ 1] /usr/lib/libmpi.so.0(+0x1e6c10) [0x7fec2b2afc10]
[oak:23192] [ 2] /usr/lib/libmpi.so.0(+0x1e71f2) [0x7fec2b2b01f2]
[oak:23192] [ 3] /usr/lib/openmpi/mca_pml_ob1.so(+0x59f2) [0x7fec264fc9f2]
[oak:23192] [ 4] /usr/lib/openmpi/mca_pml_ob1.so(+0x5e5a) [0x7fec264fce5a]
[oak:23192] [ 5] /usr/lib/openmpi/mca_pml_ob1.so(+0x2386) [0x7fec264f9386]
[oak:23192] [ 6] /usr/lib/openmpi/mca_pml_ob1.so(+0x24a0) [0x7fec264f94a0]
[oak:23192] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(+0x22fb) [0x7fec264f92fb]
[oak:23192] [ 8] /usr/lib/openmpi/mca_pml_ob1.so(+0x3a60) [0x7fec264faa60]
[oak:23192] [ 9] /usr/lib/libmpi.so.0(+0x67f51) [0x7fec2b130f51]
[oak:23192] [10] /usr/lib/libmpi.so.0(MPI_Init+0x173) [0x7fec2b161c33]
[oak:23192] [11] hello_i(main+0x22) [0x400936]
[oak:23192] [12] /lib/libc.so.6(__libc_start_main+0xfe) [0x7fec2a09bd8e]
[oak:23192] [13] hello_i() [0x400859]
[oak:23192] *** End of error message ***

I used gdb to check the stack:
(gdb) bt
#0  0x77afac10 in opal_obj_run_constructors (object=0x6ca980)
at ../opal/class/opal_object.h:427
#1  0x77afb1f2 in opal_list_construct (list=0x6ca958) at 
class/opal_list.c:88

#2  0x72d479f2 in opal_obj_run_constructors (object=0x6ca958)
at ../../../../opal/class/opal_object.h:427
#3  0x72d47e5a in mca_pml_ob1_comm_construct (comm=0x6ca8c0)
at pml_ob1_comm.c:55
#4  0x72d44386 in opal_obj_run_constructors (object=0x6ca8c0)
at ../../../../opal/class/opal_object.h:427
#5  0x72d444a0 in opal_obj_new (cls=0x72f6c040)
at ../../../../opal/class/opal_object.h:477
#6  0x72d442fb in opal_obj_new_debug (type=0x72f6c040,
file=0x72d62840 "pml_ob1.c", line=182)
at ../../../../opal/class/opal_object.h:252
#7  0x72d45a60 in mca_pml_ob1_add_comm (comm=0x601060) at 
pml_ob1.c:182
#8  0x7797bf51 in ompi_mpi_init (argc=1, argv=0x7fffdf58, 
requested=0,

provided=0x7fffde28) at runtime/ompi_mpi_init.c:770
#9  0x779acc33 in PMPI_Init (argc=0x7fffde5c, 
argv=0x7fffde50)

at pinit.c:84
#10 0x00400936 in main (argc=1, argv=0x7fffdf58) at hello_c.c:17

It seems the error happened when an object is constructed. Any idea why 
this is happening?


Thanks.

Best regards,
Xin