Re: [OMPI devel] Fix a hang in carto_base_select() if carto_module_init() fails

2011-07-07 Thread Jeff Squyres
I'd go even slightly simpler than that:

Index: opal/mca/carto/base/carto_base_select.c
===
--- opal/mca/carto/base/carto_base_select.c (revision 24842)
+++ opal/mca/carto/base/carto_base_select.c (working copy)
@@ -64,10 +64,7 @@
 cleanup:
 /* Initialize the winner */
 if (NULL != opal_carto_base_module) {
-if (OPAL_SUCCESS != (ret = 
opal_carto_base_module->carto_module_init()) ) {
-exit_status = ret;
-goto cleanup;
-}
+exit_status = opal_carto_base_module->carto_module_init();
 }

 return exit_status;



On Jun 28, 2011, at 3:02 AM, nadia.derbey wrote:

> Hi,
> 
> When using the carto/file module with a syntactically incorrect carto
> file, we get stuck into opal_carto_base_select().
> 
> The attached trivial patch fixes the issue.
> 
> Regards,
> Nadia
> 
> 
> -- 
> nadia.derbey 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] TIPC BTL Segmentation fault

2011-07-07 Thread Jeff Squyres
Sorry for the delay; this past weekend was a holiday in the US.  I'm just now 
catching up on the backlog.

Have you edited pml_ob1_comm.c?  For me, line 56 (on the trunk) is:

OBJ_CONSTRUCT(&comm->matching_lock, opal_mutex_t);

But clearly you seem to be executing the line above that:

OBJ_CONSTRUCT(&comm->wild_receives, opal_list_t);

I can't imagine why that line would segv -- it would imply that the "class 
definition" for opal_list_t is hosed in memory somehow.

Are you 100% sure that you're compiling / linking against your development copy 
of Open MPI, and not accidentally mixing it with some other OMPI installation 
at run time?  (e.g., via LD_LIBRARY_PATH or somesuch)

If you're not, you might want to run hello_c through a debugger and put a watch 
on the opal_list_t_class variable and see when it changes.  It should be 
initialed early in opal_init() somewhere and then used many times during 
MPI_Init() before the place where it fails.  The sentinel value 
0xDEAFBEEDDEAFBEED is used in OMPI debug builds to mean that it's an object 
that has been destroyed.  But this should never happen in the opal_list_t_class 
instance itself.



On Jul 4, 2011, at 9:37 AM, Xin He wrote:

> Hi, here is the result:
> 
> ehhexxn@oak:~/git/test$ mpirun -n 2 -mca btl tipc,self valgrind ./hello_c > 
> 11.out
> ==30850== Memcheck, a memory error detector
> ==30850== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==30850== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for 
> copyright info
> ==30850== Command: ./hello_c
> ==30850==
> ==30849== Memcheck, a memory error detector
> ==30849== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==30849== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for 
> copyright info
> ==30849== Command: ./hello_c
> ==30849==
> ==30849== Jump to the invalid address stated on the next line
> ==30849==at 0xDEAFBEEDDEAFBEED: ???
> ==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
> ==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
> ==30849==by 0xA8A12FA: opal_obj_new_debug (opal_object.h:252)
> ==30849==by 0xA8A2A5F: mca_pml_ob1_add_comm (pml_ob1.c:182)
> ==30849==by 0x4E95F50: ompi_mpi_init (ompi_mpi_init.c:770)
> ==30849==by 0x4EC6C32: PMPI_Init (pinit.c:84)
> ==30849==by 0x400935: main (in /home/ehhexxn/git/test/hello_c)
> ==30849==  Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently) 
> free'd
> ==30849==
> [oak:30849] *** Process received signal ***
> [oak:30849] Signal: Segmentation fault (11)
> [oak:30849] Signal code: Invalid permissions (2)
> [oak:30849] Failing at address: 0xdeafbeeddeafbeed
> ==30849== Invalid read of size 1
> ==30849==at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
> ==30849==by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
> ==30849==by 0x60BE69D: backtrace (backtrace.c:91)
> ==30849==by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
> ==30849==by 0x5026DF3: show_stackframe (stacktrace.c:348)
> ==30849==by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
> ==30849==by 0xDEAFBEEDDEAFBEEC: ???
> ==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
> ==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
> ==30849==  Address 0xdeafbeeddeafbeed is not stack'd, malloc'd or (recently) 
> free'd
> ==30849==
> ==30849==
> ==30849== Process terminating with default action of signal 11 (SIGSEGV): 
> dumping core
> ==30849==  General Protection Fault
> ==30849==at 0xA011FDB: ??? (in /lib/libgcc_s.so.1)
> ==30849==by 0xA012B0B: _Unwind_Backtrace (in /lib/libgcc_s.so.1)
> ==30849==by 0x60BE69D: backtrace (backtrace.c:91)
> ==30849==by 0x4FAB055: opal_backtrace_buffer (backtrace_execinfo.c:54)
> ==30849==by 0x5026DF3: show_stackframe (stacktrace.c:348)
> ==30849==by 0x5DB1B3F: ??? (in /lib/libpthread-2.12.1.so)
> ==30849==by 0xDEAFBEEDDEAFBEEC: ???
> ==30849==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30849==by 0xA8A49F1: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A4E59: mca_pml_ob1_comm_construct (pml_ob1_comm.c:56)
> ==30849==by 0xA8A1385: opal_obj_run_constructors (opal_object.h:427)
> ==30849==by 0xA8A149F: opal_obj_new (opal_object.h:477)
> ==30850== Jump to the invalid address stated on the next line
> ==30850==at 0xDEAFBEEDDEAFBEED: ???
> ==30850==by 0x50151F1: opal_list_construct (opal_list.c:88)
> ==30850==by 0xA8A49F1: opal_obj_run_constructors (opal_

Re: [OMPI devel] Question about hanging mpirun

2011-07-07 Thread Jeff Squyres
On Jul 5, 2011, at 2:21 PM, Ralph Castain wrote:

>> Ok I think I figured out what the deadlock in my application was... and 
>> please confirm if this makes sense:
>> 1. There was an 'if' condition that was met, causing 2 (out of 3) of my 
>> processes to call MPI_finalize(). 
>> 2. The remaining process was still trying to run and at some point was 
>> requesting calls like MPI_receive(), MPI_send() and MPI_wait() while the 
>> other two processes were at MPI_finalize() (althought they would never 
>> exit).The application would hang at that point, but the program was too big 
>> for me to figure out where exactly the lonely running process would hang. 
>> 3. I am no expert on openmpi, so I would appreciate it if someone can 
>> confirm if this was an expected behavior. I addressed the condition and now 
>> all processes run their course.
> 
> That is correct behavior for MPI - i.e., if one process is rattling off MPI 
> requests while the others have already entered finalize, then the job will 
> hang since the requests cannot possibly be met and that proc never calls 
> finalize to release completion of the job.

One clarification on this point...

If process A calls MPI_Send to process B and that send completes before B 
actually receives the message (e.g., if the message was small and there were no 
other messages pending between A and B), and then A calls MPI_Finalize, then B 
can still legally call MPI_Recv to receive the outstanding message.  That 
scenario should work fine.

What doesn't work is if you initiate new communication to a process that has 
called MPI_Finalize -- e.g., if you MPI_Send to a finalized process, or you try 
to MPI_Recv a message that wasn't send before the peer finalized.

Make sense?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/