Re: [OMPI devel] application hangs with multiple dup

Ashley Pittman Wed, 9 Sep 2009 13:37:37 -0400

On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:

Thank you.  I think you missed the top three lines of the output but
that doesn't matter.


> main() at ?:?
>   PMPI_Comm_dup() at pcomm_dup.c:62
>     ompi_comm_dup() at communicator/comm.c:661
>       -----------------
>       [0,2] (2 processes)
>       -----------------
>       ompi_comm_nextcid() at communicator/comm_cid.c:264
>         ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>           ompi_coll_tuned_allreduce_intra_dec_fixed() at 
> coll_tuned_decision_fixed.c:61
>             ompi_coll_tuned_allreduce_intra_recursivedoubling() at 
> coll_tuned_allreduce.c:223
>               ompi_request_default_wait_all() at request/req_wait.c:262
>                 opal_condition_wait() at ../opal/threads/condition.h:99
>       -----------------
>       [1,3] (2 processes)
>       -----------------
>       ompi_comm_nextcid() at communicator/comm_cid.c:245
>         ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>           ompi_coll_tuned_allreduce_intra_dec_fixed() at 
> coll_tuned_decision_fixed.c:61
>             ompi_coll_tuned_allreduce_intra_recursivedoubling() at 
> coll_tuned_allreduce.c:223
>               ompi_request_default_wait_all() at request/req_wait.c:262
>                 opal_condition_wait() at ../opal/threads/condition.h:99

Lines 264 and 245 of comm_cid.c are both in a for loop which calls
allreduce() twice in a loop until a certain condition is met.  As such
it's hard to tell from this trace if it is processes [0,2] are "ahead"
or [1,3] are "behind".  Either way you look at it however the
all_reduce() should not deadlock like that so it's as likely to be a bug
in reduce as it is in ompi_comm_nextcid() from the trace.

I assume all four processes are actually in the same call to comm_dup,
re-compiling your program with -g and re-running padb would confirm this
as it would show the line numbers.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

Re: [OMPI devel] application hangs with multiple dup

Reply via email to