On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote: Thank you. I think you missed the top three lines of the output but that doesn't matter.
> main() at ?:? > PMPI_Comm_dup() at pcomm_dup.c:62 > ompi_comm_dup() at communicator/comm.c:661 > ----------------- > [0,2] (2 processes) > ----------------- > ompi_comm_nextcid() at communicator/comm_cid.c:264 > ompi_comm_allreduce_intra() at communicator/comm_cid.c:619 > ompi_coll_tuned_allreduce_intra_dec_fixed() at > coll_tuned_decision_fixed.c:61 > ompi_coll_tuned_allreduce_intra_recursivedoubling() at > coll_tuned_allreduce.c:223 > ompi_request_default_wait_all() at request/req_wait.c:262 > opal_condition_wait() at ../opal/threads/condition.h:99 > ----------------- > [1,3] (2 processes) > ----------------- > ompi_comm_nextcid() at communicator/comm_cid.c:245 > ompi_comm_allreduce_intra() at communicator/comm_cid.c:619 > ompi_coll_tuned_allreduce_intra_dec_fixed() at > coll_tuned_decision_fixed.c:61 > ompi_coll_tuned_allreduce_intra_recursivedoubling() at > coll_tuned_allreduce.c:223 > ompi_request_default_wait_all() at request/req_wait.c:262 > opal_condition_wait() at ../opal/threads/condition.h:99 Lines 264 and 245 of comm_cid.c are both in a for loop which calls allreduce() twice in a loop until a certain condition is met. As such it's hard to tell from this trace if it is processes [0,2] are "ahead" or [1,3] are "behind". Either way you look at it however the all_reduce() should not deadlock like that so it's as likely to be a bug in reduce as it is in ompi_comm_nextcid() from the trace. I assume all four processes are actually in the same call to comm_dup, re-compiling your program with -g and re-running padb would confirm this as it would show the line numbers. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
