Thanks Tim - I found the problem and will commit a fix shortly. Appreciate your testing and reporting!
On 3/27/08 8:24 AM, "Tim Prins" <tpr...@cs.indiana.edu> wrote: > This commit breaks things for me. Running on 3 nodes of odin: > > mpirun -mca btl tcp,sm,self examples/ring_c > > causes a hang. All of the processes are stuck in > orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang, > and the ring program does not hang all the time, but fairly often. > > Tim > > r...@osl.iu.edu wrote: >> Author: rhc >> Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008) >> New Revision: 17941 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/17941 >> >> Log: >> Fix the allgather and allgather_list functions to avoid deadlocks at large >> node/proc counts. Violated the RML rules here - we received the allgather >> buffer and then did an xcast, which causes a send to go out, and is then >> subsequently received by the sender. This fix breaks that pattern by forcing >> the recv to complete outside of the function itself - thus, the allgather and >> allgather_list always complete their recvs before returning or sending. >> >> Reogranize the grpcomm code a little to provide support for soon-to-come new >> grpcomm components. The revised organization puts what will be common code >> elements in the base to avoid duplication, while allowing components that >> don't need those functions to ignore them. >> >> Added: >> trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c >> trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c >> trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c >> Text files modified: >> trunk/orte/mca/grpcomm/base/Makefile.am | 5 >> trunk/orte/mca/grpcomm/base/base.h | 23 + >> trunk/orte/mca/grpcomm/base/grpcomm_base_close.c | 4 >> trunk/orte/mca/grpcomm/base/grpcomm_base_open.c | 1 >> trunk/orte/mca/grpcomm/base/grpcomm_base_select.c | 121 ++--- >> trunk/orte/mca/grpcomm/basic/grpcomm_basic.h | 16 >> trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c | 30 - >> trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c | 845 >> ++------------------------------------- >> trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8 >> trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c | 8 >> trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c | 21 >> trunk/orte/mca/grpcomm/grpcomm.h | 45 + >> trunk/orte/mca/rml/rml_types.h | 31 >> trunk/orte/orted/orted_comm.c | 27 + >> 14 files changed, 226 insertions(+), 959 deletions(-) >> >> >> Diff not shown due to size (92619 bytes). >> To see the diff, run the following command: >> >> svn diff -r 17940:17941 --no-diff-deleted >> >> _______________________________________________ >> svn mailing list >> s...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/svn > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel