Appears fixed with r17992 - at least, it works on TM, slurm (odin), and Mac.
On 3/27/08 11:06 AM, "Ralph H Castain" <r...@lanl.gov> wrote: > Found the problem - should have a fix committed soon. Issue is with > differences in the number of daemons launched by the various plms (whether > or not procs are launched local to mpirun). > > > > On 3/27/08 10:39 AM, "Ralph H Castain" <r...@lanl.gov> wrote: > >> Hmmm...puzzling. It is working fine for me on TM machines and on my Mac. >> However, Galen reports it borked on alps as well. >> >> I'll have to dig a little to check this out and see if there is something >> missing on those PLMs. Will get back shortly. >> >> Sorry for problem >> >> >> On 3/27/08 10:28 AM, "Tim Prins" <tpr...@cs.indiana.edu> wrote: >> >>> Unfortunately now with r17988 I cannot run any mpi programs, they seem >>> to hang in the modex. >>> >>> Tim >>> >>> Ralph H Castain wrote: >>>> Thanks Tim - I found the problem and will commit a fix shortly. >>>> >>>> Appreciate your testing and reporting! >>>> >>>> >>>> On 3/27/08 8:24 AM, "Tim Prins" <tpr...@cs.indiana.edu> wrote: >>>> >>>>> This commit breaks things for me. Running on 3 nodes of odin: >>>>> >>>>> mpirun -mca btl tcp,sm,self examples/ring_c >>>>> >>>>> causes a hang. All of the processes are stuck in >>>>> orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang, >>>>> and the ring program does not hang all the time, but fairly often. >>>>> >>>>> Tim >>>>> >>>>> r...@osl.iu.edu wrote: >>>>>> Author: rhc >>>>>> Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008) >>>>>> New Revision: 17941 >>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/17941 >>>>>> >>>>>> Log: >>>>>> Fix the allgather and allgather_list functions to avoid deadlocks at >>>>>> large >>>>>> node/proc counts. Violated the RML rules here - we received the allgather >>>>>> buffer and then did an xcast, which causes a send to go out, and is then >>>>>> subsequently received by the sender. This fix breaks that pattern by >>>>>> forcing >>>>>> the recv to complete outside of the function itself - thus, the allgather >>>>>> and >>>>>> allgather_list always complete their recvs before returning or sending. >>>>>> >>>>>> Reogranize the grpcomm code a little to provide support for soon-to-come >>>>>> new >>>>>> grpcomm components. The revised organization puts what will be common >>>>>> code >>>>>> elements in the base to avoid duplication, while allowing components that >>>>>> don't need those functions to ignore them. >>>>>> >>>>>> Added: >>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c >>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c >>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c >>>>>> Text files modified: >>>>>> trunk/orte/mca/grpcomm/base/Makefile.am | 5 >>>>>> trunk/orte/mca/grpcomm/base/base.h | 23 + >>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_close.c | 4 >>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_open.c | 1 >>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_select.c | 121 ++--- >>>>>> trunk/orte/mca/grpcomm/basic/grpcomm_basic.h | 16 >>>>>> trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c | 30 - >>>>>> trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c | 845 >>>>>> ++------------------------------------- >>>>>> trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8 >>>>>> trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c | 8 >>>>>> trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c | 21 >>>>>> trunk/orte/mca/grpcomm/grpcomm.h | 45 + >>>>>> trunk/orte/mca/rml/rml_types.h | 31 >>>>>> trunk/orte/orted/orted_comm.c | 27 + >>>>>> 14 files changed, 226 insertions(+), 959 deletions(-) >>>>>> >>>>>> >>>>>> Diff not shown due to size (92619 bytes). >>>>>> To see the diff, run the following command: >>>>>> >>>>>> svn diff -r 17940:17941 --no-diff-deleted >>>>>> >>>>>> _______________________________________________ >>>>>> svn mailing list >>>>>> s...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel