I’ve managed to create a 100% reproducer - I’ll try to track this down as quickly as I can. Meantime, I’m working on that internal timeout so we don’t hang in case anything else interferes.
> On Sep 3, 2015, at 12:53 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > > HI Ralph, > > If its any help, the first run has yet to hang. Its always one of the > subsequent mpirun's (and hence why its the fortran) > that shows this problem. > > Howard > > > 2015-09-03 13:52 GMT-06:00 Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>>: > Thanks! I’ll at least try, and can certainly provide some diag output (just > have to live thru it when it doesn’t fail, and hopefully it won’t change the > timing so much that it won’t reproduce any more) > >> On Sep 3, 2015, at 12:44 PM, Howard Pritchard <hpprit...@gmail.com >> <mailto:hpprit...@gmail.com>> wrote: >> >> Hi Ralph, >> >> Warning that it seems to be hard to reproduce, at least on the UH server. >> >> Howard >> >> >> 2015-09-03 13:12 GMT-06:00 Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>>: >> I’ll try to replicate, and provide some diagnostics targeting this exchange. >> What is happening is that the client process is attempting to connect to the >> ORTE daemon, and for some reason the connection isn’t generating a response >> from the daemon. >> >> I’ll also add a timeout function in there so we don’t hang when this >> happens, but instead cleanly error out. >> >> >>> On Sep 3, 2015, at 11:15 AM, Howard Pritchard <hpprit...@gmail.com >>> <mailto:hpprit...@gmail.com>> wrote: >>> >>> Hi Folks, >>> >>> I'm seeing again a case of a hang (yes I'm going to start using timeout) of >>> a two process >>> run on the iu jenkins server for master. This is the --disable-dlopen >>> jenkins project for >>> the IU jenkins server. >>> >>> I attached to the hanging processes and get this for a backtrace: >>> >>> #0 0x00007fdd4ca7ae94 in recv () from /lib64/libpthread.so.0 >>> >>> #1 0x00007fdd4bab622a in opal_pmix_pmix1xx_pmix_usock_recv_blocking >>> (sd=13, data=0x7fff9342fb78 "&", size=4) >>> >>> at src/usock/usock.c:157 >>> >>> #2 0x00007fdd4babad69 in recv_connect_ack (sd=13) at >>> src/client/pmix_client.c:777 >>> >>> #3 0x00007fdd4babbc59 in usock_connect (addr=0x7fff9342fe80) at >>> src/client/pmix_client.c:1026 >>> >>> #4 0x00007fdd4bab88ae in connect_to_server (address=0x7fff9342fe80, >>> cbdata=0x7fff9342fc30) at src/client/pmix_client.c:177 >>> >>> #5 0x00007fdd4bab90f7 in OPAL_PMIX_PMIX1XX_PMIx_Init (proc=0x7fdd4c2e9820 >>> <myproc>) at src/client/pmix_client.c:329 >>> >>> #6 0x00007fdd4bff1892 in pmix1_client_init () at pmix1_client.c:58 >>> >>> #7 0x00007fdd4c37ce1d in pmi_component_query (module=0x7fff9342ffd0, >>> priority=0x7fff9342ffcc) at ess_pmi_component.c:89 >>> >>> #8 0x00007fdd4bf54c38 in mca_base_select (type_name=0x7fdd4c45e5b9 "ess", >>> output_id=-1, >>> >>> components_available=0x7fdd4c6b21d0 <orte_ess_base_framework+80>, >>> best_module=0x7fff93430000, best_component=0x7fff93430008) >>> >>> at mca_base_components_select.c:73 >>> >>> #9 0x00007fdd4c373f0d in orte_ess_base_select () at >>> base/ess_base_select.c:39 >>> >>> #10 0x00007fdd4c312fed in orte_init (pargc=0x0, pargv=0x0, flags=32) at >>> runtime/orte_init.c:221 >>> >>> #11 0x00007fdd4d788e26 in ompi_mpi_init (argc=0, argv=0x0, requested=0, >>> provided=0x7fff934300fc) at runtime/ompi_mpi_init.c:468 >>> >>> #12 0x00007fdd4d7be27a in PMPI_Init (argc=0x7fff93430138, >>> argv=0x7fff93430130) at pinit.c:84 >>> >>> #13 0x00007fdd4dce515e in ompi_init_f (ierr=0x7fff9343043c) at pinit_f.c:82 >>> >>> #14 0x0000000000400dff in MAIN__ () >>> >>> #15 0x0000000000400f38 in main () >>> >>> This seems to only happen periodically. >>> >>> Any suggestions on how to further analyze? >>> >>> Howard >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/09/17943.php >>> <http://www.open-mpi.org/community/lists/devel/2015/09/17943.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/17946.php >> <http://www.open-mpi.org/community/lists/devel/2015/09/17946.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/17947.php >> <http://www.open-mpi.org/community/lists/devel/2015/09/17947.php> > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17948.php > <http://www.open-mpi.org/community/lists/devel/2015/09/17948.php> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17949.php