I’ve managed to create a 100% reproducer - I’ll try to track this down as 
quickly as I can. Meantime, I’m working on that internal timeout so we don’t 
hang in case anything else interferes.


> On Sep 3, 2015, at 12:53 PM, Howard Pritchard <hpprit...@gmail.com> wrote:
> 
> HI Ralph,
> 
> If its any help,  the first run has yet to hang.  Its always one of the 
> subsequent mpirun's (and hence why its the fortran)
> that shows this problem.
> 
> Howard
> 
> 
> 2015-09-03 13:52 GMT-06:00 Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>>:
> Thanks! I’ll at least try, and can certainly provide some diag output (just 
> have to live thru it when it doesn’t fail, and hopefully it won’t change the 
> timing so much that it won’t reproduce any more)
> 
>> On Sep 3, 2015, at 12:44 PM, Howard Pritchard <hpprit...@gmail.com 
>> <mailto:hpprit...@gmail.com>> wrote:
>> 
>> Hi Ralph,
>> 
>> Warning that it seems to be hard to reproduce, at least on the UH server.
>> 
>> Howard
>> 
>> 
>> 2015-09-03 13:12 GMT-06:00 Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>>:
>> I’ll try to replicate, and provide some diagnostics targeting this exchange. 
>> What is happening is that the client process is attempting to connect to the 
>> ORTE daemon, and for some reason the connection isn’t generating a response 
>> from the daemon.
>> 
>> I’ll also add a timeout function in there so we don’t hang when this 
>> happens, but instead cleanly error out.
>> 
>> 
>>> On Sep 3, 2015, at 11:15 AM, Howard Pritchard <hpprit...@gmail.com 
>>> <mailto:hpprit...@gmail.com>> wrote:
>>> 
>>> Hi Folks,
>>> 
>>> I'm seeing again a case of a hang (yes I'm going to start using timeout) of 
>>> a two process
>>> run on the iu jenkins server for master.  This is the --disable-dlopen 
>>> jenkins project for
>>> the IU jenkins server.
>>> 
>>> I attached to the hanging processes and get this for a backtrace:
>>> 
>>> #0  0x00007fdd4ca7ae94 in recv () from /lib64/libpthread.so.0
>>> 
>>> #1  0x00007fdd4bab622a in opal_pmix_pmix1xx_pmix_usock_recv_blocking 
>>> (sd=13, data=0x7fff9342fb78 "&", size=4)
>>> 
>>>     at src/usock/usock.c:157
>>> 
>>> #2  0x00007fdd4babad69 in recv_connect_ack (sd=13) at 
>>> src/client/pmix_client.c:777
>>> 
>>> #3  0x00007fdd4babbc59 in usock_connect (addr=0x7fff9342fe80) at 
>>> src/client/pmix_client.c:1026
>>> 
>>> #4  0x00007fdd4bab88ae in connect_to_server (address=0x7fff9342fe80, 
>>> cbdata=0x7fff9342fc30) at src/client/pmix_client.c:177
>>> 
>>> #5  0x00007fdd4bab90f7 in OPAL_PMIX_PMIX1XX_PMIx_Init (proc=0x7fdd4c2e9820 
>>> <myproc>) at src/client/pmix_client.c:329
>>> 
>>> #6  0x00007fdd4bff1892 in pmix1_client_init () at pmix1_client.c:58
>>> 
>>> #7  0x00007fdd4c37ce1d in pmi_component_query (module=0x7fff9342ffd0, 
>>> priority=0x7fff9342ffcc) at ess_pmi_component.c:89
>>> 
>>> #8  0x00007fdd4bf54c38 in mca_base_select (type_name=0x7fdd4c45e5b9 "ess", 
>>> output_id=-1, 
>>> 
>>>     components_available=0x7fdd4c6b21d0 <orte_ess_base_framework+80>, 
>>> best_module=0x7fff93430000, best_component=0x7fff93430008)
>>> 
>>>     at mca_base_components_select.c:73
>>> 
>>> #9  0x00007fdd4c373f0d in orte_ess_base_select () at 
>>> base/ess_base_select.c:39
>>> 
>>> #10 0x00007fdd4c312fed in orte_init (pargc=0x0, pargv=0x0, flags=32) at 
>>> runtime/orte_init.c:221
>>> 
>>> #11 0x00007fdd4d788e26 in ompi_mpi_init (argc=0, argv=0x0, requested=0, 
>>> provided=0x7fff934300fc) at runtime/ompi_mpi_init.c:468
>>> 
>>> #12 0x00007fdd4d7be27a in PMPI_Init (argc=0x7fff93430138, 
>>> argv=0x7fff93430130) at pinit.c:84
>>> 
>>> #13 0x00007fdd4dce515e in ompi_init_f (ierr=0x7fff9343043c) at pinit_f.c:82
>>> 
>>> #14 0x0000000000400dff in MAIN__ ()
>>> 
>>> #15 0x0000000000400f38 in main ()
>>> 
>>> This seems to only happen periodically.  
>>> 
>>> Any suggestions on how to further analyze?
>>> 
>>> Howard
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/09/17943.php 
>>> <http://www.open-mpi.org/community/lists/devel/2015/09/17943.php>
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/09/17946.php 
>> <http://www.open-mpi.org/community/lists/devel/2015/09/17946.php>
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/09/17947.php 
>> <http://www.open-mpi.org/community/lists/devel/2015/09/17947.php>
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17948.php 
> <http://www.open-mpi.org/community/lists/devel/2015/09/17948.php>
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17949.php

Reply via email to