Thanks! I’ll at least try, and can certainly provide some diag output (just 
have to live thru it when it doesn’t fail, and hopefully it won’t change the 
timing so much that it won’t reproduce any more)

> On Sep 3, 2015, at 12:44 PM, Howard Pritchard <hpprit...@gmail.com> wrote:
> 
> Hi Ralph,
> 
> Warning that it seems to be hard to reproduce, at least on the UH server.
> 
> Howard
> 
> 
> 2015-09-03 13:12 GMT-06:00 Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>>:
> I’ll try to replicate, and provide some diagnostics targeting this exchange. 
> What is happening is that the client process is attempting to connect to the 
> ORTE daemon, and for some reason the connection isn’t generating a response 
> from the daemon.
> 
> I’ll also add a timeout function in there so we don’t hang when this happens, 
> but instead cleanly error out.
> 
> 
>> On Sep 3, 2015, at 11:15 AM, Howard Pritchard <hpprit...@gmail.com 
>> <mailto:hpprit...@gmail.com>> wrote:
>> 
>> Hi Folks,
>> 
>> I'm seeing again a case of a hang (yes I'm going to start using timeout) of 
>> a two process
>> run on the iu jenkins server for master.  This is the --disable-dlopen 
>> jenkins project for
>> the IU jenkins server.
>> 
>> I attached to the hanging processes and get this for a backtrace:
>> 
>> #0  0x00007fdd4ca7ae94 in recv () from /lib64/libpthread.so.0
>> 
>> #1  0x00007fdd4bab622a in opal_pmix_pmix1xx_pmix_usock_recv_blocking (sd=13, 
>> data=0x7fff9342fb78 "&", size=4)
>> 
>>     at src/usock/usock.c:157
>> 
>> #2  0x00007fdd4babad69 in recv_connect_ack (sd=13) at 
>> src/client/pmix_client.c:777
>> 
>> #3  0x00007fdd4babbc59 in usock_connect (addr=0x7fff9342fe80) at 
>> src/client/pmix_client.c:1026
>> 
>> #4  0x00007fdd4bab88ae in connect_to_server (address=0x7fff9342fe80, 
>> cbdata=0x7fff9342fc30) at src/client/pmix_client.c:177
>> 
>> #5  0x00007fdd4bab90f7 in OPAL_PMIX_PMIX1XX_PMIx_Init (proc=0x7fdd4c2e9820 
>> <myproc>) at src/client/pmix_client.c:329
>> 
>> #6  0x00007fdd4bff1892 in pmix1_client_init () at pmix1_client.c:58
>> 
>> #7  0x00007fdd4c37ce1d in pmi_component_query (module=0x7fff9342ffd0, 
>> priority=0x7fff9342ffcc) at ess_pmi_component.c:89
>> 
>> #8  0x00007fdd4bf54c38 in mca_base_select (type_name=0x7fdd4c45e5b9 "ess", 
>> output_id=-1, 
>> 
>>     components_available=0x7fdd4c6b21d0 <orte_ess_base_framework+80>, 
>> best_module=0x7fff93430000, best_component=0x7fff93430008)
>> 
>>     at mca_base_components_select.c:73
>> 
>> #9  0x00007fdd4c373f0d in orte_ess_base_select () at 
>> base/ess_base_select.c:39
>> 
>> #10 0x00007fdd4c312fed in orte_init (pargc=0x0, pargv=0x0, flags=32) at 
>> runtime/orte_init.c:221
>> 
>> #11 0x00007fdd4d788e26 in ompi_mpi_init (argc=0, argv=0x0, requested=0, 
>> provided=0x7fff934300fc) at runtime/ompi_mpi_init.c:468
>> 
>> #12 0x00007fdd4d7be27a in PMPI_Init (argc=0x7fff93430138, 
>> argv=0x7fff93430130) at pinit.c:84
>> 
>> #13 0x00007fdd4dce515e in ompi_init_f (ierr=0x7fff9343043c) at pinit_f.c:82
>> 
>> #14 0x0000000000400dff in MAIN__ ()
>> 
>> #15 0x0000000000400f38 in main ()
>> 
>> This seems to only happen periodically.  
>> 
>> Any suggestions on how to further analyze?
>> 
>> Howard
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/09/17943.php 
>> <http://www.open-mpi.org/community/lists/devel/2015/09/17943.php>
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17946.php 
> <http://www.open-mpi.org/community/lists/devel/2015/09/17946.php>
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17947.php

Reply via email to