On Fri, Dec 12, 2014 at 2:58 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Aha! You are the first to fall thru the timeout. How interesting.
>

When it comes to the release candidates, I seem to own a lot of "firsts".
It is not as fun as one might imagine :-).

Can you please try adding "-mca oob_tcp_connect_timeout 5:0"?
>

That appeared to produce a timeout of about 5 SECONDS ("time mpirun"
reports 5.8s elapsed).  Was that really the intent?   No difference if I
change "5:0" to "5:00".  So, you might have an "extra" bug lurking there.


New stderr attached for
  $ mpirun -mca oob_tcp_if_include bge0 -mca oob_tcp_connect_timeout 5:0
-mca oob_base_verbose 20 -mca btl sm,self,openib -np 2 -host
pcp-j-19,pcp-j-20 examples/ring_c

Assuming "5:0" was intended to get a 5 MINUTE timeout, I also tried "-mca
oob_tcp_connect_timeout 300", and have also attached the resulting stderr.

No joy for either timeout value.

-Paul



>
> On Dec 12, 2014, at 8:53 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>
>
> First, I want to ask what became of the issue discussed in this thread?
>    http://www.open-mpi.org/community/lists/devel/2014/11/16160.php
> I though we had concluded that one just needed -D_REENTRANT.
> I mention that only for completeness, because I think my current problem
> is different.
>
> The following works fine with 1.8.3, making the current behavior a
> regression.
>
> I am still on the same system as that previous report, and still/again see
> a message like the following:
>
> ------------------------------------------------------------
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:    pcp-j-19
>   Remote host:   172.18.0.120
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> ------------------------------------------------------------
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> [...etc...]
>
> It may be worth noting that the hostname pcp-j-19 (172.16.0.119) and the
> address 172.18.0.120 are on different subnets.
>
> I CANNOT resolve the issue this time by adding -D_REENTRANT to CFLAGS at
> configure time (I didn't bother to check if it there by default now or not).
>
> NOR can I resolve it by using "-mca oob_tcp_if_include bge0" to allow only
> the 172.16.0.120 subnet.
> IN FACT, the message is the same with that option, other than "172.18"
> changing to "172.16".
>
> I've attached the output generated by "-mca oob_base_verbose 20" both with
> and without the oob_tcp_if_include.
>
> I should also note that that the following is my full mpirun command,
> which excludes the tcp BTL.
> pcp-j-20$ mpirun -mca oob_tcp_if_include bge0 -mca oob_base_verbose 20
> -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c
>
>
> -Paul
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>  <stdout-inc.txt><stderr-2if.txt>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16551.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16561.php
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
[pcp-j-20:07201] mca: base: components_register: registering oob components
[pcp-j-20:07201] mca: base: components_register: found loaded component tcp
[pcp-j-20:07201] mca: base: components_register: component tcp register 
function successful
[pcp-j-20:07201] mca: base: components_open: opening oob components
[pcp-j-20:07201] mca: base: components_open: found loaded component tcp
[pcp-j-20:07201] mca: base: components_open: component tcp open function 
successful
[pcp-j-20:07201] mca:oob:select: checking available component tcp
[pcp-j-20:07201] mca:oob:select: Querying component [tcp]
[pcp-j-20:07201] oob:tcp: component_available called
[pcp-j-20:07201] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[pcp-j-20:07201] [[32105,0],0] oob:tcp:init rejecting interface lo0 (not in 
include list)
[pcp-j-20:07201] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[pcp-j-20:07201] [[32105,0],0] oob:tcp:init adding 172.16.0.120 to our list of 
V4 connections
[pcp-j-20:07201] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[pcp-j-20:07201] [[32105,0],0] oob:tcp:init rejecting interface pFFFF.ibp0 (not 
in include list)
[pcp-j-20:07201] [[32105,0],0] TCP STARTUP
[pcp-j-20:07201] [[32105,0],0] attempting to bind to IPv4 port 0
[pcp-j-20:07201] [[32105,0],0] assigned IPv4 port 62289
[pcp-j-20:07201] mca:oob:select: Adding component to end
[pcp-j-20:07201] mca:oob:select: Found 1 active transports
[pcp-j-19:12105] mca: base: components_register: registering oob components
[pcp-j-19:12105] mca: base: components_register: found loaded component tcp
[pcp-j-19:12105] mca: base: components_register: component tcp register 
function successful
[pcp-j-19:12105] mca: base: components_open: opening oob components
[pcp-j-19:12105] mca: base: components_open: found loaded component tcp
[pcp-j-19:12105] mca: base: components_open: component tcp open function 
successful
[pcp-j-19:12105] mca:oob:select: checking available component tcp
[pcp-j-19:12105] mca:oob:select: Querying component [tcp]
[pcp-j-19:12105] oob:tcp: component_available called
[pcp-j-19:12105] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[pcp-j-19:12105] [[32105,0],1] oob:tcp:init rejecting interface lo0 (not in 
include list)
[pcp-j-19:12105] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[pcp-j-19:12105] [[32105,0],1] oob:tcp:init adding 172.16.0.119 to our list of 
V4 connections
[pcp-j-19:12105] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[pcp-j-19:12105] [[32105,0],1] oob:tcp:init rejecting interface pFFFF.ibp0 (not 
in include list)
[pcp-j-19:12105] [[32105,0],1] TCP STARTUP
[pcp-j-19:12105] [[32105,0],1] attempting to bind to IPv4 port 0
[pcp-j-19:12105] [[32105,0],1] assigned IPv4 port 49889
[pcp-j-19:12105] mca:oob:select: Adding component to end
[pcp-j-19:12105] mca:oob:select: Found 1 active transports
[pcp-j-19:12105] [[32105,0],1]: set_addr to uri 
2104033280.0;tcp://172.16.0.120:62289
[pcp-j-19:12105] [[32105,0],1]:set_addr checking if peer [[32105,0],0] is 
reachable via component tcp
[pcp-j-19:12105] [[32105,0],1] oob:tcp: working peer [[32105,0],0] address 
tcp://172.16.0.120:62289
[pcp-j-19:12105] [[32105,0],1] PASSING ADDR 172.16.0.120 TO MODULE
[pcp-j-19:12105] [[32105,0],1]:tcp set addr for peer [[32105,0],0]
[pcp-j-19:12105] [[32105,0],1]: peer [[32105,0],0] is reachable via component 
tcp
[pcp-j-19:12105] [[32105,0],1] OOB_SEND: 
/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/rml/oob/rml_oob_send.c:199
[pcp-j-19:12105] [[32105,0],1]:tcp:processing set_peer cmd
[pcp-j-19:12105] [[32105,0],1] SET_PEER ADDING PEER [[32105,0],0]
[pcp-j-19:12105] [[32105,0],1] set_peer: peer [[32105,0],0] is listening on net 
172.16.0.120 port 62289
[pcp-j-19:12105] [[32105,0],1] oob:base:send to target [[32105,0],0]
[pcp-j-19:12105] [[32105,0],1] oob:tcp:send_nb to peer [[32105,0],0]:10
[pcp-j-19:12105] [[32105,0],1] tcp:send_nb to peer [[32105,0],0]
[pcp-j-19:12105] 
[[32105,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:478]
 post send to [[32105,0],0]
[pcp-j-19:12105] 
[[32105,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:415]
 processing send to peer [[32105,0],0]:10
[pcp-j-19:12105] 
[[32105,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:449]
 queue pending to [[32105,0],0]
[pcp-j-19:12105] [[32105,0],1] tcp:send_nb: initiating connection to 
[[32105,0],0]
[pcp-j-19:12105] 
[[32105,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:463]
 connect to [[32105,0],0]
[pcp-j-20:07201] [[32105,0],0] mca_oob_tcp_listen_thread: new connection: (15, 
0) 172.16.0.119:35311
[pcp-j-20:07201] [[32105,0],0] connection_handler: working connection (15, 11) 
172.16.0.119:35311
[pcp-j-20:07201] [[32105,0],0] accept_connection: 172.16.0.119:35311
[pcp-j-19:12105] [[32105,0],1] orte_tcp_peer_try_connect: attempting to connect 
to proc [[32105,0],0]
[pcp-j-19:12105] [[32105,0],1] oob:tcp:peer creating socket to [[32105,0],0]
[pcp-j-19:12105] [[32105,0],1] orte_tcp_peer_try_connect: attempting to connect 
to proc [[32105,0],0] on socket 10
[pcp-j-19:12105] [[32105,0],1] orte_tcp_peer_try_connect: attempting to connect 
to proc [[32105,0],0] on 172.16.0.120:62289 - 0 retries
[pcp-j-19:12105] [[32105,0],1] waiting 5:000 for connect completion to 
[[32105,0],0]
[pcp-j-20:07201] [[32105,0],0]:tcp:recv:handler called
[pcp-j-20:07201] [[32105,0],0] RECV CONNECT ACK FROM UNKNOWN ON SOCKET 15
[pcp-j-20:07201] [[32105,0],0] waiting for connect ack from UNKNOWN
[pcp-j-20:07201] [[32105,0],0]-UNKNOWN tcp_peer_recv_blocking: peer closed 
connection: peer state 0
[pcp-j-20:07201] [[32105,0],0] unable to complete recv of connect-ack from 
UNKNOWN ON SOCKET 15
[pcp-j-19:12105] [[32105,0],1] tcp:failed_to_connect called for peer 
[[32105,0],0]
[pcp-j-19:12105] [[32105,0],1] tcp:failed_to_connect unable to reach peer 
[[32105,0],0]
[pcp-j-19:12105] [[32105,0],1] TCP SHUTDOWN
[pcp-j-19:12105] [[32105,0],1] RELEASING PEER OBJ [[32105,0],0]
[pcp-j-19:12105] [[32105,0],1] CLOSING SOCKET 10
[pcp-j-19:12105] mca: base: close: component tcp closed
[pcp-j-19:12105] mca: base: close: unloading component tcp
[pcp-j-20:07201] [[32105,0],0] TCP SHUTDOWN
[pcp-j-20:07201] mca: base: close: component tcp closed
[pcp-j-20:07201] mca: base: close: unloading component tcp
[pcp-j-20:07209] mca: base: components_register: registering oob components
[pcp-j-20:07209] mca: base: components_register: found loaded component tcp
[pcp-j-20:07209] mca: base: components_register: component tcp register 
function successful
[pcp-j-20:07209] mca: base: components_open: opening oob components
[pcp-j-20:07209] mca: base: components_open: found loaded component tcp
[pcp-j-20:07209] mca: base: components_open: component tcp open function 
successful
[pcp-j-20:07209] mca:oob:select: checking available component tcp
[pcp-j-20:07209] mca:oob:select: Querying component [tcp]
[pcp-j-20:07209] oob:tcp: component_available called
[pcp-j-20:07209] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[pcp-j-20:07209] [[32097,0],0] oob:tcp:init rejecting interface lo0 (not in 
include list)
[pcp-j-20:07209] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[pcp-j-20:07209] [[32097,0],0] oob:tcp:init adding 172.16.0.120 to our list of 
V4 connections
[pcp-j-20:07209] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[pcp-j-20:07209] [[32097,0],0] oob:tcp:init rejecting interface pFFFF.ibp0 (not 
in include list)
[pcp-j-20:07209] [[32097,0],0] TCP STARTUP
[pcp-j-20:07209] [[32097,0],0] attempting to bind to IPv4 port 0
[pcp-j-20:07209] [[32097,0],0] assigned IPv4 port 61664
[pcp-j-20:07209] mca:oob:select: Adding component to end
[pcp-j-20:07209] mca:oob:select: Found 1 active transports
[pcp-j-19:12141] mca: base: components_register: registering oob components
[pcp-j-19:12141] mca: base: components_register: found loaded component tcp
[pcp-j-19:12141] mca: base: components_register: component tcp register 
function successful
[pcp-j-19:12141] mca: base: components_open: opening oob components
[pcp-j-19:12141] mca: base: components_open: found loaded component tcp
[pcp-j-19:12141] mca: base: components_open: component tcp open function 
successful
[pcp-j-19:12141] mca:oob:select: checking available component tcp
[pcp-j-19:12141] mca:oob:select: Querying component [tcp]
[pcp-j-19:12141] oob:tcp: component_available called
[pcp-j-19:12141] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[pcp-j-19:12141] [[32097,0],1] oob:tcp:init rejecting interface lo0 (not in 
include list)
[pcp-j-19:12141] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[pcp-j-19:12141] [[32097,0],1] oob:tcp:init adding 172.16.0.119 to our list of 
V4 connections
[pcp-j-19:12141] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[pcp-j-19:12141] [[32097,0],1] oob:tcp:init rejecting interface pFFFF.ibp0 (not 
in include list)
[pcp-j-19:12141] [[32097,0],1] TCP STARTUP
[pcp-j-19:12141] [[32097,0],1] attempting to bind to IPv4 port 0
[pcp-j-19:12141] [[32097,0],1] assigned IPv4 port 46106
[pcp-j-19:12141] mca:oob:select: Adding component to end
[pcp-j-19:12141] mca:oob:select: Found 1 active transports
[pcp-j-19:12141] [[32097,0],1]: set_addr to uri 
2103508992.0;tcp://172.16.0.120:61664
[pcp-j-19:12141] [[32097,0],1]:set_addr checking if peer [[32097,0],0] is 
reachable via component tcp
[pcp-j-19:12141] [[32097,0],1] oob:tcp: working peer [[32097,0],0] address 
tcp://172.16.0.120:61664
[pcp-j-19:12141] [[32097,0],1] PASSING ADDR 172.16.0.120 TO MODULE
[pcp-j-19:12141] [[32097,0],1]:tcp set addr for peer [[32097,0],0]
[pcp-j-19:12141] [[32097,0],1]: peer [[32097,0],0] is reachable via component 
tcp
[pcp-j-19:12141] [[32097,0],1] OOB_SEND: 
/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/rml/oob/rml_oob_send.c:199
[pcp-j-19:12141] [[32097,0],1]:tcp:processing set_peer cmd
[pcp-j-19:12141] [[32097,0],1] SET_PEER ADDING PEER [[32097,0],0]
[pcp-j-19:12141] [[32097,0],1] set_peer: peer [[32097,0],0] is listening on net 
172.16.0.120 port 61664
[pcp-j-19:12141] [[32097,0],1] oob:base:send to target [[32097,0],0]
[pcp-j-19:12141] [[32097,0],1] oob:tcp:send_nb to peer [[32097,0],0]:10
[pcp-j-19:12141] [[32097,0],1] tcp:send_nb to peer [[32097,0],0]
[pcp-j-19:12141] 
[[32097,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:478]
 post send to [[32097,0],0]
[pcp-j-19:12141] 
[[32097,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:415]
 processing send to peer [[32097,0],0]:10
[pcp-j-19:12141] 
[[32097,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:449]
 queue pending to [[32097,0],0]
[pcp-j-19:12141] [[32097,0],1] tcp:send_nb: initiating connection to 
[[32097,0],0]
[pcp-j-19:12141] 
[[32097,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:463]
 connect to [[32097,0],0]
[pcp-j-19:12141] [[32097,0],1] orte_tcp_peer_try_connect: attempting to connect 
to proc [[32097,0],0]
[pcp-j-19:12141] [[32097,0],1] oob:tcp:peer creating socket to [[32097,0],0]
[pcp-j-20:07209] [[32097,0],0] mca_oob_tcp_listen_thread: new connection: (15, 
0) 172.16.0.119:33495
[pcp-j-20:07209] [[32097,0],0] connection_handler: working connection (15, 11) 
172.16.0.119:33495
[pcp-j-20:07209] [[32097,0],0] accept_connection: 172.16.0.119:33495
[pcp-j-19:12141] [[32097,0],1] orte_tcp_peer_try_connect: attempting to connect 
to proc [[32097,0],0] on socket 10
[pcp-j-19:12141] [[32097,0],1] orte_tcp_peer_try_connect: attempting to connect 
to proc [[32097,0],0] on 172.16.0.120:61664 - 0 retries
[pcp-j-19:12141] [[32097,0],1] waiting 300:000 for connect completion to 
[[32097,0],0]
[pcp-j-20:07209] [[32097,0],0]:tcp:recv:handler called
[pcp-j-20:07209] [[32097,0],0] RECV CONNECT ACK FROM UNKNOWN ON SOCKET 15
[pcp-j-20:07209] [[32097,0],0] waiting for connect ack from UNKNOWN
[pcp-j-20:07209] [[32097,0],0]-UNKNOWN tcp_peer_recv_blocking: peer closed 
connection: peer state 0
[pcp-j-20:07209] [[32097,0],0] unable to complete recv of connect-ack from 
UNKNOWN ON SOCKET 15
[pcp-j-19:12141] [[32097,0],1] tcp:failed_to_connect called for peer 
[[32097,0],0]
[pcp-j-19:12141] [[32097,0],1] tcp:failed_to_connect unable to reach peer 
[[32097,0],0]
[pcp-j-19:12141] [[32097,0],1] TCP SHUTDOWN
[pcp-j-19:12141] [[32097,0],1] RELEASING PEER OBJ [[32097,0],0]
[pcp-j-19:12141] [[32097,0],1] CLOSING SOCKET 10
[pcp-j-19:12141] mca: base: close: component tcp closed
[pcp-j-19:12141] mca: base: close: unloading component tcp
[pcp-j-20:07209] [[32097,0],0] TCP SHUTDOWN
[pcp-j-20:07209] mca: base: close: component tcp closed
[pcp-j-20:07209] mca: base: close: unloading component tcp

Reply via email to