First, I want to ask what became of the issue discussed in this thread? http://www.open-mpi.org/community/lists/devel/2014/11/16160.php I though we had concluded that one just needed -D_REENTRANT. I mention that only for completeness, because I think my current problem is different.
The following works fine with 1.8.3, making the current behavior a regression. I am still on the same system as that previous report, and still/again see a message like the following: ------------------------------------------------------------ A process or daemon was unable to complete a TCP connection to another process: Local host: pcp-j-19 Remote host: 172.18.0.120 This is usually caused by a firewall on the remote host. Please check that any firewall (e.g., iptables) has been disabled and try again. ------------------------------------------------------------ -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: [...etc...] It may be worth noting that the hostname pcp-j-19 (172.16.0.119) and the address 172.18.0.120 are on different subnets. I CANNOT resolve the issue this time by adding -D_REENTRANT to CFLAGS at configure time (I didn't bother to check if it there by default now or not). NOR can I resolve it by using "-mca oob_tcp_if_include bge0" to allow only the 172.16.0.120 subnet. IN FACT, the message is the same with that option, other than "172.18" changing to "172.16". I've attached the output generated by "-mca oob_base_verbose 20" both with and without the oob_tcp_if_include. I should also note that that the following is my full mpirun command, which excludes the tcp BTL. pcp-j-20$ mpirun -mca oob_tcp_if_include bge0 -mca oob_base_verbose 20 -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c -Paul -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[pcp-j-20:29156] mca: base: components_register: registering oob components [pcp-j-20:29156] mca: base: components_register: found loaded component tcp [pcp-j-20:29156] mca: base: components_register: component tcp register function successful [pcp-j-20:29156] mca: base: components_open: opening oob components [pcp-j-20:29156] mca: base: components_open: found loaded component tcp [pcp-j-20:29156] mca: base: components_open: component tcp open function successful [pcp-j-20:29156] mca:oob:select: checking available component tcp [pcp-j-20:29156] mca:oob:select: Querying component [tcp] [pcp-j-20:29156] oob:tcp: component_available called [pcp-j-20:29156] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [pcp-j-20:29156] [[4268,0],0] oob:tcp:init rejecting interface lo0 (not in include list) [pcp-j-20:29156] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 [pcp-j-20:29156] [[4268,0],0] oob:tcp:init adding 172.16.0.120 to our list of V4 connections [pcp-j-20:29156] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 [pcp-j-20:29156] [[4268,0],0] oob:tcp:init rejecting interface pFFFF.ibp0 (not in include list) [pcp-j-20:29156] [[4268,0],0] TCP STARTUP [pcp-j-20:29156] [[4268,0],0] attempting to bind to IPv4 port 0 [pcp-j-20:29156] [[4268,0],0] assigned IPv4 port 33536 [pcp-j-20:29156] mca:oob:select: Adding component to end [pcp-j-20:29156] mca:oob:select: Found 1 active transports [pcp-j-19:26282] mca: base: components_register: registering oob components [pcp-j-19:26282] mca: base: components_register: found loaded component tcp [pcp-j-19:26282] mca: base: components_register: component tcp register function successful [pcp-j-19:26282] mca: base: components_open: opening oob components [pcp-j-19:26282] mca: base: components_open: found loaded component tcp [pcp-j-19:26282] mca: base: components_open: component tcp open function successful [pcp-j-19:26282] mca:oob:select: checking available component tcp [pcp-j-19:26282] mca:oob:select: Querying component [tcp] [pcp-j-19:26282] oob:tcp: component_available called [pcp-j-19:26282] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [pcp-j-19:26282] [[4268,0],1] oob:tcp:init rejecting interface lo0 (not in include list) [pcp-j-19:26282] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 [pcp-j-19:26282] [[4268,0],1] oob:tcp:init adding 172.16.0.119 to our list of V4 connections [pcp-j-19:26282] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 [pcp-j-19:26282] [[4268,0],1] oob:tcp:init rejecting interface pFFFF.ibp0 (not in include list) [pcp-j-19:26282] [[4268,0],1] TCP STARTUP [pcp-j-19:26282] [[4268,0],1] attempting to bind to IPv4 port 0 [pcp-j-19:26282] [[4268,0],1] assigned IPv4 port 33429 [pcp-j-19:26282] mca:oob:select: Adding component to end [pcp-j-19:26282] mca:oob:select: Found 1 active transports [pcp-j-19:26282] [[4268,0],1]: set_addr to uri 279707648.0;tcp://172.16.0.120:33536 [pcp-j-19:26282] [[4268,0],1]:set_addr checking if peer [[4268,0],0] is reachable via component tcp [pcp-j-19:26282] [[4268,0],1] oob:tcp: working peer [[4268,0],0] address tcp://172.16.0.120:33536 [pcp-j-19:26282] [[4268,0],1] PASSING ADDR 172.16.0.120 TO MODULE [pcp-j-19:26282] [[4268,0],1]:tcp set addr for peer [[4268,0],0] [pcp-j-19:26282] [[4268,0],1]: peer [[4268,0],0] is reachable via component tcp [pcp-j-19:26282] [[4268,0],1] OOB_SEND: /shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/rml/oob/rml_oob_send.c:199 [pcp-j-19:26282] [[4268,0],1]:tcp:processing set_peer cmd [pcp-j-19:26282] [[4268,0],1] SET_PEER ADDING PEER [[4268,0],0] [pcp-j-19:26282] [[4268,0],1] set_peer: peer [[4268,0],0] is listening on net 172.16.0.120 port 33536 [pcp-j-19:26282] [[4268,0],1] oob:base:send to target [[4268,0],0] [pcp-j-19:26282] [[4268,0],1] oob:tcp:send_nb to peer [[4268,0],0]:10 [pcp-j-19:26282] [[4268,0],1] tcp:send_nb to peer [[4268,0],0] [pcp-j-19:26282] [[4268,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:478] post send to [[4268,0],0] [pcp-j-19:26282] [[4268,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:415] processing send to peer [[4268,0],0]:10 [pcp-j-19:26282] [[4268,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:449] queue pending to [[4268,0],0] [pcp-j-19:26282] [[4268,0],1] tcp:send_nb: initiating connection to [[4268,0],0] [pcp-j-19:26282] [[4268,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:463] connect to [[4268,0],0] [pcp-j-19:26282] [[4268,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[4268,0],0] [pcp-j-19:26282] [[4268,0],1] oob:tcp:peer creating socket to [[4268,0],0] [pcp-j-19:26282] [[4268,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[4268,0],0] on socket 10 [pcp-j-19:26282] [[4268,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[4268,0],0] on 172.16.0.120:33536 - 0 retries [pcp-j-20:29156] [[4268,0],0] mca_oob_tcp_listen_thread: new connection: (15, 0) 172.16.0.119:51495 [pcp-j-20:29156] [[4268,0],0] connection_handler: working connection (15, 11) 172.16.0.119:51495 [pcp-j-20:29156] [[4268,0],0] accept_connection: 172.16.0.119:51495 [pcp-j-19:26282] [[4268,0],1] waiting 2:000 for connect completion to [[4268,0],0] [pcp-j-20:29156] [[4268,0],0]:tcp:recv:handler called [pcp-j-20:29156] [[4268,0],0] RECV CONNECT ACK FROM UNKNOWN ON SOCKET 15 [pcp-j-20:29156] [[4268,0],0] waiting for connect ack from UNKNOWN [pcp-j-20:29156] [[4268,0],0]-UNKNOWN tcp_peer_recv_blocking: peer closed connection: peer state 0 [pcp-j-20:29156] [[4268,0],0] unable to complete recv of connect-ack from UNKNOWN ON SOCKET 15 [pcp-j-19:26282] [[4268,0],1] tcp:failed_to_connect called for peer [[4268,0],0] [pcp-j-19:26282] [[4268,0],1] tcp:failed_to_connect unable to reach peer [[4268,0],0] [pcp-j-19:26282] [[4268,0],1] TCP SHUTDOWN [pcp-j-19:26282] [[4268,0],1] RELEASING PEER OBJ [[4268,0],0] [pcp-j-19:26282] [[4268,0],1] CLOSING SOCKET 10 [pcp-j-19:26282] mca: base: close: component tcp closed [pcp-j-19:26282] mca: base: close: unloading component tcp [pcp-j-20:29156] [[4268,0],0] TCP SHUTDOWN [pcp-j-20:29156] mca: base: close: component tcp closed [pcp-j-20:29156] mca: base: close: unloading component tcp
[pcp-j-20:28881] mca: base: components_register: registering oob components [pcp-j-20:28881] mca: base: components_register: found loaded component tcp [pcp-j-20:28881] mca: base: components_register: component tcp register function successful [pcp-j-20:28881] mca: base: components_open: opening oob components [pcp-j-20:28881] mca: base: components_open: found loaded component tcp [pcp-j-20:28881] mca: base: components_open: component tcp open function successful [pcp-j-20:28881] mca:oob:select: checking available component tcp [pcp-j-20:28881] mca:oob:select: Querying component [tcp] [pcp-j-20:28881] oob:tcp: component_available called [pcp-j-20:28881] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [pcp-j-20:28881] [[4505,0],0] oob:tcp:init rejecting loopback interface lo0 [pcp-j-20:28881] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 [pcp-j-20:28881] [[4505,0],0] oob:tcp:init adding 172.16.0.120 to our list of V4 connections [pcp-j-20:28881] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 [pcp-j-20:28881] [[4505,0],0] oob:tcp:init adding 172.18.0.120 to our list of V4 connections [pcp-j-20:28881] [[4505,0],0] TCP STARTUP [pcp-j-20:28881] [[4505,0],0] attempting to bind to IPv4 port 0 [pcp-j-20:28881] [[4505,0],0] assigned IPv4 port 40203 [pcp-j-20:28881] mca:oob:select: Adding component to end [pcp-j-20:28881] mca:oob:select: Found 1 active transports [pcp-j-19:25913] mca: base: components_register: registering oob components [pcp-j-19:25913] mca: base: components_register: found loaded component tcp [pcp-j-19:25913] mca: base: components_register: component tcp register function successful [pcp-j-19:25913] mca: base: components_open: opening oob components [pcp-j-19:25913] mca: base: components_open: found loaded component tcp [pcp-j-19:25913] mca: base: components_open: component tcp open function successful [pcp-j-19:25913] mca:oob:select: checking available component tcp [pcp-j-19:25913] mca:oob:select: Querying component [tcp] [pcp-j-19:25913] oob:tcp: component_available called [pcp-j-19:25913] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [pcp-j-19:25913] [[4505,0],1] oob:tcp:init rejecting loopback interface lo0 [pcp-j-19:25913] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 [pcp-j-19:25913] [[4505,0],1] oob:tcp:init adding 172.16.0.119 to our list of V4 connections [pcp-j-19:25913] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 [pcp-j-19:25913] [[4505,0],1] oob:tcp:init adding 172.18.0.119 to our list of V4 connections [pcp-j-19:25913] [[4505,0],1] TCP STARTUP [pcp-j-19:25913] [[4505,0],1] attempting to bind to IPv4 port 0 [pcp-j-19:25913] [[4505,0],1] assigned IPv4 port 47045 [pcp-j-19:25913] mca:oob:select: Adding component to end [pcp-j-19:25913] mca:oob:select: Found 1 active transports [pcp-j-19:25913] [[4505,0],1]: set_addr to uri 295239680.0;tcp://172.16.0.120,172.18.0.120:40203 [pcp-j-19:25913] [[4505,0],1]:set_addr checking if peer [[4505,0],0] is reachable via component tcp [pcp-j-19:25913] [[4505,0],1] oob:tcp: working peer [[4505,0],0] address tcp://172.16.0.120,172.18.0.120:40203 [pcp-j-19:25913] [[4505,0],1] PASSING ADDR 172.16.0.120 TO MODULE [pcp-j-19:25913] [[4505,0],1]:tcp set addr for peer [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] PASSING ADDR 172.18.0.120 TO MODULE [pcp-j-19:25913] [[4505,0],1]:tcp set addr for peer [[4505,0],0] [pcp-j-19:25913] [[4505,0],1]: peer [[4505,0],0] is reachable via component tcp [pcp-j-19:25913] [[4505,0],1] OOB_SEND: /shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/rml/oob/rml_oob_send.c:199 [pcp-j-19:25913] [[4505,0],1]:tcp:processing set_peer cmd [pcp-j-19:25913] [[4505,0],1] SET_PEER ADDING PEER [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] set_peer: peer [[4505,0],0] is listening on net 172.16.0.120 port 40203 [pcp-j-19:25913] [[4505,0],1]:tcp:processing set_peer cmd [pcp-j-19:25913] [[4505,0],1] set_peer: peer [[4505,0],0] is listening on net 172.18.0.120 port 40203 [pcp-j-19:25913] [[4505,0],1] oob:base:send to target [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] oob:tcp:send_nb to peer [[4505,0],0]:10 [pcp-j-19:25913] [[4505,0],1] tcp:send_nb to peer [[4505,0],0] [pcp-j-19:25913] [[4505,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:478] post send to [[4505,0],0] [pcp-j-19:25913] [[4505,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:415] processing send to peer [[4505,0],0]:10 [pcp-j-19:25913] [[4505,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:449] queue pending to [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] tcp:send_nb: initiating connection to [[4505,0],0] [pcp-j-20:28881] [[4505,0],0] mca_oob_tcp_listen_thread: new connection: (15, 0) 172.16.0.119:41947 [pcp-j-20:28881] [[4505,0],0] connection_handler: working connection (15, 11) 172.16.0.119:41947 [pcp-j-20:28881] [[4505,0],0] accept_connection: 172.16.0.119:41947 [pcp-j-19:25913] [[4505,0],1]:[/shared/OMPI/openmpi-1.8.4rc3-solaris11-x64-ib-gcc452/openmpi-1.8.4rc3/orte/mca/oob/tcp/oob_tcp.c:463] connect to [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] oob:tcp:peer creating socket to [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[4505,0],0] on socket 10 [pcp-j-19:25913] [[4505,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[4505,0],0] on 172.16.0.120:40203 - 0 retries [pcp-j-19:25913] [[4505,0],1] waiting 2:000 for connect completion to [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[4505,0],0] on 172.18.0.120:40203 - 0 retries [pcp-j-20:28881] [[4505,0],0]:tcp:recv:handler called [pcp-j-20:28881] [[4505,0],0] RECV CONNECT ACK FROM UNKNOWN ON SOCKET 15 [pcp-j-20:28881] [[4505,0],0] waiting for connect ack from UNKNOWN [pcp-j-20:28881] [[4505,0],0]-UNKNOWN tcp_peer_recv_blocking: peer closed connection: peer state 0 [pcp-j-20:28881] [[4505,0],0] unable to complete recv of connect-ack from UNKNOWN ON SOCKET 15 [pcp-j-19:25913] [[4505,0],1] tcp:failed_to_connect called for peer [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] tcp:failed_to_connect unable to reach peer [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] TCP SHUTDOWN [pcp-j-19:25913] [[4505,0],1] RELEASING PEER OBJ [[4505,0],0] [pcp-j-19:25913] [[4505,0],1] CLOSING SOCKET 10 [pcp-j-19:25913] mca: base: close: component tcp closed [pcp-j-19:25913] mca: base: close: unloading component tcp [pcp-j-20:28881] [[4505,0],0] TCP SHUTDOWN [pcp-j-20:28881] mca: base: close: component tcp closed [pcp-j-20:28881] mca: base: close: unloading component tcp