Ralph, Requested output is attached.
I have a Linux/x86 system with the same network configuration and will soon be able to determine if the problem is specific to Solaris. -Paul On Mon, Nov 3, 2014 at 7:11 PM, Ralph Castain <rhc.open...@gmail.com> wrote: > Could you please set -mca oob_base_verbose 20? I'm not sure why the > connection is failing. > > Thanks > Ralph > > On Nov 3, 2014, at 5:56 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Not clear if the following failure is Solaris-specific, but it *IS* a > regression relative to 1.8.3. > > The system has 2 IPV4 interfaces: > Ethernet on 172.16.0.119/16 > IPoIB on 172.18.0.119/16 > > $ ifconfig bge0 > bge0: flags=1004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4> mtu 1500 > index 2 > inet 172.16.0.119 netmask ffff0000 broadcast 172.16.255.255 > $ ifconfig pFFFF.ibp0 > pFFFF.ibp0: flags=1001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FIXEDMTU> > mtu 2044 index 3 > inet 172.18.0.119 netmask ffff0000 broadcast 172.18.255.255 > > However, I get a message from mca/oob/tcp about not being able to > communicate between these two interfaces ON THE SAME NODE: > > $ /shared/OMPI/openmpi-1.8.4rc1-solaris11-x86-ib-ss12u3/INST/bin/mpirun > -mca btl sm,self,openib -np 1 -host pcp-j-19 examples/ring_c > [pcp-j-19:00899] mca_oob_tcp_accept: accept() failed: Error 0 (0). > ------------------------------------------------------------ > A process or daemon was unable to complete a TCP connection > to another process: > Local host: pcp-j-19 > Remote host: 172.18.0.119 > This is usually caused by a firewall on the remote host. Please > check that any firewall (e.g., iptables) has been disabled and > try again. > ------------------------------------------------------------ > > Let me know what sort of verbose options I should use to gather any > additional info you may need. > > -Paul > > On Fri, Oct 31, 2014 at 7:14 PM, Ralph Castain <rhc.open...@gmail.com> > wrote: > >> Hi folks >> >> I know 1.8.4 isn't entirely complete just yet, but I'd like to get a head >> start on the testing so we can release by Fri Nov 7th. So please take a >> little time and test the current tarball: >> >> http://www.open-mpi.org/software/ompi/v1.8/ >> >> Thanks >> Ralph >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/10/16138.php >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16160.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16161.php > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[pcp-j-19:01003] mca: base: components_register: registering oob components [pcp-j-19:01003] mca: base: components_register: found loaded component tcp [pcp-j-19:01003] mca: base: components_register: component tcp register function successful [pcp-j-19:01003] mca: base: components_open: opening oob components [pcp-j-19:01003] mca: base: components_open: found loaded component tcp [pcp-j-19:01003] mca: base: components_open: component tcp open function successful [pcp-j-19:01003] mca:oob:select: checking available component tcp [pcp-j-19:01003] mca:oob:select: Querying component [tcp] [pcp-j-19:01003] oob:tcp: component_available called [pcp-j-19:01003] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [pcp-j-19:01003] [[26539,0],0] oob:tcp:init rejecting loopback interface lo0 [pcp-j-19:01003] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 [pcp-j-19:01003] [[26539,0],0] oob:tcp:init adding 172.16.0.119 to our list of V4 connections [pcp-j-19:01003] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 [pcp-j-19:01003] [[26539,0],0] oob:tcp:init adding 172.18.0.119 to our list of V4 connections [pcp-j-19:01003] [[26539,0],0] TCP STARTUP [pcp-j-19:01003] [[26539,0],0] attempting to bind to IPv4 port 0 [pcp-j-19:01003] [[26539,0],0] assigned IPv4 port 43391 [pcp-j-19:01003] mca:oob:select: Adding component to end [pcp-j-19:01003] mca:oob:select: Found 1 active transports [pcp-j-19:01003] [[26539,0],0]: set_addr to uri 1739259904.0;tcp://172.16.0.119,172.18.0.119:43391 [pcp-j-19:01003] [[26539,0],0]:set_addr peer [[26539,0],0] is me [pcp-j-19:01004] mca: base: components_register: registering oob components [pcp-j-19:01004] mca: base: components_register: found loaded component tcp [pcp-j-19:01004] mca: base: components_register: component tcp register function successful [pcp-j-19:01004] mca: base: components_open: opening oob components [pcp-j-19:01004] mca: base: components_open: found loaded component tcp [pcp-j-19:01004] mca: base: components_open: component tcp open function successful [pcp-j-19:01004] mca:oob:select: checking available component tcp [pcp-j-19:01004] mca:oob:select: Querying component [tcp] [pcp-j-19:01004] oob:tcp: component_available called [pcp-j-19:01004] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [pcp-j-19:01004] [[26539,1],0] oob:tcp:init rejecting loopback interface lo0 [pcp-j-19:01004] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 [pcp-j-19:01004] [[26539,1],0] oob:tcp:init adding 172.16.0.119 to our list of V4 connections [pcp-j-19:01004] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 [pcp-j-19:01004] [[26539,1],0] oob:tcp:init adding 172.18.0.119 to our list of V4 connections [pcp-j-19:01004] [[26539,1],0] TCP STARTUP [pcp-j-19:01004] [[26539,1],0] attempting to bind to IPv4 port 0 [pcp-j-19:01004] [[26539,1],0] assigned IPv4 port 56330 [pcp-j-19:01004] mca:oob:select: Adding component to end [pcp-j-19:01004] mca:oob:select: Found 1 active transports [pcp-j-19:01004] [[26539,1],0]: set_addr to uri 1739259904.0;tcp://172.16.0.119,172.18.0.119:43391 [pcp-j-19:01004] [[26539,1],0]:set_addr checking if peer [[26539,0],0] is reachable via component tcp [pcp-j-19:01004] [[26539,1],0] oob:tcp: working peer [[26539,0],0] address tcp://172.16.0.119,172.18.0.119:43391 [pcp-j-19:01004] [[26539,1],0] PASSING ADDR 172.16.0.119 TO MODULE [pcp-j-19:01004] [[26539,1],0]:tcp set addr for peer [[26539,0],0] [pcp-j-19:01004] [[26539,1],0] PASSING ADDR 172.18.0.119 TO MODULE [pcp-j-19:01004] [[26539,1],0]:tcp set addr for peer [[26539,0],0] [pcp-j-19:01004] [[26539,1],0]: peer [[26539,0],0] is reachable via component tcp [pcp-j-19:01004] [[26539,1],0] OOB_SEND: /shared/OMPI/openmpi-1.8.4rc1-solaris11-x86-ib-ss12u3/openmpi-1.8.4rc1/orte/mca/rml/oob/rml_oob_send.c:199 [pcp-j-19:01004] [[26539,1],0]:tcp:processing set_peer cmd [pcp-j-19:01004] [[26539,1],0] SET_PEER ADDING PEER [[26539,0],0] [pcp-j-19:01004] [[26539,1],0] set_peer: peer [[26539,0],0] is listening on net 172.16.0.119 port 43391 [pcp-j-19:01004] [[26539,1],0]:tcp:processing set_peer cmd [pcp-j-19:01004] [[26539,1],0] set_peer: peer [[26539,0],0] is listening on net 172.18.0.119 port 43391 [pcp-j-19:01004] [[26539,1],0] oob:base:send to target [[26539,0],0] [pcp-j-19:01004] [[26539,1],0] oob:tcp:send_nb to peer [[26539,0],0]:1 [pcp-j-19:01004] [[26539,1],0] tcp:send_nb to peer [[26539,0],0] [pcp-j-19:01004] [[26539,1],0]:[/shared/OMPI/openmpi-1.8.4rc1-solaris11-x86-ib-ss12u3/openmpi-1.8.4rc1/orte/mca/oob/tcp/oob_tcp.c:478] post send to [[26539,0],0] [pcp-j-19:01004] [[26539,1],0]:[/shared/OMPI/openmpi-1.8.4rc1-solaris11-x86-ib-ss12u3/openmpi-1.8.4rc1/orte/mca/oob/tcp/oob_tcp.c:415] processing send to peer [[26539,0],0]:1 [pcp-j-19:01004] [[26539,1],0]:[/shared/OMPI/openmpi-1.8.4rc1-solaris11-x86-ib-ss12u3/openmpi-1.8.4rc1/orte/mca/oob/tcp/oob_tcp.c:449] queue pending to [[26539,0],0] [pcp-j-19:01004] [[26539,1],0] tcp:send_nb: initiating connection to [[26539,0],0] [pcp-j-19:01004] [[26539,1],0]:[/shared/OMPI/openmpi-1.8.4rc1-solaris11-x86-ib-ss12u3/openmpi-1.8.4rc1/orte/mca/oob/tcp/oob_tcp.c:463] connect to [[26539,0],0] [pcp-j-19:01004] [[26539,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[26539,0],0] [pcp-j-19:01004] [[26539,1],0] oob:tcp:peer creating socket to [[26539,0],0] [pcp-j-19:01004] [[26539,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[26539,0],0] on socket 14 [pcp-j-19:01004] [[26539,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[26539,0],0] on 172.16.0.119:43391 - 0 retries [pcp-j-19:01004] [[26539,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[26539,0],0] on 172.18.0.119:43391 - 0 retries [pcp-j-19:01003] [[26539,0],0] mca_oob_tcp_listen_thread: new connection: (16, 0) 172.16.0.119:44249 [pcp-j-19:01003] mca_oob_tcp_accept: accept() failed: Error 0 (0). [pcp-j-19:01003] [[26539,0],0] connection_handler: working connection (16, 11) 172.16.0.119:44249 [pcp-j-19:01003] [[26539,0],0] accept_connection: 172.16.0.119:44249 [pcp-j-19:01004] [[26539,1],0] tcp:failed_to_connect called for peer [[26539,0],0] [pcp-j-19:01004] [[26539,1],0] tcp:failed_to_connect unable to reach peer [[26539,0],0] [pcp-j-19:01003] [[26539,0],0]:tcp:recv:handler called [pcp-j-19:01003] [[26539,0],0] RECV CONNECT ACK FROM UNKNOWN ON SOCKET 16 [pcp-j-19:01003] [[26539,0],0] waiting for connect ack from UNKNOWN [pcp-j-19:01003] [[26539,0],0]-UNKNOWN tcp_peer_recv_blocking: peer closed connection: peer state 0 [pcp-j-19:01003] [[26539,0],0] unable to complete recv of connect-ack from UNKNOWN ON SOCKET 16 [pcp-j-19:01003] [[26539,0],0] TCP SHUTDOWN [pcp-j-19:01003] mca: base: close: component tcp closed [pcp-j-19:01003] mca: base: close: unloading component tcp