Allan,

about IPoIB, the error message (no route to host) is very puzzling.
did you double check IPoIB is ok between all nodes ?
this error message suggests IPoIB is not working between sm3 and sm4,
this could be caused by the subnet manager, or a firewall.
ping is the first tool you should use to test that, then you can use nc 
(netcat).
for example, on sm4
nc -l 1234
on sm3
echo hello | nc 10.1.0.5 1234
(expected result: "hello" should be displayed on sm4)

about openib, you first need to double check the btl/openib was built.
assuming you did not configure with --disable-dlopen, you should have a 
mca_btl_openib.so
file in /.../lib/openmpi. it should be accessible by the user, and
ldd /.../lib/openmpi/mca_btl_openib.so
should not have any unresolved dependencies on *all* your nodes

Cheers,

Gilles

----- Original Message -----
> I have been having some issues with using openmpi with tcp over IPoIB 
> and openib. The problems arise when I run a program that uses basic 
> collective communication. The two programs that I have been using are 
> attached.
> 
> *** IPoIB ***
> 
> The mpirun command I am using to run mpi over IPoIB is,
> mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_include 
> 10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes 
> -np 8 ./avg 8000
> 
> This program will appear to run on the nodes, but will sit at 100% CPU 
> and use no memory. On the host node an error will be printed,
> 
> [sm1][[58411,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect] 
> connect() to 10.1.0.3 failed: No route to host (113)
> 
> Using another program,
> 
> mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_
include 
> 10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes 
> -np 8 ./congrad 800
> Produces the following result. This program will also run on the nodes 
> sm1, sm2, sm3, and sm4 at 100% and use no memory.
> [sm3][[61383,1],4][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect] 
> connect() to 10.1.0.5 failed: No route to host (113)
> [sm4][[61383,1],6][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect] 
> connect() to 10.1.0.4 failed: No route to host (113)
> [sm2][[61383,1],3][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect] 
> connect() to 10.1.0.2 failed: No route to host (113)
> [sm3][[61383,1],5][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect] 
> connect() to 10.1.0.5 failed: No route to host (113)
> [sm4][[61383,1],7][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect] 
> connect() to 10.1.0.4 failed: No route to host (113)
> [sm2][[61383,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect] 
> connect() to 10.1.0.2 failed: No route to host (113)
> [sm1][[61383,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect] 
> connect() to 10.1.0.3 failed: No route to host (113)
> [sm1][[61383,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect] 
> connect() to 10.1.0.3 failed: No route to host (113)
> 
> *** openib ***
> 
> Running the congrad program over openib will produce the result,
> mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include 
> 10.1.0.0/24 -hostfile nodes -np 8 ./avg 800
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> ----------------------------------------------------------------------
----
> A requested component was not found, or was unable to be opened. This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
> Host:      sm2.overst.local
> Framework: btl
> Component: openib
> ----------------------------------------------------------------------
----
> ----------------------------------------------------------------------
----
> It looks like MPI_INIT failed for some reason; your parallel process 
is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or 
environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>    mca_bml_base_open() failed
>    --> Returned "Not found" (-13) instead of "Success" (0)
> ----------------------------------------------------------------------
----
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> [sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
() 
> to socket 29 failed: Broken pipe (32)
> [sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in 
> file oob_usock_connection.c at line 316
> [sm1.overst.local:32239] [[57506,0],1]-[[57506,1],1] usock_peer_accept:
 
> usock_peer_send_connect_ack failed
> [sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
() 
> to socket 27 failed: Broken pipe (32)
> [sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in 
> file oob_usock_connection.c at line 316
> [sm1.overst.local:32239] [[57506,0],1]-[[57506,1],0] usock_peer_accept:
 
> usock_peer_send_connect_ack failed
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> [smd:31760] 4 more processes have sent help message help-mca-base.txt 
/ 
> find-available:not-valid
> [smd:31760] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all 
> help / error messages
> [smd:31760] 4 more processes have sent help message help-mpi-runtime.
txt 
> / mpi_init:startup:internal-failure
> === Later errors printed out on the host node ===
> ------------------------------------------------------------
> A process or daemon was unable to complete a TCP connection
> to another process:
>    Local host:    sm3
>    Remote host:   10.1.0.1
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> ------------------------------------------------------------
> ------------------------------------------------------------
> A process or daemon was unable to complete a TCP connection
> to another process:
>    Local host:    sm1
>    Remote host:   10.1.0.1
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> ------------------------------------------------------------
> ------------------------------------------------------------
> A process or daemon was unable to complete a TCP connection
> to another process:
>    Local host:    sm2
>    Remote host:   10.1.0.1
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> ------------------------------------------------------------
> ------------------------------------------------------------
> A process or daemon was unable to complete a TCP connection
> to another process:
>    Local host:    sm4
>    Remote host:   10.1.0.1
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> ------------------------------------------------------------
> The ./avg process was not created on any of the nodes.
> Running the ./congrad program,
> mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include 
> 10.1.0.0/24 -hostfile nodes -np 8 ./congrad 800
> Will results in the following errors,
> 
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> ----------------------------------------------------------------------
----
> A requested component was not found, or was unable to be opened. This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
> Host:      sm3.overst.local
> Framework: btl
> Component: openib
> ----------------------------------------------------------------------
----
> ----------------------------------------------------------------------
----
> It looks like MPI_INIT failed for some reason; your parallel process 
is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or 
environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>    mca_bml_base_open() failed
>    --> Returned "Not found" (-13) instead of "Success" (0)
> ----------------------------------------------------------------------
----
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> [sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
() 
> to socket 29 failed: Broken pipe (32)
> [sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in 
> file oob_usock_connection.c at line 316
> [sm1.overst.local:32271] [[57834,0],1]-[[57834,1],0] usock_peer_accept:
 
> usock_peer_send_connect_ack failed
> [sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
() 
> to socket 27 failed: Broken pipe (32)
> [sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in 
> file oob_usock_connection.c at line 316
> [sm1.overst.local:32271] [[57834,0],1]-[[57834,1],1] usock_peer_accept:
 
> usock_peer_send_connect_ack failed
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
> ***    and potentially your MPI job)
> [smd:32088] 5 more processes have sent help message help-mca-base.txt 
/ 
> find-available:not-valid
> [smd:32088] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all 
> help / error messages
> [smd:32088] 5 more processes have sent help message help-mpi-runtime.
txt 
> / mpi_init:startup:internal-failure
> 
> Using these mpirun commands will successfully run using a test program 
> that is only using point to point communication.
> 
> The nodes are interconnected in the following way.
> 
>                                          __  _
>                                         [__]|=|
>                                         /::/|_|
>                             HOST: smd
>                             Dual 1Gb Ethernet Bonded
>             .-------------> Bond0 IP: 192.168.1.200
>             |               Infiniband Card: MHQH29B-XTR <------------.
>             |               Ib0 IP: 10.1.0.1 |
>             |               OS: Ubuntu Mate |
>             |                           __  _ |
>             |                          [__]|=| |
>             |                          /::/|_| |
>             |               HOST: sm1 |
>             |               Dual 1Gb Ethernet Bonded |
>             |-------------> Bond0 IP: 192.168.1.196                   
|
>             |               Infiniband Card: QLOGIC QLE7340 <---------
|
>             |               Ib0 IP: 10.1.0.2 |
>             |               OS: Centos 7 Minimal |
>             |                           __  _ |
>             |                          [__]|=| |
>             |---------.                /::/|_| |
>             |         |     HOST: sm2 |
>             |         |     Dual 1Gb Ethernet Bonded |
>             |         '---> Bond0 IP: 192.168.1.199                   
|
>         __________          Infiniband Card: QLOGIC QLE7340 __________
>        [_|||||||_°]         Ib0 IP: 10.1.0.3 [_|||||||_°]
>        [_|||||||_°]         OS: Centos 7 Minimal [_|||||||_°]
>        [_|||||||_°]                     __  _ [_|||||||_°]
>     Gb Ethernet Switch                 [__]|=|             Voltaire 
4036 
> QDR Switch
>             | /::/|_|                         |
>             |               HOST: sm3                                  
|
>             |               Dual 1Gb Ethernet Bonded                   
|
>             |-------------> Bond0 IP: 192.168.1.203                    
|
>             |               Infiniband Card: QLOGIC QLE7340 <---------
-|
>             |               Ib0 IP: 10.1.0.4                           
|
>             |               OS: Centos 7 Minimal                       
|
>             |                          __ _                           
|
>             | [__]|=|                          |
>             | /::/|_|                          |
>             |               HOST: sm4                                  
|
>             |               Dual 1Gb Ethernet Bonded                   
|
>             |-------------> Bond0 IP: 192.168.1.204                    
|
>             |               Infiniband Card: QLOGIC QLE7340 <---------
-|
>             |               Ib0 IP: 10.1.0.5                           
|
>             |               OS: Centos 7 Minimal                       
|
>             |                         __ _                            
|
>             | [__]|=|                           |
>             | /::/|_|                           |
>             |               HOST: dl580                                
|
>             |               Dual 1Gb Ethernet Bonded                   
|
>             '-------------> Bond0 IP: 192.168.1.201                    
|
>                             Infiniband Card: QLOGIC QLE7340 <---------
-'
>                             Ib0 IP: 10.1.0.6
>                             OS: Centos 7 Minimal
> 
> Thanks for the help again.
> 
> Sincerely,
> 
> Allan Overstreet
> 
>  
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to