The most common cause of this problem is a firewall between the nodes - you can 
ssh across, but not communicate. Have you checked to see that the firewall is 
turned off?

On Jan 17, 2014, at 4:59 PM, Doug Roberts <robe...@sharcnet.ca> wrote:

> 
> 1) When openmpi programs run across multiple nodes they hang
> rather  quickly as shown in the mpi_test example below. Note
> that I am assuming the initital topology error message is a
> separate issue since single node openmpi jobs run just fine.
> 
> [roberpj@bro127:~/samples/mpi_test] 
> /opt/sharcnet/openmpi/1.6.5/intel/bin/mpirun -np 2 --mca btl tcp,sm,self 
> --mca btl_tcp_if_include eth0,eth2 --mc
> a btl_base_verbose 30 --debug-daemons --host bro127,bro128 ./a.out
> Daemon was launched on bro128 - beginning to initialize
> ****************************************************************************
> * Hwloc has encountered what looks like an error from the operating system.
> *
> * object intersection without inclusion!
> * Error occurred in topology.c line 594
> *
> * Please report this error message to the hwloc user's mailing list,
> * along with the output from the hwloc-gather-topology.sh script.
> ****************************************************************************
> Daemon [[9945,0],1] checking in as pid 20978 on host bro128
> [bro127:19340] [[9945,0],0] orted_cmd: received add_local_procs
> [bro128:20978] [[9945,0],1] orted: up and running - waiting for commands!
> [bro128:20978] [[9945,0],1] node[0].name bro127 daemon 0
> [bro128:20978] [[9945,0],1] node[1].name bro128 daemon 1
> [bro128:20978] [[9945,0],1] orted_cmd: received add_local_procs
>  MPIR_being_debugged = 0
>  MPIR_debug_state = 1
>  MPIR_partial_attach_ok = 1
>  MPIR_i_am_starter = 0
>  MPIR_forward_output = 0
>  MPIR_proctable_size = 2
>  MPIR_proctable:
>    (i, host, exe, pid) = (0, bro127, /home/roberpj/samples/mpi_test/./a.out, 
> 19348)
>    (i, host, exe, pid) = (1, bro128, /home/roberpj/samples/mpi_test/./a.out, 
> 20979)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> [bro128:20978] [[9945,0],1] orted_recv: received sync+nidmap from local proc 
> [[9945,1],1]
> [bro127:19340] [[9945,0],0] orted_recv: received sync+nidmap from local proc 
> [[9945,1],0]
> [bro128:20979] mca: base: components_open: Looking for btl components
> [bro127:19348] mca: base: components_open: Looking for btl components
> [bro128:20979] mca: base: components_open: opening btl components
> [bro128:20979] mca: base: components_open: found loaded component self
> [bro128:20979] mca: base: components_open: component self has no register 
> function
> [bro128:20979] mca: base: components_open: component self open function 
> successful
> [bro128:20979] mca: base: components_open: found loaded component sm
> [bro128:20979] mca: base: components_open: component sm has no register 
> function
> [bro128:20979] mca: base: components_open: component sm open function 
> successful
> [bro128:20979] mca: base: components_open: found loaded component tcp
> [bro128:20979] mca: base: components_open: component tcp register function 
> successful
> [bro128:20979] mca: base: components_open: component tcp open function 
> successful
> [bro127:19348] mca: base: components_open: opening btl components
> [bro127:19348] mca: base: components_open: found loaded component self
> [bro127:19348] mca: base: components_open: component self has no register 
> function
> [bro127:19348] mca: base: components_open: component self open function 
> successful
> [bro127:19348] mca: base: components_open: found loaded component sm
> [bro127:19348] mca: base: components_open: component sm has no register 
> function
> [bro127:19348] mca: base: components_open: component sm open function 
> successful
> [bro127:19348] mca: base: components_open: found loaded component tcp
> [bro127:19348] mca: base: components_open: component tcp register function 
> successful
> [bro127:19348] mca: base: components_open: component tcp open function 
> successful
> [bro128:20979] select: initializing btl component self
> [bro128:20979] select: init of component self returned success
> [bro128:20979] select: initializing btl component sm
> [bro128:20979] select: init of component sm returned success
> [bro128:20979] select: initializing btl component tcp
> [bro128:20979] select: init of component tcp returned success
> [bro127:19348] select: initializing btl component self
> [bro127:19348] select: init of component self returned success
> [bro127:19348] select: initializing btl component sm
> [bro127:19348] select: init of component sm returned success
> [bro127:19348] select: initializing btl component tcp
> [bro127:19348] select: init of component tcp returned success
> [bro127:19340] [[9945,0],0] orted_cmd: received message_local_procs
> [bro128:20978] [[9945,0],1] orted_cmd: received message_local_procs
> [bro127:19340] [[9945,0],0] orted_cmd: received message_local_procs
> [bro128:20978] [[9945,0],1] orted_cmd: received message_local_procs
> [bro127:19348] btl: tcp: attempting to connect() to address 10.27.2.128 on 
> port 4
> Number of processes = 2
> Test repeated 3 times for reliability
> [bro128:20979] btl: tcp: attempting to connect() to address 10.27.2.127 on 
> port 4
> [bro127:19348] btl: tcp: attempting to connect() to address 10.29.4.128 on 
> port 4
> I am process 0 on node bro127
> Run 1 of 3
> P0: Sending to P1
> P0: Waiting to receive from P1
> I am process 1 on node bro128
> P1: Waiting to receive from to P0
> [bro127][[9945,1],0][../../../../../../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
>  mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> ^C
> mpirun: killing job...
> Killed by signal 2.
> [bro127:19340] [[9945,0],0] orted_cmd: received exit cmd
> [bro127:19340] [[9945,0],0] orted_cmd: received iof_complete cmd
> 
> 
> 2) The interfaces on bro127, bro128 compute nodes include a 1g network on 
> eth0 and a high speed 10GB network on eth2 such as ...
> 
> [roberpj@bro127:~] ifconfig
> eth0      Link encap:Ethernet  HWaddr 00:E0:81:C7:A8:E3
>          inet addr:10.27.2.127  Bcast:10.27.2.255  Mask:255.255.254.0
> 
> eth2      Link encap:Ethernet  HWaddr 90:E2:BA:2D:83:F0
>          inet addr:10.29.4.127  Bcast:10.29.63.255  Mask:255.255.192.0
> 
> lo        Link encap:Local Loopback
>          inet addr:127.0.0.1  Mask:255.0.0.0
> 
> 
> 3) Hostnames resolve and can connect between the 10. addresses
> using ssh without passwords on the internal network ...
> 
> [roberpj@bro127:~] host bro127
> bro127.brown.sharcnet has address 10.27.2.127
> [roberpj@bro127:~] host bro128
> bro128.brown.sharcnet has address 10.27.2.128
> [roberpj@bro127:~] host ic-bro127
> ic-bro127.brown.sharcnet has address 10.29.4.127
> [roberpj@bro127:~] host ic-bro128
> ic-bro128.brown.sharcnet has address 10.29.4.128
> 
> [roberpj@bro127:~] ssh bro128
> [roberpj@bro128:~]
> [roberpj@bro127:~] ssh ic-bro128
> [roberpj@bro128:~]
> 
> 
> 4) I'm attaching the output file "ompi_info--all_bro127.out.bz2" created
> by running command:  ompi_info --all >& ompi_info--all_bro127.out in case
> that helps.  If anything else is needed pls let me know, thankyou. 
> -Doug<ompi_info--all_bro127.out.bz2>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to