The most common cause of this problem is a firewall between the nodes - you can ssh across, but not communicate. Have you checked to see that the firewall is turned off?
On Jan 17, 2014, at 4:59 PM, Doug Roberts <robe...@sharcnet.ca> wrote: > > 1) When openmpi programs run across multiple nodes they hang > rather quickly as shown in the mpi_test example below. Note > that I am assuming the initital topology error message is a > separate issue since single node openmpi jobs run just fine. > > [roberpj@bro127:~/samples/mpi_test] > /opt/sharcnet/openmpi/1.6.5/intel/bin/mpirun -np 2 --mca btl tcp,sm,self > --mca btl_tcp_if_include eth0,eth2 --mc > a btl_base_verbose 30 --debug-daemons --host bro127,bro128 ./a.out > Daemon was launched on bro128 - beginning to initialize > **************************************************************************** > * Hwloc has encountered what looks like an error from the operating system. > * > * object intersection without inclusion! > * Error occurred in topology.c line 594 > * > * Please report this error message to the hwloc user's mailing list, > * along with the output from the hwloc-gather-topology.sh script. > **************************************************************************** > Daemon [[9945,0],1] checking in as pid 20978 on host bro128 > [bro127:19340] [[9945,0],0] orted_cmd: received add_local_procs > [bro128:20978] [[9945,0],1] orted: up and running - waiting for commands! > [bro128:20978] [[9945,0],1] node[0].name bro127 daemon 0 > [bro128:20978] [[9945,0],1] node[1].name bro128 daemon 1 > [bro128:20978] [[9945,0],1] orted_cmd: received add_local_procs > MPIR_being_debugged = 0 > MPIR_debug_state = 1 > MPIR_partial_attach_ok = 1 > MPIR_i_am_starter = 0 > MPIR_forward_output = 0 > MPIR_proctable_size = 2 > MPIR_proctable: > (i, host, exe, pid) = (0, bro127, /home/roberpj/samples/mpi_test/./a.out, > 19348) > (i, host, exe, pid) = (1, bro128, /home/roberpj/samples/mpi_test/./a.out, > 20979) > MPIR_executable_path: NULL > MPIR_server_arguments: NULL > [bro128:20978] [[9945,0],1] orted_recv: received sync+nidmap from local proc > [[9945,1],1] > [bro127:19340] [[9945,0],0] orted_recv: received sync+nidmap from local proc > [[9945,1],0] > [bro128:20979] mca: base: components_open: Looking for btl components > [bro127:19348] mca: base: components_open: Looking for btl components > [bro128:20979] mca: base: components_open: opening btl components > [bro128:20979] mca: base: components_open: found loaded component self > [bro128:20979] mca: base: components_open: component self has no register > function > [bro128:20979] mca: base: components_open: component self open function > successful > [bro128:20979] mca: base: components_open: found loaded component sm > [bro128:20979] mca: base: components_open: component sm has no register > function > [bro128:20979] mca: base: components_open: component sm open function > successful > [bro128:20979] mca: base: components_open: found loaded component tcp > [bro128:20979] mca: base: components_open: component tcp register function > successful > [bro128:20979] mca: base: components_open: component tcp open function > successful > [bro127:19348] mca: base: components_open: opening btl components > [bro127:19348] mca: base: components_open: found loaded component self > [bro127:19348] mca: base: components_open: component self has no register > function > [bro127:19348] mca: base: components_open: component self open function > successful > [bro127:19348] mca: base: components_open: found loaded component sm > [bro127:19348] mca: base: components_open: component sm has no register > function > [bro127:19348] mca: base: components_open: component sm open function > successful > [bro127:19348] mca: base: components_open: found loaded component tcp > [bro127:19348] mca: base: components_open: component tcp register function > successful > [bro127:19348] mca: base: components_open: component tcp open function > successful > [bro128:20979] select: initializing btl component self > [bro128:20979] select: init of component self returned success > [bro128:20979] select: initializing btl component sm > [bro128:20979] select: init of component sm returned success > [bro128:20979] select: initializing btl component tcp > [bro128:20979] select: init of component tcp returned success > [bro127:19348] select: initializing btl component self > [bro127:19348] select: init of component self returned success > [bro127:19348] select: initializing btl component sm > [bro127:19348] select: init of component sm returned success > [bro127:19348] select: initializing btl component tcp > [bro127:19348] select: init of component tcp returned success > [bro127:19340] [[9945,0],0] orted_cmd: received message_local_procs > [bro128:20978] [[9945,0],1] orted_cmd: received message_local_procs > [bro127:19340] [[9945,0],0] orted_cmd: received message_local_procs > [bro128:20978] [[9945,0],1] orted_cmd: received message_local_procs > [bro127:19348] btl: tcp: attempting to connect() to address 10.27.2.128 on > port 4 > Number of processes = 2 > Test repeated 3 times for reliability > [bro128:20979] btl: tcp: attempting to connect() to address 10.27.2.127 on > port 4 > [bro127:19348] btl: tcp: attempting to connect() to address 10.29.4.128 on > port 4 > I am process 0 on node bro127 > Run 1 of 3 > P0: Sending to P1 > P0: Waiting to receive from P1 > I am process 1 on node bro128 > P1: Waiting to receive from to P0 > [bro127][[9945,1],0][../../../../../../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) > ^C > mpirun: killing job... > Killed by signal 2. > [bro127:19340] [[9945,0],0] orted_cmd: received exit cmd > [bro127:19340] [[9945,0],0] orted_cmd: received iof_complete cmd > > > 2) The interfaces on bro127, bro128 compute nodes include a 1g network on > eth0 and a high speed 10GB network on eth2 such as ... > > [roberpj@bro127:~] ifconfig > eth0 Link encap:Ethernet HWaddr 00:E0:81:C7:A8:E3 > inet addr:10.27.2.127 Bcast:10.27.2.255 Mask:255.255.254.0 > > eth2 Link encap:Ethernet HWaddr 90:E2:BA:2D:83:F0 > inet addr:10.29.4.127 Bcast:10.29.63.255 Mask:255.255.192.0 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > > > 3) Hostnames resolve and can connect between the 10. addresses > using ssh without passwords on the internal network ... > > [roberpj@bro127:~] host bro127 > bro127.brown.sharcnet has address 10.27.2.127 > [roberpj@bro127:~] host bro128 > bro128.brown.sharcnet has address 10.27.2.128 > [roberpj@bro127:~] host ic-bro127 > ic-bro127.brown.sharcnet has address 10.29.4.127 > [roberpj@bro127:~] host ic-bro128 > ic-bro128.brown.sharcnet has address 10.29.4.128 > > [roberpj@bro127:~] ssh bro128 > [roberpj@bro128:~] > [roberpj@bro127:~] ssh ic-bro128 > [roberpj@bro128:~] > > > 4) I'm attaching the output file "ompi_info--all_bro127.out.bz2" created > by running command: ompi_info --all >& ompi_info--all_bro127.out in case > that helps. If anything else is needed pls let me know, thankyou. > -Doug<ompi_info--all_bro127.out.bz2>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users