Hi, I'm trying to run an OpenMPI 1.6.5 job across a set of nodes, some with 
Mellanox cards and some with Qlogic cards.  I'm getting errors indicating "At 
least one pair of MPI processes are unable to reach each other for MPI 
communications".  As far as I can tell all of the nodes are properly configured 
and able to reach each other, via IP and non-IP connections.
I've also discovered that even if I turn off the IB transport via "--mca btl 
tcp,self" I'm still getting the same issue.
The test works fine if I run it confined to hosts with identical IB cards.
I'd appreciate some assistance in figuring out what I'm doing wrong.

Thanks,
Kevin

Here's a log of a failed run:

> mpirun -d --debug-daemons --mca btl tcp,self --mca orte_base_help_aggregate 0 
> --mca btl_base_verbose 100 -np 2 -machinefile foo.hosts 
> /homes/kevin/alltoall.mpi-1.6.5
[compute-g18-5.deepthought.umd.edu:20574] procdir: 
/tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/0/0
[compute-g18-5.deepthought.umd.edu:20574] jobdir: 
/tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/0
[compute-g18-5.deepthought.umd.edu:20574] top: 
openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0
[compute-g18-5.deepthought.umd.edu:20574] tmp: /tmp
[compute-g18-5.deepthought.umd.edu:20574] mpirun: reset PATH: 
/cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_r
      
ftware/gcc/4.8.1/sys/bin:/cell_root/software/moab/bin:/cell_root/software/gold/bin:/usr/local/ofed/1.5.4/sbin:/usr/local/ofed/1.5.4/bin:/homes/kevin/bin:/homes/kevin/bin/amd64:/dept/oit/glue/
      
scripts:/usr/local/scripts:/usr/local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/afsws/bin:/usr/afsws/etc
[compute-g18-5.deepthought.umd.edu:20574] mpirun: reset LD_LIBRARY_PATH: 
/cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/lib:/usr/local/ofed/1.5.4/lib64
Daemon was launched on compute-g17-33.deepthought.umd.edu - beginning to 
initialize
[compute-g17-33.deepthought.umd.edu:20174] procdir: 
/tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/0/1
[compute-g17-33.deepthought.umd.edu:20174] jobdir: 
/tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/0
[compute-g17-33.deepthought.umd.edu:20174] top: 
openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0
[compute-g17-33.deepthought.umd.edu:20174] tmp: /tmp
Daemon [[63142,0],1] checking in as pid 20174 on host 
compute-g17-33.deepthought.umd.edu
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: up and running 
- waiting for commands!
[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received 
add_local_procs
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[0].name 
compute-g18-5 daemon 0
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[1].name 
compute-g17-33 daemon 1
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
add_local_procs
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_forward_output = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, compute-g18-5.deepthought.umd.edu, 
/homes/kevin/alltoall.mpi-1.6.5, 20576)
    (i, host, exe, pid) = (1, compute-g17-33, /homes/kevin/alltoall.mpi-1.6.5, 
20175)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[compute-g18-5.deepthought.umd.edu:20576] procdir: 
/tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/1/0
[compute-g18-5.deepthought.umd.edu:20576] jobdir: 
/tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/1
[compute-g18-5.deepthought.umd.edu:20576] top: 
openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0
[compute-g18-5.deepthought.umd.edu:20576] tmp: /tmp
[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_recv: received 
sync+nidmap from local proc [[63142,1],0]
[compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[0].name 
compute-g18-5 daemon 0
[compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[1].name 
compute-g17-33 daemon 1
[compute-g17-33.deepthought.umd.edu:20175] procdir: 
/tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/1/1
[compute-g17-33.deepthought.umd.edu:20175] jobdir: 
/tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/1
[compute-g17-33.deepthought.umd.edu:20175] top: 
openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0
[compute-g17-33.deepthought.umd.edu:20175] tmp: /tmp
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_recv: received 
sync+nidmap from local proc [[63142,1],1]
[compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[0].name 
compute-g18-5 daemon 0
[compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[1].name 
compute-g17-33 daemon 1
[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: Looking for 
btl components
[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: opening btl 
components
[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found 
loaded component self
[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component 
self has no register function
[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component 
self open function successful
[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found 
loaded component tcp
[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component 
tcp register function successful
[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component 
tcp open function successful
[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: Looking for 
btl components
[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: opening btl 
components
[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found 
loaded component self
[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component 
self has no register function
[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component 
self open function successful
[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found 
loaded component tcp
[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component 
tcp register function successful
[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component 
tcp open function successful
[compute-g17-33.deepthought.umd.:20175] select: initializing btl component self
[compute-g17-33.deepthought.umd.:20175] select: init of component self returned 
success
[compute-g17-33.deepthought.umd.:20175] select: initializing btl component tcp
[compute-g17-33.deepthought.umd.:20175] btl: tcp: Searching for exclude 
address+prefix: 127.0.0.1 / 8
[compute-g17-33.deepthought.umd.:20175] btl: tcp: Found match: 127.0.0.1 (lo)
[compute-g17-33.deepthought.umd.:20175] select: init of component tcp returned 
success
[compute-g18-5.deepthought.umd.e:20576] mca: base: close: component self closed
[compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component 
self
[compute-g18-5.deepthought.umd.e:20576] mca: base: close: component tcp closed
[compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component 
tcp
[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received 
message_local_procs
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
message_local_procs
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     Before MPI_INIT completed
  Local host: compute-g18-5.deepthought.umd.edu
  PID:        20576
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[63142,1],1]) is on host: compute-g17-33.deepthought.umd.edu
  Process 2 ([[63142,1],0]) is on host: compute-g18-5
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

* Check the output of ompi_info to see which BTL/MTL plugins are
   available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     Before MPI_INIT completed
  Local host: compute-g17-33.deepthought.umd.edu
  PID:        20175
--------------------------------------------------------------------------
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
waitpid_fired cmd
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
iof_complete cmd
[compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: proc session dir 
not empty - leaving
[compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir 
not empty - leaving
[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received 
iof_complete cmd
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 20175 on
node compute-g17-33 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received 
exit cmd
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received 
exit cmd
[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: finalizing
[compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: job session dir 
not empty - leaving
[compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: job session dir 
not empty - leaving
[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data for 
[63142,0]
[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data for 
[63142,1]
[compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir 
not empty - leaving
orterun: exiting with status 1

Reply via email to