I am trying to run NetPIPE-3.7.1 NPmpi using Open MPI version 1.3.2 with
the openib btl in an OFED-1.4 environment. The system environment is two
Linux (2.6.27) ppc64 blades, each with one Chelsio RNIC device,
interconnected by a 10GbE switch. The problem is that I cannot (using
Open MPI) establish connections between the two MPI ranks.
I have already read the OMPI FAQ entries and searched for similar
problems reported to this email list without success. I do have a
compressed config.log that I can provide separately (it is 80KB in size
so I'll spare everyone here). I also have the output of ompi_info --all
that I can share.
I can successfully run small diagnostic programs such as rping,
ib_rdma_bw, ib_rdma_lat, etc. between the same two blades. I can also
run NPmpi using another MPI library (MVAPICH2) and the Chelsio/iWARP
interface.
Here is the one example mpirun command line I used:
mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self --hostfile
~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l 1 -u 1024 > outfile1 2>&1
and its output:
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: aae1
Local device: cxgb3_0
CPCs attempted: oob, xoob, rdmacm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: aae4
Local device: cxgb3_0
CPCs attempted: oob, xoob, rdmacm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[3115,1],0]) is on host: aae4
Process 2 ([[3115,1],1]) is on host: aae1
BTLs attempted: self
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[3115,1],1]) is on host: aae1
Process 2 ([[3115,1],0]) is on host: aae4
BTLs attempted: self
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
*** before MPI was initialized
[aae1:6598] Abort before MPI_INIT completed successfully; not able to guarantee
that all other processes were killed!
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[aae4:19434] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 19434 on
node aae4 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Here is the another mpirun command I used (adding verbosity and more
specific btl parameters):
mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self,sm --mca
btl_base_verbose 10 --mca btl_openib_verbose 10 --mca
btl_openib_if_include cxgb3_0:1 --mca btl_openib_cpc_include rdmacm
--mca btl_openib_device_type iwarp --mca btl_openib_max_btls 1 --mca
mpi_leave_pinned 1 --hostfile ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l
1 -u 1024 > ~/outfile2 2>&1
and its output:
[aae4:19426] mca: base: components_open: Looking for btl components
[aae4:19426] mca: base: components_open: opening btl components
[aae4:19426] mca: base: components_open: found loaded component openib
[aae4:19426] mca: base: components_open: component openib has no register
function
[aae4:19426] mca: base: components_open: component openib open function
successful
[aae4:19426] mca: base: components_open: found loaded component self
[aae4:19426] mca: base: components_open: component self has no register function
[aae4:19426] mca: base: components_open: component self open function successful
[aae4:19426] mca: base: components_open: found loaded component sm
[aae4:19426] mca: base: components_open: component sm has no register function
[aae4:19426] mca: base: components_open: component sm open function successful
[aae1:06503] mca: base: components_open: Looking for btl components
[aae1:06503] mca: base: components_open: opening btl components
[aae1:06503] mca: base: components_open: found loaded component openib
[aae1:06503] mca: base: components_open: component openib has no register
function
[aae1:06503] mca: base: components_open: component openib open function
successful
[aae1:06503] mca: base: components_open: found loaded component self
[aae1:06503] mca: base: components_open: component self has no register function
[aae1:06503] mca: base: components_open: component self open function successful
[aae1:06503] mca: base: components_open: found loaded component sm
[aae1:06503] mca: base: components_open: component sm has no register function
[aae1:06503] mca: base: components_open: component sm open function successful
[aae4:19426] select: initializing btl component openib
[aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI
files for vendor 0x1425, part ID 49
[aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
corresponding INI values: Chelsio T3
[aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI
files for vendor 0x0000, part ID 0
[aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
corresponding INI values: default
[aae4:19426] openib BTL: rdmacm CPC available for use on cxgb3_0
[aae4:19426] select: init of component openib returned success
[aae4:19426] select: initializing btl component self
[aae4:19426] select: init of component self returned success
[aae4:19426] select: initializing btl component sm
[aae4:19426] select: init of component sm returned success
[aae1:06503] select: initializing btl component openib
[aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI
files for vendor 0x1425, part ID 49
[aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
corresponding INI values: Chelsio T3
[aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI
files for vendor 0x0000, part ID 0
[aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
corresponding INI values: default
[aae1:06503] openib BTL: rdmacm CPC available for use on cxgb3_0
[aae1:06503] select: init of component openib returned success
[aae1:06503] select: initializing btl component self
[aae1:06503] select: init of component self returned success
[aae1:06503] select: initializing btl component sm
[aae1:06503] select: init of component sm returned success
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[3107,1],0]) is on host: aae4
Process 2 ([[3107,1],1]) is on host: aae1
BTLs attempted: openib self sm
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[3107,1],1]) is on host: aae1
Process 2 ([[3107,1],0]) is on host: aae4
BTLs attempted: openib self sm
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[aae1:6503] Abort before MPI_INIT completed successfully; not able to guarantee
that all other processes were killed!
[aae4:19426] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 19426 on
node aae4 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Thanks for any advice/help you can offer.
-Ken
This message is intended only for the designated recipient(s) and may
contain confidential or proprietary information of Mercury Computer
Systems, Inc. This message is solely intended to facilitate business
discussions and does not constitute an express or implied offer to sell
or purchase any products, services, or support. Any commitments must be
made in writing and signed by duly authorized representatives of each
party.