Ok, we've entered the Land of Really Weird - I've never seen a btl work with 
one mpi app and not another.

Some q's:

- are you running both apps on the same nodes?
- is anything else running on the nodes at the same time (e.g., other mpi jobs 
using openfabrics)?
- is the imb compiled for ompi 1.4.1?
- can you run ldd on the apps to ensure they're linking to the same libmpi?

-jms
Sent from my PDA.  No type good.

----- Original Message -----
From: users-boun...@open-mpi.org <users-boun...@open-mpi.org>
To: Open MPI Users <us...@open-mpi.org>
Sent: Wed May 19 02:45:58 2010
Subject: Re: [OMPI users] init of component openib returned failure

Hello,

thanks for your reply.

Jeff Squyres wrote:
> Try running with:
> 
> mpirun.openmpi-1.4.1 --mca btl_base_verbose 50  --mca btl self,openib -n 2 
> --mca btl_openib_verbose 100 ./IMB-MPI1 -npmin 2 PingPong

the output is exactly the same as before.

> 
> Also, are you saying that running the same command line with osu_latency 
> works just fine?  That would be really weird...

Yes, if I run:

mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2 
--mca btl_openib_verbose 100 ./osu_lat_ompi-1.4.1

the openib component can be initialized:

----------------------------8<----------------------------------------------

[beo-15:29479] mca: base: components_open: Looking for btl components
[beo-16:29063] mca: base: components_open: Looking for btl components
[beo-15:29479] mca: base: components_open: opening btl components
[beo-15:29479] mca: base: components_open: found loaded component openib
[beo-15:29479] mca: base: components_open: component openib has no register 
function
[beo-15:29479] mca: base: components_open: component openib open function 
successful
[beo-15:29479] mca: base: components_open: found loaded component self
[beo-15:29479] mca: base: components_open: component self has no register 
function
[beo-15:29479] mca: base: components_open: component self open function 
successful
[beo-16:29063] mca: base: components_open: opening btl components
[beo-16:29063] mca: base: components_open: found loaded component openib
[beo-16:29063] mca: base: components_open: component openib has no register 
function
[beo-16:29063] mca: base: components_open: component openib open function 
successful
[beo-16:29063] mca: base: components_open: found loaded component self
[beo-16:29063] mca: base: components_open: component self has no register 
function
[beo-16:29063] mca: base: components_open: component self open function 
successful
[beo-15:29479] select: initializing btl component openib
[beo-16:29063] select: initializing btl component openib
[beo-15][[12785,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying 
INI files for vendor 0x02c9, part ID 25204
[beo-15][[12785,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
corresponding INI values: Mellanox Sinai Infinihost III
[beo-15][[12785,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying 
INI files for vendor 0x0000, part ID 0
[beo-15][[12785,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
corresponding INI values: default
[beo-15:29479] openib BTL: oob CPC available for use on mthca0:1
[beo-15:29479] openib BTL: xoob CPC only supported with XRC receive queues; 
skipped on mthca0:1
[beo-15:29479] openib BTL: rdmacm CPC available for use on mthca0:1
[beo-15:29479] select: init of component openib returned success
[beo-15:29479] select: initializing btl component self
[beo-15:29479] select: init of component self returned success
[beo-16][[12785,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying 
INI files for vendor 0x02c9, part ID 25204
[beo-16][[12785,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
corresponding INI values: Mellanox Sinai Infinihost III
[beo-16][[12785,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying 
INI files for vendor 0x0000, part ID 0
[beo-16][[12785,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
corresponding INI values: default
[beo-16:29063] openib BTL: oob CPC available for use on mthca0:1
[beo-16:29063] openib BTL: xoob CPC only supported with XRC receive queues; 
skipped on mthca0:1
[beo-16:29063] openib BTL: rdmacm CPC available for use on mthca0:1
[beo-16:29063] select: init of component openib returned success
[beo-16:29063] select: initializing btl component self
[beo-16:29063] select: init of component self returned success
# OSU MPI Latency Test (Version 2.2)
# Size          Latency (us)
[beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set 
MTU to IBV value 4 (2048 bytes)
[beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set 
MTU to IBV value 4 (2048 bytes)
[beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set 
MTU to IBV value 4 (2048 bytes)
[beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set 
MTU to IBV value 4 (2048 bytes)
[beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set 
MTU to IBV value 4 (2048 bytes)
[beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set 
MTU to IBV value 4 (2048 bytes)
[beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set 
MTU to IBV value 4 (2048 bytes)
[beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set 
MTU to IBV value 4 (2048 bytes)
0               3.57
1               3.65
2               3.63
4               3.64
8               3.68
16              3.72
32              3.77
64              3.95
128             4.95
256             5.36
512             6.03
1024            7.64
2048            9.95
4096            12.78
8192            18.22
16384           25.48
32768           37.03
65536           60.21
131072          107.90
262144          201.18
524288          389.08
1048576         762.38
2097152         1510.91
4194304         3005.72
[beo-15:29479] mca: base: close: component openib closed
[beo-16:29063] mca: base: close: component openib closed
[beo-16:29063] mca: base: close: unloading component openib
[beo-15:29479] mca: base: close: unloading component openib
[beo-16:29063] mca: base: close: component self closed
[beo-16:29063] mca: base: close: unloading component self
[beo-15:29479] mca: base: close: component self closed
[beo-15:29479] mca: base: close: unloading component self


----------------------------8<----------------------------------------------

really weird.

   Peter

> 
> 
> On May 18, 2010, at 6:18 AM, Peter Kruse wrote:
> 
>> Hello,
>>
>> trying to run Intel MPI Benchmarks with OpenMPI 1.4.1 fails in initializing
>> the component openib.  System is Debian GNU/Linux 5.0.4.
>> The command to start the job (under Torque 2.4.7) was:
>>
>> mpirun.openmpi-1.4.1 --mca btl_base_verbose 50  --mca btl self,openib -n 2
>> ./IMB-MPI1 -npmin 2 PingPong
>>
>> and results in these messages:
>>
>> ----------------------------8<----------------------------------------------
>>
>> [beo-15:20933] mca: base: components_open: Looking for btl components
>> [beo-16:20605] mca: base: components_open: Looking for btl components
>> [beo-15:20933] mca: base: components_open: opening btl components
>> [beo-15:20933] mca: base: components_open: found loaded component openib
>> [beo-15:20933] mca: base: components_open: component openib has no register
>> function
>> [beo-15:20933] mca: base: components_open: component openib open function
>> successful
>> [beo-15:20933] mca: base: components_open: found loaded component self
>> [beo-15:20933] mca: base: components_open: component self has no register 
>> function
>> [beo-15:20933] mca: base: components_open: component self open function 
>> successful
>> [beo-16:20605] mca: base: components_open: opening btl components
>> [beo-16:20605] mca: base: components_open: found loaded component openib
>> [beo-16:20605] mca: base: components_open: component openib has no register
>> function
>> [beo-16:20605] mca: base: components_open: component openib open function
>> successful
>> [beo-16:20605] mca: base: components_open: found loaded component self
>> [beo-16:20605] mca: base: components_open: component self has no register 
>> function
>> [beo-16:20605] mca: base: components_open: component self open function 
>> successful
>> [beo-15:20933] select: initializing btl component openib
>> [beo-15:20933] select: init of component openib returned failure
>> [beo-15:20933] select: module openib unloaded
>> [beo-15:20933] select: initializing btl component self
>> [beo-15:20933] select: init of component self returned success
>> [beo-16:20605] select: initializing btl component openib
>> [beo-16:20605] select: init of component openib returned failure
>> [beo-16:20605] select: module openib unloaded
>> [beo-16:20605] select: initializing btl component self
>> [beo-16:20605] select: init of component self returned success
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>>    Process 1 ([[4887,1],0]) is on host: beo-15
>>    Process 2 ([[4887,1],1]) is on host: beo-16
>>    BTLs attempted: self
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>    PML add procs failed
>>    --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init_thread
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [beo-15:20933] Abort before MPI_INIT completed successfully; not able to
>> guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> orterun has exited due to process rank 0 with PID 20933 on
>> node beo-15 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by orterun (as reported here).
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init_thread
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [beo-16:20605] Abort before MPI_INIT completed successfully; not able to
>> guarantee that all other processes were killed!
>> [beo-15:20930] 1 more process has sent help message help-mca-bml-r2.txt /
>> unreachable proc
>> [beo-15:20930] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
>> help / error messages
>> [beo-15:20930] 1 more process has sent help message help-mpi-runtime /
>> mpi_init:startup:internal-failure
>>
>> ----------------------------8<----------------------------------------------
>>
>> running another Benchmark (OSU) succeeds in loading the openib component.
>>
>> "ibstat |grep -i state" on both nodes gives:
>>
>> ----------------------------8<----------------------------------------------
>>                  State: Active
>>                  Physical state: LinkUp
>> ----------------------------8<----------------------------------------------
>>
>> Running with "mpi_abort_delay -1" and attaching an strace on the process
>> is not very helpful it loops with:
>>
>> ----------------------------8<----------------------------------------------
>> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
>> rt_sigaction(SIGCHLD, NULL, {0x2aee58ff3250, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2aee59d44f60}, 8) = 0
>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>> nanosleep({5, 0}, {5, 0})               = 0
>> ----------------------------8<----------------------------------------------
>>
>> Does anybody have an idea what is wrong or how can we get more debugging
>> information about the initialization of the openib module?
>>
>> Thanks for any help,
>>
>>    Peter
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> 
> 

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to