Hello,

thanks for your reply.

Jeff Squyres wrote:
Try running with:

mpirun.openmpi-1.4.1 --mca btl_base_verbose 50  --mca btl self,openib -n 2 
--mca btl_openib_verbose 100 ./IMB-MPI1 -npmin 2 PingPong

the output is exactly the same as before.


Also, are you saying that running the same command line with osu_latency works 
just fine?  That would be really weird...

Yes, if I run:

mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2 --mca btl_openib_verbose 100 ./osu_lat_ompi-1.4.1

the openib component can be initialized:

----------------------------8<----------------------------------------------

[beo-15:29479] mca: base: components_open: Looking for btl components
[beo-16:29063] mca: base: components_open: Looking for btl components
[beo-15:29479] mca: base: components_open: opening btl components
[beo-15:29479] mca: base: components_open: found loaded component openib
[beo-15:29479] mca: base: components_open: component openib has no register function [beo-15:29479] mca: base: components_open: component openib open function successful
[beo-15:29479] mca: base: components_open: found loaded component self
[beo-15:29479] mca: base: components_open: component self has no register 
function
[beo-15:29479] mca: base: components_open: component self open function 
successful
[beo-16:29063] mca: base: components_open: opening btl components
[beo-16:29063] mca: base: components_open: found loaded component openib
[beo-16:29063] mca: base: components_open: component openib has no register function [beo-16:29063] mca: base: components_open: component openib open function successful
[beo-16:29063] mca: base: components_open: found loaded component self
[beo-16:29063] mca: base: components_open: component self has no register 
function
[beo-16:29063] mca: base: components_open: component self open function 
successful
[beo-15:29479] select: initializing btl component openib
[beo-16:29063] select: initializing btl component openib
[beo-15][[12785,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 25204 [beo-15][[12785,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Mellanox Sinai Infinihost III [beo-15][[12785,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0 [beo-15][[12785,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: default
[beo-15:29479] openib BTL: oob CPC available for use on mthca0:1
[beo-15:29479] openib BTL: xoob CPC only supported with XRC receive queues; skipped on mthca0:1
[beo-15:29479] openib BTL: rdmacm CPC available for use on mthca0:1
[beo-15:29479] select: init of component openib returned success
[beo-15:29479] select: initializing btl component self
[beo-15:29479] select: init of component self returned success
[beo-16][[12785,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 25204 [beo-16][[12785,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Mellanox Sinai Infinihost III [beo-16][[12785,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0 [beo-16][[12785,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: default
[beo-16:29063] openib BTL: oob CPC available for use on mthca0:1
[beo-16:29063] openib BTL: xoob CPC only supported with XRC receive queues; skipped on mthca0:1
[beo-16:29063] openib BTL: rdmacm CPC available for use on mthca0:1
[beo-16:29063] select: init of component openib returned success
[beo-16:29063] select: initializing btl component self
[beo-16:29063] select: init of component self returned success
# OSU MPI Latency Test (Version 2.2)
# Size          Latency (us)
[beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set MTU to IBV value 4 (2048 bytes) [beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set MTU to IBV value 4 (2048 bytes) [beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set MTU to IBV value 4 (2048 bytes) [beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set MTU to IBV value 4 (2048 bytes) [beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set MTU to IBV value 4 (2048 bytes) [beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set MTU to IBV value 4 (2048 bytes) [beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set MTU to IBV value 4 (2048 bytes) [beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set MTU to IBV value 4 (2048 bytes)
0               3.57
1               3.65
2               3.63
4               3.64
8               3.68
16              3.72
32              3.77
64              3.95
128             4.95
256             5.36
512             6.03
1024            7.64
2048            9.95
4096            12.78
8192            18.22
16384           25.48
32768           37.03
65536           60.21
131072          107.90
262144          201.18
524288          389.08
1048576         762.38
2097152         1510.91
4194304         3005.72
[beo-15:29479] mca: base: close: component openib closed
[beo-16:29063] mca: base: close: component openib closed
[beo-16:29063] mca: base: close: unloading component openib
[beo-15:29479] mca: base: close: unloading component openib
[beo-16:29063] mca: base: close: component self closed
[beo-16:29063] mca: base: close: unloading component self
[beo-15:29479] mca: base: close: component self closed
[beo-15:29479] mca: base: close: unloading component self


----------------------------8<----------------------------------------------

really weird.

  Peter



On May 18, 2010, at 6:18 AM, Peter Kruse wrote:

Hello,

trying to run Intel MPI Benchmarks with OpenMPI 1.4.1 fails in initializing
the component openib.  System is Debian GNU/Linux 5.0.4.
The command to start the job (under Torque 2.4.7) was:

mpirun.openmpi-1.4.1 --mca btl_base_verbose 50  --mca btl self,openib -n 2
./IMB-MPI1 -npmin 2 PingPong

and results in these messages:

----------------------------8<----------------------------------------------

[beo-15:20933] mca: base: components_open: Looking for btl components
[beo-16:20605] mca: base: components_open: Looking for btl components
[beo-15:20933] mca: base: components_open: opening btl components
[beo-15:20933] mca: base: components_open: found loaded component openib
[beo-15:20933] mca: base: components_open: component openib has no register
function
[beo-15:20933] mca: base: components_open: component openib open function
successful
[beo-15:20933] mca: base: components_open: found loaded component self
[beo-15:20933] mca: base: components_open: component self has no register 
function
[beo-15:20933] mca: base: components_open: component self open function 
successful
[beo-16:20605] mca: base: components_open: opening btl components
[beo-16:20605] mca: base: components_open: found loaded component openib
[beo-16:20605] mca: base: components_open: component openib has no register
function
[beo-16:20605] mca: base: components_open: component openib open function
successful
[beo-16:20605] mca: base: components_open: found loaded component self
[beo-16:20605] mca: base: components_open: component self has no register 
function
[beo-16:20605] mca: base: components_open: component self open function 
successful
[beo-15:20933] select: initializing btl component openib
[beo-15:20933] select: init of component openib returned failure
[beo-15:20933] select: module openib unloaded
[beo-15:20933] select: initializing btl component self
[beo-15:20933] select: init of component self returned success
[beo-16:20605] select: initializing btl component openib
[beo-16:20605] select: init of component openib returned failure
[beo-16:20605] select: module openib unloaded
[beo-16:20605] select: initializing btl component self
[beo-16:20605] select: init of component self returned success
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

   Process 1 ([[4887,1],0]) is on host: beo-15
   Process 2 ([[4887,1],1]) is on host: beo-16
   BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

   PML add procs failed
   --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[beo-15:20933] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
--------------------------------------------------------------------------
orterun has exited due to process rank 0 with PID 20933 on
node beo-15 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[beo-16:20605] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
[beo-15:20930] 1 more process has sent help message help-mca-bml-r2.txt /
unreachable proc
[beo-15:20930] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
[beo-15:20930] 1 more process has sent help message help-mpi-runtime /
mpi_init:startup:internal-failure

----------------------------8<----------------------------------------------

running another Benchmark (OSU) succeeds in loading the openib component.

"ibstat |grep -i state" on both nodes gives:

----------------------------8<----------------------------------------------
                 State: Active
                 Physical state: LinkUp
----------------------------8<----------------------------------------------

Running with "mpi_abort_delay -1" and attaching an strace on the process
is not very helpful it loops with:

----------------------------8<----------------------------------------------
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {0x2aee58ff3250, [CHLD], SA_RESTORER|SA_RESTART,
0x2aee59d44f60}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({5, 0}, {5, 0})               = 0
----------------------------8<----------------------------------------------

Does anybody have an idea what is wrong or how can we get more debugging
information about the initialization of the openib module?

Thanks for any help,

   Peter
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Reply via email to