I've been happily using OpenMPI 4.1.4 for a while, but I've run into a weird 
new problem. I mainly use it with ucx, typically running with the mpirun flags
--bind-to core --report-bindings --mca pml ucx --mca osc ucx --mca btl 
^vader,tcp,openib
and with our compiled Fortran codes it seems to work fine. When I turn on 
"--mca pml_base_verbose 10" the output looks something like:
[compute-9-6:290747] mca: base: components_register: registering framework pml 
components
[compute-9-6:290747] mca: base: components_register: found loaded component ucx
[compute-9-6:290747] mca: base: components_register: component ucx register 
function successful
[compute-9-6:290747] mca: base: components_open: opening pml components
[compute-9-6:290747] mca: base: components_open: found loaded component ucx
[compute-9-6:290747] mca: base: components_open: component ucx open function 
successful
[compute-9-6:290747] select: initializing pml component ucx
[compute-9-6:290747] select: init returned priority 51
[compute-9-6:290747] selected ucx best priority 51
[compute-9-6:290747] select: component ucx selected
[compute-9-6:290747] mca: base: close: component ucx closed
[compute-9-6:290747] mca: base: close: unloading component ucx

The problem is when I try to use mpi via python + mpi4py.  It fails to work, 
giving the messages:
[compute-9-6:290971] mca: base: components_register: registering framework pml 
components
[compute-9-6:290971] mca: base: components_register: found loaded component ucx
[compute-9-6:290971] mca: base: components_register: component ucx register 
function successful
[compute-9-6:290971] mca: base: components_open: opening pml components
[compute-9-6:290971] mca: base: components_open: found loaded component ucx
[compute-9-6:290971] mca: base: components_open: component ucx open function 
successful
[compute-9-6:290971] select: initializing pml component ucx
[compute-9-6:290971] select: init returned failure for component ucx
[compute-9-6:290971] PML ucx cannot be selected

When I run ldd on the fortran executable and the shared object that's part of 
mpi4py that brings in the MPI library, they both claim they'll use the same mpi 
library (although different actual .so files, because the fortran executable 
used the Intel compiler and the mpi4py code is using a library that was 
compiler with g++, but the two .so files are compiled identically except for 
the compiler).

Does anyone have any idea how I can figure out why the mpi4py run is failing?  
I tried pml_base_verbose up to 100 and it makes no difference to the level of 
detail.  Is there some other way to get more information as to what exactly is 
failing?

thanks,
Noam
  • [OMPI use... Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) via users
    • Re: ... Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) via users

Reply via email to