I've been happily using OpenMPI 4.1.4 for a while, but I've run into a weird new problem. I mainly use it with ucx, typically running with the mpirun flags --bind-to core --report-bindings --mca pml ucx --mca osc ucx --mca btl ^vader,tcp,openib and with our compiled Fortran codes it seems to work fine. When I turn on "--mca pml_base_verbose 10" the output looks something like: [compute-9-6:290747] mca: base: components_register: registering framework pml components [compute-9-6:290747] mca: base: components_register: found loaded component ucx [compute-9-6:290747] mca: base: components_register: component ucx register function successful [compute-9-6:290747] mca: base: components_open: opening pml components [compute-9-6:290747] mca: base: components_open: found loaded component ucx [compute-9-6:290747] mca: base: components_open: component ucx open function successful [compute-9-6:290747] select: initializing pml component ucx [compute-9-6:290747] select: init returned priority 51 [compute-9-6:290747] selected ucx best priority 51 [compute-9-6:290747] select: component ucx selected [compute-9-6:290747] mca: base: close: component ucx closed [compute-9-6:290747] mca: base: close: unloading component ucx
The problem is when I try to use mpi via python + mpi4py. It fails to work, giving the messages: [compute-9-6:290971] mca: base: components_register: registering framework pml components [compute-9-6:290971] mca: base: components_register: found loaded component ucx [compute-9-6:290971] mca: base: components_register: component ucx register function successful [compute-9-6:290971] mca: base: components_open: opening pml components [compute-9-6:290971] mca: base: components_open: found loaded component ucx [compute-9-6:290971] mca: base: components_open: component ucx open function successful [compute-9-6:290971] select: initializing pml component ucx [compute-9-6:290971] select: init returned failure for component ucx [compute-9-6:290971] PML ucx cannot be selected When I run ldd on the fortran executable and the shared object that's part of mpi4py that brings in the MPI library, they both claim they'll use the same mpi library (although different actual .so files, because the fortran executable used the Intel compiler and the mpi4py code is using a library that was compiler with g++, but the two .so files are compiled identically except for the compiler). Does anyone have any idea how I can figure out why the mpi4py run is failing? I tried pml_base_verbose up to 100 and it makes no difference to the level of detail. Is there some other way to get more information as to what exactly is failing? thanks, Noam