Hi,

We see this on our cluster as well — we traced it to because Python loads 
shared library extensions using RTLD_LOCAL.

The Python module (mpi4py?) has a dependency on libmpi.so, which in turn has a 
dependency on libhcoll.so. So the Python module is being loaded with 
RTLD_LOCAL, anything that it pulls in with it also ends up being loaded like 
that. Later, hcoll tries loading its own plugin .so files, but since 
libhcoll.so was loaded with RTLD_LOCAL that plugin library can’t resolve any 
symbols there.

It might be fixable by having the hcoll plugins linked against libhcoll.so, but 
since it’s just a pre-built bundle from Mellanox it’s not something I can test 
easily.

Otherwise, the solution we use is to just LD_PRELOAD=libmpi.so when launching 
Python so that it gets loaded into the global namespace like would happen with 
a “normal” compiled program.

Cheers,
Ben



> On 8 Nov 2022, at 1:48 am, Tomislav Janjusic via devel 
> <devel@lists.open-mpi.org> wrote:
> 
> Ugh - runtime command is literally in the e-mail.
>  
> Sorry about that.
>  
>  
> --
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA <http://www.nvidia.com/>
>  
> From: Tomislav Janjusic 
> Sent: Monday, November 7, 2022 8:48 AM
> To: 'Open MPI Developers' <devel@lists.open-mpi.org>; Open MPI Users 
> <us...@lists.open-mpi.org>
> Cc: mrlong <mrlong...@gmail.com>
> Subject: RE: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available 
> but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> What is the runtime command?
> It’s coming from HCOLL. If HCOLL is not needed feel free to disable it -mca 
> coll ^hcoll
>  
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA <http://www.nvidia.com/>
>  
> From: devel <devel-boun...@lists.open-mpi.org 
> <mailto:devel-boun...@lists.open-mpi.org>> On Behalf Of mrlong via devel
> Sent: Monday, November 7, 2022 2:33 AM
> To: devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>; Open MPI 
> Users <us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>>
> Cc: mrlong <mrlong...@gmail.com <mailto:mrlong...@gmail.com>>
> Subject: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
> requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> External email: Use caution opening links or attachments
>  
> The execution of openmpi 5.0.0rc9 results in the following:
> 
> (py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> 
> Why is this message printed?
> 

Reply via email to