[OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread mrlong via devel

The execution of openmpi 5.0.0rc9 results in the following:

(py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
[LOG_CAT_ML] component basesmuma is not available but requested in 
hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p

[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in 
hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p

[LOG_CAT_ML] ml_discover_hierarchy exited with error

Why is this message printed?


[OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application

2022-11-07 Thread mrlong via devel

*Two machines, each with 64 cores. The contents of the hosts file are:*

192.168.180.48 slots=1
192.168.60.203 slots=1

*Why do you get the following error when running with openmpi 5.0.0rc9?*

(py3.9) [user@machine01 share]$  mpirun -n 2 --machinefile hosts hostname
--
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  hostname

Either request fewer procs for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
 processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
 hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
 RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.



Re: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Tomislav Janjusic via devel
What is the runtime command?
It’s coming from HCOLL. If HCOLL is not needed feel free to disable it -mca 
coll ^hcoll

Tomislav Janjusic
Staff Eng., Mellanox, HPC SW
+1 (512) 598-0386
NVIDIA

From: devel  On Behalf Of mrlong via devel
Sent: Monday, November 7, 2022 2:33 AM
To: devel@lists.open-mpi.org; Open MPI Users 
Cc: mrlong 
Subject: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, 
p2p

External email: Use caution opening links or attachments


The execution of openmpi 5.0.0rc9 results in the following:

(py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error

Why is this message printed?


Re: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Tomislav Janjusic via devel
Ugh - runtime command is literally in the e-mail.

Sorry about that.


--
Tomislav Janjusic
Staff Eng., Mellanox, HPC SW
+1 (512) 598-0386
NVIDIA

From: Tomislav Janjusic
Sent: Monday, November 7, 2022 8:48 AM
To: 'Open MPI Developers' ; Open MPI Users 

Cc: mrlong 
Subject: RE: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, 
p2p

What is the runtime command?
It’s coming from HCOLL. If HCOLL is not needed feel free to disable it -mca 
coll ^hcoll

Tomislav Janjusic
Staff Eng., Mellanox, HPC SW
+1 (512) 598-0386
NVIDIA

From: devel 
mailto:devel-boun...@lists.open-mpi.org>> On 
Behalf Of mrlong via devel
Sent: Monday, November 7, 2022 2:33 AM
To: devel@lists.open-mpi.org; Open MPI Users 
mailto:us...@lists.open-mpi.org>>
Cc: mrlong mailto:mrlong...@gmail.com>>
Subject: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, 
p2p

External email: Use caution opening links or attachments


The execution of openmpi 5.0.0rc9 results in the following:

(py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error

Why is this message printed?


[OMPI devel] Open MPI v5.0.0 release timeline delay

2022-11-07 Thread Geoffrey Paulsen via devel
Open MPI developers,

I’ve got some bad news from a OMPI v5.0.0 release timeframe.  IBM has asked 
Austen and I (and our team) to focus 100% on another project for the next two 
full weeks.

  Open MPI v5.0.x still has a few remaining blocking items including 
documentation, PRRTE 3.0 release, some collective performance 
data/marketing/messaging, along with a few platform specific bugs (see: 
https://github.com/open-mpi/ompi/projects/3)

  Due to these reasons (along with Super Computing and the holidays) the Open 
MPI v5.0 RMs feel that a January of 2023 is a more realistic timeframe for 
release.

Thank you for your understanding.

The Open MPI v5.0.x Release Managers:
   - Tomislav Janjusic, nVidia
   - Austen Lauria, IBM
   - Geoff Paulsen, IBM


Re: [OMPI devel] Fwd: --mca btl_base_verbose 30 not working in version 5.0

2022-11-07 Thread Jeff Squyres (jsquyres) via devel
Sorry; I missed that this email came in a week ago.  😕

The "btl_base_verbose" MCA param only works on the BTL components.  The Linux 
"hostname(1)" command is not an MPI application, and therefore does not utilize 
any of the BTL components.  Hence, you can set btl_base_verbose to whatever you 
want, but it'll be ignored by non-MPI applications (but is harmless).

--
Jeff Squyres
jsquy...@cisco.com

From: devel  on behalf of 龙龙 via devel 

Sent: Sunday, October 30, 2022 10:34 AM
To: devel@lists.open-mpi.org 
Cc: 龙龙 
Subject: [OMPI devel] Fwd: --mca btl_base_verbose 30 not working in version 5.0



-- Forwarded message -
发件人: mrlong mailto:mrlong...@gmail.com>>
Date: 2022年10月30日周日 22:03
Subject: --mca btl_base_verbose 30 not working in version 5.0
To: mailto:us...@lists.open-mpi.org>>



mpirun --mca btl self,sm,tcp --mca btl_base_verbose 30 -np 2 --machinefile 
hostfile  hostname

Why this sentence does not print IP addresses are routable in openmpi 5.0.0.rc9?



Re: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Ben Menadue via devel
Hi,

We see this on our cluster as well — we traced it to because Python loads 
shared library extensions using RTLD_LOCAL.

The Python module (mpi4py?) has a dependency on libmpi.so, which in turn has a 
dependency on libhcoll.so. So the Python module is being loaded with 
RTLD_LOCAL, anything that it pulls in with it also ends up being loaded like 
that. Later, hcoll tries loading its own plugin .so files, but since 
libhcoll.so was loaded with RTLD_LOCAL that plugin library can’t resolve any 
symbols there.

It might be fixable by having the hcoll plugins linked against libhcoll.so, but 
since it’s just a pre-built bundle from Mellanox it’s not something I can test 
easily.

Otherwise, the solution we use is to just LD_PRELOAD=libmpi.so when launching 
Python so that it gets loaded into the global namespace like would happen with 
a “normal” compiled program.

Cheers,
Ben



> On 8 Nov 2022, at 1:48 am, Tomislav Janjusic via devel 
>  wrote:
> 
> Ugh - runtime command is literally in the e-mail.
>  
> Sorry about that.
>  
>  
> --
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA 
>  
> From: Tomislav Janjusic 
> Sent: Monday, November 7, 2022 8:48 AM
> To: 'Open MPI Developers' ; Open MPI Users 
> 
> Cc: mrlong 
> Subject: RE: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available 
> but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> What is the runtime command?
> It’s coming from HCOLL. If HCOLL is not needed feel free to disable it -mca 
> coll ^hcoll
>  
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA 
>  
> From: devel  > On Behalf Of mrlong via devel
> Sent: Monday, November 7, 2022 2:33 AM
> To: devel@lists.open-mpi.org ; Open MPI 
> Users mailto:us...@lists.open-mpi.org>>
> Cc: mrlong mailto:mrlong...@gmail.com>>
> Subject: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
> requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> External email: Use caution opening links or attachments
>  
> The execution of openmpi 5.0.0rc9 results in the following:
> 
> (py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> 
> Why is this message printed?
>