[OMPI devel] GitHub v4.0.2 tag is broken

2020-04-01 Thread Ben Menadue via devel
Hi,

The v4.0.2 tag in GitHub is broken at the moment -- trying to go to it
just takes you to the v4.0.2 _branch_, which looks to be a separate,
much more recent fork from master:

https://github.com/open-mpi/ompi/tree/v4.0.2

Cheers,
Ben




[OMPI devel] v5.0 equivalent of --map-by numa

2021-11-10 Thread Ben Menadue via devel
Hi,

Quick question: what's the equivalent of "--map-by numa" for the new
PRRTE-based runtime for v5.0? I can see "package" and "l3cache" in the
help, which are close, but don't quite match "numa" for our system.

In more detail...

We have dual-socket CLX- and SKL-based nodes with sub-NUMA clustering
enabled. This shows up in the OS as two packages, each with 1 L3 cache
domain and 2 NUMA domains. Even worse, each compute node effectively
has its own unique mapping of the cores of each socket between the NUMA
domains.

A common way of running for our users is with 1 MPI process per NUMA
domain and then some form of threading within the cores associated with
that domain. This effectively gives each MPI process its own memory
controller and DIMMs.

Using "--map-by numa" worked really well for this, since it took care
of the unique core numbering of each node. The only way I can think of
to set up something equivalent without that would be manually
enumerating the nodes in each job and building a rank file.

I've include an example topology below.

Or do you think this is better as a GitHub issue?

Thanks,
Ben

[bjm900@gadi-cpu-clx-0143 build]$ lstopo
Machine (189GB total)
  Package L#0 + L3 L#0 (36MB)
Group0 L#0
  NUMANode L#0 (P#0 47GB)
  L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU
L#0 (P#0)
  L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU
L#1 (P#1)
  L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU
L#2 (P#2)
  L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU
L#3 (P#3)
  L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU
L#4 (P#7)
  L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU
L#5 (P#8)
  L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU
L#6 (P#9)
  L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU
L#7 (P#13)
  L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU
L#8 (P#14)
  L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU
L#9 (P#15)
  L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
+ PU L#10 (P#19)
  L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
+ PU L#11 (P#20)
  HostBridge
PCI 00:11.5 (SATA)
PCI 00:17.0 (SATA)
  Block(Disk) "sda"
PCIBridge
  PCIBridge
PCI 02:00.0 (VGA)
  HostBridge
PCIBridge
  PCIBridge
PCIBridge
  PCI 08:00.2 (Ethernet)
Net "eno1"
Group0 L#1
  NUMANode L#1 (P#1 47GB)
  L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
+ PU L#12 (P#4)
  L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
+ PU L#13 (P#5)
  L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
+ PU L#14 (P#6)
  L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
+ PU L#15 (P#10)
  L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
+ PU L#16 (P#11)
  L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
+ PU L#17 (P#12)
  L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
+ PU L#18 (P#16)
  L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
+ PU L#19 (P#17)
  L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
+ PU L#20 (P#18)
  L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
+ PU L#21 (P#21)
  L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
+ PU L#22 (P#22)
  L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
+ PU L#23 (P#23)
  HostBridge
PCIBridge
  PCI 58:00.0 (InfiniBand)
Net "ib0"
OpenFabrics "mlx5_0"
  Package L#1 + L3 L#1 (36MB)
Group0 L#2
  NUMANode L#2 (P#2 47GB)
  L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
+ PU L#24 (P#24)
  L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
+ PU L#25 (P#25)
  L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
+ PU L#26 (P#26)
  L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
+ PU L#27 (P#27)
  L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
+ PU L#28 (P#30)
  L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
+ PU L#29 (P#31)
  L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
+ PU L#30 (P#35)
  L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
+ PU L#31 (P#36)
  L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
+ PU L#32 (P#37)
  L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
+ PU L#33 (P#42)
  L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
+ PU L#34 (P#43)
  L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
+ PU L#35 (P#44)
Group0 L#3
  NUMANode L#3 (P#3 47GB)
  L2 L#36 (1024KB) + L1d 

Re: [OMPI devel] v5.0 equivalent of --map-by numa

2021-11-11 Thread Ben Menadue via devel
Hi Brice,Great, thanks for that!I think I need to brush up on my searching skills, not sure why I didn't find that issue. Sorry for the noise.Cheers,BenOn 11 Nov. 2021 19:05, Brice Goglin  wrote:Hello Ben



It will be back, at least for the majority of platforms (those without 

heterogeneous memory).



See https://github.com/open-mpi/ompi/issues/8170 and 

https://github.com/openpmix/prrte/pull/1141



Brice







Le 11/11/2021 à 05:33, Ben Menadue via devel a écrit :

> Hi,

>

> Quick question: what's the equivalent of "--map-by numa" for the new

> PRRTE-based runtime for v5.0? I can see "package" and "l3cache" in the

> help, which are close, but don't quite match "numa" for our system.

>

> In more detail...

>

> We have dual-socket CLX- and SKL-based nodes with sub-NUMA clustering

> enabled. This shows up in the OS as two packages, each with 1 L3 cache

> domain and 2 NUMA domains. Even worse, each compute node effectively

> has its own unique mapping of the cores of each socket between the NUMA

> domains.

>

> A common way of running for our users is with 1 MPI process per NUMA

> domain and then some form of threading within the cores associated with

> that domain. This effectively gives each MPI process its own memory

> controller and DIMMs.

>

> Using "--map-by numa" worked really well for this, since it took care

> of the unique core numbering of each node. The only way I can think of

> to set up something equivalent without that would be manually

> enumerating the nodes in each job and building a rank file.

>

> I've include an example topology below.

>

> Or do you think this is better as a GitHub issue?

>

> Thanks,

> Ben

>

> [bjm900@gadi-cpu-clx-0143 build]$ lstopo

> Machine (189GB total)

>    Package L#0 + L3 L#0 (36MB)

>  Group0 L#0

>    NUMANode L#0 (P#0 47GB)

>    L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU

> L#0 (P#0)

>    L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU

> L#1 (P#1)

>    L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU

> L#2 (P#2)

>    L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU

> L#3 (P#3)

>    L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU

> L#4 (P#7)

>    L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU

> L#5 (P#8)

>    L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU

> L#6 (P#9)

>    L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU

> L#7 (P#13)

>    L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU

> L#8 (P#14)

>    L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU

> L#9 (P#15)

>    L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10

> + PU L#10 (P#19)

>    L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11

> + PU L#11 (P#20)

>    HostBridge

>  PCI 00:11.5 (SATA)

>  PCI 00:17.0 (SATA)

>    Block(Disk) "sda"

>  PCIBridge

>    PCIBridge

>  PCI 02:00.0 (VGA)

>    HostBridge

>  PCIBridge

>    PCIBridge

>  PCIBridge

>    PCI 08:00.2 (Ethernet)

>  Net "eno1"

>  Group0 L#1

>    NUMANode L#1 (P#1 47GB)

>    L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12

> + PU L#12 (P#4)

>    L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13

> + PU L#13 (P#5)

>    L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14

> + PU L#14 (P#6)

>    L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15

> + PU L#15 (P#10)

>    L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16

> + PU L#16 (P#11)

>    L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17

> + PU L#17 (P#12)

>    L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18

> + PU L#18 (P#16)

>    L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19

> + PU L#19 (P#17)

>    L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20

> + PU L#20 (P#18)

>    L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21

> + PU L#21 (P#21)

>    L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22

> + PU L#22 (P#22)

>    L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23

> + PU L#23 (P#23)

>    HostBridge

>  PCIBridge

>    PCI 58:00.0 (InfiniBand)

>  Net "ib0"

Re: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Ben Menadue via devel
Hi,

We see this on our cluster as well — we traced it to because Python loads 
shared library extensions using RTLD_LOCAL.

The Python module (mpi4py?) has a dependency on libmpi.so, which in turn has a 
dependency on libhcoll.so. So the Python module is being loaded with 
RTLD_LOCAL, anything that it pulls in with it also ends up being loaded like 
that. Later, hcoll tries loading its own plugin .so files, but since 
libhcoll.so was loaded with RTLD_LOCAL that plugin library can’t resolve any 
symbols there.

It might be fixable by having the hcoll plugins linked against libhcoll.so, but 
since it’s just a pre-built bundle from Mellanox it’s not something I can test 
easily.

Otherwise, the solution we use is to just LD_PRELOAD=libmpi.so when launching 
Python so that it gets loaded into the global namespace like would happen with 
a “normal” compiled program.

Cheers,
Ben



> On 8 Nov 2022, at 1:48 am, Tomislav Janjusic via devel 
>  wrote:
> 
> Ugh - runtime command is literally in the e-mail.
>  
> Sorry about that.
>  
>  
> --
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA 
>  
> From: Tomislav Janjusic 
> Sent: Monday, November 7, 2022 8:48 AM
> To: 'Open MPI Developers' ; Open MPI Users 
> 
> Cc: mrlong 
> Subject: RE: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available 
> but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> What is the runtime command?
> It’s coming from HCOLL. If HCOLL is not needed feel free to disable it -mca 
> coll ^hcoll
>  
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA 
>  
> From: devel  > On Behalf Of mrlong via devel
> Sent: Monday, November 7, 2022 2:33 AM
> To: devel@lists.open-mpi.org ; Open MPI 
> Users mailto:us...@lists.open-mpi.org>>
> Cc: mrlong mailto:mrlong...@gmail.com>>
> Subject: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
> requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> External email: Use caution opening links or attachments
>  
> The execution of openmpi 5.0.0rc9 results in the following:
> 
> (py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> 
> Why is this message printed?
>