[OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

Carlo Nervi via users Thu, 20 Aug 2020 03:22:19 -0700

Dear OMPI community,
I'm a simple end-user with no particular experience.
I compile quantum chemical programs and use them in parallel.


My system is a 4 socket, 12 core per socket Opteron 6168 system for a total
of 48 cores and 64 Gb of RAM. It has 8 NUMA nodes:

openmpi $ hwloc-info
depth 0:           1 Machine (type #0)
 depth 1:          4 Package (type #1)
  depth 2:         8 L3Cache (type #6)
   depth 3:        48 L2Cache (type #5)
    depth 4:       48 L1dCache (type #4)
     depth 5:      48 L1iCache (type #9)
      depth 6:     48 Core (type #2)
       depth 7:    48 PU (type #3)
Special depth -3:  8 NUMANode (type #13)
Special depth -4:  3 Bridge (type #14)
Special depth -5:  5 PCIDev (type #15)
Special depth -6:  5 OSDev (type #16)

lstopo:

openmpi $ lstopo
Machine (63GB total)
  Package L#0
    L3 L#0 (5118KB)
      NUMANode L#0 (P#0 7971MB)
      L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
(P#0)
      L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
(P#1)
      L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
(P#2)
      L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
(P#3)
      L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
(P#4)
      L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
(P#5)
      HostBridge
        PCIBridge
          PCI 02:00.0 (Ethernet)
            Net "enp2s0f0"
          PCI 02:00.1 (Ethernet)
            Net "enp2s0f1"
        PCI 00:11.0 (RAID)
          Block(Disk) "sdb"
          Block(Disk) "sdc"
          Block(Disk) "sda"
        PCI 00:14.1 (IDE)
        PCIBridge
          PCI 01:04.0 (VGA)
    L3 L#1 (5118KB)
      NUMANode L#1 (P#1 8063MB)
      L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
(P#6)
      L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
(P#7)
      L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8
(P#8)
      L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9
(P#9)
      L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU
L#10 (P#10)
      L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU
L#11 (P#11)
  Package L#1
    L3 L#2 (5118KB)
      NUMANode L#2 (P#2 8063MB)
      L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU
L#12 (P#12)
      L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU
L#13 (P#13)
      L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU
L#14 (P#14)
      L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU
L#15 (P#15)
      L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU
L#16 (P#16)
      L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU
L#17 (P#17)
    L3 L#3 (5118KB)
      NUMANode L#3 (P#3 8063MB)
      L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU
L#18 (P#18)
      L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU
L#19 (P#19)
      L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU
L#20 (P#20)
      L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU
L#21 (P#21)
      L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU
L#22 (P#22)
      L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU
L#23 (P#23)
  Package L#2
    L3 L#4 (5118KB)
      NUMANode L#4 (P#4 8063MB)
      L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU
L#24 (P#24)
      L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU
L#25 (P#25)
      L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU
L#26 (P#26)
      L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU
L#27 (P#27)
      L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU
L#28 (P#28)
      L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU
L#29 (P#29)
    L3 L#5 (5118KB)
      NUMANode L#5 (P#5 8063MB)
      L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU
L#30 (P#30)
      L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU
L#31 (P#31)
      L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU
L#32 (P#32)
      L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU
L#33 (P#33)
      L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU
L#34 (P#34)
      L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU
L#35 (P#35)
  Package L#3
    L3 L#6 (5118KB)
      NUMANode L#6 (P#6 8063MB)
      L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU
L#36 (P#36)
      L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU
L#37 (P#37)
      L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU
L#38 (P#38)
      L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU
L#39 (P#39)
      L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU
L#40 (P#40)
      L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU
L#41 (P#41)
    L3 L#7 (5118KB)
      NUMANode L#7 (P#7 8062MB)
      L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU
L#42 (P#42)
      L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU
L#43 (P#43)
      L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU
L#44 (P#44)
      L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU
L#45 (P#45)
      L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU
L#46 (P#46)
      L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU
L#47 (P#47)

openmpi $ numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 7971 MB
node 0 free: 6858 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 8062 MB
node 1 free: 6860 MB
node 2 cpus: 12 13 14 15 16 17
node 2 size: 8062 MB
node 2 free: 6979 MB
node 3 cpus: 18 19 20 21 22 23
node 3 size: 8062 MB
node 3 free: 7132 MB
node 4 cpus: 24 25 26 27 28 29
node 4 size: 8062 MB
node 4 free: 6276 MB
node 5 cpus: 30 31 32 33 34 35
node 5 size: 8062 MB
node 5 free: 7190 MB
node 6 cpus: 36 37 38 39 40 41
node 6 size: 8062 MB
node 6 free: 7059 MB
node 7 cpus: 42 43 44 45 46 47
node 7 size: 8061 MB
node 7 free: 7075 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  22  16  22  16  22
  1:  16  10  22  16  22  16  22  16
  2:  16  22  10  16  16  22  16  22
  3:  22  16  16  10  22  16  22  16
  4:  16  22  16  22  10  16  16  22
  5:  22  16  22  16  16  10  22  16
  6:  16  22  16  22  16  22  10  16
  7:  22  16  22  16  22  16  16  10


I compiled openmpi 4.0.4 but probably some bugs alter the behavior of
mpirun, and therefore I'm asking you suggestions on how to properly run
parallel code on my system. I'm using Linux Gentoo.
Questions:

1) If I recompile openmpi changing the version, should I also recompile all
my openmpi programs? I'm a little bit confused since I tried to downgrade
to 4.0.1, but new random errors pop up.

2) I have many jobs to run, each of them can use from 1 to N cpus (all mpi:
by now I tend to avoid OpenMP). Ideally I would like to run the same simple
mpirun command for each job so that mpirun should distribute jobs and
processes automatically, taking in account the NUMA architecture (8 NUMA, 6
CPUs per NUMA node).
I tried to use --map-by numa in 4.0.1 (--map-by numa gives errors in
4.0.4), but many processes run at 10-30% of CPU.
Then I switched back to 4.0.4 (the version I used to compile my programs),
and the only way I found effective is to use "mpirun --bind-to none". In
this way all cpus are running at 100%, but I lose efficiency due to NUMA.

What is the correct way (if exists) to bind (almost) all the processes  of
the same job to a single NUMA node? I'm not searching for a perfect
solution, but at least distribute the first 8 parallel jobs to 8 NUMA nodes.

I hope I made myself clear!
Thank for your patience and sorry for the long letter,
Carlo




-- 

------------------------------------------------------------
Prof. Carlo Nervi carlo.ne...@unito.it  Tel:+39 0116707507/8
Fax: +39 0116707855      -      Dipartimento di Chimica, via
P. Giuria 7, 10125 Torino, Italy.    http://lem.ch.unito.it/

ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com
International Conference on Coordination Chemistry (ICCC 2020)

 <http://www.iccc2020.com/>

[OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

Reply via email to