Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

Ralph Castain via users Thu, 20 Aug 2020 09:40:26 -0700

Your use-case sounds more like a workflow than an application - in which case, 
you probably should be using PRRTE to execute it instead of "mpirun" as PRRTE 
will "remember" the multiple jobs and avoid the overload scenario you describe.


This link will walk you thru how to get and build it: 
https://openpmix.github.io/code/getting-the-pmix-reference-server

This link will provide some directions on how to use it: 
https://openpmix.github.io/support/how-to/running-apps-under-psrvr

You would need to update your script to use "prun --personality ompi" instead 
of "mpirun", but things should otherwise be the same.


On Aug 20, 2020, at 9:01 AM, Carlo Nervi via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:

Thank you, Christoph. I did not consider the --cpu-list.
However, this is okay if I have a single script that is launching several jobs 
(please note that each job may have a different number of CPUs). In my case I 
have the same script (that launches mpirun), which is called many times. The 
script is periodically called till all 48 CPUs are busy, and whenever other 
jobs are finished...

If the script will be the same, I'm afraid that all the jobs subsequent to the 
first will be bound to the same cpulist. Am I wrong?
mpirun -n xx --cpu-list "$(seq -s, 0 47)" --bind-to cpu-list:ordered $app
(with xx < 48) will bound all the processes to the first 6 CPUs?
I need a way to dynamically manage the cpulist, but I was hoping that mpirun 
can do the tasks.
If not, I'm afraid that the simplest solution is to use --bind-to none
Thank you,
Carlo


Il giorno gio 20 ago 2020 alle ore 17:24 Christoph Niethammer 
<nietham...@hlrs.de <mailto:nietham...@hlrs.de> > ha scritto:
Hello Carlo,

If you execute multiple mpirun commands they will not know about each others 
resource bindings.
E.g. if you bind to cores each mpirun will start with the same core to assign 
with again.
This results then in over subscription of the cores, which slows down your 
programs - as you did realize.


You can use "--cpu-list" together with "--bind-to cpu-list:ordered"
So if you start all your simulations in a single script this would look like

mpirun -n 6 --cpu-list "$(seq -s, 0 5)" --bind-to cpu-list:ordered  $app
mpirun -n 6 --cpu-list "$(seq -s, 6 11)" --bind-to cpu-list:ordered  $app
...
mpirun -n 6 --cpu-list "$(seq -s, 42 47)" --bind-to cpu-list:ordered  $app


Best
Christoph


----- Original Message -----
From: "Open MPI Users" <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> >
To: "Open MPI Users" <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> >
Cc: "Carlo Nervi" <carlo.ne...@unito.it <mailto:carlo.ne...@unito.it> >
Sent: Thursday, 20 August, 2020 12:17:21
Subject: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

Dear OMPI community,
I'm a simple end-user with no particular experience.
I compile quantum chemical programs and use them in parallel.

My system is a 4 socket, 12 core per socket Opteron 6168 system for a total
of 48 cores and 64 Gb of RAM. It has 8 NUMA nodes:

openmpi $ hwloc-info
depth 0:           1 Machine (type #0)
 depth 1:          4 Package (type #1)
  depth 2:         8 L3Cache (type #6)
   depth 3:        48 L2Cache (type #5)
    depth 4:       48 L1dCache (type #4)
     depth 5:      48 L1iCache (type #9)
      depth 6:     48 Core (type #2)
       depth 7:    48 PU (type #3)
Special depth -3:  8 NUMANode (type #13)
Special depth -4:  3 Bridge (type #14)
Special depth -5:  5 PCIDev (type #15)
Special depth -6:  5 OSDev (type #16)

lstopo:

openmpi $ lstopo
Machine (63GB total)
  Package L#0
    L3 L#0 (5118KB)
      NUMANode L#0 (P#0 7971MB)
      L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
(P#0)
      L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
(P#1)
      L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
(P#2)
      L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
(P#3)
      L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
(P#4)
      L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
(P#5)
      HostBridge
        PCIBridge
          PCI 02:00.0 (Ethernet)
            Net "enp2s0f0"
          PCI 02:00.1 (Ethernet)
            Net "enp2s0f1"
        PCI 00:11.0 (RAID)
          Block(Disk) "sdb"
          Block(Disk) "sdc"
          Block(Disk) "sda"
        PCI 00:14.1 (IDE)
        PCIBridge
          PCI 01:04.0 (VGA)
    L3 L#1 (5118KB)
      NUMANode L#1 (P#1 8063MB)
      L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
(P#6)
      L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
(P#7)
      L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8
(P#8)
      L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9
(P#9)
      L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU
L#10 (P#10)
      L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU
L#11 (P#11)
  Package L#1
    L3 L#2 (5118KB)
      NUMANode L#2 (P#2 8063MB)
      L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU
L#12 (P#12)
      L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU
L#13 (P#13)
      L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU
L#14 (P#14)
      L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU
L#15 (P#15)
      L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU
L#16 (P#16)
      L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU
L#17 (P#17)
    L3 L#3 (5118KB)
      NUMANode L#3 (P#3 8063MB)
      L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU
L#18 (P#18)
      L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU
L#19 (P#19)
      L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU
L#20 (P#20)
      L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU
L#21 (P#21)
      L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU
L#22 (P#22)
      L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU
L#23 (P#23)
  Package L#2
    L3 L#4 (5118KB)
      NUMANode L#4 (P#4 8063MB)
      L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU
L#24 (P#24)
      L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU
L#25 (P#25)
      L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU
L#26 (P#26)
      L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU
L#27 (P#27)
      L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU
L#28 (P#28)
      L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU
L#29 (P#29)
    L3 L#5 (5118KB)
      NUMANode L#5 (P#5 8063MB)
      L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU
L#30 (P#30)
      L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU
L#31 (P#31)
      L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU
L#32 (P#32)
      L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU
L#33 (P#33)
      L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU
L#34 (P#34)
      L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU
L#35 (P#35)
  Package L#3
    L3 L#6 (5118KB)
      NUMANode L#6 (P#6 8063MB)
      L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU
L#36 (P#36)
      L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU
L#37 (P#37)
      L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU
L#38 (P#38)
      L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU
L#39 (P#39)
      L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU
L#40 (P#40)
      L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU
L#41 (P#41)
    L3 L#7 (5118KB)
      NUMANode L#7 (P#7 8062MB)
      L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU
L#42 (P#42)
      L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU
L#43 (P#43)
      L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU
L#44 (P#44)
      L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU
L#45 (P#45)
      L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU
L#46 (P#46)
      L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU
L#47 (P#47)

openmpi $ numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 7971 MB
node 0 free: 6858 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 8062 MB
node 1 free: 6860 MB
node 2 cpus: 12 13 14 15 16 17
node 2 size: 8062 MB
node 2 free: 6979 MB
node 3 cpus: 18 19 20 21 22 23
node 3 size: 8062 MB
node 3 free: 7132 MB
node 4 cpus: 24 25 26 27 28 29
node 4 size: 8062 MB
node 4 free: 6276 MB
node 5 cpus: 30 31 32 33 34 35
node 5 size: 8062 MB
node 5 free: 7190 MB
node 6 cpus: 36 37 38 39 40 41
node 6 size: 8062 MB
node 6 free: 7059 MB
node 7 cpus: 42 43 44 45 46 47
node 7 size: 8061 MB
node 7 free: 7075 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  22  16  22  16  22
  1:  16  10  22  16  22  16  22  16
  2:  16  22  10  16  16  22  16  22
  3:  22  16  16  10  22  16  22  16
  4:  16  22  16  22  10  16  16  22
  5:  22  16  22  16  16  10  22  16
  6:  16  22  16  22  16  22  10  16
  7:  22  16  22  16  22  16  16  10


I compiled openmpi 4.0.4 but probably some bugs alter the behavior of
mpirun, and therefore I'm asking you suggestions on how to properly run
parallel code on my system. I'm using Linux Gentoo.
Questions:

1) If I recompile openmpi changing the version, should I also recompile all
my openmpi programs? I'm a little bit confused since I tried to downgrade
to 4.0.1, but new random errors pop up.

2) I have many jobs to run, each of them can use from 1 to N cpus (all mpi:
by now I tend to avoid OpenMP). Ideally I would like to run the same simple
mpirun command for each job so that mpirun should distribute jobs and
processes automatically, taking in account the NUMA architecture (8 NUMA, 6
CPUs per NUMA node).
I tried to use --map-by numa in 4.0.1 (--map-by numa gives errors in
4.0.4), but many processes run at 10-30% of CPU.
Then I switched back to 4.0.4 (the version I used to compile my programs),
and the only way I found effective is to use "mpirun --bind-to none". In

this way all cpus are running at 100%, but I lose efficiency due to NUMA.

What is the correct way (if exists) to bind (almost) all the processes  of
the same job to a single NUMA node? I'm not searching for a perfect
solution, but at least distribute the first 8 parallel jobs to 8 NUMA nodes.

I hope I made myself clear!
Thank for your patience and sorry for the long letter,
Carlo




-- 

------------------------------------------------------------
Prof. Carlo Nervi carlo.ne...@unito.it <mailto:carlo.ne...@unito.it>   Tel:+39 
0116707507/8
Fax: +39 0116707855      -      Dipartimento di Chimica, via
P. Giuria 7, 10125 Torino, Italy.    http://lem.ch.unito.it/ 
<http://lem.ch.unito.it/> 

ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com 
<http://www.iccc2020.com/> 
International Conference on Coordination Chemistry (ICCC 2020)

 <http://www.iccc2020.com/ <http://www.iccc2020.com/> >


-- 

------------------------------------------------------------
Prof. Carlo Nervi carlo.ne...@unito.it <mailto:carlo.ne...@unito.it>   Tel:+39 
0116707507/8
Fax: +39 0116707855      -      Dipartimento di Chimica, via
P. Giuria 7, 10125 Torino, Italy.    http://lem.ch.unito.it/ 
<http://lem.ch.unito.it/> 



ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com 
<http://www.iccc2020.com/> 
International Conference on Coordination Chemistry (ICCC 2020)


 <http://www.iccc2020.com/>

Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

Reply via email to