Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

Carlo Nervi via users Thu, 20 Aug 2020 10:54:24 -0700

Thank you Ralph for the suggestion!
I will carefully consider it, although I'm a chemist and not a sysadmin (I
miss a lot a specialized sysadmin in our Department!).
Carlo



Il giorno gio 20 ago 2020 alle ore 18:45 Ralph Castain via users <
users@lists.open-mpi.org> ha scritto:

> Your use-case sounds more like a workflow than an application - in which
> case, you probably should be using PRRTE to execute it instead of "mpirun"
> as PRRTE will "remember" the multiple jobs and avoid the overload scenario
> you describe.
>
> This link will walk you thru how to get and build it:
> https://openpmix.github.io/code/getting-the-pmix-reference-server
>
> This link will provide some directions on how to use it:
> https://openpmix.github.io/support/how-to/running-apps-under-psrvr
>
> You would need to update your script to use "prun --personality ompi"
> instead of "mpirun", but things should otherwise be the same.
>
>
> On Aug 20, 2020, at 9:01 AM, Carlo Nervi via users <
> users@lists.open-mpi.org> wrote:
>
> Thank you, Christoph. I did not consider the --cpu-list.
> However, this is okay if I have a single script that is launching several
> jobs (please note that each job may have a different number of CPUs). In my
> case I have the same script (that launches mpirun), which is called many
> times. The script is periodically called till all 48 CPUs are busy, and
> whenever other jobs are finished...
>
> If the script will be the same, I'm afraid that all the jobs subsequent to
> the first will be bound to the same cpulist. Am I wrong?
> mpirun -n xx --cpu-list "$(seq -s, 0 47)" --bind-to cpu-list:ordered $app
> (with xx < 48) will bound all the processes to the first 6 CPUs?
> I need a way to dynamically manage the cpulist, but I was hoping that
> mpirun can do the tasks.
> If not, I'm afraid that the simplest solution is to use --bind-to none
> Thank you,
> Carlo
>
>
> Il giorno gio 20 ago 2020 alle ore 17:24 Christoph Niethammer <
> nietham...@hlrs.de> ha scritto:
>
>> Hello Carlo,
>>
>> If you execute multiple mpirun commands they will not know about each
>> others resource bindings.
>> E.g. if you bind to cores each mpirun will start with the same core to
>> assign with again.
>> This results then in over subscription of the cores, which slows down
>> your programs - as you did realize.
>>
>>
>> You can use "--cpu-list" together with "--bind-to cpu-list:ordered"
>> So if you start all your simulations in a single script this would look
>> like
>>
>> mpirun -n 6 --cpu-list "$(seq -s, 0 5)" --bind-to cpu-list:ordered  $app
>> mpirun -n 6 --cpu-list "$(seq -s, 6 11)" --bind-to cpu-list:ordered  $app
>> ...
>> mpirun -n 6 --cpu-list "$(seq -s, 42 47)" --bind-to cpu-list:ordered  $app
>>
>>
>> Best
>> Christoph
>>
>>
>> ----- Original Message -----
>> From: "Open MPI Users" <users@lists.open-mpi.org>
>> To: "Open MPI Users" <users@lists.open-mpi.org>
>> Cc: "Carlo Nervi" <carlo.ne...@unito.it>
>> Sent: Thursday, 20 August, 2020 12:17:21
>> Subject: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa
>> architecture
>>
>> Dear OMPI community,
>> I'm a simple end-user with no particular experience.
>> I compile quantum chemical programs and use them in parallel.
>>
>> My system is a 4 socket, 12 core per socket Opteron 6168 system for a
>> total
>> of 48 cores and 64 Gb of RAM. It has 8 NUMA nodes:
>>
>> openmpi $ hwloc-info
>> depth 0:           1 Machine (type #0)
>>  depth 1:          4 Package (type #1)
>>   depth 2:         8 L3Cache (type #6)
>>    depth 3:        48 L2Cache (type #5)
>>     depth 4:       48 L1dCache (type #4)
>>      depth 5:      48 L1iCache (type #9)
>>       depth 6:     48 Core (type #2)
>>        depth 7:    48 PU (type #3)
>> Special depth -3:  8 NUMANode (type #13)
>> Special depth -4:  3 Bridge (type #14)
>> Special depth -5:  5 PCIDev (type #15)
>> Special depth -6:  5 OSDev (type #16)
>>
>> lstopo:
>>
>> openmpi $ lstopo
>> Machine (63GB total)
>>   Package L#0
>>     L3 L#0 (5118KB)
>>       NUMANode L#0 (P#0 7971MB)
>>       L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
>> (P#0)
>>       L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
>> (P#1)
>>       L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
>> (P#2)
>>       L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
>> (P#3)
>>       L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
>> (P#4)
>>       L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
>> (P#5)
>>       HostBridge
>>         PCIBridge
>>           PCI 02:00.0 (Ethernet)
>>             Net "enp2s0f0"
>>           PCI 02:00.1 (Ethernet)
>>             Net "enp2s0f1"
>>         PCI 00:11.0 (RAID)
>>           Block(Disk) "sdb"
>>           Block(Disk) "sdc"
>>           Block(Disk) "sda"
>>         PCI 00:14.1 (IDE)
>>         PCIBridge
>>           PCI 01:04.0 (VGA)
>>     L3 L#1 (5118KB)
>>       NUMANode L#1 (P#1 8063MB)
>>       L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
>> (P#6)
>>       L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
>> (P#7)
>>       L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8
>> (P#8)
>>       L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9
>> (P#9)
>>       L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU
>> L#10 (P#10)
>>       L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU
>> L#11 (P#11)
>>   Package L#1
>>     L3 L#2 (5118KB)
>>       NUMANode L#2 (P#2 8063MB)
>>       L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU
>> L#12 (P#12)
>>       L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU
>> L#13 (P#13)
>>       L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU
>> L#14 (P#14)
>>       L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU
>> L#15 (P#15)
>>       L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU
>> L#16 (P#16)
>>       L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU
>> L#17 (P#17)
>>     L3 L#3 (5118KB)
>>       NUMANode L#3 (P#3 8063MB)
>>       L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU
>> L#18 (P#18)
>>       L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU
>> L#19 (P#19)
>>       L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU
>> L#20 (P#20)
>>       L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU
>> L#21 (P#21)
>>       L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU
>> L#22 (P#22)
>>       L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU
>> L#23 (P#23)
>>   Package L#2
>>     L3 L#4 (5118KB)
>>       NUMANode L#4 (P#4 8063MB)
>>       L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU
>> L#24 (P#24)
>>       L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU
>> L#25 (P#25)
>>       L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU
>> L#26 (P#26)
>>       L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU
>> L#27 (P#27)
>>       L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU
>> L#28 (P#28)
>>       L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU
>> L#29 (P#29)
>>     L3 L#5 (5118KB)
>>       NUMANode L#5 (P#5 8063MB)
>>       L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU
>> L#30 (P#30)
>>       L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU
>> L#31 (P#31)
>>       L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU
>> L#32 (P#32)
>>       L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU
>> L#33 (P#33)
>>       L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU
>> L#34 (P#34)
>>       L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU
>> L#35 (P#35)
>>   Package L#3
>>     L3 L#6 (5118KB)
>>       NUMANode L#6 (P#6 8063MB)
>>       L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU
>> L#36 (P#36)
>>       L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU
>> L#37 (P#37)
>>       L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU
>> L#38 (P#38)
>>       L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU
>> L#39 (P#39)
>>       L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU
>> L#40 (P#40)
>>       L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU
>> L#41 (P#41)
>>     L3 L#7 (5118KB)
>>       NUMANode L#7 (P#7 8062MB)
>>       L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU
>> L#42 (P#42)
>>       L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU
>> L#43 (P#43)
>>       L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU
>> L#44 (P#44)
>>       L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU
>> L#45 (P#45)
>>       L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU
>> L#46 (P#46)
>>       L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU
>> L#47 (P#47)
>>
>> openmpi $ numactl -H
>> available: 8 nodes (0-7)
>> node 0 cpus: 0 1 2 3 4 5
>> node 0 size: 7971 MB
>> node 0 free: 6858 MB
>> node 1 cpus: 6 7 8 9 10 11
>> node 1 size: 8062 MB
>> node 1 free: 6860 MB
>> node 2 cpus: 12 13 14 15 16 17
>> node 2 size: 8062 MB
>> node 2 free: 6979 MB
>> node 3 cpus: 18 19 20 21 22 23
>> node 3 size: 8062 MB
>> node 3 free: 7132 MB
>> node 4 cpus: 24 25 26 27 28 29
>> node 4 size: 8062 MB
>> node 4 free: 6276 MB
>> node 5 cpus: 30 31 32 33 34 35
>> node 5 size: 8062 MB
>> node 5 free: 7190 MB
>> node 6 cpus: 36 37 38 39 40 41
>> node 6 size: 8062 MB
>> node 6 free: 7059 MB
>> node 7 cpus: 42 43 44 45 46 47
>> node 7 size: 8061 MB
>> node 7 free: 7075 MB
>> node distances:
>> node   0   1   2   3   4   5   6   7
>>   0:  10  16  16  22  16  22  16  22
>>   1:  16  10  22  16  22  16  22  16
>>   2:  16  22  10  16  16  22  16  22
>>   3:  22  16  16  10  22  16  22  16
>>   4:  16  22  16  22  10  16  16  22
>>   5:  22  16  22  16  16  10  22  16
>>   6:  16  22  16  22  16  22  10  16
>>   7:  22  16  22  16  22  16  16  10
>>
>>
>> I compiled openmpi 4.0.4 but probably some bugs alter the behavior of
>> mpirun, and therefore I'm asking you suggestions on how to properly run
>> parallel code on my system. I'm using Linux Gentoo.
>> Questions:
>>
>> 1) If I recompile openmpi changing the version, should I also recompile
>> all
>> my openmpi programs? I'm a little bit confused since I tried to downgrade
>> to 4.0.1, but new random errors pop up.
>>
>> 2) I have many jobs to run, each of them can use from 1 to N cpus (all
>> mpi:
>> by now I tend to avoid OpenMP). Ideally I would like to run the same
>> simple
>> mpirun command for each job so that mpirun should distribute jobs and
>> processes automatically, taking in account the NUMA architecture (8 NUMA,
>> 6
>> CPUs per NUMA node).
>> I tried to use --map-by numa in 4.0.1 (--map-by numa gives errors in
>> 4.0.4), but many processes run at 10-30% of CPU.
>> Then I switched back to 4.0.4 (the version I used to compile my programs),
>> and the only way I found effective is to use "mpirun --bind-to none". In
>> this way all cpus are running at 100%, but I lose efficiency due to NUMA.
>>
>> What is the correct way (if exists) to bind (almost) all the processes  of
>> the same job to a single NUMA node? I'm not searching for a perfect
>> solution, but at least distribute the first 8 parallel jobs to 8 NUMA
>> nodes.
>>
>> I hope I made myself clear!
>> Thank for your patience and sorry for the long letter,
>> Carlo
>>
>>
>>
>>
>> --
>>
>> ------------------------------------------------------------
>> Prof. Carlo Nervi carlo.ne...@unito.it  Tel:+39 0116707507/8
>> Fax: +39 0116707855      -      Dipartimento di Chimica, via
>> P. Giuria 7, 10125 Torino, Italy.    http://lem.ch.unito.it/
>>
>> ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com
>> International Conference on Coordination Chemistry (ICCC 2020)
>>
>>  <http://www.iccc2020.com/>
>>
>
>
> --
>
> ------------------------------------------------------------
> Prof. Carlo Nervi carlo.ne...@unito.it  Tel:+39 0116707507/8
> Fax: +39 0116707855      -      Dipartimento di Chimica, via
> P. Giuria 7, 10125 Torino, Italy.    http://lem.ch.unito.it/
>
> ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com
> International Conference on Coordination Chemistry (ICCC 2020)
>
>  <http://www.iccc2020.com/>
>
>
>

-- 

------------------------------------------------------------
Prof. Carlo Nervi carlo.ne...@unito.it  Tel:+39 0116707507/8
Fax: +39 0116707855      -      Dipartimento di Chimica, via
P. Giuria 7, 10125 Torino, Italy.    http://lem.ch.unito.it/

ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com
International Conference on Coordination Chemistry (ICCC 2020)

 <http://www.iccc2020.com/>

Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

Reply via email to