Your use-case sounds more like a workflow than an application - in which case, you probably should be using PRRTE to execute it instead of "mpirun" as PRRTE will "remember" the multiple jobs and avoid the overload scenario you describe.
This link will walk you thru how to get and build it: https://openpmix.github.io/code/getting-the-pmix-reference-server This link will provide some directions on how to use it: https://openpmix.github.io/support/how-to/running-apps-under-psrvr You would need to update your script to use "prun --personality ompi" instead of "mpirun", but things should otherwise be the same. On Aug 20, 2020, at 9:01 AM, Carlo Nervi via users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote: Thank you, Christoph. I did not consider the --cpu-list. However, this is okay if I have a single script that is launching several jobs (please note that each job may have a different number of CPUs). In my case I have the same script (that launches mpirun), which is called many times. The script is periodically called till all 48 CPUs are busy, and whenever other jobs are finished... If the script will be the same, I'm afraid that all the jobs subsequent to the first will be bound to the same cpulist. Am I wrong? mpirun -n xx --cpu-list "$(seq -s, 0 47)" --bind-to cpu-list:ordered $app (with xx < 48) will bound all the processes to the first 6 CPUs? I need a way to dynamically manage the cpulist, but I was hoping that mpirun can do the tasks. If not, I'm afraid that the simplest solution is to use --bind-to none Thank you, Carlo Il giorno gio 20 ago 2020 alle ore 17:24 Christoph Niethammer <nietham...@hlrs.de <mailto:nietham...@hlrs.de> > ha scritto: Hello Carlo, If you execute multiple mpirun commands they will not know about each others resource bindings. E.g. if you bind to cores each mpirun will start with the same core to assign with again. This results then in over subscription of the cores, which slows down your programs - as you did realize. You can use "--cpu-list" together with "--bind-to cpu-list:ordered" So if you start all your simulations in a single script this would look like mpirun -n 6 --cpu-list "$(seq -s, 0 5)" --bind-to cpu-list:ordered $app mpirun -n 6 --cpu-list "$(seq -s, 6 11)" --bind-to cpu-list:ordered $app ... mpirun -n 6 --cpu-list "$(seq -s, 42 47)" --bind-to cpu-list:ordered $app Best Christoph ----- Original Message ----- From: "Open MPI Users" <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > To: "Open MPI Users" <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > Cc: "Carlo Nervi" <carlo.ne...@unito.it <mailto:carlo.ne...@unito.it> > Sent: Thursday, 20 August, 2020 12:17:21 Subject: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture Dear OMPI community, I'm a simple end-user with no particular experience. I compile quantum chemical programs and use them in parallel. My system is a 4 socket, 12 core per socket Opteron 6168 system for a total of 48 cores and 64 Gb of RAM. It has 8 NUMA nodes: openmpi $ hwloc-info depth 0: 1 Machine (type #0) depth 1: 4 Package (type #1) depth 2: 8 L3Cache (type #6) depth 3: 48 L2Cache (type #5) depth 4: 48 L1dCache (type #4) depth 5: 48 L1iCache (type #9) depth 6: 48 Core (type #2) depth 7: 48 PU (type #3) Special depth -3: 8 NUMANode (type #13) Special depth -4: 3 Bridge (type #14) Special depth -5: 5 PCIDev (type #15) Special depth -6: 5 OSDev (type #16) lstopo: openmpi $ lstopo Machine (63GB total) Package L#0 L3 L#0 (5118KB) NUMANode L#0 (P#0 7971MB) L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5) HostBridge PCIBridge PCI 02:00.0 (Ethernet) Net "enp2s0f0" PCI 02:00.1 (Ethernet) Net "enp2s0f1" PCI 00:11.0 (RAID) Block(Disk) "sdb" Block(Disk) "sdc" Block(Disk) "sda" PCI 00:14.1 (IDE) PCIBridge PCI 01:04.0 (VGA) L3 L#1 (5118KB) NUMANode L#1 (P#1 8063MB) L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11) Package L#1 L3 L#2 (5118KB) NUMANode L#2 (P#2 8063MB) L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15) L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU L#17 (P#17) L3 L#3 (5118KB) NUMANode L#3 (P#3 8063MB) L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU L#19 (P#19) L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23) Package L#2 L3 L#4 (5118KB) NUMANode L#4 (P#4 8063MB) L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU L#27 (P#27) L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU L#28 (P#28) L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU L#29 (P#29) L3 L#5 (5118KB) NUMANode L#5 (P#5 8063MB) L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU L#30 (P#30) L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU L#31 (P#31) L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU L#35 (P#35) Package L#3 L3 L#6 (5118KB) NUMANode L#6 (P#6 8063MB) L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU L#36 (P#36) L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU L#37 (P#37) L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU L#38 (P#38) L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU L#39 (P#39) L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU L#40 (P#40) L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU L#41 (P#41) L3 L#7 (5118KB) NUMANode L#7 (P#7 8062MB) L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU L#42 (P#42) L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU L#43 (P#43) L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU L#44 (P#44) L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU L#45 (P#45) L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU L#46 (P#46) L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU L#47 (P#47) openmpi $ numactl -H available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 node 0 size: 7971 MB node 0 free: 6858 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 8062 MB node 1 free: 6860 MB node 2 cpus: 12 13 14 15 16 17 node 2 size: 8062 MB node 2 free: 6979 MB node 3 cpus: 18 19 20 21 22 23 node 3 size: 8062 MB node 3 free: 7132 MB node 4 cpus: 24 25 26 27 28 29 node 4 size: 8062 MB node 4 free: 6276 MB node 5 cpus: 30 31 32 33 34 35 node 5 size: 8062 MB node 5 free: 7190 MB node 6 cpus: 36 37 38 39 40 41 node 6 size: 8062 MB node 6 free: 7059 MB node 7 cpus: 42 43 44 45 46 47 node 7 size: 8061 MB node 7 free: 7075 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 22 16 22 16 22 1: 16 10 22 16 22 16 22 16 2: 16 22 10 16 16 22 16 22 3: 22 16 16 10 22 16 22 16 4: 16 22 16 22 10 16 16 22 5: 22 16 22 16 16 10 22 16 6: 16 22 16 22 16 22 10 16 7: 22 16 22 16 22 16 16 10 I compiled openmpi 4.0.4 but probably some bugs alter the behavior of mpirun, and therefore I'm asking you suggestions on how to properly run parallel code on my system. I'm using Linux Gentoo. Questions: 1) If I recompile openmpi changing the version, should I also recompile all my openmpi programs? I'm a little bit confused since I tried to downgrade to 4.0.1, but new random errors pop up. 2) I have many jobs to run, each of them can use from 1 to N cpus (all mpi: by now I tend to avoid OpenMP). Ideally I would like to run the same simple mpirun command for each job so that mpirun should distribute jobs and processes automatically, taking in account the NUMA architecture (8 NUMA, 6 CPUs per NUMA node). I tried to use --map-by numa in 4.0.1 (--map-by numa gives errors in 4.0.4), but many processes run at 10-30% of CPU. Then I switched back to 4.0.4 (the version I used to compile my programs), and the only way I found effective is to use "mpirun --bind-to none". In this way all cpus are running at 100%, but I lose efficiency due to NUMA. What is the correct way (if exists) to bind (almost) all the processes of the same job to a single NUMA node? I'm not searching for a perfect solution, but at least distribute the first 8 parallel jobs to 8 NUMA nodes. I hope I made myself clear! Thank for your patience and sorry for the long letter, Carlo -- ------------------------------------------------------------ Prof. Carlo Nervi carlo.ne...@unito.it <mailto:carlo.ne...@unito.it> Tel:+39 0116707507/8 Fax: +39 0116707855 - Dipartimento di Chimica, via P. Giuria 7, 10125 Torino, Italy. http://lem.ch.unito.it/ <http://lem.ch.unito.it/> ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com <http://www.iccc2020.com/> International Conference on Coordination Chemistry (ICCC 2020) <http://www.iccc2020.com/ <http://www.iccc2020.com/> > -- ------------------------------------------------------------ Prof. Carlo Nervi carlo.ne...@unito.it <mailto:carlo.ne...@unito.it> Tel:+39 0116707507/8 Fax: +39 0116707855 - Dipartimento di Chimica, via P. Giuria 7, 10125 Torino, Italy. http://lem.ch.unito.it/ <http://lem.ch.unito.it/> ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com <http://www.iccc2020.com/> International Conference on Coordination Chemistry (ICCC 2020) <http://www.iccc2020.com/>