Dear OMPI community, I'm a simple end-user with no particular experience. I compile quantum chemical programs and use them in parallel.
My system is a 4 socket, 12 core per socket Opteron 6168 system for a total of 48 cores and 64 Gb of RAM. It has 8 NUMA nodes: openmpi $ hwloc-info depth 0: 1 Machine (type #0) depth 1: 4 Package (type #1) depth 2: 8 L3Cache (type #6) depth 3: 48 L2Cache (type #5) depth 4: 48 L1dCache (type #4) depth 5: 48 L1iCache (type #9) depth 6: 48 Core (type #2) depth 7: 48 PU (type #3) Special depth -3: 8 NUMANode (type #13) Special depth -4: 3 Bridge (type #14) Special depth -5: 5 PCIDev (type #15) Special depth -6: 5 OSDev (type #16) lstopo: openmpi $ lstopo Machine (63GB total) Package L#0 L3 L#0 (5118KB) NUMANode L#0 (P#0 7971MB) L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5) HostBridge PCIBridge PCI 02:00.0 (Ethernet) Net "enp2s0f0" PCI 02:00.1 (Ethernet) Net "enp2s0f1" PCI 00:11.0 (RAID) Block(Disk) "sdb" Block(Disk) "sdc" Block(Disk) "sda" PCI 00:14.1 (IDE) PCIBridge PCI 01:04.0 (VGA) L3 L#1 (5118KB) NUMANode L#1 (P#1 8063MB) L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11) Package L#1 L3 L#2 (5118KB) NUMANode L#2 (P#2 8063MB) L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15) L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU L#17 (P#17) L3 L#3 (5118KB) NUMANode L#3 (P#3 8063MB) L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU L#19 (P#19) L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23) Package L#2 L3 L#4 (5118KB) NUMANode L#4 (P#4 8063MB) L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU L#27 (P#27) L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU L#28 (P#28) L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU L#29 (P#29) L3 L#5 (5118KB) NUMANode L#5 (P#5 8063MB) L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU L#30 (P#30) L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU L#31 (P#31) L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU L#35 (P#35) Package L#3 L3 L#6 (5118KB) NUMANode L#6 (P#6 8063MB) L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU L#36 (P#36) L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU L#37 (P#37) L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU L#38 (P#38) L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU L#39 (P#39) L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU L#40 (P#40) L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU L#41 (P#41) L3 L#7 (5118KB) NUMANode L#7 (P#7 8062MB) L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU L#42 (P#42) L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU L#43 (P#43) L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU L#44 (P#44) L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU L#45 (P#45) L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU L#46 (P#46) L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU L#47 (P#47) openmpi $ numactl -H available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 node 0 size: 7971 MB node 0 free: 6858 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 8062 MB node 1 free: 6860 MB node 2 cpus: 12 13 14 15 16 17 node 2 size: 8062 MB node 2 free: 6979 MB node 3 cpus: 18 19 20 21 22 23 node 3 size: 8062 MB node 3 free: 7132 MB node 4 cpus: 24 25 26 27 28 29 node 4 size: 8062 MB node 4 free: 6276 MB node 5 cpus: 30 31 32 33 34 35 node 5 size: 8062 MB node 5 free: 7190 MB node 6 cpus: 36 37 38 39 40 41 node 6 size: 8062 MB node 6 free: 7059 MB node 7 cpus: 42 43 44 45 46 47 node 7 size: 8061 MB node 7 free: 7075 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 22 16 22 16 22 1: 16 10 22 16 22 16 22 16 2: 16 22 10 16 16 22 16 22 3: 22 16 16 10 22 16 22 16 4: 16 22 16 22 10 16 16 22 5: 22 16 22 16 16 10 22 16 6: 16 22 16 22 16 22 10 16 7: 22 16 22 16 22 16 16 10 I compiled openmpi 4.0.4 but probably some bugs alter the behavior of mpirun, and therefore I'm asking you suggestions on how to properly run parallel code on my system. I'm using Linux Gentoo. Questions: 1) If I recompile openmpi changing the version, should I also recompile all my openmpi programs? I'm a little bit confused since I tried to downgrade to 4.0.1, but new random errors pop up. 2) I have many jobs to run, each of them can use from 1 to N cpus (all mpi: by now I tend to avoid OpenMP). Ideally I would like to run the same simple mpirun command for each job so that mpirun should distribute jobs and processes automatically, taking in account the NUMA architecture (8 NUMA, 6 CPUs per NUMA node). I tried to use --map-by numa in 4.0.1 (--map-by numa gives errors in 4.0.4), but many processes run at 10-30% of CPU. Then I switched back to 4.0.4 (the version I used to compile my programs), and the only way I found effective is to use "mpirun --bind-to none". In this way all cpus are running at 100%, but I lose efficiency due to NUMA. What is the correct way (if exists) to bind (almost) all the processes of the same job to a single NUMA node? I'm not searching for a perfect solution, but at least distribute the first 8 parallel jobs to 8 NUMA nodes. I hope I made myself clear! Thank for your patience and sorry for the long letter, Carlo -- ------------------------------------------------------------ Prof. Carlo Nervi carlo.ne...@unito.it Tel:+39 0116707507/8 Fax: +39 0116707855 - Dipartimento di Chimica, via P. Giuria 7, 10125 Torino, Italy. http://lem.ch.unito.it/ ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com International Conference on Coordination Chemistry (ICCC 2020) <http://www.iccc2020.com/>