Thank you Ralph for the suggestion! I will carefully consider it, although I'm a chemist and not a sysadmin (I miss a lot a specialized sysadmin in our Department!). Carlo
Il giorno gio 20 ago 2020 alle ore 18:45 Ralph Castain via users < users@lists.open-mpi.org> ha scritto: > Your use-case sounds more like a workflow than an application - in which > case, you probably should be using PRRTE to execute it instead of "mpirun" > as PRRTE will "remember" the multiple jobs and avoid the overload scenario > you describe. > > This link will walk you thru how to get and build it: > https://openpmix.github.io/code/getting-the-pmix-reference-server > > This link will provide some directions on how to use it: > https://openpmix.github.io/support/how-to/running-apps-under-psrvr > > You would need to update your script to use "prun --personality ompi" > instead of "mpirun", but things should otherwise be the same. > > > On Aug 20, 2020, at 9:01 AM, Carlo Nervi via users < > users@lists.open-mpi.org> wrote: > > Thank you, Christoph. I did not consider the --cpu-list. > However, this is okay if I have a single script that is launching several > jobs (please note that each job may have a different number of CPUs). In my > case I have the same script (that launches mpirun), which is called many > times. The script is periodically called till all 48 CPUs are busy, and > whenever other jobs are finished... > > If the script will be the same, I'm afraid that all the jobs subsequent to > the first will be bound to the same cpulist. Am I wrong? > mpirun -n xx --cpu-list "$(seq -s, 0 47)" --bind-to cpu-list:ordered $app > (with xx < 48) will bound all the processes to the first 6 CPUs? > I need a way to dynamically manage the cpulist, but I was hoping that > mpirun can do the tasks. > If not, I'm afraid that the simplest solution is to use --bind-to none > Thank you, > Carlo > > > Il giorno gio 20 ago 2020 alle ore 17:24 Christoph Niethammer < > nietham...@hlrs.de> ha scritto: > >> Hello Carlo, >> >> If you execute multiple mpirun commands they will not know about each >> others resource bindings. >> E.g. if you bind to cores each mpirun will start with the same core to >> assign with again. >> This results then in over subscription of the cores, which slows down >> your programs - as you did realize. >> >> >> You can use "--cpu-list" together with "--bind-to cpu-list:ordered" >> So if you start all your simulations in a single script this would look >> like >> >> mpirun -n 6 --cpu-list "$(seq -s, 0 5)" --bind-to cpu-list:ordered $app >> mpirun -n 6 --cpu-list "$(seq -s, 6 11)" --bind-to cpu-list:ordered $app >> ... >> mpirun -n 6 --cpu-list "$(seq -s, 42 47)" --bind-to cpu-list:ordered $app >> >> >> Best >> Christoph >> >> >> ----- Original Message ----- >> From: "Open MPI Users" <users@lists.open-mpi.org> >> To: "Open MPI Users" <users@lists.open-mpi.org> >> Cc: "Carlo Nervi" <carlo.ne...@unito.it> >> Sent: Thursday, 20 August, 2020 12:17:21 >> Subject: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa >> architecture >> >> Dear OMPI community, >> I'm a simple end-user with no particular experience. >> I compile quantum chemical programs and use them in parallel. >> >> My system is a 4 socket, 12 core per socket Opteron 6168 system for a >> total >> of 48 cores and 64 Gb of RAM. It has 8 NUMA nodes: >> >> openmpi $ hwloc-info >> depth 0: 1 Machine (type #0) >> depth 1: 4 Package (type #1) >> depth 2: 8 L3Cache (type #6) >> depth 3: 48 L2Cache (type #5) >> depth 4: 48 L1dCache (type #4) >> depth 5: 48 L1iCache (type #9) >> depth 6: 48 Core (type #2) >> depth 7: 48 PU (type #3) >> Special depth -3: 8 NUMANode (type #13) >> Special depth -4: 3 Bridge (type #14) >> Special depth -5: 5 PCIDev (type #15) >> Special depth -6: 5 OSDev (type #16) >> >> lstopo: >> >> openmpi $ lstopo >> Machine (63GB total) >> Package L#0 >> L3 L#0 (5118KB) >> NUMANode L#0 (P#0 7971MB) >> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 >> (P#0) >> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 >> (P#1) >> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 >> (P#2) >> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 >> (P#3) >> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 >> (P#4) >> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 >> (P#5) >> HostBridge >> PCIBridge >> PCI 02:00.0 (Ethernet) >> Net "enp2s0f0" >> PCI 02:00.1 (Ethernet) >> Net "enp2s0f1" >> PCI 00:11.0 (RAID) >> Block(Disk) "sdb" >> Block(Disk) "sdc" >> Block(Disk) "sda" >> PCI 00:14.1 (IDE) >> PCIBridge >> PCI 01:04.0 (VGA) >> L3 L#1 (5118KB) >> NUMANode L#1 (P#1 8063MB) >> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 >> (P#6) >> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 >> (P#7) >> L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 >> (P#8) >> L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 >> (P#9) >> L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU >> L#10 (P#10) >> L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU >> L#11 (P#11) >> Package L#1 >> L3 L#2 (5118KB) >> NUMANode L#2 (P#2 8063MB) >> L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU >> L#12 (P#12) >> L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU >> L#13 (P#13) >> L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU >> L#14 (P#14) >> L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU >> L#15 (P#15) >> L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU >> L#16 (P#16) >> L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU >> L#17 (P#17) >> L3 L#3 (5118KB) >> NUMANode L#3 (P#3 8063MB) >> L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU >> L#18 (P#18) >> L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU >> L#19 (P#19) >> L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU >> L#20 (P#20) >> L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU >> L#21 (P#21) >> L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU >> L#22 (P#22) >> L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU >> L#23 (P#23) >> Package L#2 >> L3 L#4 (5118KB) >> NUMANode L#4 (P#4 8063MB) >> L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU >> L#24 (P#24) >> L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU >> L#25 (P#25) >> L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU >> L#26 (P#26) >> L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU >> L#27 (P#27) >> L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU >> L#28 (P#28) >> L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU >> L#29 (P#29) >> L3 L#5 (5118KB) >> NUMANode L#5 (P#5 8063MB) >> L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU >> L#30 (P#30) >> L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU >> L#31 (P#31) >> L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU >> L#32 (P#32) >> L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU >> L#33 (P#33) >> L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU >> L#34 (P#34) >> L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU >> L#35 (P#35) >> Package L#3 >> L3 L#6 (5118KB) >> NUMANode L#6 (P#6 8063MB) >> L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU >> L#36 (P#36) >> L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU >> L#37 (P#37) >> L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU >> L#38 (P#38) >> L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU >> L#39 (P#39) >> L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU >> L#40 (P#40) >> L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU >> L#41 (P#41) >> L3 L#7 (5118KB) >> NUMANode L#7 (P#7 8062MB) >> L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU >> L#42 (P#42) >> L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU >> L#43 (P#43) >> L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU >> L#44 (P#44) >> L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU >> L#45 (P#45) >> L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU >> L#46 (P#46) >> L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU >> L#47 (P#47) >> >> openmpi $ numactl -H >> available: 8 nodes (0-7) >> node 0 cpus: 0 1 2 3 4 5 >> node 0 size: 7971 MB >> node 0 free: 6858 MB >> node 1 cpus: 6 7 8 9 10 11 >> node 1 size: 8062 MB >> node 1 free: 6860 MB >> node 2 cpus: 12 13 14 15 16 17 >> node 2 size: 8062 MB >> node 2 free: 6979 MB >> node 3 cpus: 18 19 20 21 22 23 >> node 3 size: 8062 MB >> node 3 free: 7132 MB >> node 4 cpus: 24 25 26 27 28 29 >> node 4 size: 8062 MB >> node 4 free: 6276 MB >> node 5 cpus: 30 31 32 33 34 35 >> node 5 size: 8062 MB >> node 5 free: 7190 MB >> node 6 cpus: 36 37 38 39 40 41 >> node 6 size: 8062 MB >> node 6 free: 7059 MB >> node 7 cpus: 42 43 44 45 46 47 >> node 7 size: 8061 MB >> node 7 free: 7075 MB >> node distances: >> node 0 1 2 3 4 5 6 7 >> 0: 10 16 16 22 16 22 16 22 >> 1: 16 10 22 16 22 16 22 16 >> 2: 16 22 10 16 16 22 16 22 >> 3: 22 16 16 10 22 16 22 16 >> 4: 16 22 16 22 10 16 16 22 >> 5: 22 16 22 16 16 10 22 16 >> 6: 16 22 16 22 16 22 10 16 >> 7: 22 16 22 16 22 16 16 10 >> >> >> I compiled openmpi 4.0.4 but probably some bugs alter the behavior of >> mpirun, and therefore I'm asking you suggestions on how to properly run >> parallel code on my system. I'm using Linux Gentoo. >> Questions: >> >> 1) If I recompile openmpi changing the version, should I also recompile >> all >> my openmpi programs? I'm a little bit confused since I tried to downgrade >> to 4.0.1, but new random errors pop up. >> >> 2) I have many jobs to run, each of them can use from 1 to N cpus (all >> mpi: >> by now I tend to avoid OpenMP). Ideally I would like to run the same >> simple >> mpirun command for each job so that mpirun should distribute jobs and >> processes automatically, taking in account the NUMA architecture (8 NUMA, >> 6 >> CPUs per NUMA node). >> I tried to use --map-by numa in 4.0.1 (--map-by numa gives errors in >> 4.0.4), but many processes run at 10-30% of CPU. >> Then I switched back to 4.0.4 (the version I used to compile my programs), >> and the only way I found effective is to use "mpirun --bind-to none". In >> this way all cpus are running at 100%, but I lose efficiency due to NUMA. >> >> What is the correct way (if exists) to bind (almost) all the processes of >> the same job to a single NUMA node? I'm not searching for a perfect >> solution, but at least distribute the first 8 parallel jobs to 8 NUMA >> nodes. >> >> I hope I made myself clear! >> Thank for your patience and sorry for the long letter, >> Carlo >> >> >> >> >> -- >> >> ------------------------------------------------------------ >> Prof. Carlo Nervi carlo.ne...@unito.it Tel:+39 0116707507/8 >> Fax: +39 0116707855 - Dipartimento di Chimica, via >> P. Giuria 7, 10125 Torino, Italy. http://lem.ch.unito.it/ >> >> ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com >> International Conference on Coordination Chemistry (ICCC 2020) >> >> <http://www.iccc2020.com/> >> > > > -- > > ------------------------------------------------------------ > Prof. Carlo Nervi carlo.ne...@unito.it Tel:+39 0116707507/8 > Fax: +39 0116707855 - Dipartimento di Chimica, via > P. Giuria 7, 10125 Torino, Italy. http://lem.ch.unito.it/ > > ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com > International Conference on Coordination Chemistry (ICCC 2020) > > <http://www.iccc2020.com/> > > > -- ------------------------------------------------------------ Prof. Carlo Nervi carlo.ne...@unito.it Tel:+39 0116707507/8 Fax: +39 0116707855 - Dipartimento di Chimica, via P. Giuria 7, 10125 Torino, Italy. http://lem.ch.unito.it/ ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com International Conference on Coordination Chemistry (ICCC 2020) <http://www.iccc2020.com/>