Hello Lucas, Le mar. 13 sept. 2022 à 14:23, Lucas Chaloyard via users < users@lists.open-mpi.org> a écrit :
> Hello, > > I'm working as a research intern in a lab where we're studying > virtualization. > And I've been working with several benchmarks using OpenMPI 4.1.0 (ASKAP, > GPAW and Incompact3d from Phrononix Test suite). > > To briefly explain my experiments, I'm running those benchmarks on several > virtual machines using different topologies. > During one experiment I've been comparing those two topologies : > - Topology1 : 96 vCPUS divided in 96 sockets containing 1 threads > - Topology2 : 96 vCPUS divided in 48 sockets containing 2 threads (usage > of hyperthreading) > > For the ASKAP Benchmark : > - While using Topology2, 2306 processes will be created by the application > to do its work. > - While using Topology1, 4612 processes will be created by the application > to do its work. > This is also happening when running GPAW and Incompact3d benchmarks. > > What I've been wondering (and looking for) is, does OpenMPI take into > account the topology, and reduce the number of processes create to execute > its work in order to avoid the usage of hyperthreading ? > Or is it something done by the application itself ? > I would like to add that it is possible that the VMM (Virtual Machine Monitor) may never expose completely the physical topology to a guest. This may vary from one hypervisor to another. Thus, VM topology won't ever match the physical topology. I am not even sure if you can tweak the VMM to perfectly match physical and virtual topology. There was an interesting talk about this at the KVM forum a few years ago. You can watch it at https://youtu.be/hHPuEF7qP_Q. That said, I am experimenting by running MPI applications by using a unikernel. The unikernel is deployed in a single VM with the same number of VCPUs as in the host. In this deployment, I am using one thread per VCPU and the communication is over shared-memory, i.e., virtio. This deployment aims at leveraging the NUMA topology by using dynamic memory that is allocated per-core. In other words, threads allocate only local memory. For the moment, I could not bench this deployment but I will do it soon. Matias > I was looking at the source code, and I've been trying to find how and > when are filled the information about the MPI_COMM_WORLD communicator, to > see if the 'num_procs' field depends on the topology, but I didn't have any > chance for now. > > Respectfully, Chaloyard Lucas. >