Hello Lucas,

Le mar. 13 sept. 2022 à 14:23, Lucas Chaloyard via users <
users@lists.open-mpi.org> a écrit :

> Hello,
>
> I'm working as a research intern in a lab where we're studying
> virtualization.
> And I've been working with several benchmarks using OpenMPI 4.1.0 (ASKAP,
> GPAW and Incompact3d from Phrononix Test suite).
>
> To briefly explain my experiments, I'm running those benchmarks on several
> virtual machines using different topologies.
> During one experiment I've been comparing those two topologies :
> - Topology1 : 96 vCPUS divided in 96 sockets containing 1 threads
> - Topology2 : 96 vCPUS divided in 48 sockets containing 2 threads (usage
> of hyperthreading)
>
> For the ASKAP Benchmark :
> - While using Topology2, 2306 processes will be created by the application
> to do its work.
> - While using Topology1, 4612 processes will be created by the application
> to do its work.
> This is also happening when running GPAW and Incompact3d benchmarks.
>
> What I've been wondering (and looking for) is, does OpenMPI take into
> account the topology, and reduce the number of processes create to execute
> its work in order to avoid the usage of hyperthreading ?
> Or is it something done by the application itself ?
>

I would like to add that it is possible that the VMM (Virtual Machine
Monitor) may never expose completely the physical topology to a guest. This
may vary from one hypervisor to another. Thus, VM topology won't ever match
the physical topology. I am not even sure if you can tweak the VMM to
perfectly match physical and virtual topology. There was an interesting
talk about this at the KVM forum a few years ago. You can watch it at
https://youtu.be/hHPuEF7qP_Q. That said, I am experimenting by running MPI
applications by using a unikernel. The unikernel is deployed in a single VM
with the same number of VCPUs as in the host. In this deployment, I am
using one thread per VCPU and the communication is over shared-memory,
i.e., virtio. This deployment aims at leveraging the NUMA topology by using
dynamic memory that is allocated per-core. In other words, threads allocate
only local memory. For the moment, I could not bench this deployment but I
will do it soon.

Matias



> I was looking at the source code, and I've been trying to find how and
> when are filled the information about the MPI_COMM_WORLD communicator, to
> see if the 'num_procs' field depends on the topology, but I didn't have any
> chance for now.
>
> Respectfully, Chaloyard Lucas.
>

Reply via email to