On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote:
> QEMU is trying to obsolete -numa node,cpus= because that uses
> ambiguous vCPU id to [socket, die, core, thread] mapping. The new
> form is:
> 
>   -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
> 
> which is repeated for every vCPU and places it at [S, D, C, T]
> into guest NUMA node N.
> 
> While in general this is magic mapping, we can deal with it.
> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
> is given then maxvcpus must be sockets * dies * cores * threads
> (i.e. there are no 'holes').
> Secondly, if no topology is given then libvirt itself places each
> vCPU into a different socket (basically, it fakes topology of:
> [maxvcpus, 1, 1, 1])
> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
> onto topology, to make sure vCPUs don't start to move around.

There's a problem with this premise though and unfortunately we don't
seem to have qemuxml2argvtest for it.

On PPC64, in certain situations the CPU can be configured such that
threads are visible only to VMs. This has substantial impact on how CPUs
are configured using the modern parameters (until now used only for
cpu hotplug purposes, and that's the reason vCPU hotplug has such
complicated incantations when starting the VM).

In the above situation a CPU with topology of:
 sockets=1, cores=4, threads=8 (thus 32 cpus)

will only expose 4 CPU "devices".

 core-id: 0,  core-id: 8, core-id: 16 and core-id: 24

yet the guest will correctly see 32 cpus when used as such.

You can see this in:

tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json

Also note that the 'props' object does _not_ have any socket-id, and
management apps are supposed to pass in 'props' as is. (There's a bunch
of code to do that on hotplug).

The problem is that you need to query the topology first (unless we want
to duplicate all of qemu code that has to do with topology state and
keep up with changes to it) to know how it's behaving on current
machine.  This historically was not possible. The supposed solution for
this was the pre-config state where we'd be able to query and set it up
via QMP, but I was not keeping up sufficiently with that work, so I
don't know if it's possible.

If preconfig is a viable option we IMO should start using it sooner
rather than later and avoid duplicating qemu's logic here.

> Note, migration from old to new cmd line works and therefore
> doesn't need any special handling.
> 
> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1678085
> Signed-off-by: Michal Privoznik <mpriv...@redhat.com>
> ---
>  src/qemu/qemu_command.c                       | 112 +++++++++++++++++-
>  .../hugepages-nvdimm.x86_64-latest.args       |   4 +-
>  ...memory-default-hugepage.x86_64-latest.args |  10 +-
>  .../memfd-memory-numa.x86_64-latest.args      |  10 +-
>  ...y-hotplug-nvdimm-access.x86_64-latest.args |   4 +-
>  ...ory-hotplug-nvdimm-align.x86_64-5.2.0.args |   4 +-
>  ...ry-hotplug-nvdimm-align.x86_64-latest.args |   4 +-
>  ...ory-hotplug-nvdimm-label.x86_64-5.2.0.args |   4 +-
>  ...ry-hotplug-nvdimm-label.x86_64-latest.args |   4 +-
>  ...mory-hotplug-nvdimm-pmem.x86_64-5.2.0.args |   4 +-
>  ...ory-hotplug-nvdimm-pmem.x86_64-latest.args |   4 +-
>  ...-hotplug-nvdimm-readonly.x86_64-5.2.0.args |   4 +-
>  ...hotplug-nvdimm-readonly.x86_64-latest.args |   4 +-
>  .../memory-hotplug-nvdimm.x86_64-latest.args  |   4 +-
>  ...mory-hotplug-virtio-pmem.x86_64-5.2.0.args |   4 +-
>  ...ory-hotplug-virtio-pmem.x86_64-latest.args |   4 +-
>  .../numatune-hmat.x86_64-latest.args          |  18 ++-
>  ...emnode-restrictive-mode.x86_64-latest.args |  38 +++++-
>  .../numatune-memnode.x86_64-5.2.0.args        |  38 +++++-
>  .../numatune-memnode.x86_64-latest.args       |  38 +++++-
>  ...vhost-user-fs-fd-memory.x86_64-latest.args |   4 +-
>  ...vhost-user-fs-hugepages.x86_64-latest.args |   4 +-
>  ...host-user-gpu-secondary.x86_64-latest.args |   3 +-
>  .../vhost-user-vga.x86_64-latest.args         |   3 +-
>  24 files changed, 296 insertions(+), 34 deletions(-)
> 
> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
> index f04ae1e311..5192bd7630 100644
> --- a/src/qemu/qemu_command.c
> +++ b/src/qemu/qemu_command.c

[...]

> @@ -7432,6 +7432,94 @@ qemuBuildNumaCPUs(virBuffer *buf,
>  }
>  
>  
> +/**
> + * qemuTranlsatevCPUID:
> + *
> + * For given vCPU @id and vCPU topology (@cpu) compute corresponding
> + * @socket, @die, @core and @thread). This assumes linear topology,
> + * that is every [socket, die, core, thread] combination is valid vCPU
> + * ID and there are no 'holes'. This is ensured by
> + * qemuValidateDomainDef() if QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS is
> + * set.

As noted above, this assumption does not hold on PPC64. There are indeed
"holes" in certain cases, while filled with cpus you are e.g. unable to
spread them across multiple numa nodes.

In fact allowing to have two sibling threads to be spread across
multiple NUMA nodes is also nonsensical configuration, which we allowed
unfortunately.

> + * Moreover, if @diesSupported is false (QEMU lacks
> + * QEMU_CAPS_SMP_DIES) then @die is set to zero and @socket is
> + * computed without taking number of dies into account.
> + *
> + * The algorithm is shamelessly copied over from QEMU's
> + * x86_topo_ids_from_idx() and its history (before introducing dies).
> + */
> +static void
> +qemuTranlsatevCPUID(unsigned int id,
> +                    bool diesSupported,
> +                    virCPUDef *cpu,
> +                    unsigned int *socket,
> +                    unsigned int *die,
> +                    unsigned int *core,
> +                    unsigned int *thread)
> +{
> +    if (cpu && cpu->sockets) {
> +        *thread = id % cpu->threads;
> +        *core = id / cpu->threads % cpu->cores;
> +        if (diesSupported) {
> +            *die = id / (cpu->cores * cpu->threads) % cpu->dies;
> +            *socket = id / (cpu->dies * cpu->cores * cpu->threads);
> +        } else {
> +            *die = 0;
> +            *socket = id / (cpu->cores * cpu->threads) % cpu->sockets;
> +        }
> +    } else {
> +        /* If no topology was provided, then qemuBuildSmpCommandLine()
> +         * puts all vCPUs into a separate socket. */
> +        *thread = 0;
> +        *core = 0;
> +        *die = 0;
> +        *socket = id;
> +    }
> +}
> +
> +
> +static void
> +qemuBuildNumaNewCPUs(virCommand *cmd,
> +                     virCPUDef *cpu,
> +                     virBitmap *cpumask,
> +                     size_t nodeid,
> +                     virQEMUCaps *qemuCaps)
> +{
> +    const bool diesSupported = virQEMUCapsGet(qemuCaps, QEMU_CAPS_SMP_DIES);
> +    ssize_t vcpuid = -1;
> +
> +    if (!cpumask)
> +        return;
> +
> +    while ((vcpuid = virBitmapNextSetBit(cpumask, vcpuid)) >= 0) {
> +        unsigned int socket;
> +        unsigned int die;
> +        unsigned int core;
> +        unsigned int thread;
> +
> +        qemuTranlsatevCPUID(vcpuid, diesSupported, cpu,
> +                            &socket, &die, &core, &thread);
> +
> +        virCommandAddArg(cmd, "-numa");
> +
> +        /* The simple fact that dies are supported by QEMU doesn't mean we 
> can
> +         * put it onto command line. QEMU will accept die-id only if -smp 
> dies
> +         * was set to a value greater than 1. On the other hand, this allows 
> us
> +         * to generate shorter command line. */
> +        if (diesSupported && cpu && cpu->dies > 1) {
> +            virCommandAddArgFormat(cmd,
> +                                   
> "cpu,node-id=%zu,socket-id=%u,die-id=%u,core-id=%u,thread-id=%u",
> +                                   nodeid, socket, die, core, thread);
> +        } else {
> +            virCommandAddArgFormat(cmd,
> +                                   
> "cpu,node-id=%zu,socket-id=%u,core-id=%u,thread-id=%u",
> +                                   nodeid, socket, core, thread);
> +        }
> +    }
> +}
> +
> +
>  static int
>  qemuBuildNumaCommandLine(virQEMUDriverConfig *cfg,
>                           virDomainDef *def,

[...]

> @@ -7484,6 +7573,17 @@ qemuBuildNumaCommandLine(virQEMUDriverConfig *cfg,
>          qemuBuildMemPathStr(def, cmd, priv) < 0)
>          goto cleanup;
>  
> +    /* Use modern style of specifying vCPU topology only if:
> +     * -numa cpu is available, introduced in the same time as -numa
> +     *           dist, hence slightly misleading capability test, and
> +     * query-hotpluggable-cpus is avialable, because then
> +     *                         qemuValidateDomainDef() ensures that if
> +     *                         topology is specified it matches max vCPU
> +     *                         count and we can make some shortcuts in
> +     *                         qemuTranlsatevCPUID().
> +     */
> +    newCpus = virQEMUCapsGet(qemuCaps, QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS);
> +
>      for (i = 0; i < ncells; i++) {
>          if (virDomainNumaGetNodeCpumask(def->numa, i)) {
>              masterInitiator = i;

[...]

> diff --git a/tests/qemuxml2argvdata/hugepages-nvdimm.x86_64-latest.args 
> b/tests/qemuxml2argvdata/hugepages-nvdimm.x86_64-latest.args
> index 9f3c6fa63f..8af4b44758 100644
> --- a/tests/qemuxml2argvdata/hugepages-nvdimm.x86_64-latest.args
> +++ b/tests/qemuxml2argvdata/hugepages-nvdimm.x86_64-latest.args

[...]

> diff --git a/tests/qemuxml2argvdata/vhost-user-vga.x86_64-latest.args 
> b/tests/qemuxml2argvdata/vhost-user-vga.x86_64-latest.args
> index 0bdd98d3b6..8c463496a1 100644
> --- a/tests/qemuxml2argvdata/vhost-user-vga.x86_64-latest.args
> +++ b/tests/qemuxml2argvdata/vhost-user-vga.x86_64-latest.args
> @@ -16,7 +16,8 @@ XDG_CONFIG_HOME=/tmp/lib/domain--1-QEMUGuest1/.config \
>  -overcommit mem-lock=off \
>  -smp 1,sockets=1,cores=1,threads=1 \
>  -object 
> '{"qom-type":"memory-backend-memfd","id":"ram-node0","share":true,"size":224395264}'
>  \
> --numa node,nodeid=0,cpus=0,memdev=ram-node0 \
> +-numa node,nodeid=0,memdev=ram-node0 \
> +-numa cpu,node-id=0,socket-id=0,core-id=0,thread-id=0 \
>  -uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \
>  -display none \
>  -no-user-config \

None of the impacted tests have 'threads' set to anything else than 1 so
we are not getting any 'thread-id' coverage. Please add some before this
commit. Also as noted, we'll need some PPC64 tests that are impacted.

Reply via email to