Hi Igor,

Thanks for the concrete pointers in this round -- the machine_done /
ACPI-runtime timing notes are useful regardless of where this lands.
Replying inline.

On 5/18/2026 10:32 PM, Igor Mammedov wrote:
On Mon, 18 May 2026 18:43:10 +0800
"Huang, FangSheng (Jerry)" <[email protected]> wrote:

Hi Igor,

Thanks again for the careful read.

On 5/15/2026 9:04 PM, Igor Mammedov wrote:


Without magic/hardcodding, firmware is likely to discover HBM memory
as CXL device and adds E820/SRAT entries for it. That ideally how it
should be modeled in QEMU as well. (not ad-hoc -numa foo options)

I won't push towards the feature being part of GPU pass-through,
or being HBM being a CXL device (which it probably should be).
but read on ...

Understood, and thanks for not making those preconditions.  Setting
CXL and GPU-passthrough-bundle aside for this reply, the remaining
question is whether the SPM topology is expressed on the -numa node
line (v7) or on a separate -device line (your proposal).  I'll
address that below under the mixed-memory point.


It shouldn't be done at realize time.

Ultimately we publish E820 table at machine_done time,
and it's right time to iterate over present memory devices
and add relevant entries if necessary.
SRAT is produced even later
(effectively at runtime on 1st access to acpi tables blob),
so it can pickup memory devices at that time as well.

Noted -- machine_done for E820 and ACPI-tables-runtime for SRAT is
the right hook set for a memory-device-based implementation; that's
where the iteration over present memory devices would land.
(v7 itself sets the E820 entries during pc machine-init, which is
the appropriate hook for the topology-driven -numa path -- there's
no per-device realize step whose results would need to be iterated
over.)

(4) On the longer-term direction

that's exactly what I'm concerned about


If we merge v7 approach, we are bound to support it for years
and if we later on we add device based variant, it will increase
support burden even more and complicate configuration, which
in turn will propagate up the stack.

Hence, I'd like possible options on the table explored 1st
before we commit to a particular approach.

I wouldn't object much is it were a fix that we had to rush in,
but this is not the case (are we in hurry to rush this in?).

This is where I'd like to push back, on three connected points: the
mixed-memory question that the v4 thread already converged on; the
relationship between v7 and a future spm-memory device; and the
review timeline.

(1) The mixed-memory question -- v4 thread carries directly over

In the v4 round, an -object/backend-property variant of this feature
was set aside after pushback from Gregory and David on a single
concern: allowing a single NUMA node to mix SPM and normal memory.
The current memmap-type=[normal,spm,reserved] form was Gregory's
suggestion in that same discussion, and v7 carries Acked-by from
David and Reviewed-by from Gregory.

A -device spm-memory model reopens that question.  A memory device
is plugged into a target NUMA node via its node= parameter; unless
additional constraints are added, that node can also have normal
memory attached via -numa node,memdev=.  Adding a runtime check
that errors out on mixed-node configurations is possible, but it
changes the failure mode from "read your command line and see which
node is SPM" to "assemble a command line that looks fine, hit a
realize-time error, edit, retry."  For users composing multi-node
topologies on the CLI (or via libvirt-generated cmdlines), that's a
meaningful UX regression.

The whole-node typing in v7 expresses the SPM-vs-normal property
declaratively, on the same -numa line that already carries the
node's other attributes -- which is the form the v4 review
converged on, precisely to keep this kind of constraint visible at
the configuration layer rather than at realize time.

(Gregory, David: this is exactly the regression you raised in v4
that pushed the series away from -object form.  Would value your
read on whether a -device spm-memory variant raises it again, and
what an acceptable resolution would look like in that form.)

(2) v7 and a future spm-memory device aren't mutually exclusive

I don't see anything in v7 that conflicts with the device-based
direction you sketched.  v7 establishes the user-visible semantic
(whole-node SPM, with E820 SOFT_RESERVED + SRAT memory-affinity
emission) and the firmware-side compatibility (OVMF already
upstream, SeaBIOS in flight).  A future spm-memory device built on
memory-device infrastructure can target the same E820 + SRAT
emission and either co-exist with v7 (different idioms for the same
underlying semantic) or, if and when it proves out, subsume v7 via
the standard QEMU deprecation flow.

Three observations on the "merging v7 means committing to support
burden for years" framing:

(A) On your own framing, the existing -numa node,memdev= interface
    for attaching memory to NUMA nodes is itself on a long
    deprecation arc toward memory-device.  v7's memmap-type= is a
    sibling attribute on the same -numa node configuration, sharing
    the same lifetime envelope; it doesn't create a new long-term
    contract beyond what -numa node,memdev= already commits us to.

(B) v7 is the interim shape that a future spm-memory device can
    subsume.  If the device-based variant lands later and proves
    out, v7's memmap-type=spm can be marked deprecated under the
    standard two-cycle policy and removed.  That's the same path
    we'd take for any of the -numa node,memdev= family.  Treating
    v7 as a permanent commitment overstates the contract.

(C) The marginal code surface of v7 is small: a single new -numa
    node attribute that routes into the existing E820 plumbing in
    machine-init.  The marginal maintenance cost is bounded by
    that surface.  Compare against the cost of holding the in-tree
    use case while a memory-device-based prototype is designed,
    implemented, reviewed and stabilised.

(3) The review timeline

I want to be transparent about this rather than leave it implied.
This series has been in upstream review for roughly 1.5 years.  The
v5, v6 and v7 cycles went through with Acked-by from David and
Reviewed-by from Gregory; your previous engagement on this thread
was on the v4 round in January.  v7 was posted with the QEMU 11.1
window in mind, and the soft freeze is 7 July.

I'm raising this not to dismiss the technical questions you've put
on the table -- those are worth a separate RFC and I'd be glad to
take part -- but because gating v7 on prototyping a memory-device-
based variant first effectively pushes this past 11.1, and the
production deployment timeline that this series enables doesn't
accommodate that.  If the in-tree review on v7 itself had surfaced
this direction earlier, we'd be in a different place; at this point
in the cycle, the cleanest path is to land v7 as it stands (with
the tags it already carries) and pursue the device-based variant as
its own RFC.

My hunch is that memory device based approach will end-up with
more straightforward and cleaner code, not to mention proper
backend/frontend modeling.

(also related to long term: ideally existing -numa 'some memory' interface,
should go away and be replaced by memory devices.
Adding more similar '-numa' options, doesn't help that case and only
increases out technical debt).


On the cleaner-code hunch -- agreed in principle for a green-field
design.  On the -numa-should-go-away point -- agreed as a long-term
direction; that's exactly why v7 sits within the existing -numa
node,memdev= lifetime envelope rather than committing us to a new
one (point 2A above).

I'd be glad to take part in the spm-memory RFC and in the wider
migration of the -numa node,memdev= interface family to
memory-device that you sketched.  Realistically, there are
architecture-level questions still open on our side that pace what
we can take on -- and landing v7 in 11.1 helps here by giving us a
stable in-tree baseline to iterate forward from.

My ask on this thread is: that v7 lands for 11.1 on the tags it
already carries, and that the device-based direction proceeds as
separate, parallel work.

Best regards,
FangSheng Huang (Jerry)


Reply via email to