Hi Igor,
Thanks again for the careful read.
On 5/15/2026 9:04 PM, Igor Mammedov wrote:
On Fri, 15 May 2026 15:53:07 +0800
"Huang, FangSheng (Jerry)" <[email protected]> wrote:
On 5/14/2026 9:05 PM, Igor Mammedov wrote:
Hi Igor,
Thanks for taking the time to review this -- and for the candor in
the bashing chapter. Before going into the bigger picture, let me
re-establish one factual point that v7 didn't carry forward from
the v6 cover letter.
feel free to bash my review as well, I hope that we end up with
clear picture what and why we are doing.
Let me take your points in order: first the bigger-picture context
you asked for (which I'll also put in v8's commit message), then
the dax/kmem compatibility state, the prior thread on this CLI
shape, and where this could go longer-term.
bigger picture should be somewhere in commit message so later on
a reader could understand why we are doing it at all/this way.
lets continue with questions wrt impl.
Agreed, v8 will carry a condensed version of the bigger picture
in the commit message body. Setting the architectural context
out in full here first -- the rest of the reply depends on it.
(1) The architectural context: HBM is system memory in this class
of systems
This patch targets coherent CPU+accelerator shared-address-space
systems, where the accelerator's high-bandwidth memory is no
longer a device-private framebuffer behind a PCIe BAR. It lives
in the host physical address space -- visible to the CPU as
system memory (not as device MMIO) and shared coherently with
the accelerator over the platform fabric. Both sides must keep
consistent views of those pages or the coherent contract breaks.
The use case is GPU/accelerator HBM exposed to the OS as SPM. On
bare metal, the platform firmware:
- emits E820 type 0xEFFFFFFF (SOFT_RESERVED) for the HBM region;
- emits ACPI SRAT memory affinity entries that bind HBM to a
dedicated proximity domain (NUMA node);
- tags the accelerator's PCI device with _PXM matching that node.
That gives the device driver a stable lookup chain at runtime:
dev -> pci_dev_to_node(dev) -> SRAT walk -> HBM GPA range
it looks kind of convoluted, isn't it.
PCI devices were supposed to be self describing/discoverable.
Preferably without above mentioned firmware 'hooks'.
Above example could be just early impl. issues, rather than by
design issue.
NUMA node here is not incidental -- it is the OS-exposed
intermediary ID that the device driver uses to find its own HBM.
This is the in-tree path used by accelerator drivers today.
I'm assuming GPU is exposed as some composite PCI/CXL device.
and use-case is its pass-through to guest.
Perhaps we can't do anything about it now.
But shouldn't device driver discover its own memory (HBM and what not)
without external parties that magically gain knowledge about parts of
device that driver supposedly driving the device has not a clue about?
How doesn bios know about SPM when device's driver with knowledge of
device internals knows nothing about?
To be blunt about it: in this architecture, "the device alone
knows where its own memory is" doesn't apply. The accelerator
package knows it has HBM and how much, but the GPA it's mapped to
is set by platform firmware during boot fabric training, in
coordination with the CPU host bridge and the fabric coherency
hardware. Firmware "knows the GPA" because firmware authored
that binding, not because some external party magically learned
the device's internals; every party including the driver has to
conform to it.
That makes the OS-level requirement twofold: the memory has to be
visible in the system memory map (so the CPU side can address it)
and reserved exclusively for the accelerator's driver (so the
general allocator doesn't hand HBM pages to random workloads).
The SP marking therefore has to reach the CPU memory subsystem,
not just the driver. EFI_MEMORY_SP / E820 SOFT_RESERVED + SRAT
memory-affinity is the mechanism that fits this: a system-level,
firmware-produced topology that the CPU memory subsystem honors
and that the accelerator's driver consumes to discover and
manage its HBM range. Any path that emits only an ACPI namespace
node, without E820 SOFT_RESERVED, leaves the CPU memory subsystem
without that signal and the contract breaks; any path that does
also emit E820 SOFT_RESERVED restores it.
The "-numa node + memmap-type=spm + E820 SOFT_RESERVED" combo in
v7 is a direct 1:1 model of this BM topology. The E820 retyping
in the patch is exactly what makes the guest-visible E820 match
what BM firmware emits for the same kind of region.
On the DIMM / device-memory alternative:
wrt modeling GPU pass-through, my 1st attempt would be
to make -device gpu-foo take everything need to compose the device
(like in real hw) and be done with it (and PCI/CXL machinery would
take care of mapping/exposing memory to guest).
Why we aren't doing it?
barring that, and assuming we have to pass SPM as a separate memory
(why and why it should be exposed in E820 and at boot time only?)
I'd try -device foo-memory approach.
So this isn't HBM-as-VRAM, the device-private framebuffer model
from discrete GPUs. It's HBM as a system memory tier under
driver-owned allocation policy, and -numa node + memmap-type=spm
is the direct expression of that, not a workaround layered on top.
(2) v7 is naturally compatible with stock upstream Linux dax/kmem
v7 emits E820 SOFT_RESERVED via existing machine-init, into the
same fw_cfg blob the firmware and the guest kernel already
consume. The upstream kernel already understands this signal --
dax/kmem is one such consumer in tree, picking the region up via
the IORES_DESC_SOFT_RESERVED hook in drivers/dax/hmem/device.c.
The accelerator's own driver discovers its HBM range from the
same E820 + SRAT entries via dev_to_node + SRAT walk. Neither
side needs new code for v7.
For a memory-device-derived spm-memory device to reach the same
dax/kmem path, it would have to inject E820 SOFT_RESERVED from
its realize() -- which is new plumbing inside the device_memory
subsystem.
That also bears on two of your other points:
And no need for messing with firmware (SeaBIOS: RamSizeOver4G
patch) nor EDK2.
That holds only for the path where the driver doesn't go through
E820 SOFT_RESERVED at all -- but that path needs the in-tree
accelerator driver rewritten to walk ACPI namespace instead of
SRAT, plus a kernel-side patch to wire ACPI memory device
descriptors into dax/kmem. For the e820+SRAT fallback path you
describe as Step 3, the firmware-side requirement is identical to
v7.
figure out why device driver has to fetch memory map
and proximity from static tables as opposed to getting it dynamically
from _PXM -> maped-memory range. (at the time PCI devices enum runs,
> all ACPI
info incl. run time one is fully accessible to in-kernel users)
i.e. try to make driver work with runtime proximity
The current amdgpu lookup is already a runtime, PXM-keyed dynamic
lookup -- dev_to_node(pdev) evaluates _PXM on the PCI device,
then a SRAT memory-affinity walk matches by proximity_domain.
PXM is the matching key; SRAT is the storage venue. What you're
suggesting is moving the (PXM -> range) mapping from a system-
level table to a per-device ACPI node. Both are populated by
firmware at boot and queried by the driver at probe; the
difference is which kernel subsystem reads the result. The
system-table choice is what keeps the CPU memory subsystem in the
loop, per (1).
(3) The current CLI shape reflects prior maintainer alignment
The memmap-type=[normal,spm,reserved] form itself came from
Gregory's suggestion on the v4 thread in January, where you also
noted that AMD hardware "such memory has 1:1 mapping" -- which
is exactly the use case this patch targets. v6 adopted that
naming, and v7 carries Acked-by: David and Reviewed-by: Gregory.
On the firmware side, OVMF SOFT_RESERVED handling is merged
upstream (edk2 commit 4e90b65e7b); the SeaBIOS-side conversation
with Gerd and Paul is converging on a small downstream knob.
So the current shape isn't a unilateral pick -- it's the form
the review converged on, with explicit tags from David and
Gregory and consistent firmware-side adoption.
wrt my suggestion using memory-device.
It's true that the device memory region has started as hotpluggable memory.
But that's impl. detail, nothing fundamentally prevents us from
describing mix of present at boot time memory devices within it in e820/SRAT.
Answer to why DIMMs aren't in e820 was for us to avoid dealing with
linux kernel putting that memory into zone_normal instead of zone_movable.
On real hardware, one is likely to see all present at boot dimms, in e820
and SRAT.
For already existing memory devices, I'd like us continue dodging e820,
so we wouldn't break existing deployments. however for a new memory device
we don't have such limitations.
What I'd try is:
1: inherit spm-memory device from memory-device
(all memory mapping and APCI memory device descriptors, can be made
to pick it along with DIMM devices)
2: figure out why device driver has to fetch memory map
and proximity from static tables as opposed to getting it dynamically
from _PXM -> maped-memory range. (at the time PCI devices enum runs, all
ACPI
info incl. run time one is fully accessible to in-kernel users)
i.e. try to make driver work with runtime proximity
3. if #2 is impossible, we can try to expose SPM memory devices in e820,
and partition SRAT to match actual device_memory region layout.
(4) On the longer-term direction
An spm-memory device built on memory-device infrastructure -- one
that emits the same E820 SOFT_RESERVED + SRAT memory-affinity
entries v7 produces today -- is a reasonable direction for a
separate RFC in QEMU 11.2 / 12.x. Such an RFC would close
exactly the gap David noted earlier, that boot memory isn't
currently modeled as a memory device on the QEMU side.
One thing that has to stay in any such direction: the SRAT
memory-affinity emission -- without it the CPU memory subsystem
loses visibility into HBM and the contract from (1) breaks.
Happy to take part in that RFC when someone kicks it off, but
it's better treated as its own work than as a gate on v7.
Best regards,
FangSheng Huang (Jerry)