David Hildenbrand <da...@redhat.com> writes: > On 24.05.24 15:14, Daniel Henrique Barboza wrote: >> >> >> On 5/21/24 07:56, Björn Töpel wrote: >>> From: Björn Töpel <bj...@rivosinc.com> >>> >>> Virtio-based memory devices (virtio-mem/virtio-pmem) allows for >>> dynamic resizing of virtual machine memory, and requires proper >>> hotplugging (add/remove) support to work. >>> >>> Add device memory support for RISC-V "virt" machine, and enable >>> virtio-md-pci with the corresponding missing hotplugging callbacks. >>> >>> Signed-off-by: Björn Töpel <bj...@rivosinc.com> >>> --- >>> hw/riscv/Kconfig | 2 + >>> hw/riscv/virt.c | 83 +++++++++++++++++++++++++++++++++++++++++- >>> hw/virtio/virtio-mem.c | 5 ++- >>> 3 files changed, 87 insertions(+), 3 deletions(-) >>> >>> diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig >>> index a2030e3a6ff0..08f82dbb681a 100644 >>> --- a/hw/riscv/Kconfig >>> +++ b/hw/riscv/Kconfig >>> @@ -56,6 +56,8 @@ config RISCV_VIRT >>> select PLATFORM_BUS >>> select ACPI >>> select ACPI_PCI >>> + select VIRTIO_MEM_SUPPORTED >>> + select VIRTIO_PMEM_SUPPORTED >>> >>> config SHAKTI_C >>> bool >>> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c >>> index 4fdb66052587..443902f919d2 100644 >>> --- a/hw/riscv/virt.c >>> +++ b/hw/riscv/virt.c >>> @@ -53,6 +53,8 @@ >>> #include "hw/pci-host/gpex.h" >>> #include "hw/display/ramfb.h" >>> #include "hw/acpi/aml-build.h" >>> +#include "hw/mem/memory-device.h" >>> +#include "hw/virtio/virtio-mem-pci.h" >>> #include "qapi/qapi-visit-common.h" >>> #include "hw/virtio/virtio-iommu.h" >>> >>> @@ -1407,6 +1409,7 @@ static void virt_machine_init(MachineState *machine) >>> DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip; >>> int i, base_hartid, hart_count; >>> int socket_count = riscv_socket_count(machine); >>> + hwaddr device_memory_base, device_memory_size; >>> >>> /* Check socket count limit */ >>> if (VIRT_SOCKETS_MAX < socket_count) { >>> @@ -1420,6 +1423,12 @@ static void virt_machine_init(MachineState *machine) >>> exit(1); >>> } >>> >>> + if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) { >>> + error_report("unsupported amount of memory slots: %"PRIu64, >>> + machine->ram_slots); >>> + exit(EXIT_FAILURE); >>> + } >>> + >>> /* Initialize sockets */ >>> mmio_irqchip = virtio_irqchip = pcie_irqchip = NULL; >>> for (i = 0; i < socket_count; i++) { >>> @@ -1553,6 +1562,37 @@ static void virt_machine_init(MachineState *machine) >>> memory_region_add_subregion(system_memory, memmap[VIRT_MROM].base, >>> mask_rom); >>> >>> + /* device memory */ >>> + device_memory_base = ROUND_UP(s->memmap[VIRT_DRAM].base + >>> machine->ram_size, >>> + GiB); >>> + device_memory_size = machine->maxram_size - machine->ram_size; >>> + if (device_memory_size > 0) { >>> + /* >>> + * Each DIMM is aligned based on the backend's alignment value. >>> + * Assume max 1G hugepage alignment per slot. >>> + */ >>> + device_memory_size += machine->ram_slots * GiB; >> >> We don't need to align to 1GiB. This calc can use 2MiB instead (or 4MiB if >> we're >> running 32 bits). >> >>> + >>> + if (riscv_is_32bit(&s->soc[0])) { >>> + hwaddr memtop = device_memory_base + >>> ROUND_UP(device_memory_size, >>> + GiB); >> >> Same here - alignment is 2/4 MiB. >> >>> + >>> + if (memtop > UINT32_MAX) { >>> + error_report("memory exceeds 32-bit limit by %lu bytes", >>> + memtop - UINT32_MAX); >>> + exit(EXIT_FAILURE); >>> + } >>> + } >>> + >>> + if (device_memory_base + device_memory_size < device_memory_size) { >>> + error_report("unsupported amount of device memory"); >>> + exit(EXIT_FAILURE); >>> + } >> >> Took another look and found this a bit strange. These are all unsigned vars, >> so >> if (unsigned a + unsigned b < unsigned b) will always be 'false'. The >> compiler is >> probably cropping this out. > > No. Unsigned interger overflow is defined behavior and this is a common > check to detect such overflow. tI's consistent with what we do for other > architectures. > >> >> The calc we need to do is to ensure that the extra ram_slots * alignment >> will fit into >> the VIRT_DRAM block, i.e. maxram_size + (ram_slots * alignment) < >> memmap[VIRT_DRAM].size. >> >> >> TBH I'm starting to have second thoughts about letting users hotplug >> whatever they want. >> It seems cleaner to just force the 2/4 Mb alignment in pre_plug() and be >> done with it, >> no need to allocate ram_slots * alignment and doing all these extra checks. > > It's worth noting that if user space decides to specify addresses > manually, it can mess up everything already. There are other events that > can result in fragmentation of the memory device area (repeated > hot(un)plug of differing DIMMs). > > Assume you have 1 GiB range and hotplug a 512 MiB DIMM at offset 256 > MiB. You won't be able to hotplug another 512 MiB DIMM even though we > reserved space. > > My take so far is: if the user wants to do such stuff it should size the > area (maxmem) much larger or deal with the concequences (not being able > to hotplug memory). > > It usually does not happen in practice ...
Daniel/David, again thanks for spending time on the patches. The reason I picked 1 GiB per slot was because the alignment is also dependent on the backend AFAIU. Rationale in commit 085f8e88ba73 ("pc: count in 1Gb hugepage alignment when sizing hotplug-memory container"). What I'm reading from you guys are: Just depend on whatever maxmem says (modulo 2/4M alignment), and leave at that. I agree that that's much easier to reason about. Correct? >> As I sent in an earlier email, users must already comply to the alignment of >> the host >> memory when plugging pc-dimms, so I'm not sure our value/proposition with >> all this >> extra code is worth it - the alignment will most likely be forced by the >> host memory >> backend, so might as well force ourselves in pre_plug(). > > At least on x86-64, the 2 MiB alignment requirement is handled > automatically. QEMU_VMALLOC_ALIGN implicitly enforces that. > > Maybe RISCV also wants to set QEMU_VMALLOC_ALIGN? I'll look into this for the next rev! Björn