Re: [PATCH v3] hw/arm/virt: Avoid unexpected warning from Linux guest on host with Fujitsu CPUs
On 2024-06-12 1:50 pm, Philippe Mathieu-Daudé wrote: On 12/6/24 14:48, Peter Maydell wrote: On Wed, 12 Jun 2024 at 13:33, Philippe Mathieu-Daudé wrote: Hi Zhenyu, On 12/6/24 04:05, Zhenyu Zhang wrote: diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 3c93c0c0a6..3cefac6d43 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -271,6 +271,17 @@ static void create_fdt(VirtMachineState *vms) qemu_fdt_setprop_cell(fdt, "/", "#size-cells", 0x2); qemu_fdt_setprop_string(fdt, "/", "model", "linux,dummy-virt"); + /* + * For QEMU, all DMA is coherent. Advertising this in the root node + * has two benefits: + * + * - It avoids potential bugs where we forget to mark a DMA + * capable device as being dma-coherent + * - It avoids spurious warnings from the Linux kernel about + * devices which can't do DMA at all + */ + qemu_fdt_setprop(fdt, "/", "dma-coherent", NULL, 0); OK, but why restrict that to the Aarch64 virt machine? Shouldn't advertise this generically in create_device_tree()? Or otherwise at least in the other virt machines? create_device_tree() creates an empty device tree, not one with stuff in it. It seems reasonable to me for this property on the root to be set in the same place we set other properties of the root node. OK. Still the question about other virt machines remains unanswered :) From the DT consumer point of view, the interpretation and assumptions around coherency *are* generally architecture- or platform-specific. For instance on RISC-V, many platforms want to assume coherency by default (and potentially use "dma-noncoherent" to mark individual devices that aren't), while others may still want to do the opposite and use "dma-coherent" in the same manner as Arm and AArch64. Neither property existed back in ePAPR, so typical PowerPC systems wouldn't even be looking and will just make their own assumptions by other means. Thanks, Robin.
Re: [PATCH RFC] hw/arm/virt: Avoid unexpected warning from Linux guest on host with Fujitsu CPUs
o took a look. of_dma_configure() is being passed force_dma = true. https://elixir.bootlin.com/linux/v6.10-rc2/source/drivers/amba/bus.c#L361 The is a comment in of_dma_configure() /* * For legacy reasons, we have to assume some devices need * DMA configuration regardless of whether "dma-ranges" is * correctly specified or not. */ So this I think this is being triggered by a workaround for broken DT. This was introduced by Robin Murphy +CC though you may need to ask on kernel list because ARM / QEMU fun. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=723288836628b Relevant comment from that patch description: "Certain bus types have a general expectation of DMA capability and carry a well-established precedent that an absent "dma-ranges" implies the same as the empty property, so we automatically opt those in to DMA configuration regardless, to avoid regressing most existing platforms." The patch implies that AMBA is one of those. So not sure this is solveable without a hack such as eliding the warning message if dma_force was set as the situation probably isn't relevant then.. Except it absolutely is, because the whole reason for setting force_dma on those buses is that they *do* commonly have DMA-capable devices, and they are also commonly non-coherent such that this condition would be serious. Especially AMBA, given that the things old enough to still be using that abstraction rather than plain platform (PL080, PL111, PL330,...) all predate ACE-Lite so don't even have the *possibility* of being coherent without external trickery in the interconnect. Thanks, Robin.
Re: [RFC] virtio-iommu: Take into account possible aliasing in virtio_iommu_mr()
On 2023-01-20 15:28, Jean-Philippe Brucker wrote: For some reason this came through as blank mail with a text attachment, so apologies for the lack of quoting, but for reference this is a long-standing known issue: https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26...@huawei.com/ In summary, yes, hanging {of,acpi}_iommu_configure() off driver probe is massively wrong, and pulling that step into iommu_probe_device() in a sane and robust manner is one of the next things to start on after getting the bus ops out of the way. Thanks, Robin.
Re: [PATCH v2 9/9] hw/arm/smmuv3: Advertise SMMUv3.2 range invalidation
On 2020-07-02 16:26, Eric Auger wrote: Expose the RIL bit so that the guest driver uses range invalidation. Hmm, this is a v3.2 feature... so strictly, in order to advertise it you would need to claim at least v3.1 in SMMU_AIDR and implement all the mandatory v3.1 behaviour ;) Robin. Signed-off-by: Eric Auger Reviewed-by: Peter Maydell --- hw/arm/smmuv3-internal.h | 1 + hw/arm/smmuv3.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h index 5babf72f7d..4e7ec252ed 100644 --- a/hw/arm/smmuv3-internal.h +++ b/hw/arm/smmuv3-internal.h @@ -54,6 +54,7 @@ REG32(IDR1,0x4) REG32(IDR2,0x8) REG32(IDR3,0xc) +FIELD(IDR3, RIL, 10, 1); REG32(IDR4,0x10) REG32(IDR5,0x14) FIELD(IDR5, OAS, 0, 3); diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 89ab11fc36..add4ba4543 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -254,6 +254,8 @@ static void smmuv3_init_regs(SMMUv3State *s) s->idr[1] = FIELD_DP32(s->idr[1], IDR1, EVENTQS, SMMU_EVENTQS); s->idr[1] = FIELD_DP32(s->idr[1], IDR1, CMDQS, SMMU_CMDQS); +s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 1); + /* 4K and 64K granule support */ s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1); s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
Re: [Qemu-devel] [Question] Memory hotplug clarification for Qemu ARM/virt
Hi Shameer, On 08/05/2019 11:15, Shameerali Kolothum Thodi wrote: Hi, This series here[0] attempts to add support for PCDIMM in QEMU for ARM/Virt platform and has stumbled upon an issue as it is not clear(at least from Qemu/EDK2 point of view) how in physical world the hotpluggable memory is handled by kernel. The proposed implementation in Qemu, builds the SRAT and DSDT parts and uses GED device to trigger the hotplug. This works fine. But when we added the DT node corresponding to the PCDIMM(cold plug scenario), we noticed that Guest kernel see this memory during early boot even if we are booting with ACPI. Because of this, hotpluggable memory may end up in zone normal and make it non-hot-un-pluggable even if Guest boots with ACPI. Further discussions[1] revealed that, EDK2 UEFI has no means to interpret the ACPI content from Qemu(this is designed to do so) and uses DT info to build the GetMemoryMap(). To solve this, introduced "hotpluggable" property to DT memory node(patches #7 & #8 from [0]) so that UEFI can differentiate the nodes and exclude the hotpluggable ones from GetMemoryMap(). But then Laszlo rightly pointed out that in order to accommodate the changes into UEFI we need to know how exactly Linux expects/handles all the hotpluggable memory scenarios. Please find the discussion here[2]. For ease, I am just copying the relevant comment from Laszlo below, /** "Given patches #7 and #8, as I understand them, the firmware cannot distinguish hotpluggable & present, from hotpluggable & absent. The firmware can only skip both hotpluggable cases. That's fine in that the firmware will hog neither type -- but is that OK for the OS as well, for both ACPI boot and DT boot? Consider in particular the "hotpluggable & present, ACPI boot" case. Assuming we modify the firmware to skip "hotpluggable" altogether, the UEFI memmap will not include the range despite it being present at boot. Presumably, ACPI will refer to the range somehow, however. Will that not confuse the OS? When Igor raised this earlier, I suggested that hotpluggable-and-present should be added by the firmware, but also allocated immediately, as EfiBootServicesData type memory. This will prevent other drivers in the firmware from allocating AcpiNVS or Reserved chunks from the same memory range, the UEFI memmap will contain the range as EfiBootServicesData, and then the OS can release that allocation in one go early during boot. But this really has to be clarified from the Linux kernel's expectations. Please formalize all of the following cases: OS boot (DT/ACPI) hotpluggable & ... GetMemoryMap() should report as DT/ACPI should report as - -- --- DT present ?? DT absent ?? ACPI present ?? ACPI absent ?? Again, this table is dictated by Linux." **/ Could you please take a look at this and let us know what is expected here from a Linux kernel view point. For arm64, so far we've not even been considering DT-based hotplug - as far as I'm aware there would still be a big open question there around notification mechanisms and how to describe them. The DT stuff so far has come from the PowerPC folks, so it's probably worth seeing what their ideas are. ACPI-wise I've always assumed/hoped that hotplug-related things should be sufficiently well-specified in UEFI that "do whatever x86/IA-64 do" would be enough for us. Robin. (Hi Laszlo/Igor/Eric, please feel free to add/change if I have missed any valid points above). Thanks, Shameer [0] https://patchwork.kernel.org/cover/10890919/ [1] https://patchwork.kernel.org/patch/10863299/ [2] https://patchwork.kernel.org/patch/10890937/
Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
On 28/03/2019 12:53, Auger Eric wrote: Hi Robin, On 3/28/19 11:56 AM, Robin Murphy wrote: On 28/03/2019 10:38, Auger Eric wrote: Hi Alex, [+ Robin] On 3/27/19 5:37 PM, Alex Williamson wrote: On Wed, 27 Mar 2019 14:25:00 +0800 Peter Xu wrote: On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: Conventional PCI buses pre-date requester IDs. An IOMMU cannot distinguish by devfn & bus between devices in a conventional PCI topology and therefore we cannot assign them separate AddressSpaces. By taking this requester ID aliasing into account, QEMU better matches the bare metal behavior and restrictions, and enables shared AddressSpace configurations that are otherwise not possible with guest IOMMU support. For the latter case, given any example where an IOMMU group on the host includes multiple devices: $ ls /sys/kernel/iommu_groups/1/devices/ :00:01.0 :01:00.0 :01:00.1 [1] If we incorporate a vIOMMU into the VM configuration, we're restricted that we can only assign one of the endpoints to the guest because a second endpoint will attempt to use a different AddressSpace. VFIO only supports IOMMU group level granularity at the container level, preventing this second endpoint from being assigned: qemu-system-x86_64 -machine q35... \ -device intel-iommu,intremap=on \ -device pcie-root-port,addr=1e.0,id=pcie.1 \ -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ :01:00.1: group 1 used in multiple address spaces However, when QEMU incorporates proper aliasing, we can make use of a PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that provides the downstream devices with the same AddressSpace, ex: qemu-system-x86_64 -machine q35... \ -device intel-iommu,intremap=on \ -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 While the utility of this hack may be limited, this AddressSpace aliasing is the correct behavior for QEMU to emulate bare metal. Signed-off-by: Alex Williamson The patch looks sane to me even as a bug fix since otherwise the DMA address spaces used under misc kinds of PCI bridges can be wrong, so: I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but I'd be cautious about this if so. Eric Auger noted that he's seen an SMMU VM hit a guest kernel bug-on, which needs further investigation. It's not clear if it's just an untested or unimplemented scenario for SMMU to see a conventional PCI bus or if there's something wrong in QEMU. I also haven't tested AMD IOMMU and only VT-d to a very limited degree, thus RFC. So I have tracked this further and here is what I can see. On guest side, the 2 assigned devices that I have put downstream to the pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one corresponding to the requester id of the very device and the second one corresponding to the rid matching the same bus number and devfn=0 dev0 = :02:01.0 :02:00.0 dev1 = :02:01.1 :02:00.0 Then iommu_probe_device is called for :02:01.0 and :02:01.1. Each time it iterates over the associated ids and we call add_device twice for :02:00.0. The second time, the arm-smmu-v3 driver recognizes a context is already alive for :02:00.0 and triggers a BUG_ON(). Hmm, aliasing bridges are supposed to be handled as of commit 563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") - what's changed since then? I our case, the problem is not we have the same ids[] for a given device but we have 2 functions attached to the pcie-to-pci bridge and their fwspec have an ids[] in common, the one with the subordinate bus and devfn=0. Oh, right, the recollection of having fixed something like this before distracted me from that important detail, sorry :) Yeah, it's a shortcoming of the driver - in general we don't support arbitrary devices sharing Stream IDs (as arm_smmu_device_group() says), but since we hadn't yet seen (nor expected) a non-trivial legacy PCI topology with SMMUv3, that case has never really been tested properly either. Robin. device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 [2.509381] ixgbe :02:01.0: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1 [2.512095] arm_smmu_install_ste_for_dev sid[0]=520 [2.513524] arm_smmu_write_strtab_ent sid=520 val=1 [2.514872] arm_smmu_write_strtab_ent sid=520 ABORT [2.516266] arm_smmu_install_ste_for_dev sid[1]=512 [2.517609] arm_smmu_write_strtab_ent sid=512 val=1 [2.518958] arm_smmu_wr
Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
On 28/03/2019 10:38, Auger Eric wrote: Hi Alex, [+ Robin] On 3/27/19 5:37 PM, Alex Williamson wrote: On Wed, 27 Mar 2019 14:25:00 +0800 Peter Xu wrote: On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: Conventional PCI buses pre-date requester IDs. An IOMMU cannot distinguish by devfn & bus between devices in a conventional PCI topology and therefore we cannot assign them separate AddressSpaces. By taking this requester ID aliasing into account, QEMU better matches the bare metal behavior and restrictions, and enables shared AddressSpace configurations that are otherwise not possible with guest IOMMU support. For the latter case, given any example where an IOMMU group on the host includes multiple devices: $ ls /sys/kernel/iommu_groups/1/devices/ :00:01.0 :01:00.0 :01:00.1 [1] If we incorporate a vIOMMU into the VM configuration, we're restricted that we can only assign one of the endpoints to the guest because a second endpoint will attempt to use a different AddressSpace. VFIO only supports IOMMU group level granularity at the container level, preventing this second endpoint from being assigned: qemu-system-x86_64 -machine q35... \ -device intel-iommu,intremap=on \ -device pcie-root-port,addr=1e.0,id=pcie.1 \ -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ :01:00.1: group 1 used in multiple address spaces However, when QEMU incorporates proper aliasing, we can make use of a PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that provides the downstream devices with the same AddressSpace, ex: qemu-system-x86_64 -machine q35... \ -device intel-iommu,intremap=on \ -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 While the utility of this hack may be limited, this AddressSpace aliasing is the correct behavior for QEMU to emulate bare metal. Signed-off-by: Alex Williamson The patch looks sane to me even as a bug fix since otherwise the DMA address spaces used under misc kinds of PCI bridges can be wrong, so: I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but I'd be cautious about this if so. Eric Auger noted that he's seen an SMMU VM hit a guest kernel bug-on, which needs further investigation. It's not clear if it's just an untested or unimplemented scenario for SMMU to see a conventional PCI bus or if there's something wrong in QEMU. I also haven't tested AMD IOMMU and only VT-d to a very limited degree, thus RFC. So I have tracked this further and here is what I can see. On guest side, the 2 assigned devices that I have put downstream to the pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one corresponding to the requester id of the very device and the second one corresponding to the rid matching the same bus number and devfn=0 dev0 = :02:01.0 :02:00.0 dev1 = :02:01.1 :02:00.0 Then iommu_probe_device is called for :02:01.0 and :02:01.1. Each time it iterates over the associated ids and we call add_device twice for :02:00.0. The second time, the arm-smmu-v3 driver recognizes a context is already alive for :02:00.0 and triggers a BUG_ON(). Hmm, aliasing bridges are supposed to be handled as of commit 563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") - what's changed since then? Robin. At the origin of the creation of 2 ids for each device, iort_iommu_configure is called on each downstream device which calls pci_for_each_dma_alias(). We enter the pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and iort_pci_iommu_init is called with bus number 2 and devfn=0. Thanks Eric Reviewed-by: Peter Xu Though I have a question that confused me even before: Alex, do you know why all the context entry of the devices in the IOMMU root table will be programmed even if the devices are under a pcie-to-pci bridge? I'm giving an example with above [1] to be clear: in that case IIUC we'll program context entries for all the three devices (00:01.0, 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on bare metal and then why we bother to program the context entry of 01:00.1? It seems never used. (It should be used for current QEMU to work with pcie-to-pci bridges if without this patch, but I feel like I don't know the real answer behind) We actually have two different scenarios that could be represented by [1], the group can be formed by lack of isolation or by lack of visibility. In the group above, it's the former, isolation. The PCIe root port does not support ACS, so while the IOMMU has visibility of the individual devices, peer-to-peer between