Hi Daniel, On Tue, May 21, 2024 at 12:17 AM Daniel Henrique Barboza < dbarb...@ventanamicro.com> wrote:
> Hi Frank, > > On 5/16/24 04:13, Frank Chang wrote: > > On Mon, May 13, 2024 at 8:37 PM Daniel Henrique Barboza < > dbarb...@ventanamicro.com <mailto:dbarb...@ventanamicro.com>> wrote: > > > > Hi Frank, > > > > > > On 5/8/24 08:15, Daniel Henrique Barboza wrote: > > > Hi Frank, > > > > > > I'll reply with that I've done so far. Still missing some stuff: > > > > > > On 5/2/24 08:37, Frank Chang wrote: > > >> Hi Daniel, > > >> > > >> Daniel Henrique Barboza <dbarb...@ventanamicro.com <mailto: > dbarb...@ventanamicro.com>> 於 2024年3月8日 週五 上午12:04寫道: > > >>> > > >>> From: Tomasz Jeznach <tjezn...@rivosinc.com <mailto: > tjezn...@rivosinc.com>> > > >>> > > >>> The RISC-V IOMMU specification is now ratified as-per the RISC-V > > >>> international process. The latest frozen specifcation can be > found > > >>> at: > > >>> > > >>> > https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf > < > https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf > > > > >>> > > >>> Add the foundation of the device emulation for RISC-V IOMMU, > which > > >>> includes an IOMMU that has no capabilities but MSI interrupt > support and > > >>> fault queue interfaces. We'll add add more features > incrementally in the > > >>> next patches. > > >>> > > >>> Co-developed-by: Sebastien Boeuf <s...@rivosinc.com <mailto: > s...@rivosinc.com>> > > >>> Signed-off-by: Sebastien Boeuf <s...@rivosinc.com <mailto: > s...@rivosinc.com>> > > >>> Signed-off-by: Tomasz Jeznach <tjezn...@rivosinc.com <mailto: > tjezn...@rivosinc.com>> > > >>> Signed-off-by: Daniel Henrique Barboza < > dbarb...@ventanamicro.com <mailto:dbarb...@ventanamicro.com>> > > >>> --- > > >>> hw/riscv/Kconfig | 4 + > > > > (...) > > > > >>> + > > >>> + s->iommus.le_next = NULL; > > >>> + s->iommus.le_prev = NULL; > > >>> + QLIST_INIT(&s->spaces); > > >>> + qemu_cond_init(&s->core_cond); > > >>> + qemu_mutex_init(&s->core_lock); > > >>> + qemu_spin_init(&s->regs_lock); > > >>> + qemu_thread_create(&s->core_proc, "riscv-iommu-core", > > >>> + riscv_iommu_core_proc, s, QEMU_THREAD_JOINABLE); > > >> > > >> In our experience, using QEMU thread increases the latency of > command > > >> queue processing, > > >> which leads to the potential IOMMU fence timeout in the Linux > driver > > >> when using IOMMU with KVM, > > >> e.g. booting the guest Linux. > > >> > > >> Is it possible to remove the thread from the IOMMU just like > ARM, AMD, > > >> and Intel IOMMU models? > > > > > > Interesting. We've been using this emulation internally in > Ventana, with > > > KVM and VFIO, and didn't experience this issue. Drew is on CC and > can talk > > > more about it. > > > > > > That said, I don't mind this change, assuming it's feasible to > make it for this > > > first version. I'll need to check it how other IOMMUs are doing > it. > > > > > > I removed the threading and it seems to be working fine without it. > I'll commit this > > change for v3. > > > > > > > > > > > > > >> > > >>> +} > > >>> + > > > > > > (...) > > > > > >>> + > > >>> +static AddressSpace *riscv_iommu_find_as(PCIBus *bus, void > *opaque, int devfn) > > >>> +{ > > >>> + RISCVIOMMUState *s = (RISCVIOMMUState *) opaque; > > >>> + PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), > devfn); > > >>> + AddressSpace *as = NULL; > > >>> + > > >>> + if (pdev && pci_is_iommu(pdev)) { > > >>> + return s->target_as; > > >>> + } > > >>> + > > >>> + /* Find first registered IOMMU device */ > > >>> + while (s->iommus.le_prev) { > > >>> + s = *(s->iommus.le_prev); > > >>> + } > > >>> + > > >>> + /* Find first matching IOMMU */ > > >>> + while (s != NULL && as == NULL) { > > >>> + as = riscv_iommu_space(s, > PCI_BUILD_BDF(pci_bus_num(bus), devfn)); > > >> > > >> For pci_bus_num(), > > >> riscv_iommu_find_as() can be called at the very early stage > > >> where software has no chance to enumerate the bus numbers. > > > > I investigated and this doesn't seem to be a problem. This function > is called at the > > last step of the realize() steps of both riscv_iommu_pci_realize() > and > > riscv_iommu_sys_realize(), and by that time the pci_bus_num() is > already assigned. > > Other iommus use pci_bus_num() into their own get_address_space() > callbacks like > > this too. > > > > > > Hi Daniel, > > > > IIUC, pci_bus_num() by default is assigned to pcibus_num(): > > > > static int pcibus_num(PCIBus *bus) > > { > > if (pci_bus_is_root(bus)) { > > return 0; /* pci host bridge */ > > } > > return bus->parent_dev->config[PCI_SECONDARY_BUS]; > > } > > > > If the bus is not the root bus, it tries to read the bus' parent device's > > secondary bus number (PCI_SECONDARY_BUS) field in the PCI configuration > space. > > This field should be programmable by the SW during PCIe enumeration. > > But I don't think SW has a chance to be executed before > riscv_iommu_sys_realize() is called, > > since it's pretty early before CPU's execution unless RISC-V IOMMU is > hot-plugged. > > Even if RISC-V IOMMU is hot-plugged, I think riscv_iommu_sys_realize() > is still called > > before SW aware of the existence of IOMMU on the PCI topology tree. > > > > Do you think this matches your observation? > > It does. You have a good point on how the pcibus_num() can vary if SW > wants to > change the PCI_SECONDARY_BUS and the IOMMU isn't in a root bus. Note that > this > will not happen with the existing riscv-iommu-pci device as it is now, > since it > has code to prevent the device to be attached to non-pci root buses, but > there's > no restrictions in the riscv-iommu-sys device. > Thanks for the explanation. Do you know where this limitation is from? Is it in this patchset or it's somewhere else in the Linux RISC-V IOMMU driver? BTW, for the case like DesignWare PCIe host controller [1], we cannot connect RISC-V IOMMU to the root bus ("pcie") [2] because it already has a child bus ("dw-pcie") connecting to it [3]. If we try to connect RISC-V IOMMU to the root bus ("pcie"), it can't be discovered by Linux PCIe driver as a PCIe Downstream Port normally leads to a Link with only Device 0 on it. PCIe spec 6.0, section 7.3.1 stats: "Downstream Ports that do not have ARI Forwarding enabled must associate only Device 0 with the device attached to the Logical Bus representing the Link from the Port." The PCI slot scan is early returned in the Linux PCIe driver [4][5]. Do you think it's possible to remove this limitation? [1] https://github.com/qemu/qemu/blob/master/hw/pci-host/designware.c [2] https://github.com/qemu/qemu/blob/master/hw/pci-host/designware.c#L695 [3] https://github.com/qemu/qemu/blob/master/hw/pci-host/designware.c#L409 [4] https://github.com/torvalds/linux/blob/master/drivers/pci/probe.c#L2674 [5] https://github.com/torvalds/linux/blob/master/drivers/pci/probe.c#L2652 Regards, Frank Chang > > And speaking on riscv-iommu-bus, the current device we have in this series > is too > bare bones, without an actual use case for it (e.g. code to add it in the > 'virt' > machine), but it's getting in the way nevertheless. > > I'll remove the riscv-iommu-sys device from v3 and re-introduce it in a > later > revision or as a follow up series. Sunil has a handful of patches that add > the > riscv-iommu-sys device in the 'virt' machine and the proper ACPI support > for it [1], > and I intend to use them as a base. We'll then need some minor adjustments > in the > existing code to make it fully functional like we're doing with > riscv-iommu-pci. > > > Thanks, > > Daniel > > > [1] https://github.com/vlsunil/qemu/commits/acpi_rimt_poc_v1/ > > > > Regards, > > Frank Chang > > > > > > > > Thanks, > > > > > > Daniel > > > > > > > > > > I'll see how other IOMMUs are handling their iommu_find_as() > > > > > > > > > Thanks, > > > > > > > > > Daniel > > > > > > > > >> > > >> > > >> > > >> > > >>> + s = s->iommus.le_next; > > >>> + } > > >>> + > > >>> + return as ? as : &address_space_memory; > > >>> +} > > >>> + > > >>> +static const PCIIOMMUOps riscv_iommu_ops = { > > >>> + .get_address_space = riscv_iommu_find_as, > > >>> +}; > > >>> + > > >>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, > PCIBus *bus, > > >>> + Error **errp) > > >>> +{ > > >>> + if (bus->iommu_ops && > > >>> + bus->iommu_ops->get_address_space == > riscv_iommu_find_as) { > > >>> + /* Allow multiple IOMMUs on the same PCIe bus, link > known devices */ > > >>> + RISCVIOMMUState *last = (RISCVIOMMUState > *)bus->iommu_opaque; > > >>> + QLIST_INSERT_AFTER(last, iommu, iommus); > > >>> + } else if (bus->iommu_ops == NULL) { > > >>> + pci_setup_iommu(bus, &riscv_iommu_ops, iommu); > > >>> + } else { > > >>> + error_setg(errp, "can't register secondary IOMMU for > PCI bus #%d", > > >>> + pci_bus_num(bus)); > > >>> + } > > >>> +} > > >>> + > > >>> +static int riscv_iommu_memory_region_index(IOMMUMemoryRegion > *iommu_mr, > > >>> + MemTxAttrs attrs) > > >>> +{ > > >>> + return attrs.unspecified ? RISCV_IOMMU_NOPASID : > (int)attrs.pasid; > > >>> +} > > >>> + > > >>> +static int > riscv_iommu_memory_region_index_len(IOMMUMemoryRegion *iommu_mr) > > >>> +{ > > >>> + RISCVIOMMUSpace *as = container_of(iommu_mr, > RISCVIOMMUSpace, iova_mr); > > >>> + return 1 << as->iommu->pasid_bits; > > >>> +} > > >>> + > > >>> +static void riscv_iommu_memory_region_init(ObjectClass *klass, > void *data) > > >>> +{ > > >>> + IOMMUMemoryRegionClass *imrc = > IOMMU_MEMORY_REGION_CLASS(klass); > > >>> + > > >>> + imrc->translate = riscv_iommu_memory_region_translate; > > >>> + imrc->notify_flag_changed = > riscv_iommu_memory_region_notify; > > >>> + imrc->attrs_to_index = riscv_iommu_memory_region_index; > > >>> + imrc->num_indexes = riscv_iommu_memory_region_index_len; > > >>> +} > > >>> + > > >>> +static const TypeInfo riscv_iommu_memory_region_info = { > > >>> + .parent = TYPE_IOMMU_MEMORY_REGION, > > >>> + .name = TYPE_RISCV_IOMMU_MEMORY_REGION, > > >>> + .class_init = riscv_iommu_memory_region_init, > > >>> +}; > > >>> + > > >>> +static void riscv_iommu_register_mr_types(void) > > >>> +{ > > >>> + type_register_static(&riscv_iommu_memory_region_info); > > >>> + type_register_static(&riscv_iommu_info); > > >>> +} > > >>> + > > >>> +type_init(riscv_iommu_register_mr_types); > > >>> diff --git a/hw/riscv/riscv-iommu.h b/hw/riscv/riscv-iommu.h > > >>> new file mode 100644 > > >>> index 0000000000..6f740de690 > > >>> --- /dev/null > > >>> +++ b/hw/riscv/riscv-iommu.h > > >>> @@ -0,0 +1,141 @@ > > >>> +/* > > >>> + * QEMU emulation of an RISC-V IOMMU (Ziommu) > > >>> + * > > >>> + * Copyright (C) 2022-2023 Rivos Inc. > > >>> + * > > >>> + * This program is free software; you can redistribute it > and/or modify > > >>> + * it under the terms of the GNU General Public License as > published by > > >>> + * the Free Software Foundation; either version 2 of the > License. > > >>> + * > > >>> + * This program is distributed in the hope that it will be > useful, > > >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty > of > > >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See > the > > >>> + * GNU General Public License for more details. > > >>> + * > > >>> + * You should have received a copy of the GNU General Public > License along > > >>> + * with this program; if not, see < > http://www.gnu.org/licenses/ <http://www.gnu.org/licenses/>>. > > >>> + */ > > >>> + > > >>> +#ifndef HW_RISCV_IOMMU_STATE_H > > >>> +#define HW_RISCV_IOMMU_STATE_H > > >>> + > > >>> +#include "qemu/osdep.h" > > >>> +#include "qom/object.h" > > >>> + > > >>> +#include "hw/riscv/iommu.h" > > >>> + > > >>> +struct RISCVIOMMUState { > > >>> + /*< private >*/ > > >>> + DeviceState parent_obj; > > >>> + > > >>> + /*< public >*/ > > >>> + uint32_t version; /* Reported interface version number > */ > > >>> + uint32_t pasid_bits; /* process identifier width */ > > >>> + uint32_t bus; /* PCI bus mapping for non-root > endpoints */ > > >>> + > > >>> + uint64_t cap; /* IOMMU supported capabilities */ > > >>> + uint64_t fctl; /* IOMMU enabled features */ > > >>> + > > >>> + bool enable_off; /* Enable out-of-reset OFF mode (DMA > disabled) */ > > >>> + bool enable_msi; /* Enable MSI remapping */ > > >>> + > > >>> + /* IOMMU Internal State */ > > >>> + uint64_t ddtp; /* Validated Device Directory Tree > Root Pointer */ > > >>> + > > >>> + dma_addr_t cq_addr; /* Command queue base physical > address */ > > >>> + dma_addr_t fq_addr; /* Fault/event queue base physical > address */ > > >>> + dma_addr_t pq_addr; /* Page request queue base physical > address */ > > >>> + > > >>> + uint32_t cq_mask; /* Command queue index bit mask */ > > >>> + uint32_t fq_mask; /* Fault/event queue index bit mask > */ > > >>> + uint32_t pq_mask; /* Page request queue index bit mask > */ > > >>> + > > >>> + /* interrupt notifier */ > > >>> + void (*notify)(RISCVIOMMUState *iommu, unsigned vector); > > >>> + > > >>> + /* IOMMU State Machine */ > > >>> + QemuThread core_proc; /* Background processing thread */ > > >>> + QemuMutex core_lock; /* Global IOMMU lock, used for > cache/regs updates */ > > >>> + QemuCond core_cond; /* Background processing wake up > signal */ > > >>> + unsigned core_exec; /* Processing thread execution > actions */ > > >>> + > > >>> + /* IOMMU target address space */ > > >>> + AddressSpace *target_as; > > >>> + MemoryRegion *target_mr; > > >>> + > > >>> + /* MSI / MRIF access trap */ > > >>> + AddressSpace trap_as; > > >>> + MemoryRegion trap_mr; > > >>> + > > >>> + GHashTable *ctx_cache; /* Device translation > Context Cache */ > > >>> + > > >>> + /* MMIO Hardware Interface */ > > >>> + MemoryRegion regs_mr; > > >>> + QemuSpin regs_lock; > > >>> + uint8_t *regs_rw; /* register state (user write) */ > > >>> + uint8_t *regs_wc; /* write-1-to-clear mask */ > > >>> + uint8_t *regs_ro; /* read-only mask */ > > >>> + > > >>> + QLIST_ENTRY(RISCVIOMMUState) iommus; > > >>> + QLIST_HEAD(, RISCVIOMMUSpace) spaces; > > >>> +}; > > >>> + > > >>> +void riscv_iommu_pci_setup_iommu(RISCVIOMMUState *iommu, > PCIBus *bus, > > >>> + Error **errp); > > >>> + > > >>> +/* private helpers */ > > >>> + > > >>> +/* Register helper functions */ > > >>> +static inline uint32_t riscv_iommu_reg_mod32(RISCVIOMMUState > *s, > > >>> + unsigned idx, uint32_t set, uint32_t clr) > > >>> +{ > > >>> + uint32_t val; > > >>> + qemu_spin_lock(&s->regs_lock); > > >>> + val = ldl_le_p(s->regs_rw + idx); > > >>> + stl_le_p(s->regs_rw + idx, (val & ~clr) | set); > > >>> + qemu_spin_unlock(&s->regs_lock); > > >>> + return val; > > >>> +} > > >>> + > > >>> +static inline void riscv_iommu_reg_set32(RISCVIOMMUState *s, > > >>> + unsigned idx, uint32_t set) > > >>> +{ > > >>> + qemu_spin_lock(&s->regs_lock); > > >>> + stl_le_p(s->regs_rw + idx, set); > > >>> + qemu_spin_unlock(&s->regs_lock); > > >>> +} > > >>> + > > >>> +static inline uint32_t riscv_iommu_reg_get32(RISCVIOMMUState > *s, > > >>> + unsigned idx) > > >>> +{ > > >>> + return ldl_le_p(s->regs_rw + idx); > > >>> +} > > >>> + > > >>> +static inline uint64_t riscv_iommu_reg_mod64(RISCVIOMMUState > *s, > > >>> + unsigned idx, uint64_t set, uint64_t clr) > > >>> +{ > > >>> + uint64_t val; > > >>> + qemu_spin_lock(&s->regs_lock); > > >>> + val = ldq_le_p(s->regs_rw + idx); > > >>> + stq_le_p(s->regs_rw + idx, (val & ~clr) | set); > > >>> + qemu_spin_unlock(&s->regs_lock); > > >>> + return val; > > >>> +} > > >>> + > > >>> +static inline void riscv_iommu_reg_set64(RISCVIOMMUState *s, > > >>> + unsigned idx, uint64_t set) > > >>> +{ > > >>> + qemu_spin_lock(&s->regs_lock); > > >>> + stq_le_p(s->regs_rw + idx, set); > > >>> + qemu_spin_unlock(&s->regs_lock); > > >>> +} > > >>> + > > >>> +static inline uint64_t riscv_iommu_reg_get64(RISCVIOMMUState > *s, > > >>> + unsigned idx) > > >>> +{ > > >>> + return ldq_le_p(s->regs_rw + idx); > > >>> +} > > >>> + > > >>> + > > >>> + > > >>> +#endif > > >>> diff --git a/hw/riscv/trace-events b/hw/riscv/trace-events > > >>> new file mode 100644 > > >>> index 0000000000..42a97caffa > > >>> --- /dev/null > > >>> +++ b/hw/riscv/trace-events > > >>> @@ -0,0 +1,11 @@ > > >>> +# See documentation at docs/devel/tracing.rst > > >>> + > > >>> +# riscv-iommu.c > > >>> +riscv_iommu_new(const char *id, unsigned b, unsigned d, > unsigned f) "%s: device attached %04x:%02x.%d" > > >>> +riscv_iommu_flt(const char *id, unsigned b, unsigned d, > unsigned f, uint64_t reason, uint64_t iova) "%s: fault %04x:%02x.%u reason: > 0x%"PRIx64" iova: 0x%"PRIx64 > > >>> +riscv_iommu_pri(const char *id, unsigned b, unsigned d, > unsigned f, uint64_t iova) "%s: page request %04x:%02x.%u iova: 0x%"PRIx64 > > >>> +riscv_iommu_dma(const char *id, unsigned b, unsigned d, > unsigned f, unsigned pasid, const char *dir, uint64_t iova, uint64_t phys) > "%s: translate %04x:%02x.%u #%u %s 0x%"PRIx64" -> 0x%"PRIx64 > > >>> +riscv_iommu_msi(const char *id, unsigned b, unsigned d, > unsigned f, uint64_t iova, uint64_t phys) "%s: translate %04x:%02x.%u MSI > 0x%"PRIx64" -> 0x%"PRIx64 > > >>> +riscv_iommu_cmd(const char *id, uint64_t l, uint64_t u) "%s: > command 0x%"PRIx64" 0x%"PRIx64 > > >>> +riscv_iommu_notifier_add(const char *id) "%s: dev-iotlb > notifier added" > > >>> +riscv_iommu_notifier_del(const char *id) "%s: dev-iotlb > notifier removed" > > >>> diff --git a/hw/riscv/trace.h b/hw/riscv/trace.h > > >>> new file mode 100644 > > >>> index 0000000000..b88504b750 > > >>> --- /dev/null > > >>> +++ b/hw/riscv/trace.h > > >>> @@ -0,0 +1,2 @@ > > >>> +#include "trace/trace-hw_riscv.h" > > >>> + > > >>> diff --git a/include/hw/riscv/iommu.h b/include/hw/riscv/iommu.h > > >>> new file mode 100644 > > >>> index 0000000000..403b365893 > > >>> --- /dev/null > > >>> +++ b/include/hw/riscv/iommu.h > > >>> @@ -0,0 +1,36 @@ > > >>> +/* > > >>> + * QEMU emulation of an RISC-V IOMMU (Ziommu) > > >>> + * > > >>> + * Copyright (C) 2022-2023 Rivos Inc. > > >>> + * > > >>> + * This program is free software; you can redistribute it > and/or modify > > >>> + * it under the terms of the GNU General Public License as > published by > > >>> + * the Free Software Foundation; either version 2 of the > License. > > >>> + * > > >>> + * This program is distributed in the hope that it will be > useful, > > >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty > of > > >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See > the > > >>> + * GNU General Public License for more details. > > >>> + * > > >>> + * You should have received a copy of the GNU General Public > License along > > >>> + * with this program; if not, see < > http://www.gnu.org/licenses/ <http://www.gnu.org/licenses/>>. > > >>> + */ > > >>> + > > >>> +#ifndef HW_RISCV_IOMMU_H > > >>> +#define HW_RISCV_IOMMU_H > > >>> + > > >>> +#include "qemu/osdep.h" > > >>> +#include "qom/object.h" > > >>> + > > >>> +#define TYPE_RISCV_IOMMU "riscv-iommu" > > >>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUState, RISCV_IOMMU) > > >>> +typedef struct RISCVIOMMUState RISCVIOMMUState; > > >>> + > > >>> +#define TYPE_RISCV_IOMMU_MEMORY_REGION "riscv-iommu-mr" > > >>> +typedef struct RISCVIOMMUSpace RISCVIOMMUSpace; > > >>> + > > >>> +#define TYPE_RISCV_IOMMU_PCI "riscv-iommu-pci" > > >>> +OBJECT_DECLARE_SIMPLE_TYPE(RISCVIOMMUStatePci, RISCV_IOMMU_PCI) > > >>> +typedef struct RISCVIOMMUStatePci RISCVIOMMUStatePci; > > >>> + > > >>> +#endif > > >>> diff --git a/meson.build b/meson.build > > >>> index c59ca496f2..75e56f3282 100644 > > >>> --- a/meson.build > > >>> +++ b/meson.build > > >>> @@ -3361,6 +3361,7 @@ if have_system > > >>> 'hw/rdma', > > >>> 'hw/rdma/vmw', > > >>> 'hw/rtc', > > >>> + 'hw/riscv', > > >>> 'hw/s390x', > > >>> 'hw/scsi', > > >>> 'hw/sd', > > >>> -- > > >>> 2.43.2 > > >>> > > >>> > > >