Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
Hi Robin, On Mon, Mar 18, 2019 at 12:25:33PM +, Robin Murphy wrote: [...] > > diff --git a/arm/include/arm-common/kvm-arch.h > > b/arm/include/arm-common/kvm-arch.h > > index b9d486d..43f78b1 100644 > > --- a/arm/include/arm-common/kvm-arch.h > > +++ b/arm/include/arm-common/kvm-arch.h > > @@ -7,10 +7,10 @@ > > > > #include "arm-common/gic.h" > > > > -#define ARM_IOPORT_AREA_AC(0x, UL) > > -#define ARM_MMIO_AREA _AC(0x0001, UL) > > -#define ARM_AXI_AREA _AC(0x4000, UL) > > -#define ARM_MEMORY_AREA_AC(0x8000, UL) > > +#define ARM_IOPORT_AREA_AC(0x8000, UL) > > +#define ARM_MMIO_AREA _AC(0x8001, UL) > > +#define ARM_AXI_AREA _AC(0x8800, UL) > > +#define ARM_MEMORY_AREA_AC(0x9000, UL) > > > > Anyway, very appreciate for the suggestions; it's sufficent for me to > > dig more for memory related information (e.g. PCIe configurations, > > IOMMU, etc) and will keep posted if I make any progress. > > None of those should need to change (all the MMIO emulation stuff is > irrelevant to PCIe DMA anyway) - provided you don't give the guest more than > 2GB of RAM, passthrough with legacy INTx ought to work out-of-the-box. For > MSIs to get through, you'll further need to change the host kernel to place > its software MSI region[2] within any of the host bridge windows as well. >From PCI configurations dumping, I can see after launch the guest with kvmtool, the host receives the first interrupt (checked with the function vfio_intx_handler() has been invoked once) and then PCI sent command with PCI_COMMAND_INTX_DISABLE to disable interrupt line. So this flow is very likely the interrupt is not forwarded properly and guest doesn't receive interrupt. It's lucky that I found below flow can let interrupt forwarding from host to guest after I always set "sky2.disable_msi=1" for both kernel command lines: hostguest INTx mode INTx mode So far, it still cannot work well if I only set "sky2.disable_msi=1" for host kernel command line, with this config it runs with below flow and which cannot forward interrupt properly from host to guest: hostguest INTx mode msi enable msi disable Switch back to INTx mode I am so happy now I can use pure INTx mode on Juno board for NIC enabling and pinged successfully from guest OS to my router :) Will look into the issue in the second secnario; and if I have more time I will look into msi mode as well (I confirmed msi mode can work with host OS but failed in guest OS). Very appreciate you & Eric helping! Thanks, Leo Yan ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH] KVM: arm/arm64: Enable direct irqfd MSI injection
Hi all, On 2019/3/18 3:35, Marc Zyngier wrote: On Sun, 17 Mar 2019 14:36:13 +, Zenghui Yu wrote: Currently, IRQFD on arm still uses the deferred workqueue mechanism to inject interrupts into guest, which will likely lead to a busy context-switching from/to the kworker thread. This overhead is for no purpose (only in my view ...) and will result in an interrupt performance degradation. Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct irqfd MSI injection, by which we can get rid of the annoying latency. As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet processing workloads) will benefit from it. Signed-off-by: Zenghui Yu --- It seems that only MSI will follow the IRQFD path, did I miss something? This patch is still under test and sent out for early feedback. If I have any mis-understanding, please fix me up and let me know. Thanks! As mentioned by other folks in the thread, this is clearly wrong. The first thing kvm_inject_msi does is to lock the corresponding ITS using a mutex. So the "no purpose" bit was a bit too quick. When doing this kind of work, I suggest you enable lockdep and all the related checkers. Also, for any optimisation, please post actual numbers for the relevant benchmarks. Saying "application X will benefit from it" is meaningless without any actual data. --- virt/kvm/arm/vgic/trace.h | 22 ++ virt/kvm/arm/vgic/vgic-irqfd.c | 21 + 2 files changed, 43 insertions(+) diff --git a/virt/kvm/arm/vgic/trace.h b/virt/kvm/arm/vgic/trace.h index 55fed77..bc1f4db 100644 --- a/virt/kvm/arm/vgic/trace.h +++ b/virt/kvm/arm/vgic/trace.h @@ -27,6 +27,28 @@ __entry->vcpu_id, __entry->irq, __entry->level) ); +TRACE_EVENT(kvm_arch_set_irq_inatomic, + TP_PROTO(u32 gsi, u32 type, int level, int irq_source_id), + TP_ARGS(gsi, type, level, irq_source_id), + + TP_STRUCT__entry( + __field(u32,gsi ) + __field(u32,type) + __field(int,level ) + __field(int,irq_source_id ) + ), + + TP_fast_assign( + __entry->gsi = gsi; + __entry->type= type; + __entry->level = level; + __entry->irq_source_id = irq_source_id; + ), + + TP_printk("gsi %u type %u level %d source %d", __entry->gsi, + __entry->type, __entry->level, __entry->irq_source_id) +); + #endif /* _TRACE_VGIC_H */ #undef TRACE_INCLUDE_PATH diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c index 99e026d..4cfc3f4 100644 --- a/virt/kvm/arm/vgic/vgic-irqfd.c +++ b/virt/kvm/arm/vgic/vgic-irqfd.c @@ -19,6 +19,7 @@ #include #include #include "vgic.h" +#include "trace.h" /** * vgic_irqfd_set_irq: inject the IRQ corresponding to the @@ -105,6 +106,26 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, return vgic_its_inject_msi(kvm, ); } +/** + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection + * + * Currently only direct MSI injecton is supported. + */ +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, + struct kvm *kvm, int irq_source_id, int level, + bool line_status) +{ + int ret; + + trace_kvm_arch_set_irq_inatomic(e->gsi, e->type, level, irq_source_id); + + if (unlikely(e->type != KVM_IRQ_ROUTING_MSI)) + return -EWOULDBLOCK; + + ret = kvm_set_msi(e, kvm, irq_source_id, level, line_status); + return ret; +} + Although we've established that the approach is wrong, maybe we can look at improving this aspect. A first approach would be to keep a small cache of the last few successful translations for this ITS, cache that could be looked-up by holding a spinlock instead. A hit in this cache could directly be injected. Any command that invalidates or changes anything (DISCARD, INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke the cache altogether. Of course, all of that needs to be quantified. Thanks for all of your explanations, especially for Marc's suggestions! It took me long time to figure out my mistakes, since I am not very familiar with the locking stuff. Now I have to apologize for my noise. As for the its-translation-cache code (a really good news to us), we have a rough look at it and start testing now! thanks, zenghui Thanks, M. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC PATCH] KVM: arm/arm64: Enable direct irqfd MSI injection
On Sun, 2019-03-17 at 14:36 +, Zenghui Yu wrote: > Currently, IRQFD on arm still uses the deferred workqueue mechanism > to inject interrupts into guest, which will likely lead to a busy > context-switching from/to the kworker thread. This overhead is for > no purpose (only in my view ...) and will result in an interrupt > performance degradation. > > Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct > irqfd MSI injection, by which we can get rid of the annoying latency. > As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet > processing workloads) will benefit from it. > > Signed-off-by: Zenghui Yu > --- > > It seems that only MSI will follow the IRQFD path, did I miss something? > > This patch is still under test and sent out for early feedback. If I have > any mis-understanding, please fix me up and let me know. Thanks! > > --- > virt/kvm/arm/vgic/trace.h | 22 ++ > virt/kvm/arm/vgic/vgic-irqfd.c | 21 + > 2 files changed, 43 insertions(+) > > diff --git a/virt/kvm/arm/vgic/trace.h b/virt/kvm/arm/vgic/trace.h > index 55fed77..bc1f4db 100644 > --- a/virt/kvm/arm/vgic/trace.h > +++ b/virt/kvm/arm/vgic/trace.h > @@ -27,6 +27,28 @@ > __entry->vcpu_id, __entry->irq, __entry->level) > ); > > +TRACE_EVENT(kvm_arch_set_irq_inatomic, > + TP_PROTO(u32 gsi, u32 type, int level, int irq_source_id), > + TP_ARGS(gsi, type, level, irq_source_id), > + > + TP_STRUCT__entry( > + __field(u32,gsi ) > + __field(u32,type) > + __field(int,level ) > + __field(int,irq_source_id ) > + ), > + > + TP_fast_assign( > + __entry->gsi= gsi; > + __entry->type = type; > + __entry->level = level; > + __entry->irq_source_id = irq_source_id; > + ), > + > + TP_printk("gsi %u type %u level %d source %d", __entry->gsi, > + __entry->type, __entry->level, __entry->irq_source_id) > +); > + > #endif /* _TRACE_VGIC_H */ > > #undef TRACE_INCLUDE_PATH > diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c > index 99e026d..4cfc3f4 100644 > --- a/virt/kvm/arm/vgic/vgic-irqfd.c > +++ b/virt/kvm/arm/vgic/vgic-irqfd.c > @@ -19,6 +19,7 @@ > #include > #include > #include "vgic.h" > +#include "trace.h" > > /** > * vgic_irqfd_set_irq: inject the IRQ corresponding to the > @@ -105,6 +106,26 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, > return vgic_its_inject_msi(kvm, ); > } > > +/** > + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection > + * > + * Currently only direct MSI injecton is supported. > + */ > +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, > + struct kvm *kvm, int irq_source_id, int level, > + bool line_status) > +{ > + int ret; > + > + trace_kvm_arch_set_irq_inatomic(e->gsi, e->type, level, irq_source_id); > + > + if (unlikely(e->type != KVM_IRQ_ROUTING_MSI)) > + return -EWOULDBLOCK; > + > + ret = kvm_set_msi(e, kvm, irq_source_id, level, line_status); The implementation of kvm_set_msi is not atomic. There is a mutex held in one of the execution paths. That is why it can not be used directly in this atomic context. > + return ret; > +} > + > int kvm_vgic_setup_default_irq_routing(struct kvm *kvm) > { > struct kvm_irq_routing_entry *entries; Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich Ust-ID: DE 289 237 879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC] Question about TLB flush while set Stage-2 huge pages
Hi ! On Sun, Mar 17, 2019 at 09:34:11PM +0800, Zenghui Yu wrote: > Hi Suzuki, > > On 2019/3/15 22:56, Suzuki K Poulose wrote: > >Hi Zhengui, > > s/Zhengui/Zheng/ > > (I think you must wanted to say "Hi" to Zheng :-) ) > Sorry about that. > > I have looked into your patch and the kernel log, and I believe that > your patch had already addressed this issue. But I think we can do it > a little better - two more points need to be handled with caution. > > Take PMD hugepage (PMD_SIZE == 2M) for example: > ... > >Thats bad, we seem to be making upto 4 unbalanced put_page(). > > > --- > virt/kvm/arm/mmu.c | 51 > +++ > 1 file changed, 35 insertions(+), 16 deletions(-) > > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > index 66e0fbb5..04b0f9b 100644 > --- a/virt/kvm/arm/mmu.c > +++ b/virt/kvm/arm/mmu.c > @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm > *kvm, struct kvm_mmu_memory_cache > * Skip updating the page table if the entry is > * unchanged. > */ > - if (pmd_val(old_pmd) == pmd_val(*new_pmd)) > + if (pmd_val(old_pmd) == pmd_val(*new_pmd)) { > return 0; > - > + } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) { > /* > - * Mapping in huge pages should only happen through a > - * fault. If a page is merged into a transparent huge > - * page, the individual subpages of that huge page > - * should be unmapped through MMU notifiers before we > - * get here. > - * > - * Merging of CompoundPages is not supported; they > - * should become splitting first, unmapped, merged, > - * and mapped back in on-demand. > + * If we have PTE level mapping for this block, > + * we must unmap it to avoid inconsistent TLB > + * state. We could end up in this situation if > + * the memory slot was marked for dirty logging > + * and was reverted, leaving PTE level mappings > + * for the pages accessed during the period. > + * Normal THP split/merge follows mmu_notifier > + * callbacks and do get handled accordingly. > */ > - VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd)); > + unmap_stage2_range(kvm, (addr & S2_PMD_MASK), > S2_PMD_SIZE); > > First, using unmap_stage2_range() here is not quite appropriate. Suppose > we've only accessed one 2M page in HPA [x, x+1]Gib range, with other > pages unaccessed. What will happen if unmap_stage2_range(this_2M_page)? > We'll unexpectedly reach clear_stage2_pud_entry(), and things are going > to get really bad. So we'd better use unmap_stage2_ptes() here since we > only want to unmap a 2M range. Yes, you're right. If this PMD entry is the only entry in the parent PUD table, then the PUD table may get free'd and we may install the table in a place which is not plugged into the table. > > > Second, consider below function stack: > > unmap_stage2_ptes() > clear_stage2_pmd_entry() > put_page(virt_to_page(pmd)) > > It seems that we have one "redundant" put_page() here, (thus comes the > bad kernel log ... ,) but actually we do not. By stage2_set_pmd_huge(), > the PMD table entry will then point to a 2M block (originally pointed > to a PTE table), the _refcount of this PMD-level table page should _not_ > change after unmap_stage2_ptes(). So what we really should do is adding > a get_page() after unmapping to keep the _refcount a balance! Yes we need an additional refcount on the new huge pmd table, if we are tearing down the PTE level table. > > > thoughts ? A simple patch below (based on yours) for details. > > > thanks, > > zenghui > > > >> > >>It seems that kvm decreases the _refcount of the page twice in > >>transparent_hugepage_adjust() > >>and unmap_stage2_range(). > > > >But I thought we should be doing that on the head_page already, as this is > >THP. > >I will take a look and get back to you on this. Btw, is it possible for you > >to turn on CONFIG_DEBUG_VM and re-run with the above patch ? > > > >Kind regards > >Suzuki > > > > ---8<--- > > test: kvm: arm: Maybe two more fixes > > Applied based on Suzuki's patch. > > Signed-off-by: Zenghui Yu > --- > virt/kvm/arm/mmu.c | 8 ++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > index 05765df..ccd5d5d 100644 > --- a/virt/kvm/arm/mmu.c > +++ b/virt/kvm/arm/mmu.c > @@ -1089,7 +1089,9 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct > kvm_mmu_memory_cache >* Normal THP split/merge follows mmu_notifier >* callbacks and do get handled accordingly. >*/ > -
Re: [RFC PATCH] KVM: arm/arm64: Enable direct irqfd MSI injection
On Sun, 17 Mar 2019 19:35:48 + Marc Zyngier wrote: [...] > A first approach would be to keep a small cache of the last few > successful translations for this ITS, cache that could be looked-up by > holding a spinlock instead. A hit in this cache could directly be > injected. Any command that invalidates or changes anything (DISCARD, > INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke > the cache altogether. And to explain what I meant with this, I've pushed a branch[1] with a basic prototype. It is good enough to get a VM to boot, but I wouldn't trust it for anything serious just yet. If anyone feels like giving it a go and check whether it has any benefit performance wise, please do so. Thanks, M. [1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/its-translation-cache -- Without deviation from the norm, progress is not possible. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v5 05/22] iommu: Introduce cache_invalidate API
Hi Jean, On 3/18/19 12:01 PM, Jean-Philippe Brucker wrote: > On 17/03/2019 16:43, Auger Eric wrote: diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h index 532a64075f23..e4c6a447e85a 100644 --- a/include/uapi/linux/iommu.h +++ b/include/uapi/linux/iommu.h @@ -159,4 +159,75 @@ struct iommu_pasid_table_config { }; }; +/* defines the granularity of the invalidation */ +enum iommu_inv_granularity { + IOMMU_INV_GRANU_DOMAIN, /* domain-selective invalidation */ + IOMMU_INV_GRANU_PASID, /* pasid-selective invalidation */ + IOMMU_INV_GRANU_ADDR, /* page-selective invalidation */ +}; + +/** + * Address Selective Invalidation Structure + * + * @flags indicates the granularity of the address-selective invalidation + * - if PASID bit is set, @pasid field is populated and the invalidation + * relates to cache entries tagged with this PASID and matching the + * address range. + * - if ARCHID bit is set, @archid is populated and the invalidation relates + * to cache entries tagged with this architecture specific id and matching + * the address range. + * - Both PASID and ARCHID can be set as they may tag different caches. + * - if neither PASID or ARCHID is set, global addr invalidation applies + * - LEAF flag indicates whether only the leaf PTE caching needs to be + * invalidated and other paging structure caches can be preserved. + * @pasid: process address space id + * @archid: architecture-specific id + * @addr: first stage/level input address + * @granule_size: page/block size of the mapping in bytes + * @nb_granules: number of contiguous granules to be invalidated + */ +struct iommu_inv_addr_info { +#define IOMMU_INV_ADDR_FLAGS_PASID(1 << 0) +#define IOMMU_INV_ADDR_FLAGS_ARCHID (1 << 1) +#define IOMMU_INV_ADDR_FLAGS_LEAF (1 << 2) + __u32 flags; + __u32 archid; + __u64 pasid; + __u64 addr; + __u64 granule_size; + __u64 nb_granules; +}; + +/** + * First level/stage invalidation information + * @cache: bitfield that allows to select which caches to invalidate + * @granularity: defines the lowest granularity used for the invalidation: + * domain > pasid > addr + * + * Not all the combinations of cache/granularity make sense: + * + * type | DEV_IOTLB | IOTLB | PASID| + * granularity| | | cache | + * -+---+---+---+ + * DOMAIN | N/A | Y | Y | + * PASID | Y | Y | Y | + * ADDR | Y | Y | N/A| + */ +struct iommu_cache_invalidate_info { +#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1 + __u32 version; +/* IOMMU paging structure cache */ +#define IOMMU_CACHE_INV_TYPE_IOTLB(1 << 0) /* IOMMU IOTLB */ +#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB(1 << 1) /* Device IOTLB */ +#define IOMMU_CACHE_INV_TYPE_PASID (1 << 2) /* PASID cache */ + __u8cache; + __u8granularity; + __u8padding[2]; + union { + __u64 pasid; >>> just realized there is already a pasid field in the addr_info, do we >>> still need this? >> I think so. Either you do a PASID based invalidation and you directly >> use the pasid field or you do an address based invalidation and you use >> the addr_info where the pasid may or not be passed. > > I guess a comment would be useful? > > - Invalidations by %IOMMU_INV_GRANU_ADDR use field @addr_info. > - Invalidations by %IOMMU_INV_GRANU_PASID use field @pasid. > - Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take an argument. OK. I will add those comments in v7. Thanks Eric > > Thanks, > Jean > >> >> Thanks >> >> Eric + struct iommu_inv_addr_info addr_info; + }; +}; + + #endif /* _UAPI_IOMMU_H */ >>> >>> [Jacob Pan] >>> > ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2
On 16/03/2019 04:56, Leo Yan wrote: Hi Robin, On Fri, Mar 15, 2019 at 12:54:10PM +, Robin Murphy wrote: Hi Leo, Sorry for the delay - I'm on holiday this week, but since I've made the mistake of glancing at my inbox I should probably save you from wasting any more time... Sorry for disturbing you in holiday and appreciate your help. It's no rush to reply. On 2019-03-15 11:03 am, Auger Eric wrote: Hi Leo, + Jean-Philippe On 3/15/19 10:37 AM, Leo Yan wrote: Hi Eric, Robin, On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote: [...] If the NIC supports MSIs they logically are used. This can be easily checked on host by issuing "cat /proc/interrupts | grep vfio". Can you check whether the guest received any interrupt? I remember that Robin said in the past that on Juno, the MSI doorbell was in the PCI host bridge window and possibly transactions towards the doorbell could not reach it since considered as peer to peer. I found back Robin's explanation. It was not related to MSI IOVA being within the PCI host bridge window but RAM GPA colliding with host PCI config space? "MSI doorbells integral to PCIe root complexes (and thus untranslatable) typically have a programmable address, so could be anywhere. In the more general category of "special hardware addresses", QEMU's default ARM guest memory map puts RAM starting at 0x4000; on the ARM Juno platform, that happens to be where PCI config space starts; as Juno's PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign the PCI bus to a guest (all of it, given the lack of ACS), the root complex just sees the guest's attempts to DMA to "memory" as the device attempting to access config space and aborts them." Below is some following investigation at my side: Firstly, must admit that I don't understand well for up paragraph; so based on the description I am wandering if can use INTx mode and if it's lucky to avoid this hardware pitfall. The problem above is that during the assignment process, the virtualizer maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to allow the device, programmed in GPA to access the whole guest RAM. Unfortunately if the device emits a DMA request with 0x4000 IOVA address, this IOVA is interpreted by the Juno RC as a transaction towards the PCIe config space. So this DMA request will not go beyond the RC, will never reach the IOMMU and will never reach the guest RAM. So globally the device is not able to reach part of the guest RAM. That's how I interpret the above statement. Then I don't know the details of the collision, I don't have access to this HW. I don't know either if this problem still exists on the r2 HW. Thanks a lot for rephrasing, Eric :) The short answer is that if you want PCI passthrough to work on Juno, the guest memory map has to look like a Juno. The PCIe root complex uses an internal lookup table to generate appropriate AXI attributes for outgoing PCIe transactions; unfortunately this has no notion of 'default' attributes, so addresses *must* match one of the programmed windows in order to be valid. From memory, EDK2 sets up a 2GB window covering the lower DRAM bank, an 8GB window covering the upper DRAM bank, and a 1MB (or thereabouts) window covering the GICv2m region with Device attributes. I checked kernel memory blocks info, it gives out below result: root@debian:~# cat /sys/kernel/debug/memblock/memory 0: 0x8000..0xfeff 1: 0x00088000..0x0009 So I think the lower 2GB DRAM window is: [0x8000_..0xfeff_] and the high DRAM window is [0x8_8000_..0x9__]. BTW, now I am using uboot rather than UEFI, so not sure if uboot has programmed memory windows for PCIe. Could you help give a point for which registers should be set in UEFI thus I also can check related configurations in uboot? U-Boot does the same thing[1] - you can confirm that by whether PCIe works at all on the host ;) Any PCIe transactions to addresses not within one of those windows will be aborted by the RC without ever going out to the AXI side where the SMMU lies (and I think anything matching the config space or I/O space windows or a region claimed by a BAR will be aborted even earlier as a peer-to-peer attempt regardless of the AXI Translation Table setup). You could potentially modify the firmware to change the window configuration, but the alignment restrictions make it awkward. I've only ever tested passthrough on Juno using kvmtool, which IIRC already has guest RAM in an appropriate place (and is trivially easy to hack if not) - I don't remember if I ever actually tried guest MSI with that. I did several tries with kvmtool to tweak memory regions but it's no lucky. Since the host uses [0x8000_..0xfeff_] as the first valid memory window for PCIe, thus I tried to change all memory/io regions into this window with below changes but it's no lucky: diff --git
Re: [PATCH v5 05/22] iommu: Introduce cache_invalidate API
On 17/03/2019 16:43, Auger Eric wrote: >>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h >>> index 532a64075f23..e4c6a447e85a 100644 >>> --- a/include/uapi/linux/iommu.h >>> +++ b/include/uapi/linux/iommu.h >>> @@ -159,4 +159,75 @@ struct iommu_pasid_table_config { >>> }; >>> }; >>> >>> +/* defines the granularity of the invalidation */ >>> +enum iommu_inv_granularity { >>> + IOMMU_INV_GRANU_DOMAIN, /* domain-selective >>> invalidation */ >>> + IOMMU_INV_GRANU_PASID, /* pasid-selective >>> invalidation */ >>> + IOMMU_INV_GRANU_ADDR, /* page-selective invalidation >>> */ +}; >>> + >>> +/** >>> + * Address Selective Invalidation Structure >>> + * >>> + * @flags indicates the granularity of the address-selective >>> invalidation >>> + * - if PASID bit is set, @pasid field is populated and the >>> invalidation >>> + * relates to cache entries tagged with this PASID and matching the >>> + * address range. >>> + * - if ARCHID bit is set, @archid is populated and the invalidation >>> relates >>> + * to cache entries tagged with this architecture specific id and >>> matching >>> + * the address range. >>> + * - Both PASID and ARCHID can be set as they may tag different >>> caches. >>> + * - if neither PASID or ARCHID is set, global addr invalidation >>> applies >>> + * - LEAF flag indicates whether only the leaf PTE caching needs to >>> be >>> + * invalidated and other paging structure caches can be preserved. >>> + * @pasid: process address space id >>> + * @archid: architecture-specific id >>> + * @addr: first stage/level input address >>> + * @granule_size: page/block size of the mapping in bytes >>> + * @nb_granules: number of contiguous granules to be invalidated >>> + */ >>> +struct iommu_inv_addr_info { >>> +#define IOMMU_INV_ADDR_FLAGS_PASID (1 << 0) >>> +#define IOMMU_INV_ADDR_FLAGS_ARCHID(1 << 1) >>> +#define IOMMU_INV_ADDR_FLAGS_LEAF (1 << 2) >>> + __u32 flags; >>> + __u32 archid; >>> + __u64 pasid; >>> + __u64 addr; >>> + __u64 granule_size; >>> + __u64 nb_granules; >>> +}; >>> + >>> +/** >>> + * First level/stage invalidation information >>> + * @cache: bitfield that allows to select which caches to invalidate >>> + * @granularity: defines the lowest granularity used for the >>> invalidation: >>> + * domain > pasid > addr >>> + * >>> + * Not all the combinations of cache/granularity make sense: >>> + * >>> + * type | DEV_IOTLB | IOTLB | PASID| >>> + * granularity | | | >>> cache | >>> + * -+---+---+---+ >>> + * DOMAIN | N/A | Y | >>> Y | >>> + * PASID | Y | Y | >>> Y | >>> + * ADDR| Y | Y | >>> N/A | >>> + */ >>> +struct iommu_cache_invalidate_info { >>> +#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1 >>> + __u32 version; >>> +/* IOMMU paging structure cache */ >>> +#define IOMMU_CACHE_INV_TYPE_IOTLB (1 << 0) /* IOMMU IOTLB */ >>> +#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB (1 << 1) /* Device >>> IOTLB */ +#define IOMMU_CACHE_INV_TYPE_PASID(1 << 2) /* PASID >>> cache */ >>> + __u8cache; >>> + __u8granularity; >>> + __u8padding[2]; >>> + union { >>> + __u64 pasid; >> just realized there is already a pasid field in the addr_info, do we >> still need this? > I think so. Either you do a PASID based invalidation and you directly > use the pasid field or you do an address based invalidation and you use > the addr_info where the pasid may or not be passed. I guess a comment would be useful? - Invalidations by %IOMMU_INV_GRANU_ADDR use field @addr_info. - Invalidations by %IOMMU_INV_GRANU_PASID use field @pasid. - Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take an argument. Thanks, Jean > > Thanks > > Eric >>> + struct iommu_inv_addr_info addr_info; >>> + }; >>> +}; >>> + >>> + >>> #endif /* _UAPI_IOMMU_H */ >> >> [Jacob Pan] >> ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm