Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-18 Thread Leo Yan
Hi Robin,

On Mon, Mar 18, 2019 at 12:25:33PM +, Robin Murphy wrote:

[...]

> > diff --git a/arm/include/arm-common/kvm-arch.h 
> > b/arm/include/arm-common/kvm-arch.h
> > index b9d486d..43f78b1 100644
> > --- a/arm/include/arm-common/kvm-arch.h
> > +++ b/arm/include/arm-common/kvm-arch.h
> > @@ -7,10 +7,10 @@
> > 
> >   #include "arm-common/gic.h"
> > 
> > -#define ARM_IOPORT_AREA_AC(0x, UL)
> > -#define ARM_MMIO_AREA  _AC(0x0001, UL)
> > -#define ARM_AXI_AREA   _AC(0x4000, UL)
> > -#define ARM_MEMORY_AREA_AC(0x8000, UL)
> > +#define ARM_IOPORT_AREA_AC(0x8000, UL)
> > +#define ARM_MMIO_AREA  _AC(0x8001, UL)
> > +#define ARM_AXI_AREA   _AC(0x8800, UL)
> > +#define ARM_MEMORY_AREA_AC(0x9000, UL)
> > 
> > Anyway, very appreciate for the suggestions; it's sufficent for me to
> > dig more for memory related information (e.g. PCIe configurations,
> > IOMMU, etc) and will keep posted if I make any progress.
> 
> None of those should need to change (all the MMIO emulation stuff is
> irrelevant to PCIe DMA anyway) - provided you don't give the guest more than
> 2GB of RAM, passthrough with legacy INTx ought to work out-of-the-box. For
> MSIs to get through, you'll further need to change the host kernel to place
> its software MSI region[2] within any of the host bridge windows as well.

>From PCI configurations dumping, I can see after launch the guest with
kvmtool, the host receives the first interrupt (checked with the
function vfio_intx_handler() has been invoked once) and then PCI sent
command with PCI_COMMAND_INTX_DISABLE to disable interrupt line.  So
this flow is very likely the interrupt is not forwarded properly and
guest doesn't receive interrupt.

It's lucky that I found below flow can let interrupt forwarding from
host to guest after I always set "sky2.disable_msi=1" for both kernel
command lines:

hostguest

  INTx mode   INTx mode

So far, it still cannot work well if I only set "sky2.disable_msi=1"
for host kernel command line, with this config it runs with below flow
and which cannot forward interrupt properly from host to guest:

hostguest

  INTx mode   msi enable
  msi disable
  Switch back to INTx mode

I am so happy now I can use pure INTx mode on Juno board for NIC
enabling and pinged successfully from guest OS to my router :)

Will look into the issue in the second secnario; and if I have more
time I will look into msi mode as well (I confirmed msi mode can work
with host OS but failed in guest OS).

Very appreciate you & Eric helping!

Thanks,
Leo Yan
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH] KVM: arm/arm64: Enable direct irqfd MSI injection

2019-03-18 Thread Zenghui Yu

Hi all,

On 2019/3/18 3:35, Marc Zyngier wrote:

On Sun, 17 Mar 2019 14:36:13 +,
Zenghui Yu  wrote:


Currently, IRQFD on arm still uses the deferred workqueue mechanism
to inject interrupts into guest, which will likely lead to a busy
context-switching from/to the kworker thread. This overhead is for
no purpose (only in my view ...) and will result in an interrupt
performance degradation.

Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct
irqfd MSI injection, by which we can get rid of the annoying latency.
As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet
processing workloads) will benefit from it.

Signed-off-by: Zenghui Yu 
---

It seems that only MSI will follow the IRQFD path, did I miss something?

This patch is still under test and sent out for early feedback. If I have
any mis-understanding, please fix me up and let me know. Thanks!


As mentioned by other folks in the thread, this is clearly wrong. The
first thing kvm_inject_msi does is to lock the corresponding ITS using
a mutex. So the "no purpose" bit was a bit too quick.

When doing this kind of work, I suggest you enable lockdep and all the
related checkers. Also, for any optimisation, please post actual
numbers for the relevant benchmarks. Saying "application X will
benefit from it" is meaningless without any actual data.



---
  virt/kvm/arm/vgic/trace.h  | 22 ++
  virt/kvm/arm/vgic/vgic-irqfd.c | 21 +
  2 files changed, 43 insertions(+)

diff --git a/virt/kvm/arm/vgic/trace.h b/virt/kvm/arm/vgic/trace.h
index 55fed77..bc1f4db 100644
--- a/virt/kvm/arm/vgic/trace.h
+++ b/virt/kvm/arm/vgic/trace.h
@@ -27,6 +27,28 @@
  __entry->vcpu_id, __entry->irq, __entry->level)
  );
  
+TRACE_EVENT(kvm_arch_set_irq_inatomic,

+   TP_PROTO(u32 gsi, u32 type, int level, int irq_source_id),
+   TP_ARGS(gsi, type, level, irq_source_id),
+
+   TP_STRUCT__entry(
+   __field(u32,gsi )
+   __field(u32,type)
+   __field(int,level   )
+   __field(int,irq_source_id   )
+   ),
+
+   TP_fast_assign(
+   __entry->gsi = gsi;
+   __entry->type= type;
+   __entry->level   = level;
+   __entry->irq_source_id   = irq_source_id;
+   ),
+
+   TP_printk("gsi %u type %u level %d source %d", __entry->gsi,
+ __entry->type, __entry->level, __entry->irq_source_id)
+);
+
  #endif /* _TRACE_VGIC_H */
  
  #undef TRACE_INCLUDE_PATH

diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c
index 99e026d..4cfc3f4 100644
--- a/virt/kvm/arm/vgic/vgic-irqfd.c
+++ b/virt/kvm/arm/vgic/vgic-irqfd.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include "vgic.h"
+#include "trace.h"
  
  /**

   * vgic_irqfd_set_irq: inject the IRQ corresponding to the
@@ -105,6 +106,26 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
return vgic_its_inject_msi(kvm, );
  }
  
+/**

+ * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
+ *
+ * Currently only direct MSI injecton is supported.
+ */
+int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
+ struct kvm *kvm, int irq_source_id, int level,
+ bool line_status)
+{
+   int ret;
+
+   trace_kvm_arch_set_irq_inatomic(e->gsi, e->type, level, irq_source_id);
+
+   if (unlikely(e->type != KVM_IRQ_ROUTING_MSI))
+   return -EWOULDBLOCK;
+
+   ret = kvm_set_msi(e, kvm, irq_source_id, level, line_status);
+   return ret;
+}
+


Although we've established that the approach is wrong, maybe we can
look at improving this aspect.

A first approach would be to keep a small cache of the last few
successful translations for this ITS, cache that could be looked-up by
holding a spinlock instead. A hit in this cache could directly be
injected. Any command that invalidates or changes anything (DISCARD,
INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
the cache altogether.

Of course, all of that needs to be quantified.


Thanks for all of your explanations, especially for Marc's suggestions!
It took me long time to figure out my mistakes, since I am not very
familiar with the locking stuff. Now I have to apologize for my noise.

As for the its-translation-cache code (a really good news to us), we
have a rough look at it and start testing now!


thanks,

zenghui



Thanks,

M.



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH] KVM: arm/arm64: Enable direct irqfd MSI injection

2019-03-18 Thread Raslan, KarimAllah
On Sun, 2019-03-17 at 14:36 +, Zenghui Yu wrote:
> Currently, IRQFD on arm still uses the deferred workqueue mechanism
> to inject interrupts into guest, which will likely lead to a busy
> context-switching from/to the kworker thread. This overhead is for
> no purpose (only in my view ...) and will result in an interrupt
> performance degradation.
> 
> Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct
> irqfd MSI injection, by which we can get rid of the annoying latency.
> As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet
> processing workloads) will benefit from it.
> 
> Signed-off-by: Zenghui Yu 
> ---
> 
> It seems that only MSI will follow the IRQFD path, did I miss something?
> 
> This patch is still under test and sent out for early feedback. If I have
> any mis-understanding, please fix me up and let me know. Thanks!
> 
> ---
>  virt/kvm/arm/vgic/trace.h  | 22 ++
>  virt/kvm/arm/vgic/vgic-irqfd.c | 21 +
>  2 files changed, 43 insertions(+)
> 
> diff --git a/virt/kvm/arm/vgic/trace.h b/virt/kvm/arm/vgic/trace.h
> index 55fed77..bc1f4db 100644
> --- a/virt/kvm/arm/vgic/trace.h
> +++ b/virt/kvm/arm/vgic/trace.h
> @@ -27,6 +27,28 @@
> __entry->vcpu_id, __entry->irq, __entry->level)
>  );
>  
> +TRACE_EVENT(kvm_arch_set_irq_inatomic,
> + TP_PROTO(u32 gsi, u32 type, int level, int irq_source_id),
> + TP_ARGS(gsi, type, level, irq_source_id),
> +
> + TP_STRUCT__entry(
> + __field(u32,gsi )
> + __field(u32,type)
> + __field(int,level   )
> + __field(int,irq_source_id   )
> + ),
> +
> + TP_fast_assign(
> + __entry->gsi= gsi;
> + __entry->type   = type;
> + __entry->level  = level;
> + __entry->irq_source_id  = irq_source_id;
> + ),
> +
> + TP_printk("gsi %u type %u level %d source %d", __entry->gsi,
> +   __entry->type, __entry->level, __entry->irq_source_id)
> +);
> +
>  #endif /* _TRACE_VGIC_H */
>  
>  #undef TRACE_INCLUDE_PATH
> diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c
> index 99e026d..4cfc3f4 100644
> --- a/virt/kvm/arm/vgic/vgic-irqfd.c
> +++ b/virt/kvm/arm/vgic/vgic-irqfd.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include "vgic.h"
> +#include "trace.h"
>  
>  /**
>   * vgic_irqfd_set_irq: inject the IRQ corresponding to the
> @@ -105,6 +106,26 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
>   return vgic_its_inject_msi(kvm, );
>  }
>  
> +/**
> + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
> + *
> + * Currently only direct MSI injecton is supported.
> + */
> +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
> +   struct kvm *kvm, int irq_source_id, int level,
> +   bool line_status)
> +{
> + int ret;
> +
> + trace_kvm_arch_set_irq_inatomic(e->gsi, e->type, level, irq_source_id);
> +
> + if (unlikely(e->type != KVM_IRQ_ROUTING_MSI))
> + return -EWOULDBLOCK;
> +
> + ret = kvm_set_msi(e, kvm, irq_source_id, level, line_status);

The implementation of kvm_set_msi is not atomic. There is a mutex held in one
of the execution paths. That is why it can not be used directly in this atomic 
context.

> + return ret;
> +}
> +
>  int kvm_vgic_setup_default_irq_routing(struct kvm *kvm)
>  {
>   struct kvm_irq_routing_entry *entries;



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] Question about TLB flush while set Stage-2 huge pages

2019-03-18 Thread Suzuki K Poulose
Hi !
On Sun, Mar 17, 2019 at 09:34:11PM +0800, Zenghui Yu wrote:
> Hi Suzuki,
> 
> On 2019/3/15 22:56, Suzuki K Poulose wrote:
> >Hi Zhengui,
> 
> s/Zhengui/Zheng/
> 
> (I think you must wanted to say "Hi" to Zheng :-) )
> 

Sorry about that.

> 
> I have looked into your patch and the kernel log, and I believe that
> your patch had already addressed this issue. But I think we can do it
> a little better - two more points need to be handled with caution.
> 
> Take PMD hugepage (PMD_SIZE == 2M) for example:
>

...

> >Thats bad, we seem to be making upto 4 unbalanced put_page().
> >
> ---
>    virt/kvm/arm/mmu.c | 51
> +++
>    1 file changed, 35 insertions(+), 16 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 66e0fbb5..04b0f9b 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm
> *kvm, struct kvm_mmu_memory_cache
>     * Skip updating the page table if the entry is
>     * unchanged.
>     */
> -    if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> +    if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>    return 0;
> -
> +    } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>    /*
> - * Mapping in huge pages should only happen through a
> - * fault.  If a page is merged into a transparent huge
> - * page, the individual subpages of that huge page
> - * should be unmapped through MMU notifiers before we
> - * get here.
> - *
> - * Merging of CompoundPages is not supported; they
> - * should become splitting first, unmapped, merged,
> - * and mapped back in on-demand.
> + * If we have PTE level mapping for this block,
> + * we must unmap it to avoid inconsistent TLB
> + * state. We could end up in this situation if
> + * the memory slot was marked for dirty logging
> + * and was reverted, leaving PTE level mappings
> + * for the pages accessed during the period.
> + * Normal THP split/merge follows mmu_notifier
> + * callbacks and do get handled accordingly.
>     */
> -    VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> +    unmap_stage2_range(kvm, (addr & S2_PMD_MASK),
> S2_PMD_SIZE);
> 
> First, using unmap_stage2_range() here is not quite appropriate. Suppose
> we've only accessed one 2M page in HPA [x, x+1]Gib range, with other
> pages unaccessed.  What will happen if unmap_stage2_range(this_2M_page)?
> We'll unexpectedly reach clear_stage2_pud_entry(), and things are going
> to get really bad.  So we'd better use unmap_stage2_ptes() here since we
> only want to unmap a 2M range.

Yes, you're right. If this PMD entry is the only entry in the parent PUD table,
then the PUD table may get free'd and we may install the table in a place which
is not plugged into the table.

> 
> 
> Second, consider below function stack:
> 
>   unmap_stage2_ptes()
> clear_stage2_pmd_entry()
>   put_page(virt_to_page(pmd))
> 
> It seems that we have one "redundant" put_page() here, (thus comes the
> bad kernel log ... ,) but actually we do not.  By stage2_set_pmd_huge(),
> the PMD table entry will then point to a 2M block (originally pointed
> to a PTE table), the _refcount of this PMD-level table page should _not_
> change after unmap_stage2_ptes().  So what we really should do is adding
> a get_page() after unmapping to keep the _refcount a balance!

Yes we need an additional refcount on the new huge pmd table, if we are
tearing down the PTE level table.

> 
> 
> thoughts ? A simple patch below (based on yours) for details.
> 
> 
> thanks,
> 
> zenghui
> 
> 
> >>
> >>It seems that kvm decreases the _refcount of the page twice in
> >>transparent_hugepage_adjust()
> >>and unmap_stage2_range().
> >
> >But I thought we should be doing that on the head_page already, as this is
> >THP.
> >I will take a look and get back to you on this. Btw, is it possible for you
> >to turn on CONFIG_DEBUG_VM and re-run with the above patch ?
> >
> >Kind regards
> >Suzuki
> >
> 
> ---8<---
> 
> test: kvm: arm: Maybe two more fixes
> 
> Applied based on Suzuki's patch.
> 
> Signed-off-by: Zenghui Yu 
> ---
>  virt/kvm/arm/mmu.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 05765df..ccd5d5d 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1089,7 +1089,9 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct
> kvm_mmu_memory_cache
>* Normal THP split/merge follows mmu_notifier
>* callbacks and do get handled accordingly.
>*/
> - 

Re: [RFC PATCH] KVM: arm/arm64: Enable direct irqfd MSI injection

2019-03-18 Thread Marc Zyngier
On Sun, 17 Mar 2019 19:35:48 +
Marc Zyngier  wrote:

[...]

> A first approach would be to keep a small cache of the last few
> successful translations for this ITS, cache that could be looked-up by
> holding a spinlock instead. A hit in this cache could directly be
> injected. Any command that invalidates or changes anything (DISCARD,
> INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
> the cache altogether.

And to explain what I meant with this, I've pushed a branch[1] with a
basic prototype. It is good enough to get a VM to boot, but I wouldn't
trust it for anything serious just yet.

If anyone feels like giving it a go and check whether it has any
benefit performance wise, please do so.

Thanks,

M.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/its-translation-cache
-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v5 05/22] iommu: Introduce cache_invalidate API

2019-03-18 Thread Auger Eric
Hi Jean,

On 3/18/19 12:01 PM, Jean-Philippe Brucker wrote:
> On 17/03/2019 16:43, Auger Eric wrote:
 diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
 index 532a64075f23..e4c6a447e85a 100644
 --- a/include/uapi/linux/iommu.h
 +++ b/include/uapi/linux/iommu.h
 @@ -159,4 +159,75 @@ struct iommu_pasid_table_config {
};
  };
  
 +/* defines the granularity of the invalidation */
 +enum iommu_inv_granularity {
 +  IOMMU_INV_GRANU_DOMAIN, /* domain-selective
 invalidation */
 +  IOMMU_INV_GRANU_PASID,  /* pasid-selective
 invalidation */
 +  IOMMU_INV_GRANU_ADDR,   /* page-selective invalidation
 */ +};
 +
 +/**
 + * Address Selective Invalidation Structure
 + *
 + * @flags indicates the granularity of the address-selective
 invalidation
 + * - if PASID bit is set, @pasid field is populated and the
 invalidation
 + *   relates to cache entries tagged with this PASID and matching the
 + *   address range.
 + * - if ARCHID bit is set, @archid is populated and the invalidation
 relates
 + *   to cache entries tagged with this architecture specific id and
 matching
 + *   the address range.
 + * - Both PASID and ARCHID can be set as they may tag different
 caches.
 + * - if neither PASID or ARCHID is set, global addr invalidation
 applies
 + * - LEAF flag indicates whether only the leaf PTE caching needs to
 be
 + *   invalidated and other paging structure caches can be preserved.
 + * @pasid: process address space id
 + * @archid: architecture-specific id
 + * @addr: first stage/level input address
 + * @granule_size: page/block size of the mapping in bytes
 + * @nb_granules: number of contiguous granules to be invalidated
 + */
 +struct iommu_inv_addr_info {
 +#define IOMMU_INV_ADDR_FLAGS_PASID(1 << 0)
 +#define IOMMU_INV_ADDR_FLAGS_ARCHID   (1 << 1)
 +#define IOMMU_INV_ADDR_FLAGS_LEAF (1 << 2)
 +  __u32   flags;
 +  __u32   archid;
 +  __u64   pasid;
 +  __u64   addr;
 +  __u64   granule_size;
 +  __u64   nb_granules;
 +};
 +
 +/**
 + * First level/stage invalidation information
 + * @cache: bitfield that allows to select which caches to invalidate
 + * @granularity: defines the lowest granularity used for the
 invalidation:
 + * domain > pasid > addr
 + *
 + * Not all the combinations of cache/granularity make sense:
 + *
 + * type |   DEV_IOTLB   | IOTLB |  PASID|
 + * granularity|   |   |
 cache  |
 + * -+---+---+---+
 + * DOMAIN |   N/A |   Y   |
 Y  |
 + * PASID  |   Y   |   Y   |
 Y  |
 + * ADDR   |   Y   |   Y   |
 N/A|
 + */
 +struct iommu_cache_invalidate_info {
 +#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
 +  __u32   version;
 +/* IOMMU paging structure cache */
 +#define IOMMU_CACHE_INV_TYPE_IOTLB(1 << 0) /* IOMMU IOTLB */
 +#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB(1 << 1) /* Device
 IOTLB */ +#define IOMMU_CACHE_INV_TYPE_PASID   (1 << 2) /* PASID
 cache */
 +  __u8cache;
 +  __u8granularity;
 +  __u8padding[2];
 +  union {
 +  __u64   pasid;
>>> just realized there is already a pasid field in the addr_info, do we
>>> still need this?
>> I think so. Either you do a PASID based invalidation and you directly
>> use the pasid field or you do an address based invalidation and you use
>> the addr_info where the pasid may or not be passed.
> 
> I guess a comment would be useful?
> 
> - Invalidations by %IOMMU_INV_GRANU_ADDR use field @addr_info.
> - Invalidations by %IOMMU_INV_GRANU_PASID use field @pasid.
> - Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take an argument.

OK. I will add those comments in v7.

Thanks

Eric
> 
> Thanks,
> Jean
> 
>>
>> Thanks
>>
>> Eric
 +  struct iommu_inv_addr_info addr_info;
 +  };
 +};
 +
 +
  #endif /* _UAPI_IOMMU_H */
>>>
>>> [Jacob Pan]
>>>
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-18 Thread Robin Murphy

On 16/03/2019 04:56, Leo Yan wrote:

Hi Robin,

On Fri, Mar 15, 2019 at 12:54:10PM +, Robin Murphy wrote:

Hi Leo,

Sorry for the delay - I'm on holiday this week, but since I've made the
mistake of glancing at my inbox I should probably save you from wasting any
more time...


Sorry for disturbing you in holiday and appreciate your help.  It's no
rush to reply.


On 2019-03-15 11:03 am, Auger Eric wrote:

Hi Leo,

+ Jean-Philippe

On 3/15/19 10:37 AM, Leo Yan wrote:

Hi Eric, Robin,

On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:

[...]


If the NIC supports MSIs they logically are used. This can be easily
checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
check whether the guest received any interrupt? I remember that Robin
said in the past that on Juno, the MSI doorbell was in the PCI host
bridge window and possibly transactions towards the doorbell could not
reach it since considered as peer to peer.


I found back Robin's explanation. It was not related to MSI IOVA being
within the PCI host bridge window but RAM GPA colliding with host PCI
config space?

"MSI doorbells integral to PCIe root complexes (and thus untranslatable)
typically have a programmable address, so could be anywhere. In the more
general category of "special hardware addresses", QEMU's default ARM
guest memory map puts RAM starting at 0x4000; on the ARM Juno
platform, that happens to be where PCI config space starts; as Juno's
PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
the PCI bus to a guest (all of it, given the lack of ACS), the root
complex just sees the guest's attempts to DMA to "memory" as the device
attempting to access config space and aborts them."


Below is some following investigation at my side:

Firstly, must admit that I don't understand well for up paragraph; so
based on the description I am wandering if can use INTx mode and if
it's lucky to avoid this hardware pitfall.


The problem above is that during the assignment process, the virtualizer
maps the whole guest RAM though the IOMMU (+ the MSI doorbell on ARM) to
allow the device, programmed in GPA to access the whole guest RAM.
Unfortunately if the device emits a DMA request with 0x4000 IOVA
address, this IOVA is interpreted by the Juno RC as a transaction
towards the PCIe config space. So this DMA request will not go beyond
the RC, will never reach the IOMMU and will never reach the guest RAM.
So globally the device is not able to reach part of the guest RAM.
That's how I interpret the above statement. Then I don't know the
details of the collision, I don't have access to this HW. I don't know
either if this problem still exists on the r2 HW.


Thanks a lot for rephrasing, Eric :)


The short answer is that if you want PCI passthrough to work on Juno, the
guest memory map has to look like a Juno.

The PCIe root complex uses an internal lookup table to generate appropriate
AXI attributes for outgoing PCIe transactions; unfortunately this has no
notion of 'default' attributes, so addresses *must* match one of the
programmed windows in order to be valid. From memory, EDK2 sets up a 2GB
window covering the lower DRAM bank, an 8GB window covering the upper DRAM
bank, and a 1MB (or thereabouts) window covering the GICv2m region with
Device attributes.


I checked kernel memory blocks info, it gives out below result:

root@debian:~# cat /sys/kernel/debug/memblock/memory
0: 0x8000..0xfeff
1: 0x00088000..0x0009

So I think the lower 2GB DRAM window is: [0x8000_..0xfeff_]
and the high DRAM window is [0x8_8000_..0x9__].

BTW, now I am using uboot rather than UEFI, so not sure if uboot has
programmed memory windows for PCIe.  Could you help give a point for
which registers should be set in UEFI thus I also can check related
configurations in uboot?


U-Boot does the same thing[1] - you can confirm that by whether PCIe 
works at all on the host ;)



Any PCIe transactions to addresses not within one of
those windows will be aborted by the RC without ever going out to the AXI
side where the SMMU lies (and I think anything matching the config space or
I/O space windows or a region claimed by a BAR will be aborted even earlier
as a peer-to-peer attempt regardless of the AXI Translation Table setup).

You could potentially modify the firmware to change the window
configuration, but the alignment restrictions make it awkward. I've only
ever tested passthrough on Juno using kvmtool, which IIRC already has guest
RAM in an appropriate place (and is trivially easy to hack if not) - I don't
remember if I ever actually tried guest MSI with that.


I did several tries with kvmtool to tweak memory regions but it's no
lucky.  Since the host uses [0x8000_..0xfeff_] as the first
valid memory window for PCIe, thus I tried to change all memory/io
regions into this window with below changes but it's no lucky:

diff --git 

Re: [PATCH v5 05/22] iommu: Introduce cache_invalidate API

2019-03-18 Thread Jean-Philippe Brucker
On 17/03/2019 16:43, Auger Eric wrote:
>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>> index 532a64075f23..e4c6a447e85a 100644
>>> --- a/include/uapi/linux/iommu.h
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -159,4 +159,75 @@ struct iommu_pasid_table_config {
>>> };
>>>  };
>>>  
>>> +/* defines the granularity of the invalidation */
>>> +enum iommu_inv_granularity {
>>> +   IOMMU_INV_GRANU_DOMAIN, /* domain-selective
>>> invalidation */
>>> +   IOMMU_INV_GRANU_PASID,  /* pasid-selective
>>> invalidation */
>>> +   IOMMU_INV_GRANU_ADDR,   /* page-selective invalidation
>>> */ +};
>>> +
>>> +/**
>>> + * Address Selective Invalidation Structure
>>> + *
>>> + * @flags indicates the granularity of the address-selective
>>> invalidation
>>> + * - if PASID bit is set, @pasid field is populated and the
>>> invalidation
>>> + *   relates to cache entries tagged with this PASID and matching the
>>> + *   address range.
>>> + * - if ARCHID bit is set, @archid is populated and the invalidation
>>> relates
>>> + *   to cache entries tagged with this architecture specific id and
>>> matching
>>> + *   the address range.
>>> + * - Both PASID and ARCHID can be set as they may tag different
>>> caches.
>>> + * - if neither PASID or ARCHID is set, global addr invalidation
>>> applies
>>> + * - LEAF flag indicates whether only the leaf PTE caching needs to
>>> be
>>> + *   invalidated and other paging structure caches can be preserved.
>>> + * @pasid: process address space id
>>> + * @archid: architecture-specific id
>>> + * @addr: first stage/level input address
>>> + * @granule_size: page/block size of the mapping in bytes
>>> + * @nb_granules: number of contiguous granules to be invalidated
>>> + */
>>> +struct iommu_inv_addr_info {
>>> +#define IOMMU_INV_ADDR_FLAGS_PASID (1 << 0)
>>> +#define IOMMU_INV_ADDR_FLAGS_ARCHID(1 << 1)
>>> +#define IOMMU_INV_ADDR_FLAGS_LEAF  (1 << 2)
>>> +   __u32   flags;
>>> +   __u32   archid;
>>> +   __u64   pasid;
>>> +   __u64   addr;
>>> +   __u64   granule_size;
>>> +   __u64   nb_granules;
>>> +};
>>> +
>>> +/**
>>> + * First level/stage invalidation information
>>> + * @cache: bitfield that allows to select which caches to invalidate
>>> + * @granularity: defines the lowest granularity used for the
>>> invalidation:
>>> + * domain > pasid > addr
>>> + *
>>> + * Not all the combinations of cache/granularity make sense:
>>> + *
>>> + * type |   DEV_IOTLB   | IOTLB |  PASID|
>>> + * granularity |   |   |
>>> cache   |
>>> + * -+---+---+---+
>>> + * DOMAIN  |   N/A |   Y   |
>>> Y   |
>>> + * PASID   |   Y   |   Y   |
>>> Y   |
>>> + * ADDR|   Y   |   Y   |
>>> N/A |
>>> + */
>>> +struct iommu_cache_invalidate_info {
>>> +#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
>>> +   __u32   version;
>>> +/* IOMMU paging structure cache */
>>> +#define IOMMU_CACHE_INV_TYPE_IOTLB (1 << 0) /* IOMMU IOTLB */
>>> +#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB (1 << 1) /* Device
>>> IOTLB */ +#define IOMMU_CACHE_INV_TYPE_PASID(1 << 2) /* PASID
>>> cache */
>>> +   __u8cache;
>>> +   __u8granularity;
>>> +   __u8padding[2];
>>> +   union {
>>> +   __u64   pasid;
>> just realized there is already a pasid field in the addr_info, do we
>> still need this?
> I think so. Either you do a PASID based invalidation and you directly
> use the pasid field or you do an address based invalidation and you use
> the addr_info where the pasid may or not be passed.

I guess a comment would be useful?

- Invalidations by %IOMMU_INV_GRANU_ADDR use field @addr_info.
- Invalidations by %IOMMU_INV_GRANU_PASID use field @pasid.
- Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take an argument.

Thanks,
Jean

> 
> Thanks
> 
> Eric
>>> +   struct iommu_inv_addr_info addr_info;
>>> +   };
>>> +};
>>> +
>>> +
>>>  #endif /* _UAPI_IOMMU_H */
>>
>> [Jacob Pan]
>>

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm