[Intel-gfx] [PATCH] drm/i915/gvt: Fix bug in getting msg length in AUX CH registers handler

2023-07-31 Thread Yan Zhao
Msg length should be obtained from value written to AUX_CH_CTL register
rather than from enum type of the register.

Commit 0cad796a2269  ("drm/i915: Use REG_BIT() & co. for AUX CH registers")
incorrectly calculates the msg_length from reg type and yields below
warning in intel_gvt_i2c_handle_aux_ch_write():
"i915 :00:02.0: drm_WARN_ON(msg_length != 4)".

Fixes: 0cad796a2269 ("drm/i915: Use REG_BIT() & co. for AUX CH registers")
Signed-off-by: Yan Zhao 
---
 drivers/gpu/drm/i915/gvt/edid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gvt/edid.c b/drivers/gpu/drm/i915/gvt/edid.c
index 2a0438f12a14..af9afdb53c7f 100644
--- a/drivers/gpu/drm/i915/gvt/edid.c
+++ b/drivers/gpu/drm/i915/gvt/edid.c
@@ -491,7 +491,7 @@ void intel_gvt_i2c_handle_aux_ch_write(struct intel_vgpu 
*vgpu,
return;
}
 
-   msg_length = REG_FIELD_GET(DP_AUX_CH_CTL_MESSAGE_SIZE_MASK, reg);
+   msg_length = REG_FIELD_GET(DP_AUX_CH_CTL_MESSAGE_SIZE_MASK, value);
 
// check the msg in DATA register.
msg = vgpu_vreg(vgpu, offset + 4);
-- 
2.17.1



Re: [Intel-gfx] [PATCH v4 06/29] drm/i915/gvt: Explicitly check that vGPU is attached before shadowing

2023-07-31 Thread Yan Zhao
On Fri, Jul 28, 2023 at 06:35:12PM -0700, Sean Christopherson wrote:
> Move the check that a vGPU is attacked from is_2MB_gtt_possible() all the
typo: "attacked" --> "attached"

> way up to shadow_ppgtt_mm() to avoid unnecessary work, and to make it more
This commit message does not match to what the patch does.
The check in the patch is in ppgtt_populate_shadow_entry().

What you want is like below?

@@ -1796,6 +1797,9 @@ static int shadow_ppgtt_mm(struct intel_vgpu_mm *mm)
if (mm->ppgtt_mm.shadowed)
return 0;

+   if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
+return -EINVAL;
+
mm->ppgtt_mm.shadowed = true;

for (index = 0; index < ARRAY_SIZE(mm->ppgtt_mm.guest_pdps); index++) {

> obvious that a future cleanup of is_2MB_gtt_possible() isn't introducing a
> bug.
> 
> is_2MB_gtt_possible() has only one caller, ppgtt_populate_shadow_entry(),
> and all paths in ppgtt_populate_shadow_entry() eventually check for
> attachment by way of intel_gvt_dma_map_guest_page().
> 
> And of the paths that lead to ppgtt_populate_shadow_entry(),
> shadow_ppgtt_mm() is the only one that doesn't already check for
> INTEL_VGPU_STATUS_ACTIVE or INTEL_VGPU_STATUS_ATTACHED.

...

> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 5426a27c1b71..2aed31b497c9 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1163,8 +1163,6 @@ static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
>   if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
>   return 0;
>  
> - if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
> - return -EINVAL;
>   pfn = gfn_to_pfn(vgpu->vfio_device.kvm, ops->get_pfn(entry));
>   if (is_error_noslot_pfn(pfn))
>   return -EINVAL;
> @@ -1277,6 +1275,9 @@ static int ppgtt_populate_shadow_entry(struct 
> intel_vgpu *vgpu,
>   if (!pte_ops->test_present(ge))
>   return 0;
>  
> + if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
> + return -EINVAL;
> +
>   gfn = pte_ops->get_pfn(ge);
>  
>   switch (ge->type) {
> -- 
> 2.41.0.487.g6d72f3e995-goog
> 


Re: [Intel-gfx] [PATCH v4 07/29] drm/i915/gvt: Error out on an attempt to shadowing an unknown GTT entry type

2023-07-31 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Jul 28, 2023 at 06:35:13PM -0700, Sean Christopherson wrote:
> Bail from ppgtt_populate_shadow_entry() if an unexpected GTT entry type
> is encountered instead of subtly falling through to the common "direct
> shadow" path.  Eliminating the default/error path's reliance on the common
> handling will allow hoisting intel_gvt_dma_map_guest_page() into the case
> statements so that the 2MiB case can try intel_gvt_dma_map_guest_page()
> and fallback to splitting the entry on failure.
> 
> Reviewed-by: Zhi Wang 
> Tested-by: Yongwei Ma 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/gtt.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 2aed31b497c9..61e38acee2d5 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1306,6 +1306,7 @@ static int ppgtt_populate_shadow_entry(struct 
> intel_vgpu *vgpu,
>   return -EINVAL;
>   default:
>   GEM_BUG_ON(1);
> + return -EINVAL;
>   }
>  
>   /* direct shadow */
> -- 
> 2.41.0.487.g6d72f3e995-goog
> 


Re: [Intel-gfx] [PATCH v4 03/29] drm/i915/gvt: Verify hugepages are contiguous in physical address space

2023-07-31 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Jul 28, 2023 at 06:35:09PM -0700, Sean Christopherson wrote:
> When shadowing a GTT entry with a 2M page, verify that the pfns are
> contiguous, not just that the struct page pointers are contiguous.  The
> memory map is virtual contiguous if "CONFIG_FLATMEM=y ||
> CONFIG_SPARSEMEM_VMEMMAP=y", but not for "CONFIG_SPARSEMEM=y &&
> CONFIG_SPARSEMEM_VMEMMAP=n", so theoretically KVMGT could encounter struct
> pages that are virtually contiguous, but not physically contiguous.
> 
> In practice, this flaw is likely a non-issue as it would cause functional
> problems iff a section isn't 2M aligned _and_ is directly adjacent to
> another section with discontiguous pfns.
> 
> Tested-by: Yongwei Ma 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index de675d799c7d..429f0f993a13 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -161,7 +161,7 @@ static int gvt_pin_guest_page(struct intel_vgpu *vgpu, 
> unsigned long gfn,
>  
>   if (npage == 0)
>   base_page = cur_page;
> - else if (base_page + npage != cur_page) {
> + else if (page_to_pfn(base_page) + npage != 
> page_to_pfn(cur_page)) {
>   gvt_vgpu_err("The pages are not continuous\n");
>   ret = -EINVAL;
>   npage++;
> -- 
> 2.41.0.487.g6d72f3e995-goog
> 


Re: [Intel-gfx] [PATCH v4.1] drm/i915/gvt: Explicitly check that vGPU is attached before shadowing

2023-08-01 Thread Yan Zhao
Reviewed-by: Yan Zhao 
Tested-by: Yan Zhao 

On Tue, Aug 01, 2023 at 04:05:21PM -0700, Sean Christopherson wrote:
> Move the check that a vGPU is attached from is_2MB_gtt_possible() all the
> way up to shadow_ppgtt_mm() to avoid unnecessary work, and to make it more
> obvious that a future cleanup of is_2MB_gtt_possible() isn't introducing a
> bug.
> 
> is_2MB_gtt_possible() has only one caller, ppgtt_populate_shadow_entry(),
> and all paths in ppgtt_populate_shadow_entry() eventually check for
> attachment by way of intel_gvt_dma_map_guest_page().
> 
> And of the paths that lead to ppgtt_populate_shadow_entry(),
> shadow_ppgtt_mm() is the only one that doesn't already check for
> INTEL_VGPU_STATUS_ACTIVE or INTEL_VGPU_STATUS_ATTACHED.
> 
>   workload_thread() <= pick_next_workload() => INTEL_VGPU_STATUS_ACTIVE
>   |
>   -> dispatch_workload()
>  |
>  |-> prepare_workload()
>  |
>  -> intel_vgpu_sync_oos_pages()
>  |  |
>  |  |-> ppgtt_set_guest_page_sync()
>  |  |
>  |  |-> sync_oos_page()
>  |  |
>  |  |-> ppgtt_populate_shadow_entry()
>  |
>  |-> intel_vgpu_flush_post_shadow()
>  |
>   1: |-> ppgtt_handle_guest_write_page_table()
>  |
>  |-> ppgtt_handle_guest_entry_add()
>  |
>   2: | -> ppgtt_populate_spt_by_guest_entry()
>  ||
>  ||-> ppgtt_populate_spt()
>  ||
>  ||-> ppgtt_populate_shadow_entry()
>  ||
>  ||-> ppgtt_populate_spt_by_guest_entry() 
> [see 2]
>  |
>  |-> ppgtt_populate_shadow_entry()
> 
>   kvmgt_page_track_write()  <= KVM callback => INTEL_VGPU_STATUS_ATTACHED
>   |
>   |-> intel_vgpu_page_track_handler()
>   |
>   |-> ppgtt_write_protection_handler()
>   |
>   |-> ppgtt_handle_guest_write_page_table_bytes()
>   |
>   |-> ppgtt_handle_guest_write_page_table() [see 1]
> 
> Signed-off-by: Sean Christopherson 
> ---
> 
> v4.1:
> 
>  - Actually make the code do what the changelog says. [Yan]
>  - Fix a typo in the changelog. [Yan]
> 
>  drivers/gpu/drm/i915/gvt/gtt.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 5426a27c1b71..de6a484090d7 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1163,8 +1163,6 @@ static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
>   if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
>   return 0;
>  
> - if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
> - return -EINVAL;
>   pfn = gfn_to_pfn(vgpu->vfio_device.kvm, ops->get_pfn(entry));
>   if (is_error_noslot_pfn(pfn))
>   return -EINVAL;
> @@ -1827,6 +1825,9 @@ static int shadow_ppgtt_mm(struct intel_vgpu_mm *mm)
>   if (mm->ppgtt_mm.shadowed)
>   return 0;
>  
> + if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
> + return -EINVAL;
> +
>   mm->ppgtt_mm.shadowed = true;
>  
>   for (index = 0; index < ARRAY_SIZE(mm->ppgtt_mm.guest_pdps); index++) {
> 
> base-commit: 03e8f77e106ba1d2fd980f8b38339dad3a07
> -- 
> 2.41.0.585.gd2178a4bd4-goog
> 


Re: [Intel-gfx] [PATCH 19/27] KVM: x86/mmu: Use page-track notifiers iff there are external users

2023-08-08 Thread Yan Zhao
On Mon, Aug 07, 2023 at 10:19:07AM -0700, Sean Christopherson wrote:
> On Mon, Aug 07, 2023, Like Xu wrote:
> > On 23/12/2022 8:57 am, Sean Christopherson wrote:
> > > +static inline void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> > > + const u8 *new, int bytes)
> > > +{
> > > + __kvm_page_track_write(vcpu, gpa, new, bytes);
> > > +
> > > + kvm_mmu_track_write(vcpu, gpa, new, bytes);
> > > +}
> > 
> > The kvm_mmu_track_write() is only used for x86, where the incoming parameter
> > "u8 *new" has not been required since 0e0fee5c539b ("kvm: mmu: Fix race in
> > emulated page table writes"), please help confirm if it's still needed ? 
> > Thanks.
> > A minor clean up is proposed.
> 
> Hmm, unless I'm misreading things, KVMGT ultimately doesn't consume @new 
> either.
> So I think we can remove @new from kvm_page_track_write() entirely.
Sorry for the late reply.
Yes, KVMGT does not consume @new and it reads the guest PTE again in the
page track write handler.

But I have a couple of questions related to the memtioned commit as
below:

(1) If "re-reading the current value of the guest PTE after the MMU lock has
been acquired", then should KVMGT also acquire the MMU lock too?
If so, could we move the MMU lock and unlock into kvm_page_track_write()
as it's common.

(2) Even if KVMGT consumes @new,
will kvm_page_track_write() be called for once or twice if there are two
concurent emulated write?


commit 0e0fee5c539b61fdd098332e0e2cc375d9073706
Author: Junaid Shahid 
Date:   Wed Oct 31 14:53:57 2018 -0700

kvm: mmu: Fix race in emulated page table writes

When a guest page table is updated via an emulated write,
kvm_mmu_pte_write() is called to update the shadow PTE using the just
written guest PTE value. But if two emulated guest PTE writes happened
concurrently, it is possible that the guest PTE and the shadow PTE end
up being out of sync. Emulated writes do not mark the shadow page as
unsync-ed, so this inconsistency will not be resolved even by a guest TLB
flush (unless the page was marked as unsync-ed at some other point).

This is fixed by re-reading the current value of the guest PTE after the
MMU lock has been acquired instead of just using the value that was
written prior to calling kvm_mmu_pte_write().








Re: [Intel-gfx] [PATCH 19/27] KVM: x86/mmu: Use page-track notifiers iff there are external users

2023-08-09 Thread Yan Zhao
On Wed, Aug 09, 2023 at 07:33:45AM -0700, Sean Christopherson wrote:
> On Wed, Aug 09, 2023, Yan Zhao wrote:
> > On Mon, Aug 07, 2023 at 10:19:07AM -0700, Sean Christopherson wrote:
> > > On Mon, Aug 07, 2023, Like Xu wrote:
> > > > On 23/12/2022 8:57 am, Sean Christopherson wrote:
> > > > > +static inline void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t 
> > > > > gpa,
> > > > > + const u8 *new, int bytes)
> > > > > +{
> > > > > + __kvm_page_track_write(vcpu, gpa, new, bytes);
> > > > > +
> > > > > + kvm_mmu_track_write(vcpu, gpa, new, bytes);
> > > > > +}
> > > > 
> > > > The kvm_mmu_track_write() is only used for x86, where the incoming 
> > > > parameter
> > > > "u8 *new" has not been required since 0e0fee5c539b ("kvm: mmu: Fix race 
> > > > in
> > > > emulated page table writes"), please help confirm if it's still needed 
> > > > ? Thanks.
> > > > A minor clean up is proposed.
> > > 
> > > Hmm, unless I'm misreading things, KVMGT ultimately doesn't consume @new 
> > > either.
> > > So I think we can remove @new from kvm_page_track_write() entirely.
> > Sorry for the late reply.
> > Yes, KVMGT does not consume @new and it reads the guest PTE again in the
> > page track write handler.
> > 
> > But I have a couple of questions related to the memtioned commit as
> > below:
> > 
> > (1) If "re-reading the current value of the guest PTE after the MMU lock has
> > been acquired", then should KVMGT also acquire the MMU lock too?
> 
> No.  If applicable, KVMGT should read the new/current value after acquiring
> whatever lock protects the generation (or update) of the shadow entries.  I
> suspect KVMGT already does this, but I don't have time to confirm that at this
I think the mutex lock and unlock of info->vgpu_lock you added in
kvmgt_page_track_write() is the counterpart :)

> exact memory.
> 
> The race that was fixed in KVM was:
> 
>   vCPU0 vCPU1   
>   write X
>  write Y
>  sync SPTE w/ Y
>   sync SPTE w/ X
> 
> Reading the value after acquiring mmu_lock ensures that both vCPUs will see 
> whatever
> value "loses" the race, i.e. whatever written value is processed second ('Y' 
> in the
> above sequence).
I suspect that vCPU0 may still generate a wrong SPTE if vCPU1 wrote 4
bytes while vCPU0 wrote 8 bytes, though the chances are very low.


> 
> > If so, could we move the MMU lock and unlock into kvm_page_track_write()
> > as it's common.
> > 
> > (2) Even if KVMGT consumes @new,
> > will kvm_page_track_write() be called for once or twice if there are two
> > concurent emulated write?
> 
> Twice, kvm_page_track_write() is wired up directly to the emulation of the 
> write,
> i.e. there is no batching.


Re: [Intel-gfx] [PATCH 19/27] KVM: x86/mmu: Use page-track notifiers iff there are external users

2023-08-09 Thread Yan Zhao
On Thu, Aug 10, 2023 at 07:21:03AM +0800, Yan Zhao wrote:
> On Wed, Aug 09, 2023 at 07:33:45AM -0700, Sean Christopherson wrote:
> > On Wed, Aug 09, 2023, Yan Zhao wrote:
> > > On Mon, Aug 07, 2023 at 10:19:07AM -0700, Sean Christopherson wrote:
> > > > On Mon, Aug 07, 2023, Like Xu wrote:
> > > > > On 23/12/2022 8:57 am, Sean Christopherson wrote:
> > > > > > +static inline void kvm_page_track_write(struct kvm_vcpu *vcpu, 
> > > > > > gpa_t gpa,
> > > > > > +   const u8 *new, int bytes)
> > > > > > +{
> > > > > > +   __kvm_page_track_write(vcpu, gpa, new, bytes);
> > > > > > +
> > > > > > +   kvm_mmu_track_write(vcpu, gpa, new, bytes);
> > > > > > +}
> > > > > 
> > > > > The kvm_mmu_track_write() is only used for x86, where the incoming 
> > > > > parameter
> > > > > "u8 *new" has not been required since 0e0fee5c539b ("kvm: mmu: Fix 
> > > > > race in
> > > > > emulated page table writes"), please help confirm if it's still 
> > > > > needed ? Thanks.
> > > > > A minor clean up is proposed.
> > > > 
> > > > Hmm, unless I'm misreading things, KVMGT ultimately doesn't consume 
> > > > @new either.
> > > > So I think we can remove @new from kvm_page_track_write() entirely.
> > > Sorry for the late reply.
> > > Yes, KVMGT does not consume @new and it reads the guest PTE again in the
> > > page track write handler.
> > > 
> > > But I have a couple of questions related to the memtioned commit as
> > > below:
> > > 
> > > (1) If "re-reading the current value of the guest PTE after the MMU lock 
> > > has
> > > been acquired", then should KVMGT also acquire the MMU lock too?
> > 
> > No.  If applicable, KVMGT should read the new/current value after acquiring
> > whatever lock protects the generation (or update) of the shadow entries.  I
> > suspect KVMGT already does this, but I don't have time to confirm that at 
> > this
> I think the mutex lock and unlock of info->vgpu_lock you added in
> kvmgt_page_track_write() is the counterpart :)
> 
> > exact memory.
> > 
> > The race that was fixed in KVM was:
> > 
> >   vCPU0 vCPU1   
> >   write X
> >  write Y
> >  sync SPTE w/ Y
> >   sync SPTE w/ X
> > 
> > Reading the value after acquiring mmu_lock ensures that both vCPUs will see 
> > whatever
> > value "loses" the race, i.e. whatever written value is processed second 
> > ('Y' in the
> > above sequence).
> I suspect that vCPU0 may still generate a wrong SPTE if vCPU1 wrote 4
> bytes while vCPU0 wrote 8 bytes, though the chances are very low.
> 
This could happen in below sequence:
vCPU0 updates a PTE to AABBCCDD;
vCPU1 updates a PTE to EEFFGGHH in two writes.
(each character stands for a byte)

vCPU0  vCPU1   
write AABBCCDD
   write GGHH
   detect 4 bytes write and hold on sync
sync SPTE w/ AABBGGHH
   write EEFF
   sync SPTE w/ EEFFGGHH


Do you think it worth below serialization work?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a915e23d61fa..51cd0ab73529 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1445,6 +1445,8 @@ struct kvm_arch {
 */
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+
+   struct xarray track_writing_range;
 };

 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index fd04e618ad2d..4b271701dcf6 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -142,12 +142,14 @@ void kvm_page_track_cleanup(struct kvm *kvm)

head = &kvm->arch.track_notifier_head;
cleanup_srcu_struct(&head->track_srcu);
+   xa_destroy(&kvm->arch.track_writing_range);
 }

 int kvm_page_track_init(struct kvm *kvm)
 {
struct kvm_page_track_notifier_head *head;

+   xa_init(&kvm->arch.track_writing_range);
head = &kvm->arch.track_notifier_head;
INIT_HLIST_HEAD(&head->track_notifier_list);
return init_srcu_struct(&head->track_srcu);
diff --git a/arch/x86/kvm/mmu/page_track.h b/arch/x86/kvm/mmu/page_track.h
index 62f98c6c5af3..1829792b9892 100644
--- a/arch/x86/kvm/mmu/page_trac

Re: [Intel-gfx] [PATCH 19/27] KVM: x86/mmu: Use page-track notifiers iff there are external users

2023-08-10 Thread Yan Zhao
On Thu, Aug 10, 2023 at 08:41:14AM -0700, Sean Christopherson wrote:
> On Thu, Aug 10, 2023, Yan Zhao wrote:
> > On Thu, Aug 10, 2023 at 07:21:03AM +0800, Yan Zhao wrote:
> > > > Reading the value after acquiring mmu_lock ensures that both vCPUs will 
> > > > see whatever
> > > > value "loses" the race, i.e. whatever written value is processed second 
> > > > ('Y' in the
> > > > above sequence).
> > > I suspect that vCPU0 may still generate a wrong SPTE if vCPU1 wrote 4
> > > bytes while vCPU0 wrote 8 bytes, though the chances are very low.
> > > 
> > This could happen in below sequence:
> > vCPU0 updates a PTE to AABBCCDD;
> > vCPU1 updates a PTE to EEFFGGHH in two writes.
> > (each character stands for a byte)
> > 
> > vCPU0  vCPU1   
> > write AABBCCDD
> >write GGHH
> >detect 4 bytes write and hold on sync
> > sync SPTE w/ AABBGGHH
> >write EEFF
> >sync SPTE w/ EEFFGGHH
> > 
> > 
> > Do you think it worth below serialization work?
> 
> No, because I don't see any KVM bugs with the above sequence.  If the guest 
> doesn't
> ensure *all* writes from vCPU0 and vCPU1 are fully serialized, then it is 
> completely
> legal for hardware (KVM in this case) to consume AABBGGHH as a PTE.  The only 
> thing
> the guest shouldn't see is EEFFCCDD, but I don't see how that can happen.
Ok, though still feel it's a little odd when a 1st cmpxch instruction on a GPA 
is still
under emulation, a 2nd or 3rd... cmpxch instruction to the same GPA may have 
returned
and they all succeeded :)


Re: [Intel-gfx] [PATCH v2 25/27] KVM: x86/mmu: Drop @slot param from exported/external page-track APIs

2023-05-03 Thread Yan Zhao
On Wed, May 03, 2023 at 04:16:10PM -0700, Sean Christopherson wrote:
> Finally getting back to this series...
> 
> On Thu, Mar 23, 2023, Yan Zhao wrote:
> > On Fri, Mar 17, 2023 at 04:28:56PM +0800, Yan Zhao wrote:
> > > On Fri, Mar 10, 2023 at 04:22:56PM -0800, Sean Christopherson wrote:
> > > ...
> > > > +int kvm_write_track_add_gfn(struct kvm *kvm, gfn_t gfn)
> > > > +{
> > > > +   struct kvm_memory_slot *slot;
> > > > +   int idx;
> > > > +
> > > > +   idx = srcu_read_lock(&kvm->srcu);
> > > > +
> > > > +   slot = gfn_to_memslot(kvm, gfn);
> > > > +   if (!slot) {
> > > > +   srcu_read_unlock(&kvm->srcu, idx);
> > > > +   return -EINVAL;
> > > > +   }
> > > > +
> > > Also fail if slot->flags & KVM_MEMSLOT_INVALID is true?
> > > There should exist a window for external users to see an invalid slot
> > > when a slot is about to get deleted/moved.
> > > (It happens before MOVE is rejected in kvm_arch_prepare_memory_region()).
> > 
> > Or using
> > if (!kvm_is_visible_memslot(slot)) {
> > srcu_read_unlock(&kvm->srcu, idx);
> > return -EINVAL;
> > }
> 
> Hrm.  If the DELETE/MOVE succeeds, then the funky accounting is ok (by the end
> of the series) as the tracking disappears on DELETE, KVMGT will reject MOVE, 
> and
> KVM proper zaps SPTEs and resets accounting on MOVE (account_shadowed() runs 
> under
> mmu_lock and thus ensures all previous SPTEs are zapped before the "flush" 
> from
> kvm_arch_flush_shadow_memslot() can run).
> 
> If kvm_prepare_memory_region() fails though...
> 
> Ah, KVM itself is safe because of the aforementioned 
> kvm_arch_flush_shadow_memslot().
> Any accounting done on a temporarily invalid memslot will be unwound when the 
> SPTEs
> are zapped.  So for KVM, ignoring invalid memslots is correct _and necessary_.
> We could clean that up by having accounted_shadowed() use the @slot from the 
> fault,
> which would close the window where the fault starts with a valid memslot but 
> then
> sees an invalid memslot when accounting a new shadow page.  But I don't think 
> there
> is a bug there.
> 
> Right, and DELETE can't actually fail in the current code base, and we've 
> established
> that MOVE can't possibly work.  So even if this is problematic in theory, 
> there are
> no _unknown_ bugs, and the known bugs are fixed by the end of the series.
> 
> And at the end of the series, KVMGT drops its tracking only when the DELETE is
> committed.  So I _think_ allowing external trackers to add and remove gfns for
> write-tracking in an invalid slot is actually desirable/correct.  I'm pretty 
> sure
> removal should be allowed as that can lead to dangling write-protection in a
> rollback scenario.   And I can't think of anything that will break (in the 
> kernel)
> if write-tracking a gfn in an invalid slot is allowed, so I don't see any 
> harm in
> allowing the extremely theoretical case of KVMGT shadowing a gfn in a 
> to-be-deleted
> memslot _and_ the deletion being rolled back.
Yes, you are right!
I previously thought that
invalid_slot->arch.gfn_write_track and old->arch.gfn_write_track are
pointing to different places. But I'm wrong.
Yes, allowing INVALID slot here is more desired for deletion rolling-back.



Re: [Intel-gfx] [PATCH v2 20/27] KVM: x86/mmu: Use page-track notifiers iff there are external users

2023-05-05 Thread Yan Zhao
On Thu, May 04, 2023 at 12:54:40PM -0700, Sean Christopherson wrote:
> On Wed, Mar 15, 2023, Sean Christopherson wrote:
> > On Wed, Mar 15, 2023, Yan Zhao wrote:
> > > On Fri, Mar 10, 2023 at 04:22:51PM -0800, Sean Christopherson wrote:
> > > > Disable the page-track notifier code at compile time if there are no
> > > > external users, i.e. if CONFIG_KVM_EXTERNAL_WRITE_TRACKING=n.  KVM 
> > > > itself
> > > > now hooks emulated writes directly instead of relying on the page-track
> > > > mechanism.
> > > > 
> > > > Signed-off-by: Sean Christopherson 
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h   |  2 ++
> > > >  arch/x86/include/asm/kvm_page_track.h |  2 ++
> > > >  arch/x86/kvm/mmu/page_track.c |  9 -
> > > >  arch/x86/kvm/mmu/page_track.h | 29 +++
> > > >  4 files changed, 33 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > index 1a4225237564..a3423711e403 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1265,7 +1265,9 @@ struct kvm_arch {
> > > >  * create an NX huge page (without hanging the guest).
> > > >  */
> > > > struct list_head possible_nx_huge_pages;
> > > > +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
> > > > struct kvm_page_track_notifier_head track_notifier_head;
> > > > +#endif
> > > > /*
> > > >  * Protects marking pages unsync during page faults, as TDP MMU 
> > > > page
> > > >  * faults only take mmu_lock for read.  For simplicity, the 
> > > > unsync
> > > > diff --git a/arch/x86/include/asm/kvm_page_track.h 
> > > > b/arch/x86/include/asm/kvm_page_track.h
> > > > index deece45936a5..53c2adb25a07 100644
> > > > --- a/arch/x86/include/asm/kvm_page_track.h
> > > > +++ b/arch/x86/include/asm/kvm_page_track.h
> > > The "#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING" can be moved to the
> > > front of this file?
> > > All the structures are only exposed for external users now.
> > 
> > Huh.  I've no idea why I didn't do that.  IIRC, the entire reason past me 
> > wrapped
> > track_notifier_head in an #ifdef was to allow this change in 
> > kvm_page_track.h.
> > 
> > I'll do this in the next version unless I discover an edge case I'm 
> > overlooking.
> 
> Ah, deja vu.  I tried this first time around, and got yelled at by the kernel 
> test
> robot.  Unsuprisingly, my second attempt yielded the same result :-)
> 
>   HDRTEST drivers/gpu/drm/i915/gvt/gvt.h
> In file included from :
> gpu/drivers/gpu/drm/i915/gvt/gvt.h:236:45: error: field ‘track_node’ has 
> incomplete type
>   236 | struct kvm_page_track_notifier_node track_node;
>   | ^~
> 
> The problem is direct header inclusion.  Nothing in the kernel includes gvt.h
> when CONFIG_DRM_I915_GVT=n, but the header include guard tests include headers
> directly on the command line.  I think I'll define a "stub" specifically to 
> play
> nice with this sort of testing.  Guarding the guts of gvt.h with 
> CONFIG_DRM_I915_GVT
> would just propagate the problem, and guarding the node definition in "struct
> intel_vgpu" would be confusing since the guard would be dead code for all 
> intents
> and purposes.
> 
> The obvious alternative would be to leave kvm_page_track_notifier_node 
> outside of
> the #ifdef, but I really want to bury kvm_page_track_notifier_head for KVM's 
> sake,
> and having "head" buried but not "node" would also be weird and confusing.
> 
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index 33f087437209..3d040741044b 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -51,6 +51,12 @@ void kvm_page_track_unregister_notifier(struct kvm *kvm,
>  
>  int kvm_write_track_add_gfn(struct kvm *kvm, gfn_t gfn);
>  int kvm_write_track_remove_gfn(struct kvm *kvm, gfn_t gfn);
> +#else
> +/*
> + * Allow defining a node in a structure even if page tracking is disabled, 
> e.g.
> + * to play nice with testing headers via direct inclusion from the command 
> line.
> + */
> +struct kv

Re: [Intel-gfx] [PATCH v2 05/27] drm/i915/gvt: Verify VFIO-pinned page is THP when shadowing 2M gtt entry

2023-05-06 Thread Yan Zhao
> > Maybe the checking of PageTransHuge(cur_page) and bailing out is not 
> > necessary.
> > If a page is not transparent huge, but there are 512 contigous 4K
> > pages, I think it's still good to map them in IOMMU in 2M.
> > See vfio_pin_map_dma() who does similar things.
> 
> I agree that bailing isn't strictly necessary, and processing "blindly" should
> Just Work for HugeTLB and other hugepage types.  I was going to argue that it
> would be safer to add this and then drop it at the end, but I think that's a
> specious argument.  If not checking the page type is unsafe, then the existing
> code is buggy, and this changelog literally states that the check for 
> contiguous
> pages guards against any such problems.
> 
> I do think there's a (very, very theoretical) issue though.  For 
> "CONFIG_SPARSEMEM=y
> && CONFIG_SPARSEMEM_VMEMMAP=n", struct pages aren't virtually contiguous with 
> respect
> to their pfns, i.e. it's possible (again, very theoretically) that two struct 
> pages
> could be virtually contiguous but physically discontiguous.  I suspect I'm 
> being
> ridiculously paranoid, but for the efficient cases where pages are guaranteed 
> to
> be contiguous, the extra page_to_pfn() checks should be optimized away by the
> compiler, i.e. there's no meaningful downside to the paranoia.
To make sure I understand it correctly:
There are 3 conditions:
(1) Two struct pages aren't virtually contiguous, but there PFNs are contiguous.
(2) Two struct pages are virtually contiguous but their PFNs aren't contiguous.
(Looks this will not happen?)
(3) Two struct pages are virtually contiguous, and their PFNs are contiguous, 
too.
But they have different backends, e.g.
PFN 1 and PFN 2 are contiguous, while PFN 1 belongs to RAM, and PFN 2
belongs to DEVMEM.

I think you mean condition (3) is problematic, am I right?
> 
> TL;DR: My plan is to drop this patch and instead harden the continuity check.

So you want to check page zone?


Re: [Intel-gfx] [PATCH v2 05/27] drm/i915/gvt: Verify VFIO-pinned page is THP when shadowing 2M gtt entry

2023-05-06 Thread Yan Zhao
On Sat, May 06, 2023 at 02:35:41PM +0800, Yan Zhao wrote:
> > > Maybe the checking of PageTransHuge(cur_page) and bailing out is not 
> > > necessary.
> > > If a page is not transparent huge, but there are 512 contigous 4K
> > > pages, I think it's still good to map them in IOMMU in 2M.
> > > See vfio_pin_map_dma() who does similar things.
> > 
> > I agree that bailing isn't strictly necessary, and processing "blindly" 
> > should
> > Just Work for HugeTLB and other hugepage types.  I was going to argue that 
> > it
> > would be safer to add this and then drop it at the end, but I think that's a
> > specious argument.  If not checking the page type is unsafe, then the 
> > existing
> > code is buggy, and this changelog literally states that the check for 
> > contiguous
> > pages guards against any such problems.
> > 
> > I do think there's a (very, very theoretical) issue though.  For 
> > "CONFIG_SPARSEMEM=y
> > && CONFIG_SPARSEMEM_VMEMMAP=n", struct pages aren't virtually contiguous 
> > with respect
> > to their pfns, i.e. it's possible (again, very theoretically) that two 
> > struct pages
> > could be virtually contiguous but physically discontiguous.  I suspect I'm 
> > being
> > ridiculously paranoid, but for the efficient cases where pages are 
> > guaranteed to
> > be contiguous, the extra page_to_pfn() checks should be optimized away by 
> > the
> > compiler, i.e. there's no meaningful downside to the paranoia.
> To make sure I understand it correctly:
> There are 3 conditions:
> (1) Two struct pages aren't virtually contiguous, but there PFNs are 
> contiguous.
> (2) Two struct pages are virtually contiguous but their PFNs aren't 
> contiguous.
> (Looks this will not happen?)
> (3) Two struct pages are virtually contiguous, and their PFNs are contiguous, 
> too.
> But they have different backends, e.g.
> PFN 1 and PFN 2 are contiguous, while PFN 1 belongs to RAM, and PFN 2
> belongs to DEVMEM.
> 
> I think you mean condition (3) is problematic, am I right?
Oh, I got it now.
You are saying about condition (2), with "CONFIG_SPARSEMEM=y &&
CONFIG_SPARSEMEM_VMEMMAP=n".
Two struct pages are contiguous if one is at one section's tail and another at
another section's head, but the two sections aren't for contiguous PFNs.

> > 
> > TL;DR: My plan is to drop this patch and instead harden the continuity 
> > check.
> 
> So you want to check page zone?


Re: [Intel-gfx] [PATCH v2 25/27] KVM: x86/mmu: Drop @slot param from exported/external page-track APIs

2023-05-07 Thread Yan Zhao
On Thu, May 04, 2023 at 10:17:20AM +0800, Yan Zhao wrote:
> On Wed, May 03, 2023 at 04:16:10PM -0700, Sean Christopherson wrote:
> > Finally getting back to this series...
> > 
> > On Thu, Mar 23, 2023, Yan Zhao wrote:
> > > On Fri, Mar 17, 2023 at 04:28:56PM +0800, Yan Zhao wrote:
> > > > On Fri, Mar 10, 2023 at 04:22:56PM -0800, Sean Christopherson wrote:
> > > > ...
> > > > > +int kvm_write_track_add_gfn(struct kvm *kvm, gfn_t gfn)
> > > > > +{
> > > > > + struct kvm_memory_slot *slot;
> > > > > + int idx;
> > > > > +
> > > > > + idx = srcu_read_lock(&kvm->srcu);
> > > > > +
> > > > > + slot = gfn_to_memslot(kvm, gfn);
> > > > > + if (!slot) {
> > > > > + srcu_read_unlock(&kvm->srcu, idx);
> > > > > + return -EINVAL;
> > > > > + }
> > > > > +
> > > > Also fail if slot->flags & KVM_MEMSLOT_INVALID is true?
> > > > There should exist a window for external users to see an invalid slot
> > > > when a slot is about to get deleted/moved.
> > > > (It happens before MOVE is rejected in 
> > > > kvm_arch_prepare_memory_region()).
> > > 
> > > Or using
> > > if (!kvm_is_visible_memslot(slot)) {
> > >   srcu_read_unlock(&kvm->srcu, idx);
> > >   return -EINVAL;
> > >   }
> > 
Hi Sean,
After more thoughts, do you think checking KVM internal memslot is necessary?

slot = gfn_to_memslot(kvm, gfn);
if (!slot || slot->id >= KVM_USER_MEM_SLOTS) {
srcu_read_unlock(&kvm->srcu, idx);
return -EINVAL;
}

Do we allow write tracking to APIC access page when APIC-write VM exit
is not desired?

Thanks
Yan

> > Hrm.  If the DELETE/MOVE succeeds, then the funky accounting is ok (by the 
> > end
> > of the series) as the tracking disappears on DELETE, KVMGT will reject 
> > MOVE, and
> > KVM proper zaps SPTEs and resets accounting on MOVE (account_shadowed() 
> > runs under
> > mmu_lock and thus ensures all previous SPTEs are zapped before the "flush" 
> > from
> > kvm_arch_flush_shadow_memslot() can run).
> > 
> > If kvm_prepare_memory_region() fails though...
> > 
> > Ah, KVM itself is safe because of the aforementioned 
> > kvm_arch_flush_shadow_memslot().
> > Any accounting done on a temporarily invalid memslot will be unwound when 
> > the SPTEs
> > are zapped.  So for KVM, ignoring invalid memslots is correct _and 
> > necessary_.
> > We could clean that up by having accounted_shadowed() use the @slot from 
> > the fault,
> > which would close the window where the fault starts with a valid memslot 
> > but then
> > sees an invalid memslot when accounting a new shadow page.  But I don't 
> > think there
> > is a bug there.
> > 
> > Right, and DELETE can't actually fail in the current code base, and we've 
> > established
> > that MOVE can't possibly work.  So even if this is problematic in theory, 
> > there are
> > no _unknown_ bugs, and the known bugs are fixed by the end of the series.
> > 
> > And at the end of the series, KVMGT drops its tracking only when the DELETE 
> > is
> > committed.  So I _think_ allowing external trackers to add and remove gfns 
> > for
> > write-tracking in an invalid slot is actually desirable/correct.  I'm 
> > pretty sure
> > removal should be allowed as that can lead to dangling write-protection in a
> > rollback scenario.   And I can't think of anything that will break (in the 
> > kernel)
> > if write-tracking a gfn in an invalid slot is allowed, so I don't see any 
> > harm in
> > allowing the extremely theoretical case of KVMGT shadowing a gfn in a 
> > to-be-deleted
> > memslot _and_ the deletion being rolled back.
> Yes, you are right!
> I previously thought that
> invalid_slot->arch.gfn_write_track and old->arch.gfn_write_track are
> pointing to different places. But I'm wrong.
> Yes, allowing INVALID slot here is more desired for deletion rolling-back.
> 


Re: [Intel-gfx] [PATCH v2 25/27] KVM: x86/mmu: Drop @slot param from exported/external page-track APIs

2023-05-11 Thread Yan Zhao
> > Hi Sean,
> > After more thoughts, do you think checking KVM internal memslot is 
> > necessary?
> 
> I don't think it's necessary per se, but I also can't think of any reason to 
> allow
> it.
> 
> > slot = gfn_to_memslot(kvm, gfn);
> > if (!slot || slot->id >= KVM_USER_MEM_SLOTS) {
> > srcu_read_unlock(&kvm->srcu, idx);
> > return -EINVAL;
> > }
> > 
> > Do we allow write tracking to APIC access page when APIC-write VM exit
> > is not desired?
> 
> Allow?  Yes.
>  
> But KVM doesn't use write-tracking for anything APICv related, e.g. to disable
> APICv, KVM instead zaps the SPTEs for the APIC access page and on page fault 
> goes
> straight to MMIO emulation.
> 
> Theoretically, the guest could create an intermediate PTE in the APIC access 
> page
> and AFAICT KVM would shadow the access and write-protect the APIC access page.
> But that's benign as the resulting emulation would be handled just like 
> emulated
> APIC MMIO.
> 
> FWIW, the other internal memslots, TSS and idenity mapped page tables, are 
> used
> if and only if paging is disabled in the guest, i.e. there are no guest PTEs 
> for
> KVM to shadow (and paging must be enabled to enable VMX, so nested EPT is also
> ruled out).  So this is theoretically possible only for the APIC access page.
> That changes with KVMGT, but that again should not be problematic.  KVM will
> emulate in response to the write-protected page and things go on.  E.g. it's
> arguably much weirder that the guest can read/write the identity mapped page
> tables that are used for EPT without unrestricted guest.
> 
> There's no sane reason to allow creating PTEs in the APIC page, but I'm also 
> not
> all that motivated to "fix" things.   account_shadowed() isn't expected to 
> fail,
> so KVM would need to check further up the stack, e.g. in walk_addr_generic() 
> by
> open coding a form of kvm_vcpu_gfn_to_hva_prot().
> 
> I _think_ that's the only place KVM would need to add a check, as KVM already
> checks that the root, i.e. CR3, is in a "visible" memslot.  I suppose KVM 
> could
> just synthesize triple fault, like it does for the root/CR3 case, but I don't
> like making up behavior.
> 
> In other words, I'm not opposed to disallowing write-tracking internal 
> memslots,
> but I can't think of anything that will break, and so for me personally at 
> least,
> the ROI isn't sufficient to justify writing tests and dealing with any 
> fallout.

It makes sense. Thanks for the explanation.



Re: [Intel-gfx] [PATCH v3 03/28] drm/i915/gvt: Verify hugepages are contiguous in physical address space

2023-05-16 Thread Yan Zhao
hi Sean

Do you think it's necessary to double check that struct page pointers
are also contiguous?

And do you like to also include a fix as below, which is to remove the
warning in vfio_device_container_unpin_pages() when npage is 0?

@ -169,7 +173,8 @@ static int gvt_pin_guest_page(struct intel_vgpu *vgpu, 
unsigned long gfn,
*page = base_page;
return 0;
 err:
-   gvt_unpin_guest_page(vgpu, gfn, npage * PAGE_SIZE);
+   if (npage)
+   gvt_unpin_guest_page(vgpu, gfn, npage * PAGE_SIZE);
return ret;
 }

BTW, I've sent a separate fix for vfio_iommu_type1_pin_pages() to ensure
struct page is a valid address.
https://lore.kernel.org/lkml/20230516093007.15234-1-yan.y.z...@intel.com/

On Fri, May 12, 2023 at 05:35:35PM -0700, Sean Christopherson wrote:
> When shadowing a GTT entry with a 2M page, verify that the pfns are
> contiguous, not just that the struct page pointers are contiguous.  The
> memory map is virtual contiguous if "CONFIG_FLATMEM=y ||
> CONFIG_SPARSEMEM_VMEMMAP=y", but not for "CONFIG_SPARSEMEM=y &&
> CONFIG_SPARSEMEM_VMEMMAP=n", so theoretically KVMGT could encounter struct
> pages that are virtually contiguous, but not physically contiguous.
> 
> In practice, this flaw is likely a non-issue as it would cause functional
> problems iff a section isn't 2M aligned _and_ is directly adjacent to
> another section with discontiguous pfns.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index de675d799c7d..429f0f993a13 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -161,7 +161,7 @@ static int gvt_pin_guest_page(struct intel_vgpu *vgpu, 
> unsigned long gfn,
>  
>   if (npage == 0)
>   base_page = cur_page;
> - else if (base_page + npage != cur_page) {
> + else if (page_to_pfn(base_page) + npage != 
> page_to_pfn(cur_page)) {
>   gvt_vgpu_err("The pages are not continuous\n");
>   ret = -EINVAL;
>   npage++;
> -- 
> 2.40.1.606.ga4b1b128d6-goog
> 


Re: [Intel-gfx] [PATCH v3 07/28] drm/i915/gvt: Don't rely on KVM's gfn_to_pfn() to query possible 2M GTT

2023-05-16 Thread Yan Zhao
Reviewed-by: Yan Zhao 
Tested-by: Yan Zhao 

On Fri, May 12, 2023 at 05:35:39PM -0700, Sean Christopherson wrote:
> Now that gvt_pin_guest_page() explicitly verifies the pinned PFN is a
> transparent hugepage page, don't use KVM's gfn_to_pfn() to pre-check if a
> 2MiB GTT entry is possible and instead just try to map the GFN with a 2MiB
> entry.  Using KVM to query pfn that is ultimately managed through VFIO is
> odd, and KVM's gfn_to_pfn() is not intended for non-KVM consumption; it's
> exported only because of KVM vendor modules (x86 and PPC).
> 
> Open code the check on 2MiB support instead of keeping
> is_2MB_gtt_possible() around for a single line of code.
> 
> Move the call to intel_gvt_dma_map_guest_page() for a 4KiB entry into its
> case statement, i.e. fork the common path into the 4KiB and 2MiB "direct"
> shadow paths.  Keeping the call in the "common" path is arguably more in
> the spirit of "one change per patch", but retaining the local "page_size"
> variable is silly, i.e. the call site will be changed either way, and
> jumping around the no-longer-common code is more subtle and rather odd,
> i.e. would just need to be immediately cleaned up.
> 
> Drop the error message from gvt_pin_guest_page() when KVMGT attempts to
> shadow a 2MiB guest page that isn't backed by a compatible hugepage in the
> host.  Dropping the pre-check on a THP makes it much more likely that the
> "error" will be encountered in normal operation.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/gtt.c   | 49 ++--
>  drivers/gpu/drm/i915/gvt/kvmgt.c |  1 -
>  2 files changed, 8 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 61e38acee2d5..f505be9e647a 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1145,36 +1145,6 @@ static inline void ppgtt_generate_shadow_entry(struct 
> intel_gvt_gtt_entry *se,
>   ops->set_pfn(se, s->shadow_page.mfn);
>  }
>  
> -/*
> - * Check if can do 2M page
> - * @vgpu: target vgpu
> - * @entry: target pfn's gtt entry
> - *
> - * Return 1 if 2MB huge gtt shadowing is possible, 0 if miscondition,
> - * negative if found err.
> - */
> -static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
> - struct intel_gvt_gtt_entry *entry)
> -{
> - const struct intel_gvt_gtt_pte_ops *ops = vgpu->gvt->gtt.pte_ops;
> - kvm_pfn_t pfn;
> - int ret;
> -
> - if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
> - return 0;
> -
> - pfn = gfn_to_pfn(vgpu->vfio_device.kvm, ops->get_pfn(entry));
> - if (is_error_noslot_pfn(pfn))
> - return -EINVAL;
> -
> - if (!pfn_valid(pfn))
> - return -EINVAL;
> -
> - ret = PageTransHuge(pfn_to_page(pfn));
> - kvm_release_pfn_clean(pfn);
> - return ret;
> -}
> -
>  static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
>   struct intel_vgpu_ppgtt_spt *spt, unsigned long index,
>   struct intel_gvt_gtt_entry *se)
> @@ -1268,7 +1238,7 @@ static int ppgtt_populate_shadow_entry(struct 
> intel_vgpu *vgpu,
>  {
>   const struct intel_gvt_gtt_pte_ops *pte_ops = vgpu->gvt->gtt.pte_ops;
>   struct intel_gvt_gtt_entry se = *ge;
> - unsigned long gfn, page_size = PAGE_SIZE;
> + unsigned long gfn;
>   dma_addr_t dma_addr;
>   int ret;
>  
> @@ -1283,6 +1253,9 @@ static int ppgtt_populate_shadow_entry(struct 
> intel_vgpu *vgpu,
>   switch (ge->type) {
>   case GTT_TYPE_PPGTT_PTE_4K_ENTRY:
>   gvt_vdbg_mm("shadow 4K gtt entry\n");
> + ret = intel_gvt_dma_map_guest_page(vgpu, gfn, PAGE_SIZE, 
> &dma_addr);
> + if (ret)
> + return -ENXIO;
>   break;
>   case GTT_TYPE_PPGTT_PTE_64K_ENTRY:
>   gvt_vdbg_mm("shadow 64K gtt entry\n");
> @@ -1294,12 +1267,10 @@ static int ppgtt_populate_shadow_entry(struct 
> intel_vgpu *vgpu,
>   return split_64KB_gtt_entry(vgpu, spt, index, &se);
>   case GTT_TYPE_PPGTT_PTE_2M_ENTRY:
>   gvt_vdbg_mm("shadow 2M gtt entry\n");
> - ret = is_2MB_gtt_possible(vgpu, ge);
> - if (ret == 0)
> + if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M) 
> ||
> + intel_gvt_dma_map_guest_page(vgpu, gfn,
> +  I915_GTT_PAGE_SIZE_2M, 
> &dma_addr))
> 

Re: [Intel-gfx] [PATCH v3 12/28] KVM: x86/mmu: Don't rely on page-track mechanism to flush on memslot change

2023-05-16 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, May 12, 2023 at 05:35:44PM -0700, Sean Christopherson wrote:
> Call kvm_mmu_zap_all_fast() directly when flushing a memslot instead of
> bouncing through the page-track mechanism.  KVM (unfortunately) needs to
> zap and flush all page tables on memslot DELETE/MOVE irrespective of
> whether KVM is shadowing guest page tables.
> 
> This will allow changing KVM to register a page-track notifier on the
> first shadow root allocation, and will also allow deleting the misguided
> kvm_page_track_flush_slot() hook itself once KVM-GT also moves to a
> different method for reacting to memslot changes.
> 
> No functional change intended.
> 
> Cc: Yan Zhao 
> Link: https://lore.kernel.org/r/20221110014821.1548347-2-sea...@google.com
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/mmu/mmu.c | 10 ++
>  1 file changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2e4476d38377..23a79723031b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6184,13 +6184,6 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm 
> *kvm)
>   return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
>  }
>  
> -static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> - struct kvm_memory_slot *slot,
> - struct kvm_page_track_notifier_node *node)
> -{
> - kvm_mmu_zap_all_fast(kvm);
> -}
> -
>  int kvm_mmu_init_vm(struct kvm *kvm)
>  {
>   struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> @@ -6208,7 +6201,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>   }
>  
>   node->track_write = kvm_mmu_pte_write;
> - node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>   kvm_page_track_register_notifier(kvm, node);
>  
>   kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> @@ -6750,6 +6742,8 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>  struct kvm_memory_slot *slot)
>  {
> + kvm_mmu_zap_all_fast(kvm);
> +
>   kvm_page_track_flush_slot(kvm, slot);
>  }
>  
> -- 
> 2.40.1.606.ga4b1b128d6-goog
> 


Re: [Intel-gfx] [PATCH v3 21/28] KVM: x86/mmu: Use page-track notifiers iff there are external users

2023-05-16 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, May 12, 2023 at 05:35:53PM -0700, Sean Christopherson wrote:
> Disable the page-track notifier code at compile time if there are no
> external users, i.e. if CONFIG_KVM_EXTERNAL_WRITE_TRACKING=n.  KVM itself
> now hooks emulated writes directly instead of relying on the page-track
> mechanism.
> 
> Provide a stub for "struct kvm_page_track_notifier_node" so that including
> headers directly from the command line, e.g. for testing include guards,
> doesn't fail due to a struct having an incomplete type.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_host.h   |  2 ++
>  arch/x86/include/asm/kvm_page_track.h | 22 +---
>  arch/x86/kvm/mmu/page_track.c | 10 -
>  arch/x86/kvm/mmu/page_track.h | 29 +++
>  4 files changed, 47 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 113598d3e886..5ce06a75d3de 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1247,7 +1247,9 @@ struct kvm_arch {
>* create an NX huge page (without hanging the guest).
>*/
>   struct list_head possible_nx_huge_pages;
> +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
>   struct kvm_page_track_notifier_head track_notifier_head;
> +#endif
>   /*
>* Protects marking pages unsync during page faults, as TDP MMU page
>* faults only take mmu_lock for read.  For simplicity, the unsync
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index 76c0070dfe2a..61adb07b5927 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -9,6 +9,14 @@ enum kvm_page_track_mode {
>   KVM_PAGE_TRACK_MAX,
>  };
>  
> +void kvm_slot_page_track_add_page(struct kvm *kvm,
> +   struct kvm_memory_slot *slot, gfn_t gfn,
> +   enum kvm_page_track_mode mode);
> +void kvm_slot_page_track_remove_page(struct kvm *kvm,
> +  struct kvm_memory_slot *slot, gfn_t gfn,
> +  enum kvm_page_track_mode mode);
> +
> +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
>  /*
>   * The notifier represented by @kvm_page_track_notifier_node is linked into
>   * the head which will be notified when guest is triggering the track event.
> @@ -48,18 +56,18 @@ struct kvm_page_track_notifier_node {
>   struct kvm_page_track_notifier_node *node);
>  };
>  
> -void kvm_slot_page_track_add_page(struct kvm *kvm,
> -   struct kvm_memory_slot *slot, gfn_t gfn,
> -   enum kvm_page_track_mode mode);
> -void kvm_slot_page_track_remove_page(struct kvm *kvm,
> -  struct kvm_memory_slot *slot, gfn_t gfn,
> -  enum kvm_page_track_mode mode);
> -
>  void
>  kvm_page_track_register_notifier(struct kvm *kvm,
>struct kvm_page_track_notifier_node *n);
>  void
>  kvm_page_track_unregister_notifier(struct kvm *kvm,
>  struct kvm_page_track_notifier_node *n);
> +#else
> +/*
> + * Allow defining a node in a structure even if page tracking is disabled, 
> e.g.
> + * to play nice with testing headers via direct inclusion from the command 
> line.
> + */
> +struct kvm_page_track_notifier_node {};
> +#endif /* CONFIG_KVM_EXTERNAL_WRITE_TRACKING */
>  
>  #endif
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index e15329d48f95..b20aad7ac3fe 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -194,6 +194,7 @@ bool kvm_slot_page_track_is_active(struct kvm *kvm,
>   return !!READ_ONCE(slot->arch.gfn_track[mode][index]);
>  }
>  
> +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
>  void kvm_page_track_cleanup(struct kvm *kvm)
>  {
>   struct kvm_page_track_notifier_head *head;
> @@ -255,14 +256,13 @@ EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
>   * The node should figure out if the written page is the one that node is
>   * interested in by itself.
>   */
> -void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> -   int bytes)
> +void __kvm_page_track_write(struct kvm *kvm, gpa_t gpa, const u8 *new, int 
> bytes)
>  {
>   struct kvm_page_track_notifier_head *head;
>   struct kvm_page_track_notifier_node *n;
>   int idx;
>  
> - head = &vcpu->kvm->arch

Re: [Intel-gfx] [PATCH v3 03/28] drm/i915/gvt: Verify hugepages are contiguous in physical address space

2023-05-18 Thread Yan Zhao
On Wed, May 17, 2023 at 07:50:26AM -0700, Sean Christopherson wrote:
> On Tue, May 16, 2023, Yan Zhao wrote:
> > hi Sean
> > 
> > Do you think it's necessary to double check that struct page pointers
> > are also contiguous?
> 
> No, the virtual address space should be irrelevant.  The only way it would be
> problematic is if something in dma_map_page() expected to be able to access 
> the
> entire chunk of memory by getting the virtual address of only the first page,
> but I can't imagine that code is reading or writing memory, let alone doing so
> across a huge range of memory.
Yes, I do find arm_iommu version of dma_map_page() access the memory by getting
virtual address of pages passed in, but it's implemented as page by page, not 
only
from the first page.

dma_map_page
  dma_map_page_attrs
ops->map_page
  arm_iommu_map_page
 __dma_page_cpu_to_dev
   dma_cache_maint_page


Just a little worried about the condition of PFNs are contiguous
while they belong to different backends, e.g. one from system memory and
one from MMIO.
But I don't know how to avoid this without complicated checks.
And this condition might not happen in practice.


> 
> > And do you like to also include a fix as below, which is to remove the
> > warning in vfio_device_container_unpin_pages() when npage is 0?
> > 
> > @ -169,7 +173,8 @@ static int gvt_pin_guest_page(struct intel_vgpu *vgpu, 
> > unsigned long gfn,
> > *page = base_page;
> > return 0;
> >  err:
> > -   gvt_unpin_guest_page(vgpu, gfn, npage * PAGE_SIZE);
> > +   if (npage)
> > +   gvt_unpin_guest_page(vgpu, gfn, npage * PAGE_SIZE);
> > return ret;
> >  }
> 
> Sure.  Want to give your SoB?  I'll write a changelog.
>
Thanks!
It's just a small code piece. Whatever is convenient for you :)


Re: [Intel-gfx] [PATCH v3 03/28] drm/i915/gvt: Verify hugepages are contiguous in physical address space

2023-05-18 Thread Yan Zhao
On Thu, May 18, 2023 at 11:04:46AM -0700, Sean Christopherson wrote:
> On Thu, May 18, 2023, Yan Zhao wrote:
> > On Wed, May 17, 2023 at 07:50:26AM -0700, Sean Christopherson wrote:
> > > On Tue, May 16, 2023, Yan Zhao wrote:
> > > > hi Sean
> > > > 
> > > > Do you think it's necessary to double check that struct page pointers
> > > > are also contiguous?
> > > 
> > > No, the virtual address space should be irrelevant.  The only way it 
> > > would be
> > > problematic is if something in dma_map_page() expected to be able to 
> > > access the
> > > entire chunk of memory by getting the virtual address of only the first 
> > > page,
> > > but I can't imagine that code is reading or writing memory, let alone 
> > > doing so
> > > across a huge range of memory.
> > Yes, I do find arm_iommu version of dma_map_page() access the memory by 
> > getting
> > virtual address of pages passed in, but it's implemented as page by page, 
> > not only
> > from the first page.
> > 
> > dma_map_page
> >   dma_map_page_attrs
> > ops->map_page
> >   arm_iommu_map_page
> 
> Heh, thankfully this is ARM specific, which presumably doesn't collide with 
> KVMGT.

Actually, this is fine with KVMGT (owning to page by page access), isn't it?
:)

> 
> >  __dma_page_cpu_to_dev
> >dma_cache_maint_page
> > 
> > Just a little worried about the condition of PFNs are contiguous
> > while they belong to different backends, e.g. one from system memory and
> > one from MMIO.
> > But I don't know how to avoid this without complicated checks.
> > And this condition might not happen in practice.
> 
> IMO, assuming that contiguous pfns are vritually contiguous is wrong, i.e. 
> would
> be a bug in the other code.  The above dma_cache_maint_page() get's this 
> right,
> and even has a well written comment to boot.
Right.


Re: [Intel-gfx] [PATCH v4 16/19] vfio: Add VFIO_DEVICE_BIND_IOMMUFD

2023-02-26 Thread Yan Zhao
On Fri, Feb 24, 2023 at 10:31:35AM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 24, 2023 at 12:58:22PM +0800, Yan Zhao wrote:
> > On Wed, Feb 22, 2023 at 08:59:51AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Feb 22, 2023 at 07:44:12AM +, Liu, Yi L wrote:
> > > > > From: Tian, Kevin 
> > > > > Sent: Wednesday, February 22, 2023 3:40 PM
> > > > > 
> > > > > > From: Liu, Yi L 
> > > > > > Sent: Tuesday, February 21, 2023 11:48 AM
> > > > > >
> > > > > > +
> > > > > > +void vfio_device_cdev_close(struct vfio_device_file *df)
> > > > > > +{
> > > > > > +   struct vfio_device *device = df->device;
> > > > > > +
> > > > > > +   mutex_lock(&device->dev_set->lock);
> > > > > > +   if (!smp_load_acquire(&df->access_granted)) {
> > > > > 
> > > > > there is no contention with another one changing this flag at this
> > > > > point so directly accessing it is fine.
> > > > 
> > > > make sense. 
> > > 
> > > Have to use READ_ONCE though
> > >
> > Just a curious question:
> > given df->access_granted is now written with device->dev_set->lock held and
> > also read with this lock held in vfio_device_cdev_close(), is READ_ONCE
> > still required? And what about df->iommufd ?
> 
> No, if the writer is under a lock held by the reader then it is always
> OK to use naked read. Best to document it with a comment
>
Thanks for the clarification!


Re: [Intel-gfx] [PATCH v2 00/27] drm/i915/gvt: KVM: KVMGT fixes and page-track cleanups

2023-03-13 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:31PM -0800, Sean Christopherson wrote:
> Fix a variety of found-by-inspection bugs in KVMGT, and overhaul KVM's
> page-track APIs to provide a leaner and cleaner interface.  The motivation
> for this series is to (significantly) reduce the number of KVM APIs that
> KVMGT uses, with a long-term goal of making all kvm_host.h headers
> KVM-internal.
> 
> As was the case in v1, tThe KVMGT changes are compile tested only.
> 
> Based on "git://git.kernel.org/pub/scm/virt/kvm/kvm.git next".
> 
> v2:
>  - Reuse vgpu_lock to protect gfn hash instead of introducing a new (and
>buggy) mutext. [Yan]
>  - Remove a spurious return from kvm_page_track_init(). [Yan]
>  - Take @kvm directly in the inner __kvm_page_track_write(). [Yan]
>  - Delete the gfn sanity check that relies on kvm_is_visible_gfn() instead
>of providing a dedicated interface. [Yan]
> 
> base-commit: 45dd9bc75d9adc9483f0c7d662ba6e73ed698a0b
> -- 
Thanks for the update!
It passed basic tests (gvt in a single vm) at my side.
Will do detailed review tomorrow.

Thanks
Yan


Re: [Intel-gfx] [PATCH v2 04/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

2023-03-13 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:35PM -0800, Sean Christopherson wrote:
> Honor KVM's max allowed page size when determining whether or not a 2MiB
> GTT shadow page can be created for the guest.  Querying KVM's max allowed
> size is somewhat odd as there's no strict requirement that KVM's memslots
> and VFIO's mappings are configured with the same gfn=>hva mapping, but
> the check will be accurate if userspace wants to have a functional guest,
> and at the very least checking KVM's memslots guarantees that the entire
> 2MiB range has been exposed to the guest.
>
hi Sean,
I remember in our last discussion, the conclusion was that
we can safely just use VFIO ABI (which is intel_gvt_dma_map_guest_page()
introduced in patch 7) to check max mapping size. [1][2]

"Though checking kvm_page_track_max_mapping_level() is also fine, it makes DMA
mapping size unnecessarily smaller."
This is especially true when the guest page is page tracked by kvm internally
for nested VMs, but is not page tracked by kvmgt as ppgtt page table pages.
kvmgt is ok to map those pages as huge when 4k is returned in
kvm_page_track_max_mapping_level() for this reason. 


"I'm totally fine if KVMGT's ABI is that VFIO is the source of truth for 
mappings
and permissions, and that the only requirement for KVM memslots is that GTT page
tables need to be visible in KVM's memslots.  But if that's the ABI, then
intel_gvt_is_valid_gfn() should be probing VFIO, not KVM (commit cc753fbe1ac4
("drm/i915/gvt: validate gfn before set shadow page entry").

In other words, pick either VFIO or KVM.  Checking that X is valid according to
KVM and then mapping X through VFIO is confusing and makes assumptions about how
userspace configures KVM and VFIO.  It works because QEMU always configures KVM
and VFIO as expected, but IMO it's unnecessarily fragile and again confusing for
unaware readers because the code is technically flawed.
"

[1] https://lore.kernel.org/all/y7y+759in2dh5...@yzhao56-desk.sh.intel.com/
[2] https://lore.kernel.org/all/y7clklumcy+xl...@google.com/

> Note, KVM may also restrict the mapping size for reasons that aren't
> relevant to KVMGT, e.g. for KVM's iTLB multi-hit workaround or if the gfn
> is write-tracked (KVM's write-tracking only handles writes from vCPUs).
> However, such scenarios are unlikely to occur with a well-behaved guest,
> and at worst will result in sub-optimal performance.

As being confirmed in [3], there's no risk of iTLB multi-hit even for
not-well-behaved guests if page tables for DMA mappings in IOMMU page tables
are in a separated set of tables from EPT/NPT (which is the by default
condition currently).

[3] https://lore.kernel.org/all/y7%2ffzpizeyial...@yzhao56-desk.sh.intel.com/

So, I'm fine with exporting this kvm_page_track_max_mapping_level()
interface, but I don't think KVMGT is a user of it. 
> 
> Fixes: b901b252b6cf ("drm/i915/gvt: Add 2M huge gtt support")
> Signed-off-by: Yan Zhao 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_page_track.h |  2 ++
>  arch/x86/kvm/mmu/page_track.c | 18 ++
>  drivers/gpu/drm/i915/gvt/gtt.c| 10 +-
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index eb186bc57f6a..3f72c7a172fc 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -51,6 +51,8 @@ void kvm_page_track_cleanup(struct kvm *kvm);
>  
>  bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
>  int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
> +enum pg_level kvm_page_track_max_mapping_level(struct kvm *kvm, gfn_t gfn,
> +enum pg_level max_level);
>  
>  void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
>  int kvm_page_track_create_memslot(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 0a2ac438d647..e739dcc3375c 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -301,3 +301,21 @@ void kvm_page_track_flush_slot(struct kvm *kvm, struct 
> kvm_memory_slot *slot)
>   n->track_flush_slot(kvm, slot, n);
>   srcu_read_unlock(&head->track_srcu, idx);
>  }
> +
> +enum pg_level kvm_page_track_max_mapping_level(struct kvm *kvm, gfn_t gfn,
> +enum pg_level max_level)
> +{
> + struct kvm_memory_slot *slot;
> + int idx;
> +
> + idx = srcu_read_lock(&kvm->srcu);
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot || slot->flags & KVM_MEMSL

Re: [Intel-gfx] [PATCH v2 11/27] KVM: x86/mmu: Don't rely on page-track mechanism to flush on memslot change

2023-03-14 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:42PM -0800, Sean Christopherson wrote:
...
> -static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> - struct kvm_memory_slot *slot,
> - struct kvm_page_track_notifier_node *node)
> -{
> - kvm_mmu_zap_all_fast(kvm);
> -}
> -
>  int kvm_mmu_init_vm(struct kvm *kvm)
>  {
>   struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> @@ -6110,7 +6103,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>   }
>  
>   node->track_write = kvm_mmu_pte_write;
> - node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>   kvm_page_track_register_notifier(kvm, node);
>  
>   kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f706621c35b8..29dd6c97d145 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12662,6 +12662,8 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>  struct kvm_memory_slot *slot)
>  {
> + kvm_mmu_zap_all_fast(kvm);
Could we still call kvm_mmu_invalidate_zap_pages_in_memslot() here?
As I know, for TDX, its version of
kvm_mmu_invalidate_zap_pages_in_memslot() is like

static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
{
if (kvm_gfn_shared_mask(kvm))
kvm_mmu_zap_memslot(kvm, slot);
else
kvm_mmu_zap_all_fast(kvm);
}

Maybe this kind of judegment is better to be confined in mmu.c?

Thanks
Yan

> +
>   kvm_page_track_flush_slot(kvm, slot);
>  }
>  


Re: [Intel-gfx] [PATCH v2 14/27] KVM: x86: Reject memslot MOVE operations if KVMGT is attached

2023-03-15 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:45PM -0800, Sean Christopherson wrote:
> Disallow moving memslots if the VM has external page-track users, i.e. if
> KVMGT is being used to expose a virtual GPU to the guest, as KVM doesn't
> correctly handle moving memory regions.
> 
> Note, this is potential ABI breakage!  E.g. userspace could move regions
> that aren't shadowed by KVMGT without harming the guest.  However, the
> only known user of KVMGT is QEMU, and QEMU doesn't move generic memory
> regions.  KVM's own support for moving memory regions was also broken for
> multiple years (albeit for an edge case, but arguably moving RAM is
> itself an edge case), e.g. see commit edd4fa37baa6 ("KVM: x86: Allocate
> new rmap and large page tracking when moving memslot").
> 
> Signed-off-by: Sean Christopherson 
...
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 29dd6c97d145..47ac9291cd43 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12484,6 +12484,13 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  struct kvm_memory_slot *new,
>  enum kvm_mr_change change)
>  {
> + /*
> +  * KVM doesn't support moving memslots when there are external page
> +  * trackers attached to the VM, i.e. if KVMGT is in use.
> +  */
> + if (change == KVM_MR_MOVE && kvm_page_track_has_external_user(kvm))
> + return -EINVAL;
Hmm, will page track work correctly on moving memslots when there's no
external users?

in case of KVM_MR_MOVE,
kvm_prepare_memory_region(kvm, old, new, change)
  |->kvm_arch_prepare_memory_region(kvm, old, new, change)
   |->kvm_alloc_memslot_metadata(kvm, new)
|->memset(&slot->arch, 0, sizeof(slot->arch));
|->kvm_page_track_create_memslot(kvm, slot, npages)
The new->arch.arch.gfn_write_track will be fresh empty.


kvm_arch_commit_memory_region(kvm, old, new, change);
  |->kvm_arch_free_memslot(kvm, old);
   |->kvm_page_track_free_memslot(slot);
The old->arch.gfn_write_track is freed afterwards.

So, in theory, the new GFNs are not write tracked though the old ones are.

Is that acceptable for the internal page-track user?

>   if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
>   if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
>   return -EINVAL;
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 19/27] KVM: x86/mmu: Move KVM-only page-track declarations to internal header

2023-03-15 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:50PM -0800, Sean Christopherson wrote:
> Bury the declaration of the page-track helpers that are intended only for
> internal KVM use in a "private" header.  In addition to guarding against
> unwanted usage of the internal-only helpers, dropping their definitions
> avoids exposing other structures that should be KVM-internal, e.g. for
> memslots.  This is a baby step toward making kvm_host.h a KVM-internal
> header in the very distant future.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_page_track.h | 26 -
>  arch/x86/kvm/mmu/mmu.c|  3 ++-
>  arch/x86/kvm/mmu/page_track.c |  8 +--
>  arch/x86/kvm/mmu/page_track.h | 33 +++
>  arch/x86/kvm/x86.c|  1 +
>  5 files changed, 42 insertions(+), 29 deletions(-)
>  create mode 100644 arch/x86/kvm/mmu/page_track.h
> 
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index e5eb98ca4fce..deece45936a5 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h

A curious question:
are arch/x86/include/asm/kvm_*.h all expected to be external accessible?

Thanks
Yan



Re: [Intel-gfx] [PATCH v2 20/27] KVM: x86/mmu: Use page-track notifiers iff there are external users

2023-03-15 Thread Yan Zhao
Nit: there is a typo in the commit header: "iff" -> "if"

> -void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> -   int bytes)
> +void __kvm_page_track_write(struct kvm *kvm, gpa_t gpa, const u8 *new, int 
> bytes)
Line length is 81 characters. A little longer than 80 :)

> +static inline bool kvm_page_track_has_external_user(struct kvm *kvm) { 
> return false; }
This line is also too long.



Re: [Intel-gfx] [PATCH v2 20/27] KVM: x86/mmu: Use page-track notifiers iff there are external users

2023-03-15 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:51PM -0800, Sean Christopherson wrote:
> Disable the page-track notifier code at compile time if there are no
> external users, i.e. if CONFIG_KVM_EXTERNAL_WRITE_TRACKING=n.  KVM itself
> now hooks emulated writes directly instead of relying on the page-track
> mechanism.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_host.h   |  2 ++
>  arch/x86/include/asm/kvm_page_track.h |  2 ++
>  arch/x86/kvm/mmu/page_track.c |  9 -
>  arch/x86/kvm/mmu/page_track.h | 29 +++
>  4 files changed, 33 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1a4225237564..a3423711e403 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1265,7 +1265,9 @@ struct kvm_arch {
>* create an NX huge page (without hanging the guest).
>*/
>   struct list_head possible_nx_huge_pages;
> +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
>   struct kvm_page_track_notifier_head track_notifier_head;
> +#endif
>   /*
>* Protects marking pages unsync during page faults, as TDP MMU page
>* faults only take mmu_lock for read.  For simplicity, the unsync
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index deece45936a5..53c2adb25a07 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
The "#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING" can be moved to the
front of this file?
All the structures are only exposed for external users now.

> @@ -55,6 +55,7 @@ void kvm_slot_page_track_remove_page(struct kvm *kvm,
>struct kvm_memory_slot *slot, gfn_t gfn,
>enum kvm_page_track_mode mode);
>  
> +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
>  enum pg_level kvm_page_track_max_mapping_level(struct kvm *kvm, gfn_t gfn,
>  enum pg_level max_level);
>  
> @@ -64,5 +65,6 @@ kvm_page_track_register_notifier(struct kvm *kvm,
>  void
>  kvm_page_track_unregister_notifier(struct kvm *kvm,
>  struct kvm_page_track_notifier_node *n);
> +#endif /* CONFIG_KVM_EXTERNAL_WRITE_TRACKING */
>  
>  #endif


Re: [Intel-gfx] [PATCH v2 19/27] KVM: x86/mmu: Move KVM-only page-track declarations to internal header

2023-03-16 Thread Yan Zhao
On Wed, Mar 15, 2023 at 08:13:37AM -0700, Sean Christopherson wrote:
> > A curious question:
> > are arch/x86/include/asm/kvm_*.h all expected to be external accessible?
> 
> Depends on what you mean by "expected".  Currently, yes, everything in there 
> is
> globally visible.  But the vast majority of structs, defines, functions, etc. 
> aren't
> intended for external non-KVM consumption, things ended up being globally 
> visible
> largely through carelessness and/or a lack of a forcing function.
> 
> E.g. there is absolutely no reason anything outside of KVM should need
> arch/x86/include/asm/kvm-x86-ops.h, but it landed in asm/ because, at the 
> time it
> was added, nothing would be harmed by making kvm-x86-ops.h "public" and we 
> didn't
> scrutinize the patches well enough.
> 
> My primary motivation for this series is to (eventually) get to a state where 
> only
> select symbols/defines/etc. are exposed by KVM to the outside world, and 
> everything
> else is internal only.  The end goal of tightly restricting KVM's global API 
> is to
> allow concurrently loading multiple instances of kvm.ko so that userspace can
> upgrade/rollback KVM without needed to move VMs off the host, i.e. by 
> performing
> intrahost migration between differenate instances of KVM on the same host.  
> To do
> that safely, anything that is visible outside of KVM needs to be compatible 
> across
> different instances of KVM, e.g. if kvm_vcpu is "public" then a KVM 
> upgrade/rollback
> wouldn't be able to touch "struct kvm_vcpu" in any way.  We'll definitely 
> want to be
> able to modify things like the vCPU structures, thus the push to restrict the 
> API.
> 
> But even if we never realize that end goal, IMO drastically reducing KVM's 
> "public"
> API surface is worthy goal in and of itself.
Got it. Thanks for explanation!


Re: [Intel-gfx] [PATCH v2 14/27] KVM: x86: Reject memslot MOVE operations if KVMGT is attached

2023-03-16 Thread Yan Zhao
On Wed, Mar 15, 2023 at 08:43:54AM -0700, Sean Christopherson wrote:
> > So, in theory, the new GFNs are not write tracked though the old ones are.
> > 
> > Is that acceptable for the internal page-track user?
> 
> It works because KVM zaps all SPTEs when a memslot is moved, i.e. the fact 
> that
Oh, yes!
And KVM will not shadow SPTEs for a invalid memslot, so there's no
problem.
Thanks~

> KVM loses the write-tracking counts is benign.  I suspect no VMM actually does
> does KVM_MR_MOVE in conjunction with shadow paging, but the ongoing 
> maintenance
> cost of supporting KVM_MR_MOVE is quite low at this point, so trying to rip it
> out isn't worth the pain of having to deal with potential ABI breakage.
> 
> Though in hindsight I wish I had tried disallowed moving memslots instead of
> fixing the various bugs a few years back. :-(


Re: [Intel-gfx] [PATCH v2 20/27] KVM: x86/mmu: Use page-track notifiers iff there are external users

2023-03-16 Thread Yan Zhao
On Wed, Mar 15, 2023 at 09:21:34AM -0700, Sean Christopherson wrote:
> On Wed, Mar 15, 2023, Yan Zhao wrote:
> > Nit: there is a typo in the commit header: "iff" -> "if"
> > 
> > > -void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 
> > > *new,
> > > -   int bytes)
> > > +void __kvm_page_track_write(struct kvm *kvm, gpa_t gpa, const u8 *new, 
> > > int bytes)
> > Line length is 81 characters. A little longer than 80 :)
> > 
> > > +static inline bool kvm_page_track_has_external_user(struct kvm *kvm) { 
> > > return false; }
> > This line is also too long.
> 
> The 80 character limit is a "soft" limit these days, e.g. checkpatch only 
> complains
> if a line is 100+.  In KVM x86, the preferred style is to treat the 80 char 
> limit
> as "firm", for lack of a better word.  E.g. let a line run over if it's just a
> char or two and there's no other wrapping in the declaration, but don't 
> create long
> lines just because checkpatch no longer yells.
> 
Got it. It's helpful to me!

> There's obviously a fair bit of subjectivity, but the guideline has worked 
> well
> so far (hopefully I didn't just jinx us).




Re: [Intel-gfx] [PATCH v2 01/27] drm/i915/gvt: Verify pfn is "valid" before dereferencing "struct page"

2023-03-16 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:32PM -0800, Sean Christopherson wrote:
> Check that the pfn found by gfn_to_pfn() is actually backed by "struct
> page" memory prior to retrieving and dereferencing the page.  KVM
> supports backing guest memory with VM_PFNMAP, VM_IO, etc., and so
> there is no guarantee the pfn returned by gfn_to_pfn() has an associated
> "struct page".
> 
> Fixes: b901b252b6cf ("drm/i915/gvt: Add 2M huge gtt support")
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/gtt.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 4ec85308379a..58b9b316ae46 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1183,6 +1183,10 @@ static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
>   pfn = gfn_to_pfn(vgpu->vfio_device.kvm, ops->get_pfn(entry));
>   if (is_error_noslot_pfn(pfn))
>   return -EINVAL;
> +
> + if (!pfn_valid(pfn))
> + return -EINVAL;
> +
>   return PageTransHuge(pfn_to_page(pfn));
>  }
>  
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 03/27] drm/i915/gvt: remove interface intel_gvt_is_valid_gfn

2023-03-16 Thread Yan Zhao
Tested-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:34PM -0800, Sean Christopherson wrote:
> From: Yan Zhao 
> 
> Currently intel_gvt_is_valid_gfn() is called in two places:
> (1) shadowing guest GGTT entry
> (2) shadowing guest PPGTT leaf entry,
> which was introduced in commit cc753fbe1ac4
> ("drm/i915/gvt: validate gfn before set shadow page entry").
> 
> However, now it's not necessary to call this interface any more, because
> a. GGTT partial write issue has been fixed by
>commit bc0686ff5fad
>("drm/i915/gvt: support inconsecutive partial gtt entry write")
>commit 510fe10b6180
>("drm/i915/gvt: fix a bug of partially write ggtt enties")
> b. PPGTT resides in normal guest RAM and we only treat 8-byte writes
>as valid page table writes. Any invalid GPA found is regarded as
>an error, either due to guest misbehavior/attack or bug in host
>shadow code.
>So,rather than do GFN pre-checking and replace invalid GFNs with
>scratch GFN and continue silently, just remove the pre-checking and
>abort PPGTT shadowing on error detected.
> c. GFN validity check is still performed in
>intel_gvt_dma_map_guest_page() --> gvt_pin_guest_page().
>It's more desirable to call VFIO interface to do both validity check
>and mapping.
>Calling intel_gvt_is_valid_gfn() to do GFN validity check from KVM side
>while later mapping the GFN through VFIO interface is unnecessarily
>fragile and confusing for unaware readers.
> 
> Signed-off-by: Yan Zhao 
> [sean: remove now-unused local variables]
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/gtt.c | 36 +-
>  1 file changed, 1 insertion(+), 35 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 58b9b316ae46..f30922c55a0c 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -49,22 +49,6 @@
>  static bool enable_out_of_sync = false;
>  static int preallocated_oos_pages = 8192;
>  
> -static bool intel_gvt_is_valid_gfn(struct intel_vgpu *vgpu, unsigned long 
> gfn)
> -{
> - struct kvm *kvm = vgpu->vfio_device.kvm;
> - int idx;
> - bool ret;
> -
> - if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
> - return false;
> -
> - idx = srcu_read_lock(&kvm->srcu);
> - ret = kvm_is_visible_gfn(kvm, gfn);
> - srcu_read_unlock(&kvm->srcu, idx);
> -
> - return ret;
> -}
> -
>  /*
>   * validate a gm address and related range size,
>   * translate it to host gm address
> @@ -1333,11 +1317,9 @@ static int ppgtt_populate_shadow_entry(struct 
> intel_vgpu *vgpu,
>  static int ppgtt_populate_spt(struct intel_vgpu_ppgtt_spt *spt)
>  {
>   struct intel_vgpu *vgpu = spt->vgpu;
> - struct intel_gvt *gvt = vgpu->gvt;
> - const struct intel_gvt_gtt_pte_ops *ops = gvt->gtt.pte_ops;
>   struct intel_vgpu_ppgtt_spt *s;
>   struct intel_gvt_gtt_entry se, ge;
> - unsigned long gfn, i;
> + unsigned long i;
>   int ret;
>  
>   trace_spt_change(spt->vgpu->id, "born", spt,
> @@ -1354,13 +1336,6 @@ static int ppgtt_populate_spt(struct 
> intel_vgpu_ppgtt_spt *spt)
>   ppgtt_generate_shadow_entry(&se, s, &ge);
>   ppgtt_set_shadow_entry(spt, &se, i);
>   } else {
> - gfn = ops->get_pfn(&ge);
> - if (!intel_gvt_is_valid_gfn(vgpu, gfn)) {
> - ops->set_pfn(&se, gvt->gtt.scratch_mfn);
> - ppgtt_set_shadow_entry(spt, &se, i);
> - continue;
> - }
> -
>   ret = ppgtt_populate_shadow_entry(vgpu, spt, i, &ge);
>   if (ret)
>   goto fail;
> @@ -2335,14 +2310,6 @@ static int emulate_ggtt_mmio_write(struct intel_vgpu 
> *vgpu, unsigned int off,
>   m.val64 = e.val64;
>   m.type = e.type;
>  
> - /* one PTE update may be issued in multiple writes and the
> -  * first write may not construct a valid gfn
> -  */
> - if (!intel_gvt_is_valid_gfn(vgpu, gfn)) {
> - ops->set_pfn(&m, gvt->gtt.scratch_mfn);
> - goto out;
> - }
> -
>   ret = intel_gvt_dma_map_guest_page(vgpu, gfn, PAGE_SIZE,
>  &dma_addr);
>   if (ret) {
> @@ -2359,7 +2326,6 @@ static int emulate_ggtt_mmio_write(struct intel_vgpu 
> *vgpu, unsigned int off,
>   ops->clear_present(&m);
>   }
>  
> -out:
>   ggtt_set_guest_entry(ggtt_mm, &e, g_gtt_index);
>  
>   ggtt_get_host_entry(ggtt_mm, &e, g_gtt_index);
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 05/27] drm/i915/gvt: Verify VFIO-pinned page is THP when shadowing 2M gtt entry

2023-03-16 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:36PM -0800, Sean Christopherson wrote:
> When shadowing a GTT entry with a 2M page, explicitly verify that the
> first page pinned by VFIO is a transparent hugepage instead of assuming
> that page observed by is_2MB_gtt_possible() is the same page pinned by
> vfio_pin_pages().  E.g. if userspace is doing something funky with the
> guest's memslots, or if the page is demoted between is_2MB_gtt_possible()
> and vfio_pin_pages().
> 
> This is more of a performance optimization than a bug fix as the check
> for contiguous struct pages should guard against incorrect mapping (even
> though assuming struct pages are virtually contiguous is wrong).
> 
> The real motivation for explicitly checking for a transparent hugepage
> after pinning is that it will reduce the risk of introducing a bug in a
> future fix for a page refcount leak (KVMGT doesn't put the reference
> acquired by gfn_to_pfn()), and eventually will allow KVMGT to stop using
> KVM's gfn_to_pfn() altogether.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 8ae7039b3683..90997cc385b4 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -159,11 +159,25 @@ static int gvt_pin_guest_page(struct intel_vgpu *vgpu, 
> unsigned long gfn,
>   goto err;
>   }
>  
> - if (npage == 0)
> - base_page = cur_page;
> + if (npage == 0) {
> + /*
> +  * Bail immediately to avoid unnecessary pinning when
> +  * trying to shadow a 2M page and the host page isn't
> +  * a transparent hugepage.
> +  *
> +  * TODO: support other type hugepages, e.g. HugeTLB.
> +  */
> + if (size == I915_GTT_PAGE_SIZE_2M &&
> + !PageTransHuge(cur_page))
Maybe the checking of PageTransHuge(cur_page) and bailing out is not necessary.
If a page is not transparent huge, but there are 512 contigous 4K
pages, I think it's still good to map them in IOMMU in 2M.
See vfio_pin_map_dma() who does similar things.

> + ret = -EIO;
> + else
> + base_page = cur_page;
> + }
>   else if (base_page + npage != cur_page) {
>   gvt_vgpu_err("The pages are not continuous\n");
>   ret = -EINVAL;
> + }
> + if (ret < 0) {
>   npage++;
>   goto err;
>   }
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 07/27] drm/i915/gvt: Don't rely on KVM's gfn_to_pfn() to query possible 2M GTT

2023-03-16 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:38PM -0800, Sean Christopherson wrote:
>  /*
> - * Check if can do 2M page
> + * Try to map a 2M gtt entry.
>   * @vgpu: target vgpu
>   * @entry: target pfn's gtt entry
>   *
> - * Return 1 if 2MB huge gtt shadowing is possible, 0 if miscondition,
> - * negative if found err.
> + * Return 1 if 2MB huge gtt shadow was creation, 0 if the entry needs to be
> + * split, negative if found err.
>   */
> -static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
> - struct intel_gvt_gtt_entry *entry)
> +static int try_map_2MB_gtt_entry(struct intel_vgpu *vgpu,
> + struct intel_gvt_gtt_entry *entry, dma_addr_t *dma_addr)
>  {
>   const struct intel_gvt_gtt_pte_ops *ops = vgpu->gvt->gtt.pte_ops;
>   unsigned long gfn = ops->get_pfn(entry);
> - kvm_pfn_t pfn;
>   int max_level;
> - int ret;
>  
>   if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
>   return 0;
> @@ -1173,16 +1171,7 @@ static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
>   if (max_level < PG_LEVEL_2M)
>   return 0;
>  
> - pfn = gfn_to_pfn(vgpu->vfio_device.kvm, gfn);
> - if (is_error_noslot_pfn(pfn))
> - return -EINVAL;
> -
> - if (!pfn_valid(pfn))
> - return -EINVAL;
> -
> - ret = PageTransHuge(pfn_to_page(pfn));
> - kvm_release_pfn_clean(pfn);
> - return ret;
> + return intel_gvt_dma_map_guest_page(vgpu, gfn, I915_GTT_PAGE_SIZE_2M, 
> dma_addr);
intel_gvt_dma_map_guest_page() returns 0 on success, which is not in
consistent with the expected return value of this function, i.e.
"
Return 1 if 2MB huge gtt shadow was creation, 0 if the entry needs to be
split, negative if found err.
"


Re: [Intel-gfx] [PATCH v2 06/27] drm/i915/gvt: Put the page reference obtained by KVM's gfn_to_pfn()

2023-03-16 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:37PM -0800, Sean Christopherson wrote:
> Put the struct page reference acquired by gfn_to_pfn(), KVM's API is that
> the caller is ultimately responsible for dropping any reference.
> 
> Note, kvm_release_pfn_clean() ensures the pfn is actually a refcounted
> struct page before trying to put any references.
> 
> Fixes: b901b252b6cf ("drm/i915/gvt: Add 2M huge gtt support")
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/gtt.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index d59c7ab9d224..15848b041a0d 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1160,6 +1160,7 @@ static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
>   unsigned long gfn = ops->get_pfn(entry);
>   kvm_pfn_t pfn;
>   int max_level;
> + int ret;
>  
>   if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
>   return 0;
> @@ -1179,7 +1180,9 @@ static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
>   if (!pfn_valid(pfn))
>   return -EINVAL;
>  
> - return PageTransHuge(pfn_to_page(pfn));
> + ret = PageTransHuge(pfn_to_page(pfn));
> + kvm_release_pfn_clean(pfn);
> + return ret;
>  }
>  
>  static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 08/27] drm/i915/gvt: Use an "unsigned long" to iterate over memslot gfns

2023-03-16 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:39PM -0800, Sean Christopherson wrote:
> Use an "unsigned long" instead of an "int" when iterating over the gfns
> in a memslot.  The number of pages in the memslot is tracked as an
> "unsigned long", e.g. KVMGT could theoretically break if a KVM memslot
> larger than 16TiB were deleted (2^32 * 4KiB).
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 90997cc385b4..68be66395598 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1634,7 +1634,7 @@ static void kvmgt_page_track_flush_slot(struct kvm *kvm,
>   struct kvm_memory_slot *slot,
>   struct kvm_page_track_notifier_node *node)
>  {
> - int i;
> + unsigned long i;
>   gfn_t gfn;
>   struct intel_vgpu *info =
>   container_of(node, struct intel_vgpu, track_node);
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 09/27] drm/i915/gvt: Drop unused helper intel_vgpu_reset_gtt()

2023-03-16 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:40PM -0800, Sean Christopherson wrote:
> Drop intel_vgpu_reset_gtt() as it no longer has any callers.  In addition
> to eliminating dead code, this eliminates the last possible scenario where
> __kvmgt_protect_table_find() can be reached without holding vgpu_lock.
> Requiring vgpu_lock to be held when calling __kvmgt_protect_table_find()
> will allow a protecting the gfn hash with vgpu_lock without too much fuss.
> 
> No functional change intended.
> 
> Fixes: ba25d977571e ("drm/i915/gvt: Do not destroy ppgtt_mm during vGPU 
> D3->D0.")
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/gtt.c | 18 --
>  drivers/gpu/drm/i915/gvt/gtt.h |  1 -
>  2 files changed, 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index e60bcce241f8..293bb2292021 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -2845,24 +2845,6 @@ void intel_vgpu_reset_ggtt(struct intel_vgpu *vgpu, 
> bool invalidate_old)
>   ggtt_invalidate(gvt->gt);
>  }
>  
> -/**
> - * intel_vgpu_reset_gtt - reset the all GTT related status
> - * @vgpu: a vGPU
> - *
> - * This function is called from vfio core to reset reset all
> - * GTT related status, including GGTT, PPGTT, scratch page.
> - *
> - */
> -void intel_vgpu_reset_gtt(struct intel_vgpu *vgpu)
> -{
> - /* Shadow pages are only created when there is no page
> -  * table tracking data, so remove page tracking data after
> -  * removing the shadow pages.
> -  */
> - intel_vgpu_destroy_all_ppgtt_mm(vgpu);
> - intel_vgpu_reset_ggtt(vgpu, true);
> -}
> -
>  /**
>   * intel_gvt_restore_ggtt - restore all vGPU's ggtt entries
>   * @gvt: intel gvt device
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.h b/drivers/gpu/drm/i915/gvt/gtt.h
> index a3b0f59ec8bd..4cb183e06e95 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.h
> +++ b/drivers/gpu/drm/i915/gvt/gtt.h
> @@ -224,7 +224,6 @@ void intel_vgpu_reset_ggtt(struct intel_vgpu *vgpu, bool 
> invalidate_old);
>  void intel_vgpu_invalidate_ppgtt(struct intel_vgpu *vgpu);
>  
>  int intel_gvt_init_gtt(struct intel_gvt *gvt);
> -void intel_vgpu_reset_gtt(struct intel_vgpu *vgpu);
>  void intel_gvt_clean_gtt(struct intel_gvt *gvt);
>  
>  struct intel_vgpu_mm *intel_gvt_find_ppgtt_mm(struct intel_vgpu *vgpu,
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 10/27] drm/i915/gvt: Protect gfn hash table with vgpu_lock

2023-03-16 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:41PM -0800, Sean Christopherson wrote:
> Use vgpu_lock instead of KVM's mmu_lock to protect accesses to the hash
> table used to track which gfns are write-protected when shadowing the
> guest's GTT, and hoist the acquisition of vgpu_lock from
> intel_vgpu_page_track_handler() out to its sole caller,
> kvmgt_page_track_write().
> 
> This fixes a bug where kvmgt_page_track_write(), which doesn't hold
> kvm->mmu_lock, could race with intel_gvt_page_track_remove() and trigger
> a use-after-free.
> 
> Fixing kvmgt_page_track_write() by taking kvm->mmu_lock is not an option
> as mmu_lock is a r/w spinlock, and intel_vgpu_page_track_handler() might
> sleep when acquiring vgpu->cache_lock deep down the callstack:
> 
>   intel_vgpu_page_track_handler()
>   |
>   |->  page_track->handler / ppgtt_write_protection_handler()
>|
>|-> ppgtt_handle_guest_write_page_table_bytes()
>|
>|->  ppgtt_handle_guest_write_page_table()
> |
> |-> ppgtt_handle_guest_entry_removal()
> |
> |-> ppgtt_invalidate_pte()
> |
> |-> intel_gvt_dma_unmap_guest_page()
> |
> |-> mutex_lock(&vgpu->cache_lock);
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c  | 55 +++
>  drivers/gpu/drm/i915/gvt/page_track.c | 10 +
>  2 files changed, 33 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 68be66395598..9824d075562e 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -366,6 +366,8 @@ __kvmgt_protect_table_find(struct intel_vgpu *info, gfn_t 
> gfn)
>  {
>   struct kvmgt_pgfn *p, *res = NULL;
>  
> + lockdep_assert_held(&info->vgpu_lock);
> +
>   hash_for_each_possible(info->ptable, p, hnode, gfn) {
>   if (gfn == p->gfn) {
>   res = p;
> @@ -1567,6 +1569,9 @@ int intel_gvt_page_track_add(struct intel_vgpu *info, 
> u64 gfn)
>   if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, info->status))
>   return -ESRCH;
>  
> + if (kvmgt_gfn_is_write_protected(info, gfn))
> + return 0;
> +
>   idx = srcu_read_lock(&kvm->srcu);
>   slot = gfn_to_memslot(kvm, gfn);
>   if (!slot) {
> @@ -1575,16 +1580,12 @@ int intel_gvt_page_track_add(struct intel_vgpu *info, 
> u64 gfn)
>   }
>  
>   write_lock(&kvm->mmu_lock);
> -
> - if (kvmgt_gfn_is_write_protected(info, gfn))
> - goto out;
> -
>   kvm_slot_page_track_add_page(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE);
> + write_unlock(&kvm->mmu_lock);
> +
> + srcu_read_unlock(&kvm->srcu, idx);
> +
>   kvmgt_protect_table_add(info, gfn);
> -
> -out:
> - write_unlock(&kvm->mmu_lock);
> - srcu_read_unlock(&kvm->srcu, idx);
>   return 0;
>  }
>  
> @@ -1597,24 +1598,22 @@ int intel_gvt_page_track_remove(struct intel_vgpu 
> *info, u64 gfn)
>   if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, info->status))
>   return -ESRCH;
>  
> - idx = srcu_read_lock(&kvm->srcu);
> - slot = gfn_to_memslot(kvm, gfn);
> - if (!slot) {
> - srcu_read_unlock(&kvm->srcu, idx);
> - return -EINVAL;
> - }
> -
> - write_lock(&kvm->mmu_lock);
> -
>   if (!kvmgt_gfn_is_write_protected(info, gfn))
> - goto out;
> + return 0;
>  
> + idx = srcu_read_lock(&kvm->srcu);
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot) {
> + srcu_read_unlock(&kvm->srcu, idx);
> + return -EINVAL;
> + }
> +
> + write_lock(&kvm->mmu_lock);
>   kvm_slot_page_track_remove_page(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE);
> + write_unlock(&kvm->mmu_lock);
> + srcu_read_unlock(&kvm->srcu, idx);
> +
>   kvmgt_protect_table_del(info, gfn);
> -
> -out:
> - write_unlock(&kvm->mmu_lock);
> - srcu_read_unlock(&kvm->srcu, idx);
>   return 0;
>  }
>  
> @@ -1625,9 +1624,13 @@ static void kvmgt_page_track_write(struct kvm_vcpu 
> *vcpu, gpa_t gpa,
>   struct intel_vgpu *info =
>   container_of(node, struct intel_vgpu, track_node);
>  
> + mutex_lock(&info->vgpu_lock);
> +
>   

Re: [Intel-gfx] [PATCH v2 12/27] KVM: x86/mmu: Don't bounce through page-track mechanism for guest PTEs

2023-03-17 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:43PM -0800, Sean Christopherson wrote:
> Don't use the generic page-track mechanism to handle writes to guest PTEs
> in KVM's MMU.  KVM's MMU needs access to information that should not be
> exposed to external page-track users, e.g. KVM needs (for some definitions
> of "need") the vCPU to query the current paging mode, whereas external
> users, i.e. KVMGT, have no ties to the current vCPU and so should never
> need the vCPU.
> 
> Moving away from the page-track mechanism will allow dropping use of the
> page-track mechanism for KVM's own MMU, and will also allow simplifying
> and cleaning up the page-track APIs.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 -
>  arch/x86/kvm/mmu.h  |  2 ++
>  arch/x86/kvm/mmu/mmu.c  | 13 ++---
>  arch/x86/kvm/mmu/page_track.c   |  2 ++
>  4 files changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 17281d6825c9..1a4225237564 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1265,7 +1265,6 @@ struct kvm_arch {
>* create an NX huge page (without hanging the guest).
>*/
>   struct list_head possible_nx_huge_pages;
> - struct kvm_page_track_notifier_node mmu_sp_tracker;
>   struct kvm_page_track_notifier_head track_notifier_head;
>   /*
>* Protects marking pages unsync during page faults, as TDP MMU page
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 168c46fd8dd1..b8bde42f6037 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -119,6 +119,8 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu);
>  void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu);
>  void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
>  void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
> +void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> +  int bytes);
>  
>  static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 409dabec69df..4f2f83d8322e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5603,9 +5603,8 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, 
> gpa_t gpa, int *nspte)
>   return spte;
>  }
>  
> -static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> -   const u8 *new, int bytes,
> -   struct kvm_page_track_notifier_node *node)
> +void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> +  int bytes)
>  {
>   gfn_t gfn = gpa >> PAGE_SHIFT;
>   struct kvm_mmu_page *sp;
> @@ -6088,7 +6087,6 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm 
> *kvm)
>  
>  int kvm_mmu_init_vm(struct kvm *kvm)
>  {
> - struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
>   int r;
>  
>   INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
> @@ -6102,9 +6100,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>   return r;
>   }
>  
> - node->track_write = kvm_mmu_pte_write;
> - kvm_page_track_register_notifier(kvm, node);
> -
>   kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
>   kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
>  
> @@ -6125,10 +6120,6 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
>  
>  void kvm_mmu_uninit_vm(struct kvm *kvm)
>  {
> - struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> -
> - kvm_page_track_unregister_notifier(kvm, node);
> -
>   if (tdp_mmu_enabled)
>   kvm_mmu_uninit_tdp_mmu(kvm);
>  
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index e739dcc3375c..f39f190ad4ae 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -274,6 +274,8 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t 
> gpa, const u8 *new,
>   if (n->track_write)
>   n->track_write(vcpu, gpa, new, bytes, n);
>   srcu_read_unlock(&head->track_srcu, idx);
> +
> + kvm_mmu_track_write(vcpu, gpa, new, bytes);
>  }
>  
>  /*
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 13/27] KVM: drm/i915/gvt: Drop @vcpu from KVM's ->track_write() hook

2023-03-17 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:44PM -0800, Sean Christopherson wrote:
> Drop @vcpu from KVM's ->track_write() hook provided for external users of
> the page-track APIs now that KVM itself doesn't use the page-track
> mechanism.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_page_track.h |  5 ++---
>  arch/x86/kvm/mmu/page_track.c |  2 +-
>  drivers/gpu/drm/i915/gvt/kvmgt.c  | 10 --
>  3 files changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index 3f72c7a172fc..0d65ae203fd6 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -26,14 +26,13 @@ struct kvm_page_track_notifier_node {
>* It is called when guest is writing the write-tracked page
>* and write emulation is finished at that time.
>*
> -  * @vcpu: the vcpu where the write access happened.
>* @gpa: the physical address written by guest.
>* @new: the data was written to the address.
>* @bytes: the written length.
>* @node: this node
>*/
> - void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> - int bytes, struct kvm_page_track_notifier_node 
> *node);
> + void (*track_write)(gpa_t gpa, const u8 *new, int bytes,
> + struct kvm_page_track_notifier_node *node);
>   /*
>* It is called when memory slot is being moved or removed
>* users can drop write-protection for the pages in that memory slot
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index f39f190ad4ae..39a0863af8b4 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -272,7 +272,7 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t 
> gpa, const u8 *new,
>   hlist_for_each_entry_srcu(n, &head->track_notifier_list, node,
>   srcu_read_lock_held(&head->track_srcu))
>   if (n->track_write)
> - n->track_write(vcpu, gpa, new, bytes, n);
> + n->track_write(gpa, new, bytes, n);
>   srcu_read_unlock(&head->track_srcu, idx);
>  
>   kvm_mmu_track_write(vcpu, gpa, new, bytes);
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 9824d075562e..292750dc819f 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -106,9 +106,8 @@ struct gvt_dma {
>  #define vfio_dev_to_vgpu(vfio_dev) \
>   container_of((vfio_dev), struct intel_vgpu, vfio_device)
>  
> -static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> - const u8 *val, int len,
> - struct kvm_page_track_notifier_node *node);
> +static void kvmgt_page_track_write(gpa_t gpa, const u8 *val, int len,
> +struct kvm_page_track_notifier_node *node);
>  static void kvmgt_page_track_flush_slot(struct kvm *kvm,
>   struct kvm_memory_slot *slot,
>   struct kvm_page_track_notifier_node *node);
> @@ -1617,9 +1616,8 @@ int intel_gvt_page_track_remove(struct intel_vgpu 
> *info, u64 gfn)
>   return 0;
>  }
>  
> -static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> - const u8 *val, int len,
> - struct kvm_page_track_notifier_node *node)
> +static void kvmgt_page_track_write(gpa_t gpa, const u8 *val, int len,
> +struct kvm_page_track_notifier_node *node)
>  {
>   struct intel_vgpu *info =
>   container_of(node, struct intel_vgpu, track_node);
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 14/27] KVM: x86: Reject memslot MOVE operations if KVMGT is attached

2023-03-17 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:45PM -0800, Sean Christopherson wrote:
> Disallow moving memslots if the VM has external page-track users, i.e. if
> KVMGT is being used to expose a virtual GPU to the guest, as KVM doesn't
> correctly handle moving memory regions.
> 
> Note, this is potential ABI breakage!  E.g. userspace could move regions
> that aren't shadowed by KVMGT without harming the guest.  However, the
> only known user of KVMGT is QEMU, and QEMU doesn't move generic memory
> regions.  KVM's own support for moving memory regions was also broken for
> multiple years (albeit for an edge case, but arguably moving RAM is
> itself an edge case), e.g. see commit edd4fa37baa6 ("KVM: x86: Allocate
> new rmap and large page tracking when moving memslot").
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_page_track.h | 3 +++
>  arch/x86/kvm/mmu/page_track.c | 5 +
>  arch/x86/kvm/x86.c| 7 +++
>  3 files changed, 15 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index 0d65ae203fd6..6a287bcbe8a9 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -77,4 +77,7 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
>  void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> int bytes);
>  void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot 
> *slot);
> +
> +bool kvm_page_track_has_external_user(struct kvm *kvm);
> +
>  #endif
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 39a0863af8b4..1cfc0a0ccc23 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -321,3 +321,8 @@ enum pg_level kvm_page_track_max_mapping_level(struct kvm 
> *kvm, gfn_t gfn,
>   return max_level;
>  }
>  EXPORT_SYMBOL_GPL(kvm_page_track_max_mapping_level);
> +
> +bool kvm_page_track_has_external_user(struct kvm *kvm)
> +{
> + return hlist_empty(&kvm->arch.track_notifier_head.track_notifier_list);
> +}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 29dd6c97d145..47ac9291cd43 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12484,6 +12484,13 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  struct kvm_memory_slot *new,
>  enum kvm_mr_change change)
>  {
> + /*
> +  * KVM doesn't support moving memslots when there are external page
> +  * trackers attached to the VM, i.e. if KVMGT is in use.
> +  */
> + if (change == KVM_MR_MOVE && kvm_page_track_has_external_user(kvm))
> + return -EINVAL;
> +
>   if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
>   if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
>   return -EINVAL;
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 15/27] drm/i915/gvt: Don't bother removing write-protection on to-be-deleted slot

2023-03-17 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:46PM -0800, Sean Christopherson wrote:
> When handling a slot "flush", don't call back into KVM to drop write
> protection for gfns in the slot.  Now that KVM rejects attempts to move
> memory slots while KVMGT is attached, the only time a slot is "flushed"
> is when it's being removed, i.e. the memslot and all its write-tracking
> metadata is about to be deleted.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 8 +---
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 292750dc819f..577712ea4893 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1644,14 +1644,8 @@ static void kvmgt_page_track_flush_slot(struct kvm 
> *kvm,
>  
>   for (i = 0; i < slot->npages; i++) {
>   gfn = slot->base_gfn + i;
> - if (kvmgt_gfn_is_write_protected(info, gfn)) {
> - write_lock(&kvm->mmu_lock);
> - kvm_slot_page_track_remove_page(kvm, slot, gfn,
> - KVM_PAGE_TRACK_WRITE);
> - write_unlock(&kvm->mmu_lock);
> -
> + if (kvmgt_gfn_is_write_protected(info, gfn))
>   kvmgt_protect_table_del(info, gfn);
> - }
>   }
>   mutex_unlock(&info->vgpu_lock);
>  }
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 16/27] KVM: x86: Add a new page-track hook to handle memslot deletion

2023-03-17 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:47PM -0800, Sean Christopherson wrote:
> From: Yan Zhao 
> 
> Add a new page-track hook, track_remove_region(), that is called when a
> memslot DELETE operation is about to be committed.  The "remove" hook
> will be used by KVMGT and will effectively replace the existing
> track_flush_slot() altogether now that KVM itself doesn't rely on the
> "flush" hook either.
> 
> The "flush" hook is flawed as it's invoked before the memslot operation
> is guaranteed to succeed, i.e. KVM might ultimately keep the existing
> memslot without notifying external page track users, a.k.a. KVMGT.  In
> practice, this can't currently happen on x86, but there are no guarantees
> that won't change in the future, not to mention that "flush" does a very
> poor job of describing what is happening.
> 
> Pass in the gfn+nr_pages instead of the slot itself so external users,
> i.e. KVMGT, don't need to exposed to KVM internals (memslots).  This will
> help set the stage for additional cleanups to the page-track APIs.
> 
> Cc: Zhenyu Wang 
> Signed-off-by: Yan Zhao 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
...

> +void kvm_page_track_delete_slot(struct kvm *kvm, struct kvm_memory_slot 
> *slot)
> +{
> + struct kvm_page_track_notifier_head *head;
> + struct kvm_page_track_notifier_node *n;
> + int idx;
> +
> + head = &kvm->arch.track_notifier_head;
> +
> + if (hlist_empty(&head->track_notifier_list))
> + return;
> +
> + idx = srcu_read_lock(&head->track_srcu);
> + hlist_for_each_entry_srcu(n, &head->track_notifier_list, node,
> + srcu_read_lock_held(&head->track_srcu))
Sorry, not sure why the alignment here is not right.
Patchwork just sent me a mail to complain about it.
Would you mind helping fix it in the next version?

Thanks a lot!

> + if (n->track_remove_region)
> + n->track_remove_region(slot->base_gfn, slot->npages, n);
> + srcu_read_unlock(&head->track_srcu, idx);
> +}
> +


Re: [Intel-gfx] [PATCH v2 17/27] drm/i915/gvt: switch from ->track_flush_slot() to ->track_remove_region()

2023-03-17 Thread Yan Zhao
Tested-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:48PM -0800, Sean Christopherson wrote:
> From: Yan Zhao 
> 
> Switch from the poorly named and flawed ->track_flush_slot() to the newly
> introduced ->track_remove_region().  From KVMGT's perspective, the two
> hooks are functionally equivalent, the only difference being that
> ->track_remove_region() is called only when KVM is 100% certain the
> memory region will be removed, i.e. is invoked slightly later in KVM's
> memslot modification flow.
> 
> Cc: Zhenyu Wang 
> Suggested-by: Sean Christopherson 
> Signed-off-by: Yan Zhao 
> [sean: handle name change, massage changelog, rebase]
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 21 +
>  1 file changed, 9 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 577712ea4893..9f188b6c3edf 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -108,9 +108,8 @@ struct gvt_dma {
>  
>  static void kvmgt_page_track_write(gpa_t gpa, const u8 *val, int len,
>  struct kvm_page_track_notifier_node *node);
> -static void kvmgt_page_track_flush_slot(struct kvm *kvm,
> - struct kvm_memory_slot *slot,
> - struct kvm_page_track_notifier_node *node);
> +static void kvmgt_page_track_remove_region(gfn_t gfn, unsigned long nr_pages,
> +struct kvm_page_track_notifier_node 
> *node);
>  
>  static ssize_t intel_vgpu_show_description(struct mdev_type *mtype, char 
> *buf)
>  {
> @@ -680,7 +679,7 @@ static int intel_vgpu_open_device(struct vfio_device 
> *vfio_dev)
>   return -EEXIST;
>  
>   vgpu->track_node.track_write = kvmgt_page_track_write;
> - vgpu->track_node.track_flush_slot = kvmgt_page_track_flush_slot;
> + vgpu->track_node.track_remove_region = kvmgt_page_track_remove_region;
>   kvm_get_kvm(vgpu->vfio_device.kvm);
>   kvm_page_track_register_notifier(vgpu->vfio_device.kvm,
>&vgpu->track_node);
> @@ -1631,22 +1630,20 @@ static void kvmgt_page_track_write(gpa_t gpa, const 
> u8 *val, int len,
>   mutex_unlock(&info->vgpu_lock);
>  }
>  
> -static void kvmgt_page_track_flush_slot(struct kvm *kvm,
> - struct kvm_memory_slot *slot,
> - struct kvm_page_track_notifier_node *node)
> +static void kvmgt_page_track_remove_region(gfn_t gfn, unsigned long nr_pages,
> +struct kvm_page_track_notifier_node 
> *node)
>  {
>   unsigned long i;
> - gfn_t gfn;
>   struct intel_vgpu *info =
>   container_of(node, struct intel_vgpu, track_node);
>  
>   mutex_lock(&info->vgpu_lock);
>  
> - for (i = 0; i < slot->npages; i++) {
> - gfn = slot->base_gfn + i;
> - if (kvmgt_gfn_is_write_protected(info, gfn))
> - kvmgt_protect_table_del(info, gfn);
> + for (i = 0; i < nr_pages; i++) {
> + if (kvmgt_gfn_is_write_protected(info, gfn + i))
> + kvmgt_protect_table_del(info, gfn + i);
>   }
> +
>   mutex_unlock(&info->vgpu_lock);
>  }
>  
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 23/27] KVM: x86/mmu: Assert that correct locks are held for page write-tracking

2023-03-17 Thread Yan Zhao
Tested-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:54PM -0800, Sean Christopherson wrote:
> When adding/removing gfns to/from write-tracking, assert that mmu_lock
> is held for write, and that either slots_lock or kvm->srcu is held.
> mmu_lock must be held for write to protect gfn_write_track's refcount,
> and SRCU or slots_lock must be held to protect the memslot itself.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/mmu/page_track.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 1993db4578e5..ffcd7ac66f9e 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -12,6 +12,7 @@
>   */
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
> +#include 
>  #include 
>  #include 
>  
> @@ -77,9 +78,6 @@ static void update_gfn_write_track(struct kvm_memory_slot 
> *slot, gfn_t gfn,
>   * add guest page to the tracking pool so that corresponding access on that
>   * page will be intercepted.
>   *
> - * It should be called under the protection both of mmu-lock and kvm->srcu
> - * or kvm->slots_lock.
> - *
>   * @kvm: the guest instance we are interested in.
>   * @slot: the @gfn belongs to.
>   * @gfn: the guest page.
> @@ -87,6 +85,11 @@ static void update_gfn_write_track(struct kvm_memory_slot 
> *slot, gfn_t gfn,
>  void kvm_write_track_add_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>gfn_t gfn)
>  {
> + lockdep_assert_held_write(&kvm->mmu_lock);
> +
> + lockdep_assert_once(lockdep_is_held(&kvm->slots_lock) ||
> + srcu_read_lock_held(&kvm->srcu));
> +
>   if (WARN_ON(!kvm_page_track_write_tracking_enabled(kvm)))
>   return;
>  
> @@ -107,9 +110,6 @@ EXPORT_SYMBOL_GPL(kvm_write_track_add_gfn);
>   * remove the guest page from the tracking pool which stops the interception
>   * of corresponding access on that page.
>   *
> - * It should be called under the protection both of mmu-lock and kvm->srcu
> - * or kvm->slots_lock.
> - *
>   * @kvm: the guest instance we are interested in.
>   * @slot: the @gfn belongs to.
>   * @gfn: the guest page.
> @@ -117,6 +117,11 @@ EXPORT_SYMBOL_GPL(kvm_write_track_add_gfn);
>  void kvm_write_track_remove_gfn(struct kvm *kvm,
>   struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> + lockdep_assert_held_write(&kvm->mmu_lock);
> +
> + lockdep_assert_once(lockdep_is_held(&kvm->slots_lock) ||
> + srcu_read_lock_held(&kvm->srcu));
> +
>   if (WARN_ON(!kvm_page_track_write_tracking_enabled(kvm)))
>   return;
>  
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 25/27] KVM: x86/mmu: Drop @slot param from exported/external page-track APIs

2023-03-17 Thread Yan Zhao
On Fri, Mar 10, 2023 at 04:22:56PM -0800, Sean Christopherson wrote:
...
> +int kvm_write_track_add_gfn(struct kvm *kvm, gfn_t gfn)
> +{
> + struct kvm_memory_slot *slot;
> + int idx;
> +
> + idx = srcu_read_lock(&kvm->srcu);
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot) {
> + srcu_read_unlock(&kvm->srcu, idx);
> + return -EINVAL;
> + }
> +
Also fail if slot->flags & KVM_MEMSLOT_INVALID is true?
There should exist a window for external users to see an invalid slot
when a slot is about to get deleted/moved.
(It happens before MOVE is rejected in kvm_arch_prepare_memory_region()).

> + write_lock(&kvm->mmu_lock);
> + __kvm_write_track_add_gfn(kvm, slot, gfn);
> + write_unlock(&kvm->mmu_lock);
> +
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_write_track_add_gfn);
> +
> +/*
> + * remove the guest page from the tracking pool which stops the interception
> + * of corresponding access on that page.
> + *
> + * @kvm: the guest instance we are interested in.
> + * @gfn: the guest page.
> + */
> +int kvm_write_track_remove_gfn(struct kvm *kvm, gfn_t gfn)
> +{
> + struct kvm_memory_slot *slot;
> + int idx;
> +
> + idx = srcu_read_lock(&kvm->srcu);
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot) {
> + srcu_read_unlock(&kvm->srcu, idx);
> + return -EINVAL;
> + }
> +
Ditto.

> + write_lock(&kvm->mmu_lock);
> + __kvm_write_track_remove_gfn(kvm, slot, gfn);
> + write_unlock(&kvm->mmu_lock);
> +
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_write_track_remove_gfn);



Re: [Intel-gfx] [PATCH v2 26/27] KVM: x86/mmu: Handle KVM bookkeeping in page-track APIs, not callers

2023-03-17 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:57PM -0800, Sean Christopherson wrote:
> Get/put references to KVM when a page-track notifier is (un)registered
> instead of relying on the caller to do so.  Forcing the caller to do the
> bookkeeping is unnecessary and adds one more thing for users to get
> wrong, e.g. see commit 9ed1fdee9ee3 ("drm/i915/gvt: Get reference to KVM
> iff attachment to VM is successful").
Just realized that "iff" stands for "if and only if" :) 



Re: [Intel-gfx] [PATCH v2 27/27] drm/i915/gvt: Drop final dependencies on KVM internal details

2023-03-17 Thread Yan Zhao
Reviewed-by: Yan Zhao 

On Fri, Mar 10, 2023 at 04:22:58PM -0800, Sean Christopherson wrote:
> Open code gpa_to_gfn() in kvmgt_page_track_write() and drop KVMGT's
> dependency on kvm_host.h, i.e. include only kvm_page_track.h.  KVMGT
> assumes "gfn == gpa >> PAGE_SHIFT" all over the place, including a few
> lines below in the same function with the same gpa, i.e. there's no
> reason to use KVM's helper for this one case.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/gpu/drm/i915/gvt/gvt.h   | 3 ++-
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/gvt.h b/drivers/gpu/drm/i915/gvt/gvt.h
> index 2d65800d8e93..53a0a42a50db 100644
> --- a/drivers/gpu/drm/i915/gvt/gvt.h
> +++ b/drivers/gpu/drm/i915/gvt/gvt.h
> @@ -34,10 +34,11 @@
>  #define _GVT_H_
>  
>  #include 
> -#include 
>  #include 
>  #include 
>  
> +#include 
> +
>  #include "i915_drv.h"
>  #include "intel_gvt.h"
>  
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index d16aced134b4..798d04481f03 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1599,7 +1599,7 @@ static void kvmgt_page_track_write(gpa_t gpa, const u8 
> *val, int len,
>  
>   mutex_lock(&info->vgpu_lock);
>  
> - if (kvmgt_gfn_is_write_protected(info, gpa_to_gfn(gpa)))
> + if (kvmgt_gfn_is_write_protected(info, gpa >> PAGE_SHIFT))
>   intel_vgpu_page_track_handler(info, gpa,
>(void *)val, len);
>  
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 


Re: [Intel-gfx] [PATCH v2 25/27] KVM: x86/mmu: Drop @slot param from exported/external page-track APIs

2023-03-23 Thread Yan Zhao
On Fri, Mar 17, 2023 at 04:28:56PM +0800, Yan Zhao wrote:
> On Fri, Mar 10, 2023 at 04:22:56PM -0800, Sean Christopherson wrote:
> ...
> > +int kvm_write_track_add_gfn(struct kvm *kvm, gfn_t gfn)
> > +{
> > +   struct kvm_memory_slot *slot;
> > +   int idx;
> > +
> > +   idx = srcu_read_lock(&kvm->srcu);
> > +
> > +   slot = gfn_to_memslot(kvm, gfn);
> > +   if (!slot) {
> > +   srcu_read_unlock(&kvm->srcu, idx);
> > +   return -EINVAL;
> > +   }
> > +
> Also fail if slot->flags & KVM_MEMSLOT_INVALID is true?
> There should exist a window for external users to see an invalid slot
> when a slot is about to get deleted/moved.
> (It happens before MOVE is rejected in kvm_arch_prepare_memory_region()).

Or using
if (!kvm_is_visible_memslot(slot)) {
srcu_read_unlock(&kvm->srcu, idx);
return -EINVAL;
}

> 
> > +   write_lock(&kvm->mmu_lock);
> > +   __kvm_write_track_add_gfn(kvm, slot, gfn);
> > +   write_unlock(&kvm->mmu_lock);
> > +
> > +   srcu_read_unlock(&kvm->srcu, idx);
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_write_track_add_gfn);
> > +
> > +/*
> > + * remove the guest page from the tracking pool which stops the 
> > interception
> > + * of corresponding access on that page.
> > + *
> > + * @kvm: the guest instance we are interested in.
> > + * @gfn: the guest page.
> > + */
> > +int kvm_write_track_remove_gfn(struct kvm *kvm, gfn_t gfn)
> > +{
> > +   struct kvm_memory_slot *slot;
> > +   int idx;
> > +
> > +   idx = srcu_read_lock(&kvm->srcu);
> > +
> > +   slot = gfn_to_memslot(kvm, gfn);
> > +   if (!slot) {
> > +   srcu_read_unlock(&kvm->srcu, idx);
> > +   return -EINVAL;
> > +   }
> > +
> Ditto.
> 
> > +   write_lock(&kvm->mmu_lock);
> > +   __kvm_write_track_remove_gfn(kvm, slot, gfn);
> > +   write_unlock(&kvm->mmu_lock);
> > +
> > +   srcu_read_unlock(&kvm->srcu, idx);
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_write_track_remove_gfn);
> 


Re: [Intel-gfx] [PATCH v2 13/14] vfio: Add ioctls for device cdev using iommufd

2023-02-06 Thread Yan Zhao
On Mon, Feb 06, 2023 at 01:05:31AM -0800, Yi Liu wrote:
...

> +void vfio_device_cdev_close(struct vfio_device_file *df)
> +{
> + mutex_lock(&df->device->dev_set->lock);
> + vfio_device_close(df);
> + vfio_device_put_kvm(df->device);
> + mutex_unlock(&df->device->dev_set->lock);
> +}
> +

...

> +long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
> + unsigned long arg)
> +{
> + struct vfio_device *device = df->device;
> + struct vfio_device_bind_iommufd bind;
> + struct iommufd_ctx *iommufd = NULL;
> + unsigned long minsz;
> + struct fd f;
> + int ret;
> +
> + minsz = offsetofend(struct vfio_device_bind_iommufd, iommufd);
> +
> + if (copy_from_user(&bind, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (bind.argsz < minsz || bind.flags)
> + return -EINVAL;
> +
> + if (!device->ops->bind_iommufd)
> + return -ENODEV;
> +
> + mutex_lock(&device->dev_set->lock);
> + /*
> +  * If already been bound to an iommufd, or already set noiommu
> +  * then fail it.
> +  */
> + if (df->iommufd || df->noiommu) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + /* iommufd < 0 means noiommu mode */
> + if (bind.iommufd < 0) {
> + if (!capable(CAP_SYS_RAWIO)) {
> + ret = -EPERM;
> + goto out_unlock;
> + }
> + df->noiommu = true;
> + } else {
> + f = fdget(bind.iommufd);
> + if (!f.file) {
> + ret = -EBADF;
> + goto out_unlock;
> + }
> + iommufd = iommufd_ctx_from_file(f.file);
> + if (IS_ERR(iommufd)) {
> + ret = PTR_ERR(iommufd);
> + goto out_put_file;
> + }
> + }
> +
> + /*
> +  * Before the first device open, get the KVM pointer currently
> +  * associated with the device file (if there is) and obtain a
> +  * reference. This reference is held until device closed. Save
> +  * the pointer in the device for use by drivers.
> +  */
> + vfio_device_get_kvm_safe(df);
> +
> + df->iommufd = iommufd;
> + ret = vfio_device_open(df, &bind.out_devid, NULL);
> + if (ret)
> + goto out_put_kvm;
> +
> + ret = copy_to_user((void __user *)arg + minsz,
> +&bind.out_devid,
> +sizeof(bind.out_devid)) ? -EFAULT : 0;
> + if (ret)
> + goto out_close_device;
> +
> + if (iommufd)
> + fdput(f);
> + else if (df->noiommu)
> + dev_warn(device->dev, "vfio-noiommu device used by user "
> +  "(%s:%d)\n", current->comm, task_pid_nr(current));
> + mutex_unlock(&device->dev_set->lock);
> + return 0;
> +
> +out_close_device:
> + vfio_device_close(df);
vfio_device_close() is called here if any error after vfio_device_open().
But it will also be called unconditionally in vfio_device_cdev_close() and
cause a wrong value of device->open_count.

df->access_granted in patch 07 can also be of wrong true value after
this vfio_device_close().


> +out_put_kvm:
> + vfio_device_put_kvm(device);
> +out_put_file:
> + if (iommufd)
> + fdput(f);
> +out_unlock:
> + df->iommufd = NULL;
> + df->noiommu = false;
> + mutex_unlock(&device->dev_set->lock);
> + return ret;
> +}
> +


Re: [Intel-gfx] [PATCH v3 14/15] vfio: Add ioctls for device cdev using iommufd

2023-02-14 Thread Yan Zhao
On Mon, Feb 13, 2023 at 07:13:47AM -0800, Yi Liu wrote:
...
> +long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
> + unsigned long arg)
> +{
> + struct vfio_device *device = df->device;
> + struct vfio_device_bind_iommufd bind;
> + struct iommufd_ctx *iommufd = NULL;
> + struct fd f;
> + unsigned long minsz;
> + int ret;
> +
> + minsz = offsetofend(struct vfio_device_bind_iommufd, out_devid);
> +
> + if (copy_from_user(&bind, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (bind.argsz < minsz || bind.flags)
> + return -EINVAL;
> +
> + if (!device->ops->bind_iommufd)
> + return -ENODEV;
> +
> + ret = vfio_device_claim_group(device);
> + if (ret)
> + return ret;
> +
> + mutex_lock(&device->dev_set->lock);
> + /*
> +  * If already been bound to an iommufd, or already set noiommu
> +  * then fail it.
> +  */
> + if (df->iommufd || df->noiommu) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + /* iommufd < 0 means noiommu mode */
> + if (bind.iommufd < 0) {
> + if (!capable(CAP_SYS_RAWIO)) {
> + ret = -EPERM;
> + goto out_unlock;
> + }
> + df->noiommu = true;
> + } else {
> + f = fdget(bind.iommufd);
> + if (!f.file) {
> + ret = -EBADF;
> + goto out_unlock;
> + }
> + iommufd = iommufd_ctx_from_file(f.file);
> + if (IS_ERR(iommufd)) {
> + ret = PTR_ERR(iommufd);
> + goto out_put_file;
> + }
> + }
> +
> + /*
> +  * Before the device open, get the KVM pointer currently
> +  * associated with the device file (if there is) and obtain a
> +  * reference. This reference is held until device closed. Save
> +  * the pointer in the device for use by drivers.
> +  */
> + vfio_device_get_kvm_safe(df);
> +
> + df->iommufd = iommufd;
> + ret = vfio_device_open(df, &bind.out_devid, NULL);
> + if (ret)
> + goto out_put_kvm;
> +
> + ret = copy_to_user((void __user *)arg +
> +offsetofend(struct vfio_device_bind_iommufd, 
> iommufd),
> +&bind.out_devid,
> +sizeof(bind.out_devid)) ? -EFAULT : 0;
> + if (ret)
> + goto out_close_device;
> +
> + if (iommufd)
> + fdput(f);
> + else if (df->noiommu)
> + dev_warn(device->dev, "vfio-noiommu device used by user "
> +  "(%s:%d)\n", current->comm, task_pid_nr(current));

IMHO, the "smp_store_release(&df->access_granted, true);" in patch 7
should be moved to here when bind is indeed successful.


> + mutex_unlock(&device->dev_set->lock);
> + return 0;
> +
> +out_close_device:
> + vfio_device_close(df);
> +out_put_kvm:
> + df->iommufd = NULL;
> + df->noiommu = false;
> + vfio_device_put_kvm(device);
> +out_put_file:
> + if (iommufd)
> + fdput(f);
> +out_unlock:
> + mutex_unlock(&device->dev_set->lock);
> + vfio_device_release_group(device);
> + return ret;
> +}
...


Re: [Intel-gfx] [PATCH v3 14/15] vfio: Add ioctls for device cdev using iommufd

2023-02-16 Thread Yan Zhao
On Mon, Feb 13, 2023 at 07:13:47AM -0800, Yi Liu wrote:
...

> +long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
> + unsigned long arg)
> +{
> + struct vfio_device *device = df->device;
> + struct vfio_device_bind_iommufd bind;
> + struct iommufd_ctx *iommufd = NULL;
> + struct fd f;
> + unsigned long minsz;
> + int ret;
> +
> + minsz = offsetofend(struct vfio_device_bind_iommufd, out_devid);
> +
> + if (copy_from_user(&bind, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (bind.argsz < minsz || bind.flags)
> + return -EINVAL;
> +
> + if (!device->ops->bind_iommufd)
> + return -ENODEV;
> +
> + ret = vfio_device_claim_group(device);
> + if (ret)
> + return ret;
> +
> + mutex_lock(&device->dev_set->lock);
> + /*
> +  * If already been bound to an iommufd, or already set noiommu
> +  * then fail it.
> +  */
> + if (df->iommufd || df->noiommu) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + /* iommufd < 0 means noiommu mode */
> + if (bind.iommufd < 0) {
> + if (!capable(CAP_SYS_RAWIO)) {
> + ret = -EPERM;
> + goto out_unlock;
> + }
> + df->noiommu = true;
> + } else {
> + f = fdget(bind.iommufd);
Here, the iommufd file count + 1,

> + if (!f.file) {
> + ret = -EBADF;
> + goto out_unlock;
> + }
> + iommufd = iommufd_ctx_from_file(f.file);
iommufd file count + 1, again

> + if (IS_ERR(iommufd)) {
> + ret = PTR_ERR(iommufd);
> + goto out_put_file;
> + }
> + }
> +
> + /*
> +  * Before the device open, get the KVM pointer currently
> +  * associated with the device file (if there is) and obtain a
> +  * reference. This reference is held until device closed. Save
> +  * the pointer in the device for use by drivers.
> +  */
> + vfio_device_get_kvm_safe(df);
> +
> + df->iommufd = iommufd;
> + ret = vfio_device_open(df, &bind.out_devid, NULL);
iommufd file count + 1 in iommufd_device_bind for first open.

> + if (ret)
> + goto out_put_kvm;
> +
> + ret = copy_to_user((void __user *)arg +
> +offsetofend(struct vfio_device_bind_iommufd, 
> iommufd),
> +&bind.out_devid,
> +sizeof(bind.out_devid)) ? -EFAULT : 0;
> + if (ret)
> + goto out_close_device;
> +
> + if (iommufd)
> + fdput(f);
But, only one file count is put.

Need a paring iommufd_ctx_put() after a successful iommufd_ctx_from_file()
above to avoid iommufd_fops_release() never being called.

e.g.

@@ -1222,11 +1226,13 @@ static long vfio_device_ioctl_bind_iommufd(struct 
vfio_device_file *df,
ret = -EBADF;
goto out_unlock;
}
iommufd = iommufd_ctx_from_file(f.file);
if (IS_ERR(iommufd)) {
ret = PTR_ERR(iommufd);
goto out_put_file;
}
+   iommufd_ctx_put(iommufd);
}

/* df->kvm is supposed to be set in vfio_device_file_set_kvm() */

> + else if (df->noiommu)
> + dev_warn(device->dev, "vfio-noiommu device used by user "
> +  "(%s:%d)\n", current->comm, task_pid_nr(current));
> + mutex_unlock(&device->dev_set->lock);
> + return 0;
> +
> +out_close_device:
> + vfio_device_close(df);
> +out_put_kvm:
> + df->iommufd = NULL;
> + df->noiommu = false;
> + vfio_device_put_kvm(device);
> +out_put_file:
> + if (iommufd)
> + fdput(f);
> +out_unlock:
> + mutex_unlock(&device->dev_set->lock);
> + vfio_device_release_group(device);
> + return ret;
> +}
> +


Re: [Intel-gfx] [PATCH v3 14/15] vfio: Add ioctls for device cdev using iommufd

2023-02-16 Thread Yan Zhao
On Thu, Feb 16, 2023 at 05:10:06PM +0800, Liu, Yi L wrote:
> > From: Zhao, Yan Y 
> > Sent: Thursday, February 16, 2023 4:24 PM
> > 
> > On Mon, Feb 13, 2023 at 07:13:47AM -0800, Yi Liu wrote:
> > ...
> > 
> > > +long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
> > > + unsigned long arg)
> > > +{
> > > + struct vfio_device *device = df->device;
> > > + struct vfio_device_bind_iommufd bind;
> > > + struct iommufd_ctx *iommufd = NULL;
> > > + struct fd f;
> > > + unsigned long minsz;
> > > + int ret;
> > > +
> > > + minsz = offsetofend(struct vfio_device_bind_iommufd, out_devid);
> > > +
> > > + if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > + return -EFAULT;
> > > +
> > > + if (bind.argsz < minsz || bind.flags)
> > > + return -EINVAL;
> > > +
> > > + if (!device->ops->bind_iommufd)
> > > + return -ENODEV;
> > > +
> > > + ret = vfio_device_claim_group(device);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + mutex_lock(&device->dev_set->lock);
> > > + /*
> > > +  * If already been bound to an iommufd, or already set noiommu
> > > +  * then fail it.
> > > +  */
> > > + if (df->iommufd || df->noiommu) {
> > > + ret = -EINVAL;
> > > + goto out_unlock;
> > > + }
> > > +
> > > + /* iommufd < 0 means noiommu mode */
> > > + if (bind.iommufd < 0) {
> > > + if (!capable(CAP_SYS_RAWIO)) {
> > > + ret = -EPERM;
> > > + goto out_unlock;
> > > + }
> > > + df->noiommu = true;
> > > + } else {
> > > + f = fdget(bind.iommufd);
> > Here, the iommufd file count + 1,
> > 
> > > + if (!f.file) {
> > > + ret = -EBADF;
> > > + goto out_unlock;
> > > + }
> > > + iommufd = iommufd_ctx_from_file(f.file);
> > iommufd file count + 1, again
> > 
> > > + if (IS_ERR(iommufd)) {
> > > + ret = PTR_ERR(iommufd);
> > > + goto out_put_file;
> > > + }
> > > + }
> > > +
> > > + /*
> > > +  * Before the device open, get the KVM pointer currently
> > > +  * associated with the device file (if there is) and obtain a
> > > +  * reference. This reference is held until device closed. Save
> > > +  * the pointer in the device for use by drivers.
> > > +  */
> > > + vfio_device_get_kvm_safe(df);
> > > +
> > > + df->iommufd = iommufd;
> > > + ret = vfio_device_open(df, &bind.out_devid, NULL);
> > iommufd file count + 1 in iommufd_device_bind for first open.
> > 
> > > + if (ret)
> > > + goto out_put_kvm;
> > > +
> > > + ret = copy_to_user((void __user *)arg +
> > > +offsetofend(struct vfio_device_bind_iommufd,
> > iommufd),
> > > +&bind.out_devid,
> > > +sizeof(bind.out_devid)) ? -EFAULT : 0;
> > > + if (ret)
> > > + goto out_close_device;
> > > +
> > > + if (iommufd)
> > > + fdput(f);
> > But, only one file count is put.
> 
> Good catch! Yes it is missed. And needs to call iommufd_ctx_put()
> in vfio_device_cdev_close() as well.
If I read correctly, iommufd_device_bind() in the first open will
get the reference through iommufd_ctx_get(ictx) and 
iommufd_device_destroy() in the last close will do the iommufd_ctx_put().

As vfio_device_ioctl_bind_iommufd() isn't paring with
vfio_device_cdev_close(), I think the fix below is simpler :) 

> > Need a paring iommufd_ctx_put() after a successful
> > iommufd_ctx_from_file()
> > above to avoid iommufd_fops_release() never being called.
> > 
> > e.g.
> > 
> > @@ -1222,11 +1226,13 @@ static long
> > vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
> > ret = -EBADF;
> > goto out_unlock;
> > }
> > iommufd = iommufd_ctx_from_file(f.file);
> > if (IS_ERR(iommufd)) {
> > ret = PTR_ERR(iommufd);
> > goto out_put_file;
> > }
> > +   iommufd_ctx_put(iommufd);
> 
> Since iommufd is recorded in df, so needs to hold refcount till
> df->iommufd=NULL;
> 
> 
> > }
> > 
> > /* df->kvm is supposed to be set in vfio_device_file_set_kvm() */
> > 
> > > + else if (df->noiommu)
> > > + dev_warn(device->dev, "vfio-noiommu device used by user
> > "
> > > +  "(%s:%d)\n", current->comm,
> > task_pid_nr(current));
> > > + mutex_unlock(&device->dev_set->lock);
> > > + return 0;
> > > +
> > > +out_close_device:
> > > + vfio_device_close(df);
> > > +out_put_kvm:
> > > + df->iommufd = NULL;
> > > + df->noiommu = false;
> > > + vfio_device_put_kvm(device);
> > > +out_put_file:
> > > + if (iommufd)
> > > + fdput(f);
> > > +out_unlock:
> > > + mutex_unlock(&device->dev_set->lock);
> > > + vfio_device_release_group(device);
> > > + return ret;
> > > +}
> > > +


Re: [Intel-gfx] [PATCH v4 16/19] vfio: Add VFIO_DEVICE_BIND_IOMMUFD

2023-02-22 Thread Yan Zhao
On Mon, Feb 20, 2023 at 07:48:09PM -0800, Yi Liu wrote:
> This adds ioctl for userspace to bind device cdev fd to iommufd.
> 
> VFIO_DEVICE_BIND_IOMMUFD: bind device to an iommufd, hence gain DMA
> control provided by the iommufd. open_device
> op is called after bind_iommufd op.
> VFIO no iommu mode is indicated by passing
> a negative iommufd value.
> 
> Signed-off-by: Yi Liu 
> ---
>  drivers/vfio/device_cdev.c | 137 +
>  drivers/vfio/vfio.h|  17 -
>  drivers/vfio/vfio_main.c   |  30 ++--
>  include/linux/iommufd.h|   6 ++
>  include/uapi/linux/vfio.h  |  34 +
>  5 files changed, 219 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> index 9e2c1ecaaf4f..be62f0a5687e 100644
> --- a/drivers/vfio/device_cdev.c
> +++ b/drivers/vfio/device_cdev.c
> @@ -3,6 +3,7 @@
>   * Copyright (c) 2023 Intel Corporation.
>   */
>  #include 
> +#include 
>  
>  #include "vfio.h"
>  
> @@ -45,6 +46,142 @@ int vfio_device_fops_cdev_open(struct inode *inode, 
> struct file *filep)
>   return ret;
>  }
>  
> +static void vfio_device_get_kvm_safe(struct vfio_device_file *df)
> +{
> + spin_lock(&df->kvm_ref_lock);
> + if (!df->kvm)
> + goto unlock;
> +
> + _vfio_device_get_kvm_safe(df->device, df->kvm);
> +
> +unlock:
> + spin_unlock(&df->kvm_ref_lock);
> +}
> +
> +void vfio_device_cdev_close(struct vfio_device_file *df)
> +{
> + struct vfio_device *device = df->device;
> +
> + mutex_lock(&device->dev_set->lock);
> + if (!smp_load_acquire(&df->access_granted)) {
> + mutex_unlock(&device->dev_set->lock);
> + return;
> + }
> + vfio_device_close(df);
> + vfio_device_put_kvm(device);
> + if (df->iommufd)
> + iommufd_ctx_put(df->iommufd);
> + mutex_unlock(&device->dev_set->lock);
> + vfio_device_unblock_group(device);
> +}
> +
> +long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
> + unsigned long arg)
> +{
> + struct vfio_device *device = df->device;
> + struct vfio_device_bind_iommufd bind;
> + struct iommufd_ctx *iommufd = NULL;
> + struct fd f;
> + unsigned long minsz;
> + int ret;
> +
> + minsz = offsetofend(struct vfio_device_bind_iommufd, out_devid);
> +
> + if (copy_from_user(&bind, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (bind.argsz < minsz || bind.flags)
> + return -EINVAL;
> +
> + if (!device->ops->bind_iommufd)
> + return -ENODEV;
> +
> + ret = vfio_device_block_group(device);
> + if (ret)
> + return ret;
> +
> + mutex_lock(&device->dev_set->lock);
> + /*
> +  * If already been bound to an iommufd, or already set noiommu
> +  * then fail it.
> +  */
> + if (df->iommufd || df->noiommu) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + /* iommufd < 0 means noiommu mode */
> + if (bind.iommufd < 0) {
> + if (!capable(CAP_SYS_RAWIO)) {
> + ret = -EPERM;
> + goto out_unlock;
> + }
> + df->noiommu = true;
> + } else {
> + f = fdget(bind.iommufd);
> + if (!f.file) {
> + ret = -EBADF;
> + goto out_unlock;
> + }
> + iommufd = iommufd_ctx_from_file(f.file);
> + if (IS_ERR(iommufd)) {
> + ret = PTR_ERR(iommufd);
> + goto out_put_file;
> + }
> + }
> +
> + /*
> +  * Before the device open, get the KVM pointer currently
> +  * associated with the device file (if there is) and obtain a
> +  * reference. This reference is held until device closed. Save
> +  * the pointer in the device for use by drivers.
> +  */
> + vfio_device_get_kvm_safe(df);
> +
> + df->iommufd = iommufd;
> + ret = vfio_device_open(df, &bind.out_devid, NULL);
> + if (ret)
> + goto out_put_kvm;
> +
> + ret = copy_to_user((void __user *)arg +
> +offsetofend(struct vfio_device_bind_iommufd, 
> iommufd),
> +&bind.out_devid,
> +sizeof(bind.out_devid)) ? -EFAULT : 0;
> + if (ret)
> + goto out_close_device;
> +
> + if (iommufd)
> + fdput(f);
> + else if (df->noiommu)
> + dev_warn(device->dev, "vfio-noiommu device used by user "
> +  "(%s:%d)\n", current->comm, task_pid_nr(current));
> +
> + /*
> +  * Paired with smp_load_acquire() in vfio_device_fops::ioctl/
> +  * read/write/mmap
> +  */
> + smp_store_release(&df->access_granted, true);
> + mutex_unlock(&device->dev_set-

Re: [Intel-gfx] [PATCH v4 07/19] vfio: Block device access via device fd until device is opened

2023-02-22 Thread Yan Zhao
On Mon, Feb 20, 2023 at 07:48:00PM -0800, Yi Liu wrote:
> Allow the vfio_device file to be in a state where the device FD is
> opened but the device cannot be used by userspace (i.e. its .open_device()
> hasn't been called). This inbetween state is not used when the device
> FD is spawned from the group FD, however when we create the device FD
> directly by opening a cdev it will be opened in the blocked state.
> 
> The reason for the inbetween state is that userspace only gets a FD but
> doesn't gain access permission until binding the FD to an iommufd. So in
> the blocked state, only the bind operation is allowed. Completing bind
> will allow user to further access the device.
> 
> This is implemented by adding a flag in struct vfio_device_file to mark
> the blocked state and using a simple smp_load_acquire() to obtain the
> flag value and serialize all the device setup with the thread accessing
> this device.
> 
> Following this lockless scheme, it can safely handle the device FD
> unbound->bound but it cannot handle bound->unbound. To allow this we'd
> need to add a lock on all the vfio ioctls which seems costly. So once
> device FD is bound, it remains bound until the FD is closed.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Yi Liu 
> Reviewed-by: Kevin Tian 
> ---
>  drivers/vfio/group.c |  6 ++
>  drivers/vfio/vfio.h  |  1 +
>  drivers/vfio/vfio_main.c | 16 
>  3 files changed, 23 insertions(+)
> 
> diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
> index 2abf55c69281..14e29525e354 100644
> --- a/drivers/vfio/group.c
> +++ b/drivers/vfio/group.c
> @@ -197,6 +197,12 @@ static int vfio_device_group_open(struct 
> vfio_device_file *df)
>   if (device->open_count == 0)
>   vfio_device_put_kvm(device);
>  
> + /*
> +  * Paired with smp_load_acquire() in vfio_device_fops::ioctl/
> +  * read/write/mmap
> +  */
> + smp_store_release(&df->access_granted, true);
> +
>   mutex_unlock(&device->dev_set->lock);
>  
>  out_unlock:
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 11e56fe079a1..d56cdb114024 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -18,6 +18,7 @@ struct vfio_container;
>  
>  struct vfio_device_file {
>   struct vfio_device *device;
> + bool access_granted;
>   spinlock_t kvm_ref_lock; /* protect kvm field */
>   struct kvm *kvm;
>   struct iommufd_ctx *iommufd; /* protected by struct 
> vfio_device_set::lock */
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index ea507a61e3b7..91c8f25393db 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1106,6 +1106,10 @@ static long vfio_device_fops_unl_ioctl(struct file 
> *filep,
>   struct vfio_device *device = df->device;
>   int ret;
>  
> + /* Paired with smp_store_release() in vfio_device_open() */

Nit pick: not vfio_device_open() now :)

> + if (!smp_load_acquire(&df->access_granted))
> + return -EINVAL;
> +
>   ret = vfio_device_pm_runtime_get(device);
>   if (ret)
>   return ret;
> @@ -1133,6 +1137,10 @@ static ssize_t vfio_device_fops_read(struct file 
> *filep, char __user *buf,
>   struct vfio_device_file *df = filep->private_data;
>   struct vfio_device *device = df->device;
>  
> + /* Paired with smp_store_release() in vfio_device_open() */
> + if (!smp_load_acquire(&df->access_granted))
> + return -EINVAL;
> +
>   if (unlikely(!device->ops->read))
>   return -EINVAL;
>  
> @@ -1146,6 +1154,10 @@ static ssize_t vfio_device_fops_write(struct file 
> *filep,
>   struct vfio_device_file *df = filep->private_data;
>   struct vfio_device *device = df->device;
>  
> + /* Paired with smp_store_release() in vfio_device_open() */
> + if (!smp_load_acquire(&df->access_granted))
> + return -EINVAL;
> +
>   if (unlikely(!device->ops->write))
>   return -EINVAL;
>  
> @@ -1157,6 +1169,10 @@ static int vfio_device_fops_mmap(struct file *filep, 
> struct vm_area_struct *vma)
>   struct vfio_device_file *df = filep->private_data;
>   struct vfio_device *device = df->device;
>  
> + /* Paired with smp_store_release() in vfio_device_open() */
> + if (!smp_load_acquire(&df->access_granted))
> + return -EINVAL;
> +
>   if (unlikely(!device->ops->mmap))
>   return -EINVAL;
>  
> -- 
> 2.34.1
> 


Re: [Intel-gfx] [PATCH v4 16/19] vfio: Add VFIO_DEVICE_BIND_IOMMUFD

2023-02-23 Thread Yan Zhao
On Wed, Feb 22, 2023 at 08:59:51AM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 22, 2023 at 07:44:12AM +, Liu, Yi L wrote:
> > > From: Tian, Kevin 
> > > Sent: Wednesday, February 22, 2023 3:40 PM
> > > 
> > > > From: Liu, Yi L 
> > > > Sent: Tuesday, February 21, 2023 11:48 AM
> > > >
> > > > +
> > > > +void vfio_device_cdev_close(struct vfio_device_file *df)
> > > > +{
> > > > +   struct vfio_device *device = df->device;
> > > > +
> > > > +   mutex_lock(&device->dev_set->lock);
> > > > +   if (!smp_load_acquire(&df->access_granted)) {
> > > 
> > > there is no contention with another one changing this flag at this
> > > point so directly accessing it is fine.
> > 
> > make sense. 
> 
> Have to use READ_ONCE though
>
Just a curious question:
given df->access_granted is now written with device->dev_set->lock held and
also read with this lock held in vfio_device_cdev_close(), is READ_ONCE
still required? And what about df->iommufd ?

Thanks
Yan


Re: [Intel-gfx] [PATCH 03/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

2023-02-23 Thread Yan Zhao
On Thu, Feb 23, 2023 at 12:41:28PM -0800, Sean Christopherson wrote:
> Apologies for the super slow reply, I put this series on the backburner while 
> I
> caught up on other stuff and completely missed your questions.
>
Never mind :)

> On Thu, Jan 19, 2023, Yan Zhao wrote:
> > On Thu, Jan 19, 2023 at 10:58:42AM +0800, Zhenyu Wang wrote:
> > > Current KVMGT usage is mostly in controlled mode, either user is own host 
> > > admin,
> > > or host admin would pre-configure specific limited number of VMs for 
> > > KVMGT use.
> > > I think printk on error should be fine, we don't need rate limit, and 
> > > adding
> > > extra trace monitor for admin might not be necessary. So I'm towards to 
> > > keep to
> > > use current error message.
> > > 
> > 
> > Thanks, Sean and Zhenyu.
> > So, could I just post the final fix as below?
> 
> No objection here.
> 
> > And, Sean, would you like to include it in this series or should I send it 
> > out
> > first?
> 
> I'd like to include it in this series as it's necessary (for some definitions 
> of
> necessary) to clean up KVM's APIs, and the main benefactor is KVM, i.e. 
> getting
> the patch merged sooner than later doesn't really benefit KVMGT itself.
> 
> Thanks much!

Then please include it and I can help to test once you sending out next
version.

Thanks
Yan


Re: [Intel-gfx] [PATCH 2/6] i915/gvt/kvm: a NULL ->mm does not mean a thread is a kthread

2020-04-06 Thread Yan Zhao
On Sat, Apr 04, 2020 at 11:40:57AM +0200, Christoph Hellwig wrote:
> Use the proper API instead.
> 
> Fixes: f440c8a572d7 ("drm/i915/gvt/kvmgt: read/write GPA via KVM API")
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 074c4efb58eb..5848400620b4 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -2037,7 +2037,7 @@ static int kvmgt_rw_gpa(unsigned long handle, unsigned 
> long gpa,
>   struct kvmgt_guest_info *info;
>   struct kvm *kvm;
>   int idx, ret;
> - bool kthread = current->mm == NULL;
> + bool kthread = (current->flags & PF_KTHREAD);
>  
>   if (!handle_valid(handle))
>   return -ESRCH;
> -- 
> 2.25.1
>
hi
we were removing this code. see
https://lore.kernel.org/kvm/20200313031109.7989-1-yan.y.z...@intel.com/

The implementation of vfio_dma_rw() has been in vfio next tree.
https://github.com/awilliam/linux-vfio/commit/8d46c0cca5f4dc0538173d62cd36b1119b5105bc

in vfio_dma_rw(),  we still use
bool kthread = current->mm == NULL.
because if current->mm != NULL and current->flags & PF_KTHREAD, instead
of calling use_mm(), we first check if (current->mm == mm) and allow 
copy_to_user() if it's true.

Do you think it's all right?

Thanks
Yan



> ___
> intel-gvt-dev mailing list
> intel-gvt-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 2/6] i915/gvt/kvm: a NULL ->mm does not mean a thread is a kthread

2020-04-13 Thread Yan Zhao
On Mon, Apr 13, 2020 at 03:27:30PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 06, 2020 at 11:08:46PM -0400, Yan Zhao wrote:
> > hi
> > we were removing this code. see
> > https://lore.kernel.org/kvm/20200313031109.7989-1-yan.y.z...@intel.com/
> 
> This didn't make 5.7-rc1.
> 
> > The implementation of vfio_dma_rw() has been in vfio next tree.
> > https://github.com/awilliam/linux-vfio/commit/8d46c0cca5f4dc0538173d62cd36b1119b5105bc
> 
> 
> This made 5.7-rc1, so I'll update the series to take it into account.
> 
> T
> > in vfio_dma_rw(),  we still use
> > bool kthread = current->mm == NULL.
> > because if current->mm != NULL and current->flags & PF_KTHREAD, instead
> > of calling use_mm(), we first check if (current->mm == mm) and allow 
> > copy_to_user() if it's true.
> > 
> > Do you think it's all right?
> 
> I can't think of another way for a kernel thread to have a mm indeed.
for example, before calling to vfio_dma_rw(), a kernel thread has already
called use_mm(), then its current->mm is not null, and it has flag
PF_KTHREAD.
in this case, we just want to allow the copy_to_user() directly if
current->mm == mm, rather than call another use_mm() again.

do you think it makes sense?

Thanks
Yan

> ___
> intel-gvt-dev mailing list
> intel-gvt-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 2/6] i915/gvt/kvm: a NULL ->mm does not mean a thread is a kthread

2020-04-14 Thread Yan Zhao
On Tue, Apr 14, 2020 at 09:00:13AM +0200, Christoph Hellwig wrote:
> On Mon, Apr 13, 2020 at 08:04:10PM -0400, Yan Zhao wrote:
> > > I can't think of another way for a kernel thread to have a mm indeed.
> > for example, before calling to vfio_dma_rw(), a kernel thread has already
> > called use_mm(), then its current->mm is not null, and it has flag
> > PF_KTHREAD.
> > in this case, we just want to allow the copy_to_user() directly if
> > current->mm == mm, rather than call another use_mm() again.
> > 
> > do you think it makes sense?
> 
> I mean no other way than using use_mm.  That being said nesting
> potentional use_mm callers sounds like a rather bad idea, and we
> should avoid that.
yes, agree.
I was explaining why we just use "current->mm == NULL"
(not "current->flag & PF_KTHREAD") as a criteria to call use_mm()
in vfio_dma_rw(), which you might ask us when you take that part into your
series. :)
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 0/3] add track_remove_slot and remove track_flush_slot

2022-11-11 Thread Yan Zhao
This series is based on Sean's series
https://lore.kernel.org/all/20221110014821.1548347-1-sea...@google.com/,
which allows KVM internal user of page track not to rely on the page track
hook .track_flush_slot.

Page track hook track_flush_slot is for notification of slot flush and
is called when a slot DELETE/MOVE is on-going.

Page track hook track_remove_slot is for notification of slot removal
and is called when the slot DELETE/MOVE has been committed.

As KVMGT, the only external user of page track, actually only cares about
when slot removal indeed happens, this series switches KVMGT to use the new
hook .track_remove_slot.
And as there are no users to .track_flush_slot any more, this hook is
removed.
 
Yan Zhao (3):
  KVM: x86: add a new page track hook track_remove_slot
  drm/i915/gvt: switch from track_flush_slot to track_remove_slot
  KVM: x86: Remove the unused page track hook track_flush_slot

 arch/x86/include/asm/kvm_page_track.h | 8 
 arch/x86/kvm/mmu/page_track.c | 8 
 arch/x86/kvm/x86.c| 5 +++--
 drivers/gpu/drm/i915/gvt/kvmgt.c  | 6 +++---
 4 files changed, 14 insertions(+), 13 deletions(-)

-- 
2.17.1



[Intel-gfx] [PATCH 1/3] KVM: x86: add a new page track hook track_remove_slot

2022-11-11 Thread Yan Zhao
Page track hook track_remove_slot is used to notify users that a slot
has been removed and is called when a slot DELETE/MOVE is about to be
completed.

Users of this hook can drop write protections in the removed slot.

Note:
Since KVM_MR_MOVE currently never actually happens in KVM/QEMU, and has
never been properly supported in the external page track user, we just
use the hook track_remove_slot to notify users of the old slot being
removed.

Cc: Zhenyu Wang 
Suggested-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Yan Zhao 
---
 arch/x86/include/asm/kvm_page_track.h | 11 +++
 arch/x86/kvm/mmu/page_track.c | 26 ++
 arch/x86/kvm/x86.c|  3 +++
 3 files changed, 40 insertions(+)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..046b024d1813 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -44,6 +44,16 @@ struct kvm_page_track_notifier_node {
 */
void (*track_flush_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node);
+   /*
+* It is called when memory slot is moved or removed
+* users can drop write-protection for the pages in that memory slot
+*
+* @kvm: the kvm where memory slot being moved or removed
+* @slot: the memory slot being moved or removed
+* @node: this node
+*/
+   void (*track_remove_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
+ struct kvm_page_track_notifier_node *node);
 };
 
 int kvm_page_track_init(struct kvm *kvm);
@@ -76,4 +86,5 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
  int bytes);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f..4d6bab1d61c9 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -300,3 +300,29 @@ void kvm_page_track_flush_slot(struct kvm *kvm, struct 
kvm_memory_slot *slot)
n->track_flush_slot(kvm, slot, n);
srcu_read_unlock(&head->track_srcu, idx);
 }
+
+/*
+ * Notify the node that memory slot is removed or moved so that it can
+ * drop write-protection for the pages in the memory slot.
+ *
+ * The node should figure out it has any write-protected pages in this slot
+ * by itself.
+ */
+void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+   struct kvm_page_track_notifier_head *head;
+   struct kvm_page_track_notifier_node *n;
+   int idx;
+
+   head = &kvm->arch.track_notifier_head;
+
+   if (hlist_empty(&head->track_notifier_list))
+   return;
+
+   idx = srcu_read_lock(&head->track_srcu);
+   hlist_for_each_entry_srcu(n, &head->track_notifier_list, node,
+   srcu_read_lock_held(&head->track_srcu))
+   if (n->track_remove_slot)
+   n->track_remove_slot(kvm, slot, n);
+   srcu_read_unlock(&head->track_srcu, idx);
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 916ebbc81e52..a24a4a2ad1a0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12844,6 +12844,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
 {
+   if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
+   kvm_page_track_remove_slot(kvm, old);
+
if (!kvm->arch.n_requested_mmu_pages &&
(change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
unsigned long nr_mmu_pages;
-- 
2.17.1



[Intel-gfx] [PATCH 2/3] drm/i915/gvt: switch from track_flush_slot to track_remove_slot

2022-11-11 Thread Yan Zhao
KVMGT only cares about when a slot is indeed removed.
So switch to use track_remove_slot which is called when a slot is removed.

Cc: Zhenyu Wang 
Suggested-by: Sean Christopherson 
Signed-off-by: Yan Zhao 
---
 drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 714221f9a131..9582d047471f 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -109,7 +109,7 @@ struct gvt_dma {
 static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
const u8 *val, int len,
struct kvm_page_track_notifier_node *node);
-static void kvmgt_page_track_flush_slot(struct kvm *kvm,
+static void kvmgt_page_track_remove_slot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node);
 
@@ -673,7 +673,7 @@ static int intel_vgpu_open_device(struct vfio_device 
*vfio_dev)
gvt_cache_init(vgpu);
 
vgpu->track_node.track_write = kvmgt_page_track_write;
-   vgpu->track_node.track_flush_slot = kvmgt_page_track_flush_slot;
+   vgpu->track_node.track_remove_slot = kvmgt_page_track_remove_slot;
kvm_get_kvm(vgpu->vfio_device.kvm);
kvm_page_track_register_notifier(vgpu->vfio_device.kvm,
 &vgpu->track_node);
@@ -1617,7 +1617,7 @@ static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, 
gpa_t gpa,
 (void *)val, len);
 }
 
-static void kvmgt_page_track_flush_slot(struct kvm *kvm,
+static void kvmgt_page_track_remove_slot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
 {
-- 
2.17.1



[Intel-gfx] [PATCH 3/3] KVM: x86: Remove the unused page track hook track_flush_slot

2022-11-11 Thread Yan Zhao
There's no users of hook track_remove_slot any more and no external page
tracker user cares about slot flush.
So remove this hook.

Cc: Zhenyu Wang 
Suggested-by: Sean Christopherson 
Signed-off-by: Yan Zhao 
---
 arch/x86/include/asm/kvm_page_track.h | 11 ---
 arch/x86/kvm/mmu/page_track.c | 26 --
 arch/x86/kvm/x86.c|  2 --
 3 files changed, 39 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 046b024d1813..4f1d3c91fdc7 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -34,16 +34,6 @@ struct kvm_page_track_notifier_node {
 */
void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
int bytes, struct kvm_page_track_notifier_node 
*node);
-   /*
-* It is called when memory slot is being moved or removed
-* users can drop write-protection for the pages in that memory slot
-*
-* @kvm: the kvm where memory slot being moved or removed
-* @slot: the memory slot being moved or removed
-* @node: this node
-*/
-   void (*track_flush_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
-   struct kvm_page_track_notifier_node *node);
/*
 * It is called when memory slot is moved or removed
 * users can drop write-protection for the pages in that memory slot
@@ -85,6 +75,5 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
   struct kvm_page_track_notifier_node *n);
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
  int bytes);
-void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 4d6bab1d61c9..f783aea618f8 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -275,32 +275,6 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t 
gpa, const u8 *new,
srcu_read_unlock(&head->track_srcu, idx);
 }
 
-/*
- * Notify the node that memory slot is being removed or moved so that it can
- * drop write-protection for the pages in the memory slot.
- *
- * The node should figure out it has any write-protected pages in this slot
- * by itself.
- */
-void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
-{
-   struct kvm_page_track_notifier_head *head;
-   struct kvm_page_track_notifier_node *n;
-   int idx;
-
-   head = &kvm->arch.track_notifier_head;
-
-   if (hlist_empty(&head->track_notifier_list))
-   return;
-
-   idx = srcu_read_lock(&head->track_srcu);
-   hlist_for_each_entry_srcu(n, &head->track_notifier_list, node,
-   srcu_read_lock_held(&head->track_srcu))
-   if (n->track_flush_slot)
-   n->track_flush_slot(kvm, slot, n);
-   srcu_read_unlock(&head->track_srcu, idx);
-}
-
 /*
  * Notify the node that memory slot is removed or moved so that it can
  * drop write-protection for the pages in the memory slot.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a24a4a2ad1a0..260288f4d741 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12872,8 +12872,6 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
   struct kvm_memory_slot *slot)
 {
kvm_mmu_zap_all_fast(kvm);
-
-   kvm_page_track_flush_slot(kvm, slot);
 }
 
 static inline bool kvm_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
-- 
2.17.1



[Intel-gfx] [PATCH v2 0/3] add track_remove_slot and remove track_flush_slot

2022-11-11 Thread Yan Zhao
This series is based on Sean's series
https://lore.kernel.org/all/20221110014821.1548347-1-sea...@google.com/,
which allows KVM internal user of page track not to rely on the page track
hook .track_flush_slot.

Page track hook track_flush_slot is for notification of slot flush and
is called when a slot DELETE/MOVE is on-going.

Page track hook track_remove_slot is for notification of slot removal
and is called when the slot DELETE/MOVE has been committed.

As KVMGT, the only external user of page track, actually only cares about
when slot removal indeed happens, this series switches KVMGT to use the new
hook .track_remove_slot.
And as there are no users to .track_flush_slot any more, this hook is
removed.

v2:
Corrected wrong email address of Sean. sorry.
 
Yan Zhao (3):
  KVM: x86: add a new page track hook track_remove_slot
  drm/i915/gvt: switch from track_flush_slot to track_remove_slot
  KVM: x86: Remove the unused page track hook track_flush_slot

 arch/x86/include/asm/kvm_page_track.h | 8 
 arch/x86/kvm/mmu/page_track.c | 8 
 arch/x86/kvm/x86.c| 5 +++--
 drivers/gpu/drm/i915/gvt/kvmgt.c  | 6 +++---
 4 files changed, 14 insertions(+), 13 deletions(-)

-- 
2.17.1



[Intel-gfx] [PATCH v2 1/3] KVM: x86: add a new page track hook track_remove_slot

2022-11-11 Thread Yan Zhao
Page track hook track_remove_slot is used to notify users that a slot
has been removed and is called when a slot DELETE/MOVE is about to be
completed.

Users of this hook can drop write protections in the removed slot.

Note:
Since KVM_MR_MOVE currently never actually happens in KVM/QEMU, and has
never been properly supported in the external page track user, we just
use the hook track_remove_slot to notify users of the old slot being
removed.

Cc: Zhenyu Wang 
Suggested-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Yan Zhao 
---
 arch/x86/include/asm/kvm_page_track.h | 11 +++
 arch/x86/kvm/mmu/page_track.c | 26 ++
 arch/x86/kvm/x86.c|  3 +++
 3 files changed, 40 insertions(+)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..046b024d1813 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -44,6 +44,16 @@ struct kvm_page_track_notifier_node {
 */
void (*track_flush_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node);
+   /*
+* It is called when memory slot is moved or removed
+* users can drop write-protection for the pages in that memory slot
+*
+* @kvm: the kvm where memory slot being moved or removed
+* @slot: the memory slot being moved or removed
+* @node: this node
+*/
+   void (*track_remove_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
+ struct kvm_page_track_notifier_node *node);
 };
 
 int kvm_page_track_init(struct kvm *kvm);
@@ -76,4 +86,5 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
  int bytes);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f..4d6bab1d61c9 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -300,3 +300,29 @@ void kvm_page_track_flush_slot(struct kvm *kvm, struct 
kvm_memory_slot *slot)
n->track_flush_slot(kvm, slot, n);
srcu_read_unlock(&head->track_srcu, idx);
 }
+
+/*
+ * Notify the node that memory slot is removed or moved so that it can
+ * drop write-protection for the pages in the memory slot.
+ *
+ * The node should figure out it has any write-protected pages in this slot
+ * by itself.
+ */
+void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+   struct kvm_page_track_notifier_head *head;
+   struct kvm_page_track_notifier_node *n;
+   int idx;
+
+   head = &kvm->arch.track_notifier_head;
+
+   if (hlist_empty(&head->track_notifier_list))
+   return;
+
+   idx = srcu_read_lock(&head->track_srcu);
+   hlist_for_each_entry_srcu(n, &head->track_notifier_list, node,
+   srcu_read_lock_held(&head->track_srcu))
+   if (n->track_remove_slot)
+   n->track_remove_slot(kvm, slot, n);
+   srcu_read_unlock(&head->track_srcu, idx);
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 916ebbc81e52..a24a4a2ad1a0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12844,6 +12844,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
 {
+   if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
+   kvm_page_track_remove_slot(kvm, old);
+
if (!kvm->arch.n_requested_mmu_pages &&
(change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
unsigned long nr_mmu_pages;
-- 
2.17.1



[Intel-gfx] [PATCH v2 2/3] drm/i915/gvt: switch from track_flush_slot to track_remove_slot

2022-11-11 Thread Yan Zhao
KVMGT only cares about when a slot is indeed removed.
So switch to use track_remove_slot which is called when a slot is removed.

Cc: Zhenyu Wang 
Suggested-by: Sean Christopherson 
Signed-off-by: Yan Zhao 
---
 drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 714221f9a131..9582d047471f 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -109,7 +109,7 @@ struct gvt_dma {
 static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
const u8 *val, int len,
struct kvm_page_track_notifier_node *node);
-static void kvmgt_page_track_flush_slot(struct kvm *kvm,
+static void kvmgt_page_track_remove_slot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node);
 
@@ -673,7 +673,7 @@ static int intel_vgpu_open_device(struct vfio_device 
*vfio_dev)
gvt_cache_init(vgpu);
 
vgpu->track_node.track_write = kvmgt_page_track_write;
-   vgpu->track_node.track_flush_slot = kvmgt_page_track_flush_slot;
+   vgpu->track_node.track_remove_slot = kvmgt_page_track_remove_slot;
kvm_get_kvm(vgpu->vfio_device.kvm);
kvm_page_track_register_notifier(vgpu->vfio_device.kvm,
 &vgpu->track_node);
@@ -1617,7 +1617,7 @@ static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, 
gpa_t gpa,
 (void *)val, len);
 }
 
-static void kvmgt_page_track_flush_slot(struct kvm *kvm,
+static void kvmgt_page_track_remove_slot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
 {
-- 
2.17.1



[Intel-gfx] [PATCH v2 3/3] KVM: x86: Remove the unused page track hook track_flush_slot

2022-11-11 Thread Yan Zhao
There's no users of hook track_remove_slot any more and no external page
tracker user cares about slot flush.
So remove this hook.

Cc: Zhenyu Wang 
Suggested-by: Sean Christopherson 
Signed-off-by: Yan Zhao 
---
 arch/x86/include/asm/kvm_page_track.h | 11 ---
 arch/x86/kvm/mmu/page_track.c | 26 --
 arch/x86/kvm/x86.c|  2 --
 3 files changed, 39 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 046b024d1813..4f1d3c91fdc7 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -34,16 +34,6 @@ struct kvm_page_track_notifier_node {
 */
void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
int bytes, struct kvm_page_track_notifier_node 
*node);
-   /*
-* It is called when memory slot is being moved or removed
-* users can drop write-protection for the pages in that memory slot
-*
-* @kvm: the kvm where memory slot being moved or removed
-* @slot: the memory slot being moved or removed
-* @node: this node
-*/
-   void (*track_flush_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
-   struct kvm_page_track_notifier_node *node);
/*
 * It is called when memory slot is moved or removed
 * users can drop write-protection for the pages in that memory slot
@@ -85,6 +75,5 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
   struct kvm_page_track_notifier_node *n);
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
  int bytes);
-void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 4d6bab1d61c9..f783aea618f8 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -275,32 +275,6 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t 
gpa, const u8 *new,
srcu_read_unlock(&head->track_srcu, idx);
 }
 
-/*
- * Notify the node that memory slot is being removed or moved so that it can
- * drop write-protection for the pages in the memory slot.
- *
- * The node should figure out it has any write-protected pages in this slot
- * by itself.
- */
-void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
-{
-   struct kvm_page_track_notifier_head *head;
-   struct kvm_page_track_notifier_node *n;
-   int idx;
-
-   head = &kvm->arch.track_notifier_head;
-
-   if (hlist_empty(&head->track_notifier_list))
-   return;
-
-   idx = srcu_read_lock(&head->track_srcu);
-   hlist_for_each_entry_srcu(n, &head->track_notifier_list, node,
-   srcu_read_lock_held(&head->track_srcu))
-   if (n->track_flush_slot)
-   n->track_flush_slot(kvm, slot, n);
-   srcu_read_unlock(&head->track_srcu, idx);
-}
-
 /*
  * Notify the node that memory slot is removed or moved so that it can
  * drop write-protection for the pages in the memory slot.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a24a4a2ad1a0..260288f4d741 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12872,8 +12872,6 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
   struct kvm_memory_slot *slot)
 {
kvm_mmu_zap_all_fast(kvm);
-
-   kvm_page_track_flush_slot(kvm, slot);
 }
 
 static inline bool kvm_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
-- 
2.17.1



Re: [Intel-gfx] [PATCH v2 1/3] KVM: x86: add a new page track hook track_remove_slot

2022-11-11 Thread Yan Zhao
On Fri, Nov 11, 2022 at 06:19:15PM +, Sean Christopherson wrote:
> TL;DR: I'm going to try to add more aggressive patches for this into my 
> series to
> clean up the KVM side of things, along with many more patches to clean up the 
> page
> track APIs.
> 
> I'll post patches next week if things go well (fingers crossed), and if not 
> I'll
> give an update 
> 
> On Fri, Nov 11, 2022, Yan Zhao wrote:
> > Page track hook track_remove_slot is used to notify users that a slot
> > has been removed and is called when a slot DELETE/MOVE is about to be
> > completed.
> 
> Phrase this as a command, and explain _why_ the new hook is being added, e.g.
> 
>   Add a new page track hook, track_remove_slot(), that is called when a
>   memslot DELETE/MOVE operation is about to be committed.  The "remove"
>   hook will be used by KVMGT and will effectively replace the existing
>   track_flush_slot() altogether now that KVM itself doesn't rely on the
>   "flush" hook either.
> 
>   The "flush" hook is flawed as it's invoked before the memslot operation
>   is guaranteed, i.e. KVM might ultimately keep the existing memslot without
>   notifying external page track users, a.k.a. KVMGT.
> 
> > Users of this hook can drop write protections in the removed slot.
> 
> Hmm, actually, on second thought, after thinking about what KVGT is doing in
> response to the memslot update, I think we should be more aggressive and 
> actively
> prevent MOVE if there are external page trackers, i.e. if KVMGT is attached.
> 
> Dropping write protections when a memslot is being deleted is a waste of 
> cycles.
> The memslot and thus gfn_track is literally being deleted immediately after 
> invoking
> the hook, updating gfn_track from KVMGT is completely unecessary.

> I.e. if we kill off the MOVE path, then KVMGT just needs to delete its hash 
> table
> entry.
> 
> Oooh!  Looking at this code again made me realize that the larger page track 
> cleanup
> that I want to do might actually work.  Long story short, I want to stop 
> forcing
> KVMGT to poke into KVM internals, but I thought there was a lock inversion 
> problem.
> 
> But AFAICT, there is no such problem.  And the cleanup I want to do will 
> actually
> fix an existing KVMGT bug: kvmgt_page_track_write() invokes 
> kvmgt_gfn_is_write_protected()
> without holding mmu_lock, and thus could consume garbage when walking the hash
> table.
> 
>   static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>   const u8 *val, int len,
>   struct kvm_page_track_notifier_node *node)
>   {
>   struct intel_vgpu *info =
>   container_of(node, struct intel_vgpu, track_node);
> 
>   if (kvmgt_gfn_is_write_protected(info, gpa_to_gfn(gpa)))
>   intel_vgpu_page_track_handler(info, gpa,
>(void *)val, len);
>   }
> 
> Acquiring mmu_lock isn't an option as intel_vgpu_page_track_handler() might 
> sleep,
> e.g. when acquiring vgpu_lock.
>
I totally agree with you and actually had the same feeling as you when
examined the code yesterday. But I thought I'd better send this series
first and do the cleanup later :)
And I'm also not sure if a slots_arch_lock is required for
kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
(Though it doesn't matter for a removing slot and we actually needn't to
call kvm_slot_page_track_remove_page() in response to a slot removal,
the two interfaces are still required elsewhere.)


> Let me see if the clean up I have in mind will actually work.  If it does, I 
> think
> the end result will be quite nice for both KVM and KVMGT.
Yes, it would be great.

Thanks
Yan


Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/gvt: switch from track_flush_slot to track_remove_slot

2022-11-11 Thread Yan Zhao
On Fri, Nov 11, 2022 at 05:07:35PM +, Sean Christopherson wrote:
> On Fri, Nov 11, 2022, Yan Zhao wrote:
> > KVMGT only cares about when a slot is indeed removed.
> > So switch to use track_remove_slot which is called when a slot is removed.
> 
> This should capture the original motivation, i.e. that the existing
> ->track_flush_slot() hook is theoretically flawed.  I think it also makes 
> sense
> to call out that KVMGT undoubtedly does the wrong thing if a memslot is moved,
> but that (a) KVMGT has never supported moving memslots and (b) there's no sane
> use case for moving memslots that might be used by the guest for the GTT.
> 
> Bonus points if you can figure out a way to capture the restriction in the 
> docs,
> e.g. somewhere in gpu/i915.rst?
> 
> Lastly, provide a link to the original discussion which provides even more 
> context.
> 
> Link: https://lore.kernel.org/all/20221108084416.11447-1-yan.y.z...@intel.com
>
Got it. I'll do it next time!

Thanks
Yan

> > Cc: Zhenyu Wang 
> > Suggested-by: Sean Christopherson 
> > Signed-off-by: Yan Zhao 
> > ---


Re: [Intel-gfx] [PATCH v2 1/3] KVM: x86: add a new page track hook track_remove_slot

2022-11-13 Thread Yan Zhao
On Sat, Nov 12, 2022 at 12:43:07AM +, Sean Christopherson wrote:
> On Sat, Nov 12, 2022, Yan Zhao wrote:
> > And I'm also not sure if a slots_arch_lock is required for
> > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
> 
> It's not required.  slots_arch_lock protects interaction between memslot 
> updates
In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
do you know which lock is used to protect it?

Thanks
Yan

> mmu_first_shadow_root_alloc().  When CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, 
> then
> the mmu_first_shadow_root_alloc() doesn't touch the memslots because 
> everything
> is pre-allocated:
> 
> bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
> {
>   return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) ||
>  !tdp_enabled || kvm_shadow_root_allocated(kvm);
> }
> 
> int kvm_page_track_create_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> unsigned long npages)
> {
>   if (!kvm_page_track_write_tracking_enabled(kvm)) <== always true
>   return 0;
> 
>   return __kvm_page_track_write_tracking_alloc(slot, npages);
> }
> 
> Though now that you point it out, it's tempting to #ifdef out some of those 
> hooks
> so that's basically impossible for mmu_first_shadow_root_alloc() to cause 
> problems.
> Not sure the extra #ideffery would be worth while though.
> 
> slots_arch_lock also protects shadow_root_allocated, but that's a KVM-internal
> detail that isn't relevant to the page-tracking machinery when
> CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y.


Re: [Intel-gfx] [PATCH v2 1/3] KVM: x86: add a new page track hook track_remove_slot

2022-11-14 Thread Yan Zhao
On Mon, Nov 14, 2022 at 04:32:34PM +, Sean Christopherson wrote:
> On Mon, Nov 14, 2022, Yan Zhao wrote:
> > On Sat, Nov 12, 2022 at 12:43:07AM +, Sean Christopherson wrote:
> > > On Sat, Nov 12, 2022, Yan Zhao wrote:
> > > > And I'm also not sure if a slots_arch_lock is required for
> > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
> > > 
> > > It's not required.  slots_arch_lock protects interaction between memslot 
> > > updates
> > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
> > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
> > do you know which lock is used to protect it?
> 
> mmu_lock protects the count, kvm->srcu protects the slot, and 
> shadow_root_allocated
> protects that validity of gfn_track, i.e. shadow_root_allocated ensures that 
> KVM
> allocates gfn_track for all memslots when shadow paging is activated.
Hmm, thanks for the reply.
but in direct_page_fault(),
if (page_fault_handle_page_track(vcpu, fault))
return RET_PF_EMULATE;

slot->arch.gfn_track is read without any mmu_lock is held.
> 
> The cleanup series I'm prepping adds lockdep assertions for the relevant 
> paths, e.g. 
> 
> $ git grep -B 8 -E "update_gfn_write_track.*;"
> arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_add_gfn(struct kvm *kvm, 
> struct kvm_memory_slot *slot,
> arch/x86/kvm/mmu/page_track.c- gfn_t gfn)
> arch/x86/kvm/mmu/page_track.c-{
> arch/x86/kvm/mmu/page_track.c-  lockdep_assert_held_write(&kvm->mmu_lock);
> arch/x86/kvm/mmu/page_track.c-
> arch/x86/kvm/mmu/page_track.c-  if 
> (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm))
> arch/x86/kvm/mmu/page_track.c-  return;
> arch/x86/kvm/mmu/page_track.c-
> arch/x86/kvm/mmu/page_track.c:  update_gfn_write_track(slot, gfn, 1);
> --
> arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct kvm 
> *kvm,
> arch/x86/kvm/mmu/page_track.c-struct 
> kvm_memory_slot *slot, gfn_t gfn)
> arch/x86/kvm/mmu/page_track.c-{
> arch/x86/kvm/mmu/page_track.c-  lockdep_assert_held_write(&kvm->mmu_lock);
> arch/x86/kvm/mmu/page_track.c-
> arch/x86/kvm/mmu/page_track.c-  if 
> (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm))
> arch/x86/kvm/mmu/page_track.c-  return;
> arch/x86/kvm/mmu/page_track.c-
> arch/x86/kvm/mmu/page_track.c:  update_gfn_write_track(slot, gfn, -1);
yes, it will be helpful.

Besides, will WRITE_ONCE or atomic_add in update_gfn_write_track() to
update slot->arch.gfn_track be better?

Thanks
Yan


Re: [Intel-gfx] [PATCH v2 1/3] KVM: x86: add a new page track hook track_remove_slot

2022-11-14 Thread Yan Zhao
On Mon, Nov 14, 2022 at 11:24:16PM +, Sean Christopherson wrote:
> On Tue, Nov 15, 2022, Yan Zhao wrote:
> > On Mon, Nov 14, 2022 at 04:32:34PM +, Sean Christopherson wrote:
> > > On Mon, Nov 14, 2022, Yan Zhao wrote:
> > > > On Sat, Nov 12, 2022 at 12:43:07AM +, Sean Christopherson wrote:
> > > > > On Sat, Nov 12, 2022, Yan Zhao wrote:
> > > > > > And I'm also not sure if a slots_arch_lock is required for
> > > > > > kvm_slot_page_track_add_page() and 
> > > > > > kvm_slot_page_track_remove_page().
> > > > > 
> > > > > It's not required.  slots_arch_lock protects interaction between 
> > > > > memslot updates
> > > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
> > > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
> > > > do you know which lock is used to protect it?
> > > 
> > > mmu_lock protects the count, kvm->srcu protects the slot, and 
> > > shadow_root_allocated
> > > protects that validity of gfn_track, i.e. shadow_root_allocated ensures 
> > > that KVM
> > > allocates gfn_track for all memslots when shadow paging is activated.
> > Hmm, thanks for the reply.
> > but in direct_page_fault(),
> > if (page_fault_handle_page_track(vcpu, fault))
> > return RET_PF_EMULATE;
> > 
> > slot->arch.gfn_track is read without any mmu_lock is held.
> 
> That's a fast path that deliberately reads out of mmu_lock.  A false positive
> only results in unnecessary emulation, and any false positive is inherently 
> prone
> to races anyways, e.g. fault racing with zap.
what about false negative?
If the fast path read 0 count, no page track write callback will be called and 
write
protection will be removed in the slow path.


Thanks
Yan
> 
> > > arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct 
> > > kvm *kvm,
> > > arch/x86/kvm/mmu/page_track.c-struct 
> > > kvm_memory_slot *slot, gfn_t gfn)
> > > arch/x86/kvm/mmu/page_track.c-{
> > > arch/x86/kvm/mmu/page_track.c-  lockdep_assert_held_write(&kvm->mmu_lock);
> > > arch/x86/kvm/mmu/page_track.c-
> > > arch/x86/kvm/mmu/page_track.c-  if 
> > > (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm))
> > > arch/x86/kvm/mmu/page_track.c-  return;
> > > arch/x86/kvm/mmu/page_track.c-
> > > arch/x86/kvm/mmu/page_track.c:  update_gfn_write_track(slot, gfn, -1);
> > yes, it will be helpful.
> > 
> > Besides, will WRITE_ONCE or atomic_add in update_gfn_write_track() to
> > update slot->arch.gfn_track be better?
> 
> WRITE_ONCE() won't suffice, it needs to be atomic.  Switching to 
> atomic_inc/dec
> isn't worth it so long as KVM's shadow MMU takes mmu_lock for write, i.e. 
> while
> the accounting is mutually exclusive for other reasons in both KVM and KVMGT.


Re: [Intel-gfx] [PATCH v2 1/3] KVM: x86: add a new page track hook track_remove_slot

2022-11-14 Thread Yan Zhao
On Tue, Nov 15, 2022 at 12:55:42AM +, Sean Christopherson wrote:
> On Tue, Nov 15, 2022, Yan Zhao wrote:
> > On Mon, Nov 14, 2022 at 11:24:16PM +, Sean Christopherson wrote:
> > > On Tue, Nov 15, 2022, Yan Zhao wrote:
> > > > On Mon, Nov 14, 2022 at 04:32:34PM +, Sean Christopherson wrote:
> > > > > On Mon, Nov 14, 2022, Yan Zhao wrote:
> > > > > > On Sat, Nov 12, 2022 at 12:43:07AM +, Sean Christopherson wrote:
> > > > > > > On Sat, Nov 12, 2022, Yan Zhao wrote:
> > > > > > > > And I'm also not sure if a slots_arch_lock is required for
> > > > > > > > kvm_slot_page_track_add_page() and 
> > > > > > > > kvm_slot_page_track_remove_page().
> > > > > > > 
> > > > > > > It's not required.  slots_arch_lock protects interaction between 
> > > > > > > memslot updates
> > > > > > In kvm_slot_page_track_add_page() and 
> > > > > > kvm_slot_page_track_remove_page(),
> > > > > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
> > > > > > do you know which lock is used to protect it?
> > > > > 
> > > > > mmu_lock protects the count, kvm->srcu protects the slot, and 
> > > > > shadow_root_allocated
> > > > > protects that validity of gfn_track, i.e. shadow_root_allocated 
> > > > > ensures that KVM
> > > > > allocates gfn_track for all memslots when shadow paging is activated.
> > > > Hmm, thanks for the reply.
> > > > but in direct_page_fault(),
> > > > if (page_fault_handle_page_track(vcpu, fault))
> > > > return RET_PF_EMULATE;
> > > > 
> > > > slot->arch.gfn_track is read without any mmu_lock is held.
> > > 
> > > That's a fast path that deliberately reads out of mmu_lock.  A false 
> > > positive
> > > only results in unnecessary emulation, and any false positive is 
> > > inherently prone
> > > to races anyways, e.g. fault racing with zap.
> > what about false negative?
> > If the fast path read 0 count, no page track write callback will be called 
> > and write
> > protection will be removed in the slow path.
> 
> No.  For a false negative to occur, a different task would have to create a 
> SPTE
> and write-protect the GFN _while holding mmu_lock_.  And then after acquiring
> mmu_lock, the vCPU that got the false negative would call make_spte(), which 
> would
> detect that making the SPTE writable is disallowed due to the GFN being 
> write-protected.
> 
>   if (pte_access & ACC_WRITE_MASK) {
>   spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask;
> 
>   /*
>* Optimization: for pte sync, if spte was writable the hash
>* lookup is unnecessary (and expensive). Write protection
>* is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
>* Same reasoning can be applied to dirty page accounting.
>*/
>   if (is_writable_pte(old_spte))
>   goto out;
> 
>   /*
>* Unsync shadow pages that are reachable by the new, writable
>* SPTE.  Write-protect the SPTE if the page can't be unsync'd,
>* e.g. it's write-tracked (upper-level SPs) or has one or more
>* shadow pages and unsync'ing pages is not allowed.
>*/
>   if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, 
> prefetch)) {
>   pgprintk("%s: found shadow page for %llx, marking ro\n",
>__func__, gfn);
>   wrprot = true;
>   pte_access &= ~ACC_WRITE_MASK;
>   spte &= ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
>   }
>   }
> 
> 
> 
> int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot 
> *slot,
>   gfn_t gfn, bool can_unsync, bool prefetch)
> {
>   struct kvm_mmu_page *sp;
>   bool locked = false;
> 
>   /*
>* Force write-protection if the page is being tracked.  Note, the page
>* track machinery is used to write-protect upper-level shadow pages,
>* i.e. this guards the role.level == 4K assertion below!
>*/
>   if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
>   return -EPERM;
> 
>   ...
> }

Oh, you are right! I thought mmu_try_to_unsync_pages() is only for the
shadow mmu, and overlooked that TDP MMU will also go into it.

Thanks for the detailed explanation.

Thanks
Yan


[Intel-gfx] [PATCH] drm/i915/gt:fix raw-wakeref not held calltrace in vGPU

2020-08-10 Thread Yan Zhao
UNCORE_HAS_FORCEWAKE is not turned on when intel_vgpu_active() returns
true, so the guest mmio writes go to gen2_write32().

 [ cut here ]
 RPM raw-wakeref not held
 WARNING: CPU: 1 PID: 6375 at drivers/gpu/drm/i915/intel_runtime_pm.h:106
 gen2_write32+0x10b/0x140 [i915]
 Modules linked in: intel_rapl_msr intel_rapl_common kvm_intel crct10dif_pclmul
 crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper
 intel_rapl_perf joydev input_leds qemu_fw_cfg mac_hid serio_raw sch_fq_codel
 parport_pc ppdev lp parport ip_tables x_tables autofs4 vfio_mdev mdev kvm
 hid_generic usbhid hid i915 video i2c_algo_bit drm_kms_helper syscopyarea
 psmouse sysfillrect sysimgblt fb_sys_fops cec rc_core floppy drm e1000
 pata_acpi i2c_piix4
 CPU: 1 PID: 6375 Comm: Xorg.wrap Not tainted 5.7.0-050700-generic
 #202005312130
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
 BIOS rel-1.13.0-0-gf21b5a4 04/01/2014
 RIP: 0010:gen2_write32+0x10b/0x140 [i915]
 Code: 76 f4 f8 0f 0b e9 46 ff ff ff 80 3d 82 18 1a 00 00 0f 85 43 ff ff ff 48
 c7 c7 42 81 4b c0 c6 05 6e 18 1a 00 01 e8 ba 76 f4 f8 <0f> 0b e9 29 ff ff ff
 80 3d 5a 18 1a 00 00 0f 85 26 ff ff ff 48 c7
 RSP: 0018:b67400aef9e8 EFLAGS: 00010282
 RAX:  RBX: 00078838 RCX: 0007
 RDX: 0007 RSI: 0082 RDI: 9ae47bd19c80
 RBP: b67400aefa10 R08: 02b9 R09: 0004
 R10:  R11: 0001 R12: 9ae46e3e07b8
 R13:  R14: 9ae46e3e R15: 35cd5000
 FS:  7f9ec20c4740() GS:9ae47bd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 7f9ec2125760 CR3: 00010e120003 CR4: 00360ee0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  gen8_ppgtt_notify_vgt+0x95/0x190 [i915]
  gen8_ppgtt_create+0x3c7/0x440 [i915]
  i915_ppgtt_create+0x7d/0x90 [i915]
  i915_gem_create_context+0x2c0/0x3a8 [i915]
  i915_gem_context_open+0x59/0xc0 [i915]
  i915_gem_open+0x8b/0xc0 [i915]
  i915_driver_open+0xe/0x10 [i915]
  drm_file_alloc+0x199/0x260 [drm]
  drm_open+0xcf/0x260 [drm]
  drm_stub_open+0xaa/0xe0 [drm]
  chrdev_open+0xd3/0x1c0
  ? cdev_default_release+0x20/0x20
  do_dentry_open+0x143/0x3a0
  vfs_open+0x2d/0x30
  path_openat+0xb1c/0x10f0
  ? filemap_map_pages+0x22f/0x370
  do_filp_open+0x91/0x100
  ? __alloc_fd+0xb8/0x150
  do_sys_openat2+0x210/0x2d0
  do_sys_open+0x46/0x80
  __x64_sys_openat+0x20/0x30
  do_syscall_64+0x57/0x1d0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f9ec22cda5b
 Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00
85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0
ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
 RSP: 002b:7ffd2f2a3960 EFLAGS: 0246 ORIG_RAX: 0101
 RAX: ffda RBX:  RCX: 7f9ec22cda5b
 RDX: 0002 RSI: 7ffd2f2a3ae0 RDI: ff9c
 RBP: 7ffd2f2a3ae0 R08:  R09: 7ffd2f2a3870
 R10:  R11: 0246 R12: 0002
 R13:  R14:  R15: 7ffd2f2a3a00
 ---[ end trace 110b629471bd16dd ]---

Signed-off-by: Yan Zhao 
---
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c 
b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
index 699125928272..3fc3488c127f 100644
--- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
@@ -55,6 +55,7 @@ static void gen8_ppgtt_notify_vgt(struct i915_ppgtt *ppgtt, 
bool create)
struct drm_i915_private *i915 = ppgtt->vm.i915;
struct intel_uncore *uncore = ppgtt->vm.gt->uncore;
enum vgt_g2v_type msg;
+   intel_wakeref_t wakeref;
int i;
 
if (create)
@@ -62,6 +63,7 @@ static void gen8_ppgtt_notify_vgt(struct i915_ppgtt *ppgtt, 
bool create)
else
atomic_dec(px_used(ppgtt->pd));
 
+   wakeref = intel_runtime_pm_get(uncore->rpm);
mutex_lock(&i915->vgpu.lock);
 
if (i915_vm_is_4lvl(&ppgtt->vm)) {
@@ -96,6 +98,7 @@ static void gen8_ppgtt_notify_vgt(struct i915_ppgtt *ppgtt, 
bool create)
intel_uncore_write(uncore, vgtif_reg(g2v_notify), msg);
 
mutex_unlock(&i915->vgpu.lock);
+   intel_runtime_pm_put(uncore->rpm, wakeref);
 }
 
 /* Index shifts into the pagetable are offset by GEN8_PTE_SHIFT [12] */
-- 
2.17.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915/gt:fix raw-wakeref not held calltrace in vGPU

2020-08-11 Thread Yan Zhao
On Tue, Aug 11, 2020 at 09:26:11AM +0100, Chris Wilson wrote:
> Quoting Yan Zhao (2020-08-11 04:02:09)
> > UNCORE_HAS_FORCEWAKE is not turned on when intel_vgpu_active() returns
> > true, so the guest mmio writes go to gen2_write32().
> 
> The vgpu writes are just memory traps, the device wakeref should not be
> required. It seems like we've accidentally added the assert to the wrong
> io accessors.
yes. if vgpu writes de not go to gen2_write, or if gen2_write does not
need such assert, it would be better.

Thanks
Yan
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 00/27] drm/i915/gvt: KVM: KVMGT fixes and page-track cleanups

2022-12-23 Thread Yan Zhao
On Fri, Dec 23, 2022 at 12:57:12AM +, Sean Christopherson wrote:
> Fix a variety of found-by-inspection bugs in KVMGT, and overhaul KVM's
> page-track APIs to provide a leaner and cleaner interface.  The motivation
> for this series is to (significantly) reduce the number of KVM APIs that
> KVMGT uses, with a long-term goal of making all kvm_host.h headers
> KVM-internal.  That said, I think the cleanup itself is worthwhile,
> e.g. KVMGT really shouldn't be touching kvm->mmu_lock.
> 
> Note!  The KVMGT changes are compile tested only as I don't have the
> necessary hardware (AFAIK).  Testing, and lots of it, on the KVMGT side
> of things is needed and any help on that front would be much appreciated.
hi Sean,
Thanks for the patch!
Could you also provide the commit id that this series is based on?
I applied them on top of latest master branch (6.1.0+,
8395ae05cb5a2e31d36106e8c85efa11cda849be) in repo
https://github.com/torvalds/linux.git, yet met some conflicts and I
fixed them manually. (patch 11 and patch 25).

A rough test shows that below mutex_init is missing.
But even with this fix, I still met guest hang during guest boots up.
Will look into it and have a detailed review next week.

diff --git a/drivers/gpu/drm/i915/gvt/vgpu.c b/drivers/gpu/drm/i915/gvt/vgpu.c
index a7ac2ec00196..c274b6a0 100644
--- a/drivers/gpu/drm/i915/gvt/vgpu.c
+++ b/drivers/gpu/drm/i915/gvt/vgpu.c
@@ -331,6 +331,7 @@ int intel_gvt_create_vgpu(struct intel_vgpu *vgpu,
vgpu->id = ret;
vgpu->sched_ctl.weight = conf->weight;
mutex_init(&vgpu->vgpu_lock);
+   mutex_init(&vgpu->gfn_lock);
mutex_init(&vgpu->dmabuf_lock);
INIT_LIST_HEAD(&vgpu->dmabuf_obj_list_head);
INIT_RADIX_TREE(&vgpu->page_track_tree, GFP_KERNEL);


Thanks
Yan


Re: [Intel-gfx] [PATCH 09/27] drm/i915/gvt: Protect gfn hash table with dedicated mutex

2022-12-27 Thread Yan Zhao
On Fri, Dec 23, 2022 at 12:57:21AM +, Sean Christopherson wrote:
> Add and use a new mutex, gfn_lock, to protect accesses to the hash table
> used to track which gfns are write-protected when shadowing the guest's
> GTT.  This fixes a bug where kvmgt_page_track_write(), which doesn't hold
> kvm->mmu_lock, could race with intel_gvt_page_track_remove() and trigger
> a use-after-free.
> 
> Fixing kvmgt_page_track_write() by taking kvm->mmu_lock is not an option
> as mmu_lock is a r/w spinlock, and intel_vgpu_page_track_handler() might
> sleep when acquiring vgpu->cache_lock deep down the callstack:
> 
>   intel_vgpu_page_track_handler()
>   |
>   |->  page_track->handler / ppgtt_write_protection_handler()
>|
>|-> ppgtt_handle_guest_write_page_table_bytes()
>|
>|->  ppgtt_handle_guest_write_page_table()
> |
> |-> ppgtt_handle_guest_entry_removal()
> |
> |-> ppgtt_invalidate_pte()
> |
> |-> intel_gvt_dma_unmap_guest_page()
> |
> |-> mutex_lock(&vgpu->cache_lock);
> 
This gfn_lock could lead to deadlock in below sequence.

(1) kvm_write_track_add_gfn() to GFN 1
(2) kvmgt_page_track_write() for GFN 1
kvmgt_page_track_write()
|
|->mutex_lock(&info->vgpu_lock)
|->intel_vgpu_page_track_handler (as is kvmgt_gfn_is_write_protected)
   |
   |->page_track->handler() (ppgtt_write_protection_handler())
  | 
  |->ppgtt_handle_guest_write_page_table_bytes()
 |
 |->ppgtt_handle_guest_write_page_table()
|
|->ppgtt_handle_guest_entry_add() --> new_present
   |
   |->ppgtt_populate_spt_by_guest_entry()
  |
  |->intel_vgpu_enable_page_track() --> for GFN 2
 |
 |->intel_gvt_page_track_add()
|
|->mutex_lock(&info->gfn_lock) ===>deadlock


Below fix based on this patch is to reuse vgpu_lock to protect the hash table
info->ptable.
Please check if it's good.


diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index b924ed079ad4..526bd973e784 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -364,7 +364,7 @@ __kvmgt_protect_table_find(struct intel_vgpu *info, gfn_t 
gfn)
 {
struct kvmgt_pgfn *p, *res = NULL;

-   lockdep_assert_held(&info->gfn_lock);
+   lockdep_assert_held(&info->vgpu_lock);

hash_for_each_possible(info->ptable, p, hnode, gfn) {
if (gfn == p->gfn) {
@@ -388,7 +388,7 @@ static void kvmgt_protect_table_add(struct intel_vgpu 
*info, gfn_t gfn)
 {
struct kvmgt_pgfn *p;

-   lockdep_assert_held(&info->gfn_lock);
+   lockdep_assert_held(&info->vgpu_lock);

if (kvmgt_gfn_is_write_protected(info, gfn))
return;
@@ -1572,7 +1572,7 @@ int intel_gvt_page_track_add(struct intel_vgpu *info, u64 
gfn)
if (!info->attached)
return -ESRCH;

-   mutex_lock(&info->gfn_lock);
+   lockdep_assert_held(&info->vgpu_lock);

if (kvmgt_gfn_is_write_protected(info, gfn))
goto out;
@@ -1581,7 +1581,6 @@ int intel_gvt_page_track_add(struct intel_vgpu *info, u64 
gfn)
if (!ret)
kvmgt_protect_table_add(info, gfn);
 out:
-   mutex_unlock(&info->gfn_lock);
return ret;
 }

@@ -1592,7 +1591,7 @@ int intel_gvt_page_track_remove(struct intel_vgpu *info, 
u64 gfn)
if (!info->attached)
return 0;

-   mutex_lock(&info->gfn_lock);
+   lockdep_assert_held(&info->vgpu_lock);

if (!kvmgt_gfn_is_write_protected(info, gfn))
goto out;
@@ -1601,7 +1600,6 @@ int intel_gvt_page_track_remove(struct intel_vgpu *info, 
u64 gfn)
if (!ret)
kvmgt_protect_table_del(info, gfn);
 out:
-   mutex_unlock(&info->gfn_lock);
return ret;
 }

@@ -1612,13 +1610,15 @@ static void kvmgt_page_track_write(gpa_t gpa, const u8 
*val, int len,
container_of(node, struct intel_vgpu, track_node);

mutex_lock(&info->vgpu_lock);
-   mutex_lock(&info->gfn_lock);

if (kvmgt_gfn_is_write_protected(info, gpa >> PAGE_SHIFT))
intel_vgpu_page_track_handler(info, gpa,
 (void *)val, len);
}

-   mutex_unlock(&info->gfn_lock);
mutex_unlock(&info->vgpu_lock);
 }
@@ -1629,12 +1629,11 @@ static void kvmgt_page_track_remove_region(gfn_t gfn, 
unsigned long nr_pages,
struct intel_vgpu *info =
container_of(node, struct intel_vgpu, track_node);
 
-   mutex_lock(&info->gfn_lock);
+   lockdep_assert_held(&info->vgpu_lock);
for (i = 0; i < nr_pages; i++) {
if (kvmgt_gfn_is_write_protected(info, gfn + i))
 

Re: [Intel-gfx] [PATCH 03/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

2022-12-27 Thread Yan Zhao
On Fri, Dec 23, 2022 at 12:57:15AM +, Sean Christopherson wrote:
> Honor KVM's max allowed page size when determining whether or not a 2MiB
> GTT shadow page can be created for the guest.  Querying KVM's max allowed
> size is somewhat odd as there's no strict requirement that KVM's memslots
> and VFIO's mappings are configured with the same gfn=>hva mapping, but
Without vIOMMU, VFIO's mapping is configured with the same as KVM's
memslots, i.e. with the same gfn==>HVA mapping


> the check will be accurate if userspace wants to have a functional guest,
> and at the very least checking KVM's memslots guarantees that the entire
> 2MiB range has been exposed to the guest.

I think just check the entrie 2MiB GFN range are all within KVM memslot is
enough.
If for some reason, KVM maps a 2MiB range in 4K sizes, KVMGT can still map
it in IOMMU size in 2MiB size as long as the PFNs are continous and the
whole range is all exposed to guest.
Actually normal device passthrough with VFIO-PCI also maps GFNs in a
similar way, i.e. maps a guest visible range in as large size as
possible as long as the PFN is continous. 
> 
> Note, KVM may also restrict the mapping size for reasons that aren't
> relevant to KVMGT, e.g. for KVM's iTLB multi-hit workaround or if the gfn
Will iTLB multi-hit affect DMA?
AFAIK, IOMMU mappings currently never sets exec bit (and I'm told this bit is
under discussion to be removed).


> is write-tracked (KVM's write-tracking only handles writes from vCPUs).
> However, such scenarios are unlikely to occur with a well-behaved guest,
> and at worst will result in sub-optimal performance.
> Fixes: b901b252b6cf ("drm/i915/gvt: Add 2M huge gtt support")
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_page_track.h |  2 ++
>  arch/x86/kvm/mmu/page_track.c | 18 ++
>  drivers/gpu/drm/i915/gvt/gtt.c| 10 +-
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index eb186bc57f6a..3f72c7a172fc 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -51,6 +51,8 @@ void kvm_page_track_cleanup(struct kvm *kvm);
>  
>  bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
>  int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
> +enum pg_level kvm_page_track_max_mapping_level(struct kvm *kvm, gfn_t gfn,
> +enum pg_level max_level);
>  
>  void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
>  int kvm_page_track_create_memslot(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 2e09d1b6249f..69ea16c31859 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -300,3 +300,21 @@ void kvm_page_track_flush_slot(struct kvm *kvm, struct 
> kvm_memory_slot *slot)
>   n->track_flush_slot(kvm, slot, n);
>   srcu_read_unlock(&head->track_srcu, idx);
>  }
> +
> +enum pg_level kvm_page_track_max_mapping_level(struct kvm *kvm, gfn_t gfn,
> +enum pg_level max_level)
> +{
> + struct kvm_memory_slot *slot;
> + int idx;
> +
> + idx = srcu_read_lock(&kvm->srcu);
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
> + max_level = PG_LEVEL_4K;
> + else
> + max_level = kvm_mmu_max_slot_mapping_level(slot, gfn, 
> max_level);
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + return max_level;
> +}
> +EXPORT_SYMBOL_GPL(kvm_page_track_max_mapping_level);
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index d0fca53a3563..6736d7bd94ea 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1178,14 +1178,22 @@ static int is_2MB_gtt_possible(struct intel_vgpu 
> *vgpu,
>   struct intel_gvt_gtt_entry *entry)
>  {
>   const struct intel_gvt_gtt_pte_ops *ops = vgpu->gvt->gtt.pte_ops;
> + unsigned long gfn = ops->get_pfn(entry);
>   kvm_pfn_t pfn;
> + int max_level;
>  
>   if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
>   return 0;
>  
>   if (!vgpu->attached)
>   return -EINVAL;
> - pfn = gfn_to_pfn(vgpu->vfio_device.kvm, ops->get_pfn(entry));
> +
> + max_level = kvm_page_track_max_mapping_level(vgpu->vfio_device.kvm,
> +  gfn, PG_LEVEL_2M);
> + if (max_level < PG_LEVEL_2M)
> + return 0;
> +
> + pfn = gfn_to_pfn(vgpu->vfio_device.kvm, gfn);
>   if (is_error_noslot_pfn(pfn))
>   return -EINVAL;
>  
> -- 
> 2.39.0.314.g84b9a713c41-goog
> 


Re: [Intel-gfx] [PATCH 19/27] KVM: x86/mmu: Use page-track notifiers iff there are external users

2022-12-27 Thread Yan Zhao
On Fri, Dec 23, 2022 at 12:57:31AM +, Sean Christopherson wrote:
> Disable the page-track notifier code at compile time if there are no
> external users, i.e. if CONFIG_KVM_EXTERNAL_WRITE_TRACKING=n.  KVM itself
> now hooks emulated writes directly instead of relying on the page-track
> mechanism.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_host.h   |  2 ++
>  arch/x86/include/asm/kvm_page_track.h |  2 ++
>  arch/x86/kvm/mmu/page_track.c |  9 
>  arch/x86/kvm/mmu/page_track.h | 30 +++
>  4 files changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index eec424fac0ba..e8f8e1bd96c7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1223,7 +1223,9 @@ struct kvm_arch {
>* create an NX huge page (without hanging the guest).
>*/
>   struct list_head possible_nx_huge_pages;
> +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
>   struct kvm_page_track_notifier_head track_notifier_head;
> +#endif
>   /*
>* Protects marking pages unsync during page faults, as TDP MMU page
>* faults only take mmu_lock for read.  For simplicity, the unsync
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index deece45936a5..53c2adb25a07 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -55,6 +55,7 @@ void kvm_slot_page_track_remove_page(struct kvm *kvm,
>struct kvm_memory_slot *slot, gfn_t gfn,
>enum kvm_page_track_mode mode);
>  
> +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
>  enum pg_level kvm_page_track_max_mapping_level(struct kvm *kvm, gfn_t gfn,
>  enum pg_level max_level);
>  
> @@ -64,5 +65,6 @@ kvm_page_track_register_notifier(struct kvm *kvm,
>  void
>  kvm_page_track_unregister_notifier(struct kvm *kvm,
>  struct kvm_page_track_notifier_node *n);
> +#endif /* CONFIG_KVM_EXTERNAL_WRITE_TRACKING */
>  
>  #endif
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 2b302fd2c5dd..f932909aa9b5 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -193,6 +193,7 @@ bool kvm_slot_page_track_is_active(struct kvm *kvm,
>   return !!READ_ONCE(slot->arch.gfn_track[mode][index]);
>  }
>  
> +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
>  void kvm_page_track_cleanup(struct kvm *kvm)
>  {
>   struct kvm_page_track_notifier_head *head;
> @@ -208,6 +209,7 @@ int kvm_page_track_init(struct kvm *kvm)
>   head = &kvm->arch.track_notifier_head;
>   INIT_HLIST_HEAD(&head->track_notifier_list);
>   return init_srcu_struct(&head->track_srcu);
> + return 0;
Double "return"s.


>  }
>  
>  /*
> @@ -254,8 +256,8 @@ EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
>   * The node should figure out if the written page is the one that node is
>   * interested in by itself.
>   */
> -void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> -   int bytes)
> +void __kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> + int bytes)
>  {
>   struct kvm_page_track_notifier_head *head;
>   struct kvm_page_track_notifier_node *n;
> @@ -272,8 +274,6 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t 
> gpa, const u8 *new,
>   if (n->track_write)
>   n->track_write(gpa, new, bytes, n);
>   srcu_read_unlock(&head->track_srcu, idx);
> -
> - kvm_mmu_track_write(vcpu, gpa, new, bytes);
>  }
>  
>  /*
> @@ -316,3 +316,4 @@ enum pg_level kvm_page_track_max_mapping_level(struct kvm 
> *kvm, gfn_t gfn,
>   return max_level;
>  }
>  EXPORT_SYMBOL_GPL(kvm_page_track_max_mapping_level);
> +#endif
> diff --git a/arch/x86/kvm/mmu/page_track.h b/arch/x86/kvm/mmu/page_track.h
> index 89712f123ad3..1b363784aa4a 100644
> --- a/arch/x86/kvm/mmu/page_track.h
> +++ b/arch/x86/kvm/mmu/page_track.h
> @@ -6,8 +6,6 @@
>  
>  #include 
>  
> -int kvm_page_track_init(struct kvm *kvm);
> -void kvm_page_track_cleanup(struct kvm *kvm);
>  
>  bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
>  int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
> @@ -21,13 +19,37 @@ bool kvm_slot_page_track_is_active(struct kvm *kvm,
>  const struct kvm_memory_slot *slot,
>  gfn_t gfn, enum kvm_page_track_mode mode);
>  
> -void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> -   int bytes);
> +#ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
> +int kvm_page_track_init(struct kvm *kvm);
> +void kvm_page_track_cleanup(struct kvm *kvm);
> +
> +void __kvm_page_track_wr

Re: [Intel-gfx] [PATCH 26/27] KVM: x86/mmu: Add page-track API to query if a gfn is valid

2022-12-28 Thread Yan Zhao
On Fri, Dec 23, 2022 at 12:57:38AM +, Sean Christopherson wrote:
> Add a page-track API to query if a gfn is "valid", i.e. is backed by a
> memslot and is visible to the guest.  This is one more step toward
> removing KVM internal details from the page-track APIs.
> 
> Add a FIXME to call out that intel_gvt_is_valid_gfn() is broken with
> respect to 2MiB (or larger) guest entries, e.g. if the starting gfn is
> valid but a 2MiB page starting at the gfn covers "invalid" memory due
> to running beyond the memslot.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/kvm_page_track.h |  1 +
>  arch/x86/kvm/mmu/page_track.c | 13 +
>  drivers/gpu/drm/i915/gvt/gtt.c| 11 ++-
>  3 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index 66a0d7c34311..99e1d6eeb0fb 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -52,6 +52,7 @@ int kvm_page_track_register_notifier(struct kvm *kvm,
>  void kvm_page_track_unregister_notifier(struct kvm *kvm,
>   struct kvm_page_track_notifier_node *n);
>  
> +bool kvm_page_track_is_valid_gfn(struct kvm *kvm, gfn_t gfn);
>  int kvm_write_track_add_gfn(struct kvm *kvm, gfn_t gfn);
>  int kvm_write_track_remove_gfn(struct kvm *kvm, gfn_t gfn);
>  #endif /* CONFIG_KVM_EXTERNAL_WRITE_TRACKING */
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 1af431a41f71..9da071a514b3 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -264,6 +264,19 @@ enum pg_level kvm_page_track_max_mapping_level(struct 
> kvm *kvm, gfn_t gfn,
>  }
>  EXPORT_SYMBOL_GPL(kvm_page_track_max_mapping_level);
>  
> +bool kvm_page_track_is_valid_gfn(struct kvm *kvm, gfn_t gfn)
> +{
> + bool ret;
> + int idx;
> +
> + idx = srcu_read_lock(&kvm->srcu);
> + ret = kvm_is_visible_gfn(kvm, gfn);
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(kvm_page_track_is_valid_gfn);
This implementation is only to check whether a GFN is within a visible
kvm memslot. So, why this helper function is named kvm_page_track_xxx()?
Don't think it's anything related to page track, and not all of its callers
in KVMGT are for page tracking.

Thanks
Yan

> +
>  /*
>   * add guest page to the tracking pool so that corresponding access on that
>   * page will be intercepted.
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 59ba6639e622..43c4fc23205d 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -51,18 +51,11 @@ static int preallocated_oos_pages = 8192;
>  
>  static bool intel_gvt_is_valid_gfn(struct intel_vgpu *vgpu, unsigned long 
> gfn)
>  {
> - struct kvm *kvm = vgpu->vfio_device.kvm;
> - int idx;
> - bool ret;
> -
>   if (!vgpu->attached)
>   return false;
>  
> - idx = srcu_read_lock(&kvm->srcu);
> - ret = kvm_is_visible_gfn(kvm, gfn);
> - srcu_read_unlock(&kvm->srcu, idx);
> -
> - return ret;
> + /* FIXME: This doesn't properly handle guest entries larger than 4K. */
> + return kvm_page_track_is_valid_gfn(vgpu->vfio_device.kvm, gfn);
>  }
>  
>  /*
> -- 
> 2.39.0.314.g84b9a713c41-goog
> 


Re: [Intel-gfx] [PATCH 09/27] drm/i915/gvt: Protect gfn hash table with dedicated mutex

2023-01-04 Thread Yan Zhao
On Tue, Jan 03, 2023 at 08:43:17PM +, Sean Christopherson wrote:
> On Wed, Dec 28, 2022, Yan Zhao wrote:
> > On Fri, Dec 23, 2022 at 12:57:21AM +, Sean Christopherson wrote:
> > > Add and use a new mutex, gfn_lock, to protect accesses to the hash table
> > > used to track which gfns are write-protected when shadowing the guest's
> > > GTT.  This fixes a bug where kvmgt_page_track_write(), which doesn't hold
> > > kvm->mmu_lock, could race with intel_gvt_page_track_remove() and trigger
> > > a use-after-free.
> > > 
> > > Fixing kvmgt_page_track_write() by taking kvm->mmu_lock is not an option
> > > as mmu_lock is a r/w spinlock, and intel_vgpu_page_track_handler() might
> > > sleep when acquiring vgpu->cache_lock deep down the callstack:
> > > 
> > >   intel_vgpu_page_track_handler()
> > >   |
> > >   |->  page_track->handler / ppgtt_write_protection_handler()
> > >|
> > >|-> ppgtt_handle_guest_write_page_table_bytes()
> > >|
> > >|->  ppgtt_handle_guest_write_page_table()
> > > |
> > > |-> ppgtt_handle_guest_entry_removal()
> > > |
> > > |-> ppgtt_invalidate_pte()
> > > |
> > > |-> intel_gvt_dma_unmap_guest_page()
> > > |
> > > |-> mutex_lock(&vgpu->cache_lock);
> > > 
> > This gfn_lock could lead to deadlock in below sequence.
> > 
> > (1) kvm_write_track_add_gfn() to GFN 1
> > (2) kvmgt_page_track_write() for GFN 1
> > kvmgt_page_track_write()
> > |
> > |->mutex_lock(&info->vgpu_lock)
> > |->intel_vgpu_page_track_handler (as is kvmgt_gfn_is_write_protected)
> >|
> >|->page_track->handler() (ppgtt_write_protection_handler())
> >   | 
> >   |->ppgtt_handle_guest_write_page_table_bytes()
> >  |
> >  |->ppgtt_handle_guest_write_page_table()
> > |
> > |->ppgtt_handle_guest_entry_add() --> new_present
> >|
> >|->ppgtt_populate_spt_by_guest_entry()
> >   |
> >   |->intel_vgpu_enable_page_track() --> for GFN 2
> >  |
> >  |->intel_gvt_page_track_add()
> > |
> > |->mutex_lock(&info->gfn_lock) ===>deadlock
> 
> Or even more simply, 
> 
>   kvmgt_page_track_write()
>   |
>   -> intel_vgpu_page_track_handler()
>  |
>  -> intel_gvt_page_track_remove()
>
yes.

> > 
> > Below fix based on this patch is to reuse vgpu_lock to protect the hash 
> > table
> > info->ptable.
> > Please check if it's good.
> > 
> > 
> > diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> > b/drivers/gpu/drm/i915/gvt/kvmgt.c
> > index b924ed079ad4..526bd973e784 100644
> > --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> > +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> > @@ -364,7 +364,7 @@ __kvmgt_protect_table_find(struct intel_vgpu *info, 
> > gfn_t gfn)
> >  {
> > struct kvmgt_pgfn *p, *res = NULL;
> > 
> > -   lockdep_assert_held(&info->gfn_lock);
> > +   lockdep_assert_held(&info->vgpu_lock);
> > 
> > hash_for_each_possible(info->ptable, p, hnode, gfn) {
> > if (gfn == p->gfn) {
> > @@ -388,7 +388,7 @@ static void kvmgt_protect_table_add(struct intel_vgpu 
> > *info, gfn_t gfn)
> >  {
> > struct kvmgt_pgfn *p;
> > 
> > -   lockdep_assert_held(&info->gfn_lock);
> > +   lockdep_assert_held(&info->vgpu_lock);
> 
> I'll just delete these assertions, the one in __kvmgt_protect_table_find() 
> should
> cover everything and is ultimately the assert that matters.
> 
> > @@ -1629,12 +1629,11 @@ static void kvmgt_page_track_remove_region(gfn_t 
> > gfn, unsigned long nr_pages,
> > struct intel_vgpu *info =
> > container_of(node, struct intel_vgpu, track_node);
> >  
> > -   mutex_lock(&info->gfn_lock);
> > +   lockdep_assert_held(&info->vgpu_lock);
> 
> This path needs to manually take vgpu_lock as it's called from KVM.  IIRC, 
> this
> is the main reason I tried adding a new lock.  That and I had a hell of a time
> figuring out whether or not vgpu_lock would

Re: [Intel-gfx] [PATCH 03/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

2023-01-04 Thread Yan Zhao
On Tue, Jan 03, 2023 at 09:13:54PM +, Sean Christopherson wrote:
> On Wed, Dec 28, 2022, Yan Zhao wrote:
> > On Fri, Dec 23, 2022 at 12:57:15AM +, Sean Christopherson wrote:
> > > Honor KVM's max allowed page size when determining whether or not a 2MiB
> > > GTT shadow page can be created for the guest.  Querying KVM's max allowed
> > > size is somewhat odd as there's no strict requirement that KVM's memslots
> > > and VFIO's mappings are configured with the same gfn=>hva mapping, but
> > Without vIOMMU, VFIO's mapping is configured with the same as KVM's
> > memslots, i.e. with the same gfn==>HVA mapping
> 
> But that's controlled by userspace, correct?

Yes, controlled by QEMU.
VFIO in kernel has no idea of whether vIOMMU is enabled or not.
KVMGT currently is known not working with vIOMMU with shadow mode on
(in this mode, VFIO maps gIOVA ==> HVA ==> HPA) .

> 
> > > the check will be accurate if userspace wants to have a functional guest,
> > > and at the very least checking KVM's memslots guarantees that the entire
> > > 2MiB range has been exposed to the guest.
> > 
> > I think just check the entrie 2MiB GFN range are all within KVM memslot is
> > enough.
> 
> Strictly speaking, no.  E.g. if a 2MiB region is covered with multiple 
> memslots
> and the memslots have different properties.
> 
> > If for some reason, KVM maps a 2MiB range in 4K sizes, KVMGT can still map
> > it in IOMMU size in 2MiB size as long as the PFNs are continous and the
> > whole range is all exposed to guest.
> 
> I agree that practically speaking this will hold true, but if KVMGT wants to 
> honor
> KVM's memslots then checking that KVM allows a hugepage is correct.  Hrm, but 
> on
> the flip side, KVMGT ignores read-only memslot flags, so KVMGT is already 
> ignoring
> pieces of KVM's memslots.
KVMGT calls dma_map_page() with DMA_BIDIRECTIONAL after checking 
gvt_pin_guest_page().
Though for a read-only memslot, DMA_TO_DEVICE should be used instead
(see dma_info_to_prot()),
as gvt_pin_guest_page() checks (IOMMU_READ | IOMMU_WRITE) permission for each 
page,
it actually ensures that the pinned GFN is not in a read-only memslot.
So, it should be fine.

> 
> I have no objection to KVMGT defining its ABI such that KVMGT is allowed to 
> create
> 2MiB so long as (a) the GFN is contiguous according to VFIO, and (b) that the 
> entire
> 2MiB range is exposed to the guest.
> 
sorry. I may not put it clearly enough.
for a normal device pass-through via VFIO-PCI, VFIO maps IOMMU mappings in this 
way:

(a) fault in PFNs in a GFN range within the same memslot (VFIO saves dma_list, 
which is
the same as memslot list when vIOMMU is not on or not in shadow mode).
(b) map continuous PFNs into iommu driver (honour ro attribute and can > 2MiB 
as long as
PFNs are continuous).
(c) IOMMU driver decides to map in 2MiB or in 4KiB according to its setting.

For KVMGT, gvt_dma_map_page() first calls gvt_pin_guest_page() which
(a) calls vfio_pin_pages() to check each GFN is within allowed dma_list with
(IOMMU_READ | IOMMU_WRITE) permission and fault-in page. 
(b) checks PFNs are continuous in 2MiB,

Though checking kvm_page_track_max_mapping_level() is also fine, it makes DMA
mapping size unnecessarily smaller.

> That said, being fully permissive also seems wasteful, e.g. KVM would need to
> explicitly support straddling multiple memslots.
> 
> As a middle ground, what about tweaking kvm_page_track_is_valid_gfn() to take 
> a
> range, and then checking that the range is contained in a single memslot?
> 
> E.g. something like:
> 
> bool kvm_page_track_is_contiguous_gfn_range(struct kvm *kvm, gfn_t gfn,
>   unsigned long nr_pages)
> {
>   struct kvm_memory_slot *memslot;
>   bool ret;
>   int idx;
> 
>   idx = srcu_read_lock(&kvm->srcu);
>   memslot = gfn_to_memslot(kvm, gfn);
>   ret = kvm_is_visible_memslot(memslot) &&
> gfn + nr_pages <= memslot->base_gfn + memslot->npages;
>   srcu_read_unlock(&kvm->srcu, idx);
> 
>   return ret;
> }

Yes, it's good.
But as explained above, gvt_dma_map_page() checks in an equivalent way.
Maybe checking kvm_page_track_is_contiguous_gfn_range() is also not
required?
> 
> > Actually normal device passthrough with VFIO-PCI also maps GFNs in a
> > similar way, i.e. maps a guest visible range in as large size as
> > possible as long as the PFN is continous. 
> > > 
> > > Note, KVM may also restrict the mapping size for reasons that aren't
> > > relevant to KVMGT, e.g. for KVM's iTLB multi-hit workaround or if the gfn
> > Will iTLB multi-hit affect DMA?
> 
> I highly doubt it, I can't imagine an IOMMU would have a dedicated instruction
> TLB :-)
I can double check it with IOMMU hardware experts.
But if DMA would tamper instruction TLB, it should have been reported
as an issue with normal VFIO pass-through?

> > AFAIK, IOMMU mappings currently never sets exec bit (and I'm told this bit 
> > is
> > under discussion to be removed).


Re: [Intel-gfx] [PATCH 26/27] KVM: x86/mmu: Add page-track API to query if a gfn is valid

2023-01-04 Thread Yan Zhao
On Tue, Jan 03, 2023 at 09:19:01PM +, Sean Christopherson wrote:
> On Wed, Dec 28, 2022, Yan Zhao wrote:
> > On Fri, Dec 23, 2022 at 12:57:38AM +, Sean Christopherson wrote:
> > > +bool kvm_page_track_is_valid_gfn(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > + bool ret;
> > > + int idx;
> > > +
> > > + idx = srcu_read_lock(&kvm->srcu);
> > > + ret = kvm_is_visible_gfn(kvm, gfn);
> > > + srcu_read_unlock(&kvm->srcu, idx);
> > > +
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(kvm_page_track_is_valid_gfn);
> > This implementation is only to check whether a GFN is within a visible
> > kvm memslot. So, why this helper function is named kvm_page_track_xxx()?
> > Don't think it's anything related to page track, and not all of its callers
> > in KVMGT are for page tracking.
> 
> KVMGT is the only user of kvm_page_track_is_valid_gfn().  kvm_is_visible_gfn()
> has other users, just not in x86.  And long term, my goal is to allow building
> KVM x86 without any exports.  Killing off KVM's "internal" (for vendor 
> modules)
> exports for select Kconfigs is easy enough, add adding a dedicated page-track 
> API
> solves the KVMGT angle.
Understand!
But personally, I don't like merging this API into page-track API as
it obviously has nothing to do with page-track stuffs, and KVMGT also calls it 
for
non-page-track purpuse.



Re: [Intel-gfx] [PATCH 00/27] drm/i915/gvt: KVM: KVMGT fixes and page-track cleanups

2023-01-04 Thread Yan Zhao
On Wed, Jan 04, 2023 at 01:01:13AM +, Sean Christopherson wrote:
> On Fri, Dec 23, 2022, Yan Zhao wrote:
> > On Fri, Dec 23, 2022 at 12:57:12AM +, Sean Christopherson wrote:
> > > Fix a variety of found-by-inspection bugs in KVMGT, and overhaul KVM's
> > > page-track APIs to provide a leaner and cleaner interface.  The motivation
> > > for this series is to (significantly) reduce the number of KVM APIs that
> > > KVMGT uses, with a long-term goal of making all kvm_host.h headers
> > > KVM-internal.  That said, I think the cleanup itself is worthwhile,
> > > e.g. KVMGT really shouldn't be touching kvm->mmu_lock.
> > > 
> > > Note!  The KVMGT changes are compile tested only as I don't have the
> > > necessary hardware (AFAIK).  Testing, and lots of it, on the KVMGT side
> > > of things is needed and any help on that front would be much appreciated.
> > hi Sean,
> > Thanks for the patch!
> > Could you also provide the commit id that this series is based on?
> 
> The commit ID is provided in the cover letter:
> 
>   base-commit: 9d75a3251adfbcf444681474511b58042a364863
> 
> Though you might have a hard time finding that commit as it's from an old
> version of kvm/queue that's probably since been force pushed.
> 
> > I applied them on top of latest master branch (6.1.0+,
> > 8395ae05cb5a2e31d36106e8c85efa11cda849be) in repo
> > https://github.com/torvalds/linux.git, yet met some conflicts and I
> > fixed them manually. (patch 11 and patch 25).
> > 
> > A rough test shows that below mutex_init is missing.
> > But even with this fix, I still met guest hang during guest boots up.
> > Will look into it and have a detailed review next week.
> 
> Thanks again for the reviews and testing!  I'll get a v2 out in the next week 
> or
> so (catching up from holidays) and will be more explicit in documenting the 
> base
> version.
That's fine and it's a pleasure to me :)


Re: [Intel-gfx] [PATCH 03/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

2023-01-05 Thread Yan Zhao
On Thu, Jan 05, 2023 at 05:40:32PM +, Sean Christopherson wrote:
> On Thu, Jan 05, 2023, Yan Zhao wrote:
> > On Tue, Jan 03, 2023 at 09:13:54PM +, Sean Christopherson wrote:
> > > On Wed, Dec 28, 2022, Yan Zhao wrote:
> > > > On Fri, Dec 23, 2022 at 12:57:15AM +, Sean Christopherson wrote:
> > > > > Honor KVM's max allowed page size when determining whether or not a 
> > > > > 2MiB
> > > > > GTT shadow page can be created for the guest.  Querying KVM's max 
> > > > > allowed
> > > > > size is somewhat odd as there's no strict requirement that KVM's 
> > > > > memslots
> > > > > and VFIO's mappings are configured with the same gfn=>hva mapping, but
> > > > Without vIOMMU, VFIO's mapping is configured with the same as KVM's
> > > > memslots, i.e. with the same gfn==>HVA mapping
> > > 
> > > But that's controlled by userspace, correct?
> > 
> > Yes, controlled by QEMU.
> 
> ...
> 
> > > Strictly speaking, no.  E.g. if a 2MiB region is covered with multiple 
> > > memslots
> > > and the memslots have different properties.
> > > 
> > > > If for some reason, KVM maps a 2MiB range in 4K sizes, KVMGT can still 
> > > > map
> > > > it in IOMMU size in 2MiB size as long as the PFNs are continous and the
> > > > whole range is all exposed to guest.
> > > 
> > > I agree that practically speaking this will hold true, but if KVMGT wants 
> > > to honor
> > > KVM's memslots then checking that KVM allows a hugepage is correct.  Hrm, 
> > > but on
> > > the flip side, KVMGT ignores read-only memslot flags, so KVMGT is already 
> > > ignoring
> > > pieces of KVM's memslots.
> > KVMGT calls dma_map_page() with DMA_BIDIRECTIONAL after checking 
> > gvt_pin_guest_page().
> > Though for a read-only memslot, DMA_TO_DEVICE should be used instead
> > (see dma_info_to_prot()),
> > as gvt_pin_guest_page() checks (IOMMU_READ | IOMMU_WRITE) permission for 
> > each page,
> > it actually ensures that the pinned GFN is not in a read-only memslot.
> > So, it should be fine.
> > 
> > > 
> > > I have no objection to KVMGT defining its ABI such that KVMGT is allowed 
> > > to create
> > > 2MiB so long as (a) the GFN is contiguous according to VFIO, and (b) that 
> > > the entire
> > > 2MiB range is exposed to the guest.
> > > 
> > sorry. I may not put it clearly enough.
> > for a normal device pass-through via VFIO-PCI, VFIO maps IOMMU mappings in 
> > this way:
> > 
> > (a) fault in PFNs in a GFN range within the same memslot (VFIO saves 
> > dma_list, which is
> > the same as memslot list when vIOMMU is not on or not in shadow mode).
> > (b) map continuous PFNs into iommu driver (honour ro attribute and can > 
> > 2MiB as long as
> > PFNs are continuous).
> > (c) IOMMU driver decides to map in 2MiB or in 4KiB according to its setting.
> > 
> > For KVMGT, gvt_dma_map_page() first calls gvt_pin_guest_page() which
> > (a) calls vfio_pin_pages() to check each GFN is within allowed dma_list with
> > (IOMMU_READ | IOMMU_WRITE) permission and fault-in page. 
> > (b) checks PFNs are continuous in 2MiB,
> > 
> > Though checking kvm_page_track_max_mapping_level() is also fine, it makes 
> > DMA
> > mapping size unnecessarily smaller.
> 
> Yeah, I got all that.  What I'm trying to say, and why I asked about whether 
> or
> not userspace controls the mappings, is that AFAIK there is nothing in the 
> kernel
> that coordinates mappings between VFIO and KVM.  So, very technically, 
> userspace
> could map a 2MiB range contiguous in VFIO but not in KVM, or RW in VFIO but 
> RO in KVM.
> 
> I can't imagine there's a real use case for doing so, and arguably there's no
> requirement that KVMGT honor KVM's memslot.  But because KVMGT taps into KVM's
> page-tracking, KVMGT _does_ honor KVM's memslots to some extent because KVMGT
> needs to know whether or not a given GFN can be write-protected.
> 
> I'm totally fine if KVMGT's ABI is that VFIO is the source of truth for 
> mappings
> and permissions, and that the only requirement for KVM memslots is that GTT 
> page
> tables need to be visible in KVM's memslots.  But if that's the ABI, then
> intel_gvt_is_valid_gfn() should be probing VFIO, not KVM (commit cc753fbe1ac4
> ("drm/i915/gvt: validate gfn before set shadow 

Re: [Intel-gfx] [PATCH 03/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

2023-01-09 Thread Yan Zhao
On Fri, Jan 06, 2023 at 11:01:53PM +, Sean Christopherson wrote:
> On Fri, Jan 06, 2023, Yan Zhao wrote:
> > On Thu, Jan 05, 2023 at 05:40:32PM +, Sean Christopherson wrote:
> > > On Thu, Jan 05, 2023, Yan Zhao wrote:
> > > I'm totally fine if KVMGT's ABI is that VFIO is the source of truth for 
> > > mappings
> > > and permissions, and that the only requirement for KVM memslots is that 
> > > GTT page
> > > tables need to be visible in KVM's memslots.  But if that's the ABI, then
> > > intel_gvt_is_valid_gfn() should be probing VFIO, not KVM (commit 
> > > cc753fbe1ac4
> > > ("drm/i915/gvt: validate gfn before set shadow page entry").
> > > 
> > > In other words, pick either VFIO or KVM.  Checking that X is valid 
> > > according to
> > > KVM and then mapping X through VFIO is confusing and makes assumptions 
> > > about how
> > > userspace configures KVM and VFIO.  It works because QEMU always 
> > > configures KVM
> > > and VFIO as expected, but IMO it's unnecessarily fragile and again 
> > > confusing for
> > > unaware readers because the code is technically flawed.
> > >
> > Agreed. 
> > Then after some further thought, I think maybe we can just remove
> > intel_gvt_is_valid_gfn() in KVMGT, because
> > 
> > (1) both intel_gvt_is_valid_gfn() in emulate_ggtt_mmio_write() and
> > ppgtt_populate_spt() are not for page track purpose, but to validate bogus
> > GFN.
> > (2) gvt_pin_guest_page() with gfn and size can do the validity checking,
> > which is called in intel_gvt_dma_map_guest_page(). So, we can move the
> > mapping of scratch page to the error path after 
> > intel_gvt_dma_map_guest_page().
> 
> IIUC, that will re-introduce the problem commit cc753fbe1ac4 ("drm/i915/gvt: 
> validate
> gfn before set shadow page entry") solved by poking into KVM.  Lack of 
> pre-validation
> means that bogus GFNs will trigger error messages, e.g.
> 
>   gvt_vgpu_err("vfio_pin_pages failed for iova %pad, ret 
> %d\n",
>&cur_iova, ret);
> 
> and
> 
>   gvt_vgpu_err("fail to populate guest ggtt entry\n");

Thanks for pointing it out.
I checked this commit message and found below original intentions to introduce
pre-validation:
   "GVT may receive partial write on one guest PTE update. Validate gfn
not to translate incomplete gfn. This avoids some unnecessary error
messages incurred by the incomplete gfn translating. Also fix the
bug that the whole PPGTT shadow page update is aborted on any invalid
gfn entry"

(1) first intention -- unnecessary error message came from GGTT partial write.
For guest GGTT writes, the guest calls writeq to an MMIO GPA, which is
8 bytes in length, while QEMU splits the MMIO write into 2 4-byte writes.
The splitted 2 writes can cause invalid GFN to be found.

But this partial write issue has been fixed by the two follow-up commits:
bc0686ff5fad drm/i915/gvt: support inconsecutive partial gtt entry write
510fe10b6180 drm/i915/gvt: fix a bug of partially write ggtt enties

so pre-validation to reduce noise is not necessary any more here.

(2) the second intention -- "the whole PPGTT shadow page update is aborted on 
any
invalid gfn entry"
As PPGTT resides in normal guest RAM and we only treat 8-byte writes
as valid page table writes, any invalid GPA found is regarded as
an error, either due to guest misbehavior/attack or bug in host
shadow code. 
So, direct abort looks good too. Like below:

@@ -1340,13 +1338,6 @@ static int ppgtt_populate_spt(struct 
intel_vgpu_ppgtt_spt *spt)
ppgtt_generate_shadow_entry(&se, s, &ge);
ppgtt_set_shadow_entry(spt, &se, i);
} else {
-   gfn = ops->get_pfn(&ge);
-   if (!intel_gvt_is_valid_gfn(vgpu, gfn)) {
-   ops->set_pfn(&se, gvt->gtt.scratch_mfn);
-   ppgtt_set_shadow_entry(spt, &se, i);
-   continue;
-   }
-
ret = ppgtt_populate_shadow_entry(vgpu, spt, i, &ge);
if (ret)
goto fail;

(I actually found that the original code will print "invalid entry type"
warning which indicates it's broken for a while due to lack of test in
this invalid gfn path)


> One thought would be to turn those printks into tracepoints to eliminate 
> unwanted
> noise, and to prevent the guest from spammi

Re: [Intel-gfx] [PATCH 03/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

2023-01-12 Thread Yan Zhao
> > > > Note, KVM may also restrict the mapping size for reasons that aren't
> > > > relevant to KVMGT, e.g. for KVM's iTLB multi-hit workaround or if the 
> > > > gfn
> > > Will iTLB multi-hit affect DMA?
> > 
> > I highly doubt it, I can't imagine an IOMMU would have a dedicated 
> > instruction
> > TLB :-)
> I can double check it with IOMMU hardware experts.
> But if DMA would tamper instruction TLB, it should have been reported
> as an issue with normal VFIO pass-through?

hi Sean,
This is the feedback:

- CPU Instruction TLB is only filled when CPU fetches an instruction.
- IOMMU uses IOTLB to cache IOVA translation.
  A remapping hardware may implement multiple IOTLBs, and some of these may
  be for special purposes, e.g., only for instruction fetches.
  There is no way for software to be aware that multiple
  translations for smaller pages have been used for a large page. If software
  modifies the paging structures so that the page size used for a 4-KByte range
  of input-addresses changes, the IOTLBs may subsequently contain multiple
  translations for the address range (one for each page size).
  A reference to a input-address in the address range may use any of these
  translations. Which translation is used may vary from one execution to
  another, and the choice may be implementation-specific.
- Theres no similar bug related to DMA requests for instruction fetch hitting
  multiple IOTLB entries reported in IOMMU side.
  The X bit in IOMMU paging structure is to be removed in future and is
  currently always unset.

Thanks
Yan


Re: [Intel-gfx] [PATCH 03/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

2023-01-18 Thread Yan Zhao
On Thu, Jan 19, 2023 at 10:58:42AM +0800, Zhenyu Wang wrote:
> On 2023.01.11 17:55:04 +, Sean Christopherson wrote:
> > On Mon, Jan 09, 2023, Yan Zhao wrote:
> > > On Fri, Jan 06, 2023 at 11:01:53PM +, Sean Christopherson wrote:
> > > > On Fri, Jan 06, 2023, Yan Zhao wrote:
> > > > > On Thu, Jan 05, 2023 at 05:40:32PM +, Sean Christopherson wrote:
> > > > > > On Thu, Jan 05, 2023, Yan Zhao wrote:
> > > > > > I'm totally fine if KVMGT's ABI is that VFIO is the source of truth 
> > > > > > for mappings
> > > > > > and permissions, and that the only requirement for KVM memslots is 
> > > > > > that GTT page
> > > > > > tables need to be visible in KVM's memslots.  But if that's the 
> > > > > > ABI, then
> > > > > > intel_gvt_is_valid_gfn() should be probing VFIO, not KVM (commit 
> > > > > > cc753fbe1ac4
> > > > > > ("drm/i915/gvt: validate gfn before set shadow page entry").
> > > > > > 
> > > > > > In other words, pick either VFIO or KVM.  Checking that X is valid 
> > > > > > according to
> > > > > > KVM and then mapping X through VFIO is confusing and makes 
> > > > > > assumptions about how
> > > > > > userspace configures KVM and VFIO.  It works because QEMU always 
> > > > > > configures KVM
> > > > > > and VFIO as expected, but IMO it's unnecessarily fragile and again 
> > > > > > confusing for
> > > > > > unaware readers because the code is technically flawed.
> > > > > >
> > > > > Agreed. 
> > > > > Then after some further thought, I think maybe we can just remove
> > > > > intel_gvt_is_valid_gfn() in KVMGT, because
> > > > > 
> > > > > (1) both intel_gvt_is_valid_gfn() in emulate_ggtt_mmio_write() and
> > > > > ppgtt_populate_spt() are not for page track purpose, but to validate 
> > > > > bogus
> > > > > GFN.
> > > > > (2) gvt_pin_guest_page() with gfn and size can do the validity 
> > > > > checking,
> > > > > which is called in intel_gvt_dma_map_guest_page(). So, we can move the
> > > > > mapping of scratch page to the error path after 
> > > > > intel_gvt_dma_map_guest_page().
> > > > 
> > > > IIUC, that will re-introduce the problem commit cc753fbe1ac4 
> > > > ("drm/i915/gvt: validate
> > > > gfn before set shadow page entry") solved by poking into KVM.  Lack of 
> > > > pre-validation
> > > > means that bogus GFNs will trigger error messages, e.g.
> > > > 
> > > > gvt_vgpu_err("vfio_pin_pages failed for iova 
> > > > %pad, ret %d\n",
> > > >  &cur_iova, ret);
> > > > 
> > > > and
> > > > 
> > > > gvt_vgpu_err("fail to populate guest ggtt 
> > > > entry\n");
> > > 
> > > Thanks for pointing it out.
> > > I checked this commit message and found below original intentions to 
> > > introduce
> > > pre-validation:
> > 
> > ...
> > 
> > > (I actually found that the original code will print "invalid entry type"
> > > warning which indicates it's broken for a while due to lack of test in
> > > this invalid gfn path)
> > > 
> > > 
> > > > One thought would be to turn those printks into tracepoints to 
> > > > eliminate unwanted
> > > > noise, and to prevent the guest from spamming the host kernel log by 
> > > > programming
> > > > garbage into the GTT (gvt_vgpu_err() isn't ratelimited).
> > > As those printks would not happen in normal conditions and printks may 
> > > have
> > > some advantages to discover the attack or bug, could we just convert
> > > gvt_vgpu_err() to be ratelimited ?
> > 
> > That's ultimately a decision that needs to be made by the GVT maintainers, 
> > as the
> > answer depends on the use case.  E.g. if most users of KVMGT run a single 
> > VM and
> > the guest user is also the host admin, then pr_err_ratelimited() is likely 
> > an
> > acceptable/preferable choice as there's a decent chance a human will see 
> > the errors
> > in the host ke