Re: [PATCH v2 09/11] KVM: MMU: simplify mmu_need_write_protect

2015-12-24 Thread Kai Huang



On 12/24/2015 05:11 PM, Xiao Guangrong wrote:



On 12/24/2015 04:36 PM, Kai Huang wrote:



On 12/23/2015 07:25 PM, Xiao Guangrong wrote:

Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync

Signed-off-by: Xiao Guangrong 
---
  arch/x86/kvm/mmu.c | 26 --
  1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5a2ca73..f89e77f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct 
kvm_vcpu *vcpu, struct kvm_mmu_page *sp)

  kvm_mmu_mark_parents_unsync(sp);
  }
-static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
+ bool can_unsync)
  {
  struct kvm_mmu_page *s;
  for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
+if (!can_unsync)
+return true;
+
  if (s->unsync)
  continue;
  WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
  __kvm_unsync_page(vcpu, s);
  }
+
+return false;
  }
I hate to say but it looks odd to me that kvm_unsync_pages takes a 
bool parameter called can_unsync,
and return a bool (which looks like suggesting whether the unsync has 
succeeded or not). How about
calling __kvm_unsync_pages directly in mmu_need_write_protect, and 
leave kvm_unsync_pages unchanged
(or even remove it as looks it is used nowhere else) ? But again it's 
to you and Paolo.




Make senses, the updated version is attached, count you review it?

Sure and it looks good to me.

Thanks,
-Kai
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 09/11] KVM: MMU: simplify mmu_need_write_protect

2015-12-24 Thread Kai Huang



On 12/23/2015 07:25 PM, Xiao Guangrong wrote:

Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync

Signed-off-by: Xiao Guangrong 
---
  arch/x86/kvm/mmu.c | 26 --
  1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5a2ca73..f89e77f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *sp)
kvm_mmu_mark_parents_unsync(sp);
  }
  
-static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)

+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
+bool can_unsync)
  {
struct kvm_mmu_page *s;
  
  	for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {

+   if (!can_unsync)
+   return true;
+
if (s->unsync)
continue;
WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
__kvm_unsync_page(vcpu, s);
}
+
+   return false;
  }
I hate to say but it looks odd to me that kvm_unsync_pages takes a bool 
parameter called can_unsync, and return a bool (which looks like 
suggesting whether the unsync has succeeded or not). How about calling 
__kvm_unsync_pages directly in mmu_need_write_protect, and leave 
kvm_unsync_pages unchanged (or even remove it as looks it is used 
nowhere else) ? But again it's to you and Paolo.


Thanks,
-Kai
  
  static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,

   bool can_unsync)
  {
-   struct kvm_mmu_page *s;
-   bool need_unsync = false;
-
if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
return true;
  
-	for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {

-   if (!can_unsync)
-   return true;
-
-   if (s->role.level != PT_PAGE_TABLE_LEVEL)
-   return true;
-
-   if (!s->unsync)
-   need_unsync = true;
-   }
-   if (need_unsync)
-   kvm_unsync_pages(vcpu, gfn);
-
-   return false;
+   return kvm_unsync_pages(vcpu, gfn, can_unsync);
  }
  
  static bool kvm_is_mmio_pfn(pfn_t pfn)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/11] KVM: MMU: simplify mmu_need_write_protect

2015-12-16 Thread Kai Huang



On 12/16/2015 04:48 PM, Xiao Guangrong wrote:



On 12/16/2015 04:05 PM, Kai Huang wrote:



On 12/15/2015 05:25 PM, Xiao Guangrong wrote:



On 12/15/2015 04:43 PM, Kai Huang wrote:



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync

Signed-off-by: Xiao Guangrong 
---
  arch/x86/kvm/mmu.c | 26 --
  1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5a2ca73..f89e77f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct 
kvm_vcpu *vcpu, struct kvm_mmu_page

*sp)
  kvm_mmu_mark_parents_unsync(sp);
  }
-static void kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
+ bool can_unsync)
  {
  struct kvm_mmu_page *s;
  for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
+if (!can_unsync)
+return true;
How about moving this right before for_each_gfn_indirect_valid_sp? 
As can_unsync is passed as

parameter, so there's no point checking it several times.



We can not do this. What we are doing here is checking if we have 
shadow page mapping

for 'gfn':
a) if no, it can be writable.
I think in this case you should also check whether the GFN is being 
write protection tracked. Ex, if
the spte never exists when you add the GFN to write protection 
tracking, and in this case I think

mmu_need_write_protect should also report true. Right?


We have already checked it:

static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
   bool can_unsync)
{
if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
return true;

return kvm_unsync_pages(vcpu, gfn, can_unsync);
}

Oh sorry I missed this :)






b) if yes, check 'can_unsync' to see if these shadow pages can make 
to be 'unsync'.


Your suggestion can break the point a).

A further thinking is can we move it to mmu_need_write_protect? 
Passing can_unsync as parameter to

kvm_unsync_pages sounds a little bit odd.


+
  if (s->unsync)
  continue;
  WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
How about large page mapping? Such as if guest uses 2M mapping and 
its shadow is indirect, does
above WARN_ON still meet? As you removed the PT level check in 
mmu_need_write_protect.


The lager mapping are on the non-leaf shadow pages which can be 
figured out by

kvm_page_track_check_mode() before we call this function.
Actually I am not quite understanding how large page mapping is 
implemented. I see in
kvm_mmu_get_page, when sp is allocated, it is large page mapping 
disabled, but I think we do support
large shadow mapping, right? I mean theoretically if guest uses 2M 
mapping and shadow mapping can
certainly use 2M mapping as well, and the 2M shadow mapping can also 
be 'unsynced' (as a leaf
mapping table). But in your series I see if we write protect some  
GFN, the shadow large page

mapping is always disabled.

Am I wrong?


If the large page contains the page which is used as page table, kvm 
does not map large page for
it, the reason is we track the 4k page instead of the whole large page 
to reduce write emulation.
I don't know why breaking large page to 4K mapping can reduce write 
emulation, but this explanation works for me. I guess KVM-GT doesn't 
care about it neither. :)


Thanks,
-Kai


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/11] KVM: MMU: use page track for non-leaf shadow pages

2015-12-16 Thread Kai Huang



On 12/16/2015 04:39 PM, Xiao Guangrong wrote:



On 12/16/2015 03:51 PM, Kai Huang wrote:



On 12/15/2015 05:10 PM, Xiao Guangrong wrote:



On 12/15/2015 03:52 PM, Kai Huang wrote:


  static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
@@ -2140,12 +2150,18 @@ static struct kvm_mmu_page 
*kvm_mmu_get_page(struct kvm_vcpu *vcpu,

  hlist_add_head(&sp->hash_link,
&vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]);
  if (!direct) {
-if (rmap_write_protect(vcpu, gfn))
+/*
+ * we should do write protection before syncing pages
+ * otherwise the content of the synced shadow page may
+ * be inconsistent with guest page table.
+ */
+account_shadowed(vcpu->kvm, sp);
+
+if (level == PT_PAGE_TABLE_LEVEL &&
+  rmap_write_protect(vcpu, gfn))
  kvm_flush_remote_tlbs(vcpu->kvm);
I think your modification is good but I am little bit confused 
here. In account_shadowed, if
sp->role.level > PT_PAGE_TABLE_LEVEL, the sp->gfn is write 
protected, and this is reasonable. So why

write protecting the gfn of PT_PAGE_TABLE_LEVEL here?


Because the shadow page will become 'sync' that means the shadow 
page will be synced
with the page table in guest. So the shadow page need to be 
write-protected to avoid

the guest page table is changed when we do the 'sync' thing.

The shadow page need to be write-protected to avoid that guest page 
table is changed
when we are syncing the shadow page table. See kvm_sync_pages() 
after doing

rmap_write_protect().
I see. So why are you treat PT_PAGE_TABLE_LEVEL gfn separately here? 
why this cannot be done in

account_shadowed, as you did for upper level sp?


non-leaf shadow pages are keepking write-protected which page fault 
handler can not fix write

access on it. And leaf shadow pages are not.
My point is the original code didn't separate the two cases so I am not 
sure why you need to separate. Perhaps you want to make account_shadowed 
imply the non-leaf guest page table is write-protected while leaf page 
table is not.


Thanks,
-Kai

Actually I am thinking whether account_shadowed is
overdoing things. From it's name it should only *account* shadow sp, 
but now it also does write

protection and disable large page mapping.



Hmm.. disable large page mapping is already in current code... i think 
account_shadowed() can
be understood as new page is taken into account, so protection things 
are needed there.


But I am not good at naming function and also my english is not good 
enough, any other better name

is welcome. ;)




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/11] KVM: MMU: simplify mmu_need_write_protect

2015-12-16 Thread Kai Huang



On 12/15/2015 05:25 PM, Xiao Guangrong wrote:



On 12/15/2015 04:43 PM, Kai Huang wrote:



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync

Signed-off-by: Xiao Guangrong 
---
  arch/x86/kvm/mmu.c | 26 --
  1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5a2ca73..f89e77f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct 
kvm_vcpu *vcpu, struct kvm_mmu_page *sp)

  kvm_mmu_mark_parents_unsync(sp);
  }
-static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
+ bool can_unsync)
  {
  struct kvm_mmu_page *s;
  for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
+if (!can_unsync)
+return true;
How about moving this right before for_each_gfn_indirect_valid_sp? As 
can_unsync is passed as

parameter, so there's no point checking it several times.



We can not do this. What we are doing here is checking if we have 
shadow page mapping

for 'gfn':
a) if no, it can be writable.
I think in this case you should also check whether the GFN is being 
write protection tracked. Ex, if the spte never exists when you add the 
GFN to write protection tracking, and in this case I think 
mmu_need_write_protect should also report true. Right?


b) if yes, check 'can_unsync' to see if these shadow pages can make to 
be 'unsync'.


Your suggestion can break the point a).

A further thinking is can we move it to mmu_need_write_protect? 
Passing can_unsync as parameter to

kvm_unsync_pages sounds a little bit odd.


+
  if (s->unsync)
  continue;
  WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
How about large page mapping? Such as if guest uses 2M mapping and 
its shadow is indirect, does
above WARN_ON still meet? As you removed the PT level check in 
mmu_need_write_protect.


The lager mapping are on the non-leaf shadow pages which can be 
figured out by

kvm_page_track_check_mode() before we call this function.
Actually I am not quite understanding how large page mapping is 
implemented. I see in kvm_mmu_get_page, when sp is allocated, it is 
large page mapping disabled, but I think we do support large shadow 
mapping, right? I mean theoretically if guest uses  2M mapping and 
shadow mapping can certainly use 2M mapping as well, and the 2M shadow 
mapping can also be 'unsynced' (as a leaf mapping table). But in your 
series I see if we write protect some  GFN, the shadow large page 
mapping is always disabled.


Am I wrong?

Thanks,
-Kai





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/11] KVM: MMU: use page track for non-leaf shadow pages

2015-12-15 Thread Kai Huang



On 12/15/2015 05:10 PM, Xiao Guangrong wrote:



On 12/15/2015 03:52 PM, Kai Huang wrote:


  static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
@@ -2140,12 +2150,18 @@ static struct kvm_mmu_page 
*kvm_mmu_get_page(struct kvm_vcpu *vcpu,

  hlist_add_head(&sp->hash_link,
&vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]);
  if (!direct) {
-if (rmap_write_protect(vcpu, gfn))
+/*
+ * we should do write protection before syncing pages
+ * otherwise the content of the synced shadow page may
+ * be inconsistent with guest page table.
+ */
+account_shadowed(vcpu->kvm, sp);
+
+if (level == PT_PAGE_TABLE_LEVEL &&
+  rmap_write_protect(vcpu, gfn))
  kvm_flush_remote_tlbs(vcpu->kvm);
I think your modification is good but I am little bit confused here. 
In account_shadowed, if
sp->role.level > PT_PAGE_TABLE_LEVEL, the sp->gfn is write protected, 
and this is reasonable. So why

write protecting the gfn of PT_PAGE_TABLE_LEVEL here?


Because the shadow page will become 'sync' that means the shadow page 
will be synced
with the page table in guest. So the shadow page need to be 
write-protected to avoid

the guest page table is changed when we do the 'sync' thing.

The shadow page need to be write-protected to avoid that guest page 
table is changed
when we are syncing the shadow page table. See kvm_sync_pages() after 
doing

rmap_write_protect().
I see. So why are you treat PT_PAGE_TABLE_LEVEL gfn separately here? why 
this cannot be done in account_shadowed, as you did for upper level sp? 
Actually I am thinking whether account_shadowed is overdoing things. 
From it's name it should only *account* shadow sp, but now it also does 
write protection and disable large page mapping.


Thanks,
-Kai



  /*
   * remove the guest page from the tracking pool which stops the 
interception
   * of corresponding access on that page. It is the opposed 
operation of
@@ -134,20 +160,12 @@ void kvm_page_track_remove_page(struct kvm 
*kvm, gfn_t gfn,

  struct kvm_memory_slot *slot;
  int i;
-WARN_ON(!check_mode(mode));
-
  for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
  slots = __kvm_memslots(kvm, i);
  slot = __gfn_to_memslot(slots, gfn);
  spin_lock(&kvm->mmu_lock);
-update_gfn_track(slot, gfn, mode, -1);
-
-/*
- * allow large page mapping for the tracked page
- * after the tracker is gone.
- */
-kvm_mmu_gfn_allow_lpage(slot, gfn);
+kvm_slot_page_track_remove_page_nolock(kvm, slot, gfn, mode);

Looks you need to merge this part with patch 1, as you are modifying
kvm_page_track_{add,remove}_page here, which are introduced in your 
patch 1.


Indeed, it is better.




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/11] KVM: page track: add the framework of guest page tracking

2015-12-15 Thread Kai Huang



On 12/15/2015 04:46 PM, Xiao Guangrong wrote:



On 12/15/2015 03:06 PM, Kai Huang wrote:

Hi Guangrong,

I am starting to review this series, and should have some comments or 
questions, you can determine

whether they are valuable :)


Thank you very much for your review and breaking the silent on this 
patchset. ;)




+static void page_track_slot_free(struct kvm_memory_slot *slot)
+{
+int i;
+
+for (i = 0; i < KVM_PAGE_TRACK_MAX; i++)
+if (slot->arch.gfn_track[i]) {
+kvfree(slot->arch.gfn_track[i]);
+slot->arch.gfn_track[i] = NULL;
+}
+}
+
+int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+  unsigned long npages)
+{
+int  i, pages = gfn_to_index(slot->base_gfn + npages - 1,
+  slot->base_gfn, PT_PAGE_TABLE_LEVEL) + 1;
+
+for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
+slot->arch.gfn_track[i] = kvm_kvzalloc(pages *
+sizeof(*slot->arch.gfn_track[i]));
+if (!slot->arch.gfn_track[i])
+goto track_free;
+}
+
+return 0;
+
+track_free:
+page_track_slot_free(slot);
+return -ENOMEM;
+}
Is it necessary to use the 'unsigned long npages' pareameter? In my 
understanding you are going to


The type, 'int', is used here as I followed the style of 'struct 
kvm_lpage_info'.


4 bytes should be enough to track all users and signed type is good to 
track

overflow.

track all GFNs in the memory slot anyway, right? If you want to keep 
npages, I think it's better to
add a base_gfn as well otherwise you are assuming you are going to 
track the npages GFN starting

from slot->base_gfn.


Yes, any page in the memslot may be tracked so that there is a index 
for every

page.




+
+void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
+ struct kvm_memory_slot *dont)
+{
+if (!dont || free->arch.gfn_track != dont->arch.gfn_track)
+page_track_slot_free(free);
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c04987e..ad4888a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7838,6 +7838,8 @@ void kvm_arch_free_memslot(struct kvm *kvm, 
struct kvm_memory_slot *free,

  free->arch.lpage_info[i - 1] = NULL;
  }
  }
+
+kvm_page_track_free_memslot(free, dont);
  }
  int kvm_arch_create_memslot(struct kvm *kvm, struct 
kvm_memory_slot *slot,
@@ -7886,6 +7888,9 @@ int kvm_arch_create_memslot(struct kvm *kvm, 
struct kvm_memory_slot *slot,

  }
  }
+if (kvm_page_track_create_memslot(slot, npages))
+goto out_free;
+
Looks essentially you are allocating one int for all GFNs of the slot 
unconditionally. In my
understanding for most of memory slots, we are not going to track 
them, so isn't it going to be

wasteful of memory?



Yes, hmm... maybe we can make the index as "unsigned short" then 1G 
memory only needs 512k index

buffer. It is not so unacceptable.

Those comments are really minor and don't bother on this :)

Thanks,
-Kai

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/11] KVM: MMU: let page fault handler be aware tracked page

2015-12-15 Thread Kai Huang



On 12/15/2015 05:03 PM, Xiao Guangrong wrote:



On 12/15/2015 04:11 PM, Kai Huang wrote:



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

The page fault caused by write access on the write tracked page can not
be fixed, it always need to be emulated. page_fault_handle_page_track()
is the fast path we introduce here to skip holding mmu-lock and shadow
page table walking
Why can it be out side of mmu-lock? Is it OK that some other thread 
removes tracking of this page
simultaneously? Shall we assuming the emulation code should handle 
this case?




What your mentioned is the worst case, if that happen the vcpu will spend
longer time to emulate the access rather then retry it. It is bad but 
it is

the rare contention. It is worth speeding up the common / most case.
My concern is when this case happen, whether emulating the access is 
still the right behavior, you know, after other thread removed the GFN 
from tracking..
And as the notifier's  track_write call back will be called in the 
emulating code, won't there be problem if the GFN has been removed from 
tracking by other thread?


Thanks,
-Kai


Even it works for write protection, is it OK for other purpose 
tracking (as your tracking framework

can be extended for other purpose)?


We only need to make sure that no track event is lost, i.e, we can not
skip the case that the index is changed from 0 to 1.

If we see index is 0, the vcpu can hold the mmu-lock and go to slow path
anyway so no track event will be lost.



However, if the page table is not present, it is worth making the page
table entry present and readonly to make the read access happy
It's  fine for tracking write from guest. But what if I want to track 
every read from guest?

Probably I am exaggerating :)



Then we do not go to the real page fault handler, just keep the shadow
page entry non-present.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/11] KVM: MMU: simplify mmu_need_write_protect

2015-12-15 Thread Kai Huang



On 12/15/2015 04:43 PM, Kai Huang wrote:



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync

Signed-off-by: Xiao Guangrong 
---
  arch/x86/kvm/mmu.c | 26 --
  1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5a2ca73..f89e77f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct kvm_vcpu 
*vcpu, struct kvm_mmu_page *sp)

  kvm_mmu_mark_parents_unsync(sp);
  }
  -static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
+ bool can_unsync)
  {
  struct kvm_mmu_page *s;
for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
+if (!can_unsync)
+return true;
How about moving this right before for_each_gfn_indirect_valid_sp? As 
can_unsync is passed as parameter, so there's no point checking it 
several times.


A further thinking is can we move it to mmu_need_write_protect? 
Passing can_unsync as parameter to kvm_unsync_pages sounds a little 
bit odd.



+
  if (s->unsync)
  continue;
  WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
How about large page mapping? Such as if guest uses 2M mapping and its 
shadow is indirect, does above WARN_ON still meet? As you removed the 
PT level check in mmu_need_write_protect.


Thanks,
-Kai

Btw I also think this patch can be merged with patch 6.

Thanks,
-Kai

  __kvm_unsync_page(vcpu, s);
  }
+
+return false;
  }



static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 bool can_unsync)
  {
-struct kvm_mmu_page *s;
-bool need_unsync = false;
-
  if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
  return true;
  -for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
-if (!can_unsync)
-return true;
-
-if (s->role.level != PT_PAGE_TABLE_LEVEL)
-return true;
-
-if (!s->unsync)
-need_unsync = true;
-}
-if (need_unsync)
-kvm_unsync_pages(vcpu, gfn);
-
-return false;
+return kvm_unsync_pages(vcpu, gfn, can_unsync);
  }
static bool kvm_is_mmio_pfn(pfn_t pfn)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/11] KVM: MMU: simplify mmu_need_write_protect

2015-12-15 Thread Kai Huang



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync

Signed-off-by: Xiao Guangrong 
---
  arch/x86/kvm/mmu.c | 26 --
  1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5a2ca73..f89e77f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *sp)
kvm_mmu_mark_parents_unsync(sp);
  }
  
-static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)

+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
+bool can_unsync)
  {
struct kvm_mmu_page *s;
  
  	for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {

+   if (!can_unsync)
+   return true;
How about moving this right before for_each_gfn_indirect_valid_sp? As 
can_unsync is passed as parameter, so there's no point checking it 
several times.


A further thinking is can we move it to mmu_need_write_protect? Passing 
can_unsync as parameter to kvm_unsync_pages sounds a little bit odd.



+
if (s->unsync)
continue;
WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
How about large page mapping? Such as if guest uses 2M mapping and its 
shadow is indirect, does above WARN_ON still meet? As you removed the PT 
level check in mmu_need_write_protect.


Thanks,
-Kai

__kvm_unsync_page(vcpu, s);
}
+
+   return false;
  }


  
  static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,

   bool can_unsync)
  {
-   struct kvm_mmu_page *s;
-   bool need_unsync = false;
-
if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
return true;
  
-	for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {

-   if (!can_unsync)
-   return true;
-
-   if (s->role.level != PT_PAGE_TABLE_LEVEL)
-   return true;
-
-   if (!s->unsync)
-   need_unsync = true;
-   }
-   if (need_unsync)
-   kvm_unsync_pages(vcpu, gfn);
-
-   return false;
+   return kvm_unsync_pages(vcpu, gfn, can_unsync);
  }
  
  static bool kvm_is_mmio_pfn(pfn_t pfn)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/11] KVM: MMU: let page fault handler be aware tracked page

2015-12-15 Thread Kai Huang



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

The page fault caused by write access on the write tracked page can not
be fixed, it always need to be emulated. page_fault_handle_page_track()
is the fast path we introduce here to skip holding mmu-lock and shadow
page table walking
Why can it be out side of mmu-lock? Is it OK that some other thread 
removes tracking of this page simultaneously? Shall we assuming the 
emulation code should handle this case?


Even it works for write protection, is it OK for other purpose tracking 
(as your tracking framework can be extended for other purpose)?


However, if the page table is not present, it is worth making the page
table entry present and readonly to make the read access happy
It's  fine for tracking write from guest. But what if I want to track 
every read from guest? Probably I am exaggerating :)


Thanks,
-Kai


mmu_need_write_protect() need to be cooked to avoid page becoming writable
when making page table present or sync/prefetch shadow page table entries

Signed-off-by: Xiao Guangrong 
---
  arch/x86/include/asm/kvm_page_track.h |  2 ++
  arch/x86/kvm/mmu.c| 44 +--
  arch/x86/kvm/page_track.c | 14 +++
  arch/x86/kvm/paging_tmpl.h|  3 +++
  4 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 9cc17c6..f223201 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -15,4 +15,6 @@ void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
 enum kvm_page_track_mode mode);
  void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
+bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
+  enum kvm_page_track_mode mode);
  #endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 39809b8..b23f9fc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -41,6 +41,7 @@
  #include 
  #include 
  #include 
+#include 
  
  /*

   * When setting this variable to true it enables Two-Dimensional-Paging
@@ -2456,25 +2457,29 @@ static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  
gfn_t gfn)
}
  }
  
-static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,

- bool can_unsync)
+static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
+  bool can_unsync)
  {
struct kvm_mmu_page *s;
bool need_unsync = false;
  
+	if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))

+   return true;
+
for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
if (!can_unsync)
-   return 1;
+   return true;
  
  		if (s->role.level != PT_PAGE_TABLE_LEVEL)

-   return 1;
+   return true;
  
  		if (!s->unsync)

need_unsync = true;
}
if (need_unsync)
kvm_unsync_pages(vcpu, gfn);
-   return 0;
+
+   return false;
  }
  
  static bool kvm_is_mmio_pfn(pfn_t pfn)

@@ -3388,10 +3393,30 @@ int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 
addr, bool direct)
  }
  EXPORT_SYMBOL_GPL(handle_mmio_page_fault);
  
+static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,

+u32 error_code, gfn_t gfn)
+{
+   if (unlikely(error_code & PFERR_RSVD_MASK))
+   return false;
+
+   if (!(error_code & PFERR_PRESENT_MASK) ||
+ !(error_code & PFERR_WRITE_MASK))
+   return false;
+
+   /*
+* guest is writing the page which is write tracked which can
+* not be fixed by page fault handler.
+*/
+   if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+   return true;
+
+   return false;
+}
+
  static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
u32 error_code, bool prefault)
  {
-   gfn_t gfn;
+   gfn_t gfn = gva >> PAGE_SHIFT;
int r;
  
  	pgprintk("%s: gva %lx error %x\n", __func__, gva, error_code);

@@ -3403,13 +3428,15 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, 
gva_t gva,
return r;
}
  
+	if (page_fault_handle_page_track(vcpu, error_code, gfn))

+   return 1;
+
r = mmu_topup_memory_caches(vcpu);
if (r)
return r;
  
  	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
  
-	gfn = gva >> PAGE_SHIFT;
  
  	return nonpaging_map(vcpu, gva & PAGE_MASK,

 error_code, gfn, prefault);
@@ -3493,6 +3520,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
return r;
}
  
+	if (page_fa

Re: [PATCH 08/11] KVM: MMU: use page track for non-leaf shadow pages

2015-12-15 Thread Kai Huang



On 12/15/2015 03:52 PM, Kai Huang wrote:



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

non-leaf shadow pages are always write protected, it can be the user
of page track

Signed-off-by: Xiao Guangrong 
---
  arch/x86/include/asm/kvm_page_track.h |  8 +
  arch/x86/kvm/mmu.c| 26 +---
  arch/x86/kvm/page_track.c | 58 
+++

  3 files changed, 67 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h

index 6744234..3447dac 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -41,8 +41,16 @@ int kvm_page_track_create_memslot(struct 
kvm_memory_slot *slot,

  void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
   struct kvm_memory_slot *dont);
  +void
+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+struct kvm_memory_slot *slot, gfn_t gfn,
+enum kvm_page_track_mode mode);
  void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
   enum kvm_page_track_mode mode);
+void kvm_slot_page_track_remove_page_nolock(struct kvm *kvm,
+struct kvm_memory_slot *slot,
+gfn_t gfn,
+enum kvm_page_track_mode mode);
  void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
  enum kvm_page_track_mode mode);
  bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b23f9fc..5a2ca73 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -806,11 +806,17 @@ static void account_shadowed(struct kvm *kvm, 
struct kvm_mmu_page *sp)

  struct kvm_memory_slot *slot;
  gfn_t gfn;
  +kvm->arch.indirect_shadow_pages++;
  gfn = sp->gfn;
  slots = kvm_memslots_for_spte_role(kvm, sp->role);
  slot = __gfn_to_memslot(slots, gfn);
+
+/* the non-leaf shadow pages are keeping readonly. */
+if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+return kvm_slot_page_track_add_page_nolock(kvm, slot, gfn,
+KVM_PAGE_TRACK_WRITE);
+
  kvm_mmu_gfn_disallow_lpage(slot, gfn);
-kvm->arch.indirect_shadow_pages++;
  }
static void unaccount_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
@@ -819,11 +825,15 @@ static void unaccount_shadowed(struct kvm *kvm, 
struct kvm_mmu_page *sp)

  struct kvm_memory_slot *slot;
  gfn_t gfn;
  +kvm->arch.indirect_shadow_pages--;
  gfn = sp->gfn;
  slots = kvm_memslots_for_spte_role(kvm, sp->role);
  slot = __gfn_to_memslot(slots, gfn);
+if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+return kvm_slot_page_track_remove_page_nolock(kvm, slot, gfn,
+KVM_PAGE_TRACK_WRITE);
+
  kvm_mmu_gfn_allow_lpage(slot, gfn);
-kvm->arch.indirect_shadow_pages--;
  }
static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
@@ -2140,12 +2150,18 @@ static struct kvm_mmu_page 
*kvm_mmu_get_page(struct kvm_vcpu *vcpu,

  hlist_add_head(&sp->hash_link,
&vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]);
  if (!direct) {
-if (rmap_write_protect(vcpu, gfn))
+/*
+ * we should do write protection before syncing pages
+ * otherwise the content of the synced shadow page may
+ * be inconsistent with guest page table.
+ */
+account_shadowed(vcpu->kvm, sp);
+
+if (level == PT_PAGE_TABLE_LEVEL &&
+  rmap_write_protect(vcpu, gfn))
  kvm_flush_remote_tlbs(vcpu->kvm);
I think your modification is good but I am little bit confused here. 
In account_shadowed, if sp->role.level > PT_PAGE_TABLE_LEVEL, the 
sp->gfn is write protected, and this is reasonable. So why write 
protecting the gfn of PT_PAGE_TABLE_LEVEL here?



  if (level > PT_PAGE_TABLE_LEVEL && need_sync)
  kvm_sync_pages(vcpu, gfn);
-
-account_shadowed(vcpu->kvm, sp);
  }
  sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
  init_shadow_page_table(sp);
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 84420df..87554d3 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -77,6 +77,26 @@ static void update_gfn_track(struct 
kvm_memory_slot *slot, gfn_t gfn,

  WARN_ON(val < 0);
  }
  +void
+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+struct kvm_memory_slot *slot, gfn_t gfn,
+enum kvm_page_track_mode mode)
+{
+WARN_ON(!check_mode(mode));
+
+update_gfn_track(slot, gfn, mode, 1);
+
+/*
+ * new track stops large page mapping for the
+ * tracked page.
+ */
+kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+if (mode == KVM_PAGE_TRACK_WRIT

Re: [PATCH 05/11] KVM: page track: introduce kvm_page_track_{add,remove}_page

2015-12-15 Thread Kai Huang



On 12/15/2015 03:15 PM, Kai Huang wrote:



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

These two functions are the user APIs:
- kvm_page_track_add_page(): add the page to the tracking pool after
   that later specified access on that page will be tracked

- kvm_page_track_remove_page(): remove the page from the tracking pool,
   the specified access on the page is not tracked after the last 
user is

   gone

Both of these are called under the protection of kvm->srcu or
kvm->slots_lock

Signed-off-by: Xiao Guangrong 
---
  arch/x86/include/asm/kvm_page_track.h |  5 ++
  arch/x86/kvm/page_track.c | 95 
+++

  2 files changed, 100 insertions(+)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h

index 347d5c9..9cc17c6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -10,4 +10,9 @@ int kvm_page_track_create_memslot(struct 
kvm_memory_slot *slot,

unsigned long npages);
  void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
   struct kvm_memory_slot *dont);
+
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+ enum kvm_page_track_mode mode);
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+enum kvm_page_track_mode mode);
  #endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 0338d36..ad510db 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -56,3 +56,98 @@ void kvm_page_track_free_memslot(struct 
kvm_memory_slot *free,

  if (!dont || free->arch.gfn_track != dont->arch.gfn_track)
  page_track_slot_free(free);
  }
+
+static bool check_mode(enum kvm_page_track_mode mode)
+{
+if (mode < 0 || mode >= KVM_PAGE_TRACK_MAX)
+return false;
+
+return true;
+}
+
+static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+ enum kvm_page_track_mode mode, int count)
+{
+int index, val;
+
+index = gfn_to_index(gfn, slot->base_gfn, PT_PAGE_TABLE_LEVEL);
+
+slot->arch.gfn_track[mode][index] += count;
+val = slot->arch.gfn_track[mode][index];
+WARN_ON(val < 0);
+}
+
+/*
+ * add guest page to the tracking pool so that corresponding access 
on that

+ * page will be intercepted.
+ *
+ * It should be called under the protection of kvm->srcu or 
kvm->slots_lock

+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+ enum kvm_page_track_mode mode)
+{
+struct kvm_memslots *slots;
+struct kvm_memory_slot *slot;
+int i;
+
+WARN_ON(!check_mode(mode));
+
+for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+slots = __kvm_memslots(kvm, i);
+slot = __gfn_to_memslot(slots, gfn);
+
+spin_lock(&kvm->mmu_lock);
+update_gfn_track(slot, gfn, mode, 1);
+
+/*
+ * new track stops large page mapping for the
+ * tracked page.
+ */
+kvm_mmu_gfn_disallow_lpage(slot, gfn);
Where is  kvm_mmu_gfn_disallow_lpage? Neither did I see it in your 
patch nor in my own latest KVM repo without your patch :)



+
+if (mode == KVM_PAGE_TRACK_WRITE)
+if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
+kvm_flush_remote_tlbs(kvm);

Neither can I find kvm_mmu_slot_gfn_write_protect. Did I miss something?
Forget this reply. Looks my thunderbird has something wrong and couldn't 
show your whole patch series, and I missed patch 1~3. Now I see them. Sorry.


Thanks,
-Kai


Thanks,
-Kai

+ spin_unlock(&kvm->mmu_lock);
+}
+}
+
+/*
+ * remove the guest page from the tracking pool which stops the 
interception

+ * of corresponding access on that page. It is the opposed operation of
+ * kvm_page_track_add_page().
+ *
+ * It should be called under the protection of kvm->srcu or 
kvm->slots_lock

+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+enum kvm_page_track_mode mode)
+{
+struct kvm_memslots *slots;
+struct kvm_memory_slot *slot;
+int i;
+
+WARN_ON(!check_mode(mode));
+
+for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+slots = __kvm_memslots(kvm, i);
+slot = __gfn_to_memslot(slots, gfn);
+
+spin_lock(&kvm->mmu_lock);
+update_gfn_track(slot, gfn, mode, -1);
+
+/*
+ * allow large page mapping for the tracked page
+ * after the tracker is gone.
+ */
+kvm_mmu_gfn_allow_lpage(slot, gfn);
+spin_unlock(&kvm->mmu_lock);
+}
+}


--
To unsubscribe from this list: send

Re: [PATCH 08/11] KVM: MMU: use page track for non-leaf shadow pages

2015-12-14 Thread Kai Huang



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

non-leaf shadow pages are always write protected, it can be the user
of page track

Signed-off-by: Xiao Guangrong 
---
  arch/x86/include/asm/kvm_page_track.h |  8 +
  arch/x86/kvm/mmu.c| 26 +---
  arch/x86/kvm/page_track.c | 58 +++
  3 files changed, 67 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 6744234..3447dac 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -41,8 +41,16 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot 
*slot,
  void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 struct kvm_memory_slot *dont);
  
+void

+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+   struct kvm_memory_slot *slot, gfn_t gfn,
+   enum kvm_page_track_mode mode);
  void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
 enum kvm_page_track_mode mode);
+void kvm_slot_page_track_remove_page_nolock(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn,
+   enum kvm_page_track_mode mode);
  void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
  bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b23f9fc..5a2ca73 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -806,11 +806,17 @@ static void account_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
struct kvm_memory_slot *slot;
gfn_t gfn;
  
+	kvm->arch.indirect_shadow_pages++;

gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
+
+   /* the non-leaf shadow pages are keeping readonly. */
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   return kvm_slot_page_track_add_page_nolock(kvm, slot, gfn,
+   KVM_PAGE_TRACK_WRITE);
+
kvm_mmu_gfn_disallow_lpage(slot, gfn);
-   kvm->arch.indirect_shadow_pages++;
  }
  
  static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)

@@ -819,11 +825,15 @@ static void unaccount_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
struct kvm_memory_slot *slot;
gfn_t gfn;
  
+	kvm->arch.indirect_shadow_pages--;

gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   return kvm_slot_page_track_remove_page_nolock(kvm, slot, gfn,
+   KVM_PAGE_TRACK_WRITE);
+
kvm_mmu_gfn_allow_lpage(slot, gfn);
-   kvm->arch.indirect_shadow_pages--;
  }
  
  static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,

@@ -2140,12 +2150,18 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
hlist_add_head(&sp->hash_link,
&vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]);
if (!direct) {
-   if (rmap_write_protect(vcpu, gfn))
+   /*
+* we should do write protection before syncing pages
+* otherwise the content of the synced shadow page may
+* be inconsistent with guest page table.
+*/
+   account_shadowed(vcpu->kvm, sp);
+
+   if (level == PT_PAGE_TABLE_LEVEL &&
+ rmap_write_protect(vcpu, gfn))
kvm_flush_remote_tlbs(vcpu->kvm);
I think your modification is good but I am little bit confused here. In 
account_shadowed, if sp->role.level > PT_PAGE_TABLE_LEVEL, the sp->gfn 
is write protected, and this is reasonable. So why write protecting the 
gfn of PT_PAGE_TABLE_LEVEL here?



if (level > PT_PAGE_TABLE_LEVEL && need_sync)
kvm_sync_pages(vcpu, gfn);
-
-   account_shadowed(vcpu->kvm, sp);
}
sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
init_shadow_page_table(sp);
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 84420df..87554d3 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -77,6 +77,26 @@ static void update_gfn_track(struct kvm_memory_slot *slot, 
gfn_t gfn,
WARN_ON(val < 0);
  }
  
+void

+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+   struct kvm_memory_slot *slot, gfn_t gfn,
+   enum kvm_page_track_mode mode)
+{
+   WARN_ON(!check_mode(mode))

Re: [PATCH 05/11] KVM: page track: introduce kvm_page_track_{add,remove}_page

2015-12-14 Thread Kai Huang



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

These two functions are the user APIs:
- kvm_page_track_add_page(): add the page to the tracking pool after
   that later specified access on that page will be tracked

- kvm_page_track_remove_page(): remove the page from the tracking pool,
   the specified access on the page is not tracked after the last user is
   gone

Both of these are called under the protection of kvm->srcu or
kvm->slots_lock

Signed-off-by: Xiao Guangrong 
---
  arch/x86/include/asm/kvm_page_track.h |  5 ++
  arch/x86/kvm/page_track.c | 95 +++
  2 files changed, 100 insertions(+)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 347d5c9..9cc17c6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -10,4 +10,9 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot 
*slot,
  unsigned long npages);
  void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 struct kvm_memory_slot *dont);
+
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+enum kvm_page_track_mode mode);
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+   enum kvm_page_track_mode mode);
  #endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 0338d36..ad510db 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -56,3 +56,98 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot 
*free,
if (!dont || free->arch.gfn_track != dont->arch.gfn_track)
page_track_slot_free(free);
  }
+
+static bool check_mode(enum kvm_page_track_mode mode)
+{
+   if (mode < 0 || mode >= KVM_PAGE_TRACK_MAX)
+   return false;
+
+   return true;
+}
+
+static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+enum kvm_page_track_mode mode, int count)
+{
+   int index, val;
+
+   index = gfn_to_index(gfn, slot->base_gfn, PT_PAGE_TABLE_LEVEL);
+
+   slot->arch.gfn_track[mode][index] += count;
+   val = slot->arch.gfn_track[mode][index];
+   WARN_ON(val < 0);
+}
+
+/*
+ * add guest page to the tracking pool so that corresponding access on that
+ * page will be intercepted.
+ *
+ * It should be called under the protection of kvm->srcu or kvm->slots_lock
+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+enum kvm_page_track_mode mode)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *slot;
+   int i;
+
+   WARN_ON(!check_mode(mode));
+
+   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   slots = __kvm_memslots(kvm, i);
+   slot = __gfn_to_memslot(slots, gfn);
+
+   spin_lock(&kvm->mmu_lock);
+   update_gfn_track(slot, gfn, mode, 1);
+
+   /*
+* new track stops large page mapping for the
+* tracked page.
+*/
+   kvm_mmu_gfn_disallow_lpage(slot, gfn);
Where is  kvm_mmu_gfn_disallow_lpage? Neither did I see it in your patch 
nor in my own latest KVM repo without your patch :)



+
+   if (mode == KVM_PAGE_TRACK_WRITE)
+   if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
+   kvm_flush_remote_tlbs(kvm);

Neither can I find kvm_mmu_slot_gfn_write_protect. Did I miss something?

Thanks,
-Kai

+   spin_unlock(&kvm->mmu_lock);
+   }
+}
+
+/*
+ * remove the guest page from the tracking pool which stops the interception
+ * of corresponding access on that page. It is the opposed operation of
+ * kvm_page_track_add_page().
+ *
+ * It should be called under the protection of kvm->srcu or kvm->slots_lock
+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+   enum kvm_page_track_mode mode)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *slot;
+   int i;
+
+   WARN_ON(!check_mode(mode));
+
+   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   slots = __kvm_memslots(kvm, i);
+   slot = __gfn_to_memslot(slots, gfn);
+
+   spin_lock(&kvm->mmu_lock);
+   update_gfn_track(slot, gfn, mode, -1);
+
+   /*
+* allow large page mapping for the tracked page
+* after the tracker is gone.
+*/
+   kvm_mmu_gfn_allow_lpage(slot, gfn);
+   spin_unlock(&kvm->mmu_lock);
+   }
+}


--

Re: [PATCH 04/11] KVM: page track: add the framework of guest page tracking

2015-12-14 Thread Kai Huang

Hi Guangrong,

I am starting to review this series, and should have some comments or 
questions, you can determine whether they are valuable :)


See  below.

On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero

Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array

Currently, only write track mode is supported

Signed-off-by: Xiao Guangrong 
---
  arch/x86/include/asm/kvm_host.h   |  2 ++
  arch/x86/include/asm/kvm_page_track.h | 13 
  arch/x86/kvm/Makefile |  3 +-
  arch/x86/kvm/page_track.c | 58 +++
  arch/x86/kvm/x86.c|  5 +++
  5 files changed, 80 insertions(+), 1 deletion(-)
  create mode 100644 arch/x86/include/asm/kvm_page_track.h
  create mode 100644 arch/x86/kvm/page_track.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5aa2dcc..afff1f1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -32,6 +32,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #define KVM_MAX_VCPUS 255

  #define KVM_SOFT_MAX_VCPUS 160
@@ -612,6 +613,7 @@ struct kvm_lpage_info {
  struct kvm_arch_memory_slot {
struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
+   int *gfn_track[KVM_PAGE_TRACK_MAX];
  };
  
  /*

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
new file mode 100644
index 000..347d5c9
--- /dev/null
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -0,0 +1,13 @@
+#ifndef _ASM_X86_KVM_PAGE_TRACK_H
+#define _ASM_X86_KVM_PAGE_TRACK_H
+
+enum kvm_page_track_mode {
+   KVM_PAGE_TRACK_WRITE,
+   KVM_PAGE_TRACK_MAX,
+};
+
+int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+ unsigned long npages);
+void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
+struct kvm_memory_slot *dont);
+#endif
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index a1ff508..464fa47 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -13,9 +13,10 @@ kvm-$(CONFIG_KVM_ASYNC_PF)   += $(KVM)/async_pf.o
  
  kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \

   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
-  hyperv.o
+  hyperv.o page_track.o
  
  kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT)	+= assigned-dev.o iommu.o

+
  kvm-intel-y   += vmx.o pmu_intel.o
  kvm-amd-y += svm.o pmu_amd.o
  
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c

new file mode 100644
index 000..0338d36
--- /dev/null
+++ b/arch/x86/kvm/page_track.c
@@ -0,0 +1,58 @@
+/*
+ * Support KVM gust page tracking
+ *
+ * This feature allows us to track page access in guest. Currently, only
+ * write access is tracked.
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *   Xiao Guangrong 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include 
+#include 
+#include 
+
+#include "mmu.h"
+
+static void page_track_slot_free(struct kvm_memory_slot *slot)
+{
+   int i;
+
+   for (i = 0; i < KVM_PAGE_TRACK_MAX; i++)
+   if (slot->arch.gfn_track[i]) {
+   kvfree(slot->arch.gfn_track[i]);
+   slot->arch.gfn_track[i] = NULL;
+   }
+}
+
+int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+ unsigned long npages)
+{
+   int  i, pages = gfn_to_index(slot->base_gfn + npages - 1,
+ slot->base_gfn, PT_PAGE_TABLE_LEVEL) + 1;
+
+   for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
+   slot->arch.gfn_track[i] = kvm_kvzalloc(pages *
+   sizeof(*slot->arch.gfn_track[i]));
+   if (!slot->arch.gfn_track[i])
+   goto track_free;
+   }
+
+   return 0;
+
+track_free:
+   page_track_slot_free(slot);
+   return -ENOMEM;
+}
Is it necessary to use the 'unsigned long npages' pareameter? In my 
understanding you are going to track all GFNs in the memory slot anyway, 
right? If you want to keep npages, I think it's better to add a base_gfn 
as well otherwise you are assuming you are going to track the npages GFN 
starting from slot->base_gfn.



+
+void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
+struct kvm_memory_slot *dont)
+{
+   

Re: [v2] KVM: VMX: Fix commit which broke PML

2015-11-04 Thread Kai Huang

Hi Paolo,

Thanks for applying! I am really sorry that I forgot to delete the line 
that clears SECONDARY_EXEC_ENABLE_PML bit in vmx_disable_pml, which is 
renamed to vmx_destroy_pml_buffer now.
It won't impact functionality but to make the function consistent, would 
you also do below? Sorry for such negligence!


diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 89f4fa2..ef4ca76 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7826,8 +7826,6 @@ static void vmx_destroy_pml_buffer(struct vcpu_vmx 
*vmx)

ASSERT(vmx->pml_pg);
__free_page(vmx->pml_pg);
vmx->pml_pg = NULL;
-
-   vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL, 
SECONDARY_EXEC_ENABLE_PML);

 }

Thanks,
-Kai

On 11/04/2015 08:00 PM, Paolo Bonzini wrote:


On 04/11/2015 06:46, Kai Huang wrote:

I found PML was broken since below commit:

commit feda805fe7c4ed9cf78158e73b1218752e3b4314
Author: Xiao Guangrong 
Date:   Wed Sep 9 14:05:55 2015 +0800

KVM: VMX: unify SECONDARY_VM_EXEC_CONTROL update

Unify the update in vmx_cpuid_update()

Signed-off-by: Xiao Guangrong 
[Rewrite to use vmcs_set_secondary_exec_control. - Paolo]
Signed-off-by: Paolo Bonzini 

The reason is in above commit vmx_cpuid_update calls vmx_secondary_exec_control,
in which currently SECONDARY_EXEC_ENABLE_PML bit is cleared unconditionally (as
PML is enabled in creating vcpu). Therefore if vcpu_cpuid_update is called after
vcpu is created, PML will be disabled unexpectedly while log-dirty code still
thinks PML is used.

Fix this by clearing SECONDARY_EXEC_ENABLE_PML in vmx_secondary_exec_control
only when PML is not supported or not enabled (!enable_pml). This is more
reasonable as PML is currently either always enabled or disabled. With this
explicit updating SECONDARY_EXEC_ENABLE_PML in vmx_enable{disable}_pml is not
needed so also rename vmx_enable{disable}_pml to vmx_create{destroy}_pml_buffer.

Signed-off-by: Kai Huang 

---

v1->v2: Fix this by following Paolo's suggestion. It's better to not to clear
SECONDARY_EXEC_ENABLE_PML in vmx_secondary_exec_control unconditionally but only
clear it when PML is not supported or enabled.

---
  arch/x86/kvm/vmx.c | 15 +++
  1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2ac11641..89f4fa2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4718,8 +4718,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx 
*vmx)
   a current VMCS12
*/
exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;
-   /* PML is enabled/disabled in creating/destorying vcpu */
-   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
+
+   if (!enable_pml)
+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
  
  	/* Currently, we allow L1 guest to directly run pcommit instruction. */

exec_control &= ~SECONDARY_EXEC_PCOMMIT;
@@ -7804,7 +7805,7 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 
*info1, u64 *info2)
*info2 = vmcs_read32(VM_EXIT_INTR_INFO);
  }
  
-static int vmx_enable_pml(struct vcpu_vmx *vmx)

+static int vmx_create_pml_buffer(struct vcpu_vmx *vmx)
  {
struct page *pml_pg;
  
@@ -7817,12 +7818,10 @@ static int vmx_enable_pml(struct vcpu_vmx *vmx)

vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
  
-	vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL, SECONDARY_EXEC_ENABLE_PML);

-
return 0;
  }
  
-static void vmx_disable_pml(struct vcpu_vmx *vmx)

+static void vmx_destroy_pml_buffer(struct vcpu_vmx *vmx)
  {
ASSERT(vmx->pml_pg);
__free_page(vmx->pml_pg);
@@ -8706,7 +8705,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
  
  	if (enable_pml)

-   vmx_disable_pml(vmx);
+   vmx_destroy_pml_buffer(vmx);
free_vpid(vmx->vpid);
leave_guest_mode(vcpu);
vmx_load_vmcs01(vcpu);
@@ -8790,7 +8789,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
 * for the guest, etc.
 */
if (enable_pml) {
-   err = vmx_enable_pml(vmx);
+   err = vmx_create_pml_buffer(vmx);
if (err)
goto free_vmcs;
}



Applied, thanks!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND PATCH v2] KVM: VMX: Fix commit which broke PML

2015-11-03 Thread Kai Huang
I found PML was broken since below commit:

commit feda805fe7c4ed9cf78158e73b1218752e3b4314
Author: Xiao Guangrong 
Date:   Wed Sep 9 14:05:55 2015 +0800

KVM: VMX: unify SECONDARY_VM_EXEC_CONTROL update

Unify the update in vmx_cpuid_update()

Signed-off-by: Xiao Guangrong 
[Rewrite to use vmcs_set_secondary_exec_control. - Paolo]
Signed-off-by: Paolo Bonzini 

The reason is in above commit vmx_cpuid_update calls vmx_secondary_exec_control,
in which currently SECONDARY_EXEC_ENABLE_PML bit is cleared unconditionally (as
PML is enabled in creating vcpu). Therefore if vcpu_cpuid_update is called after
vcpu is created, PML will be disabled unexpectedly while log-dirty code still
thinks PML is used.

Fix this by clearing SECONDARY_EXEC_ENABLE_PML in vmx_secondary_exec_control
only when PML is not supported or not enabled (!enable_pml). This is more
reasonable as PML is currently either always enabled or disabled. With this
explicit updating SECONDARY_EXEC_ENABLE_PML in vmx_enable{disable}_pml is not
needed so also rename vmx_enable{disable}_pml to vmx_create{destroy}_pml_buffer.

Signed-off-by: Kai Huang 
---

Sorry previous patch missed PATCH subject prefix. Resend by fixing that.

v1->v2: Fix this by following Paolo's suggestion. It's better to not to clear
SECONDARY_EXEC_ENABLE_PML in vmx_secondary_exec_control unconditionally but only
clear it when PML is not supported or enabled.

---
 arch/x86/kvm/vmx.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2ac11641..89f4fa2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4718,8 +4718,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx 
*vmx)
   a current VMCS12
*/
exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;
-   /* PML is enabled/disabled in creating/destorying vcpu */
-   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
+
+   if (!enable_pml)
+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
 
/* Currently, we allow L1 guest to directly run pcommit instruction. */
exec_control &= ~SECONDARY_EXEC_PCOMMIT;
@@ -7804,7 +7805,7 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 
*info1, u64 *info2)
*info2 = vmcs_read32(VM_EXIT_INTR_INFO);
 }
 
-static int vmx_enable_pml(struct vcpu_vmx *vmx)
+static int vmx_create_pml_buffer(struct vcpu_vmx *vmx)
 {
struct page *pml_pg;
 
@@ -7817,12 +7818,10 @@ static int vmx_enable_pml(struct vcpu_vmx *vmx)
vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
 
-   vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL, SECONDARY_EXEC_ENABLE_PML);
-
return 0;
 }
 
-static void vmx_disable_pml(struct vcpu_vmx *vmx)
+static void vmx_destroy_pml_buffer(struct vcpu_vmx *vmx)
 {
ASSERT(vmx->pml_pg);
__free_page(vmx->pml_pg);
@@ -8706,7 +8705,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
if (enable_pml)
-   vmx_disable_pml(vmx);
+   vmx_destroy_pml_buffer(vmx);
free_vpid(vmx->vpid);
leave_guest_mode(vcpu);
vmx_load_vmcs01(vcpu);
@@ -8790,7 +8789,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
 * for the guest, etc.
 */
if (enable_pml) {
-   err = vmx_enable_pml(vmx);
+   err = vmx_create_pml_buffer(vmx);
if (err)
goto free_vmcs;
}
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v2] KVM: VMX: Fix commit which broke PML

2015-11-03 Thread Kai Huang
I found PML was broken since below commit:

commit feda805fe7c4ed9cf78158e73b1218752e3b4314
Author: Xiao Guangrong 
Date:   Wed Sep 9 14:05:55 2015 +0800

KVM: VMX: unify SECONDARY_VM_EXEC_CONTROL update

Unify the update in vmx_cpuid_update()

Signed-off-by: Xiao Guangrong 
[Rewrite to use vmcs_set_secondary_exec_control. - Paolo]
Signed-off-by: Paolo Bonzini 

The reason is in above commit vmx_cpuid_update calls vmx_secondary_exec_control,
in which currently SECONDARY_EXEC_ENABLE_PML bit is cleared unconditionally (as
PML is enabled in creating vcpu). Therefore if vcpu_cpuid_update is called after
vcpu is created, PML will be disabled unexpectedly while log-dirty code still
thinks PML is used.

Fix this by clearing SECONDARY_EXEC_ENABLE_PML in vmx_secondary_exec_control
only when PML is not supported or not enabled (!enable_pml). This is more
reasonable as PML is currently either always enabled or disabled. With this
explicit updating SECONDARY_EXEC_ENABLE_PML in vmx_enable{disable}_pml is not
needed so also rename vmx_enable{disable}_pml to vmx_create{destroy}_pml_buffer.

Signed-off-by: Kai Huang 

---

v1->v2: Fix this by following Paolo's suggestion. It's better to not to clear
SECONDARY_EXEC_ENABLE_PML in vmx_secondary_exec_control unconditionally but only
clear it when PML is not supported or enabled.

---
 arch/x86/kvm/vmx.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2ac11641..89f4fa2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4718,8 +4718,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx 
*vmx)
   a current VMCS12
*/
exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;
-   /* PML is enabled/disabled in creating/destorying vcpu */
-   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
+
+   if (!enable_pml)
+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
 
/* Currently, we allow L1 guest to directly run pcommit instruction. */
exec_control &= ~SECONDARY_EXEC_PCOMMIT;
@@ -7804,7 +7805,7 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 
*info1, u64 *info2)
*info2 = vmcs_read32(VM_EXIT_INTR_INFO);
 }
 
-static int vmx_enable_pml(struct vcpu_vmx *vmx)
+static int vmx_create_pml_buffer(struct vcpu_vmx *vmx)
 {
struct page *pml_pg;
 
@@ -7817,12 +7818,10 @@ static int vmx_enable_pml(struct vcpu_vmx *vmx)
vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
 
-   vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL, SECONDARY_EXEC_ENABLE_PML);
-
return 0;
 }
 
-static void vmx_disable_pml(struct vcpu_vmx *vmx)
+static void vmx_destroy_pml_buffer(struct vcpu_vmx *vmx)
 {
ASSERT(vmx->pml_pg);
__free_page(vmx->pml_pg);
@@ -8706,7 +8705,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
if (enable_pml)
-   vmx_disable_pml(vmx);
+   vmx_destroy_pml_buffer(vmx);
free_vpid(vmx->vpid);
leave_guest_mode(vcpu);
vmx_load_vmcs01(vcpu);
@@ -8790,7 +8789,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
 * for the guest, etc.
 */
if (enable_pml) {
-   err = vmx_enable_pml(vmx);
+   err = vmx_create_pml_buffer(vmx);
if (err)
goto free_vmcs;
}
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: VMX: Fix commit which broke PML

2015-11-03 Thread Kai Huang



On 11/03/2015 05:59 PM, Paolo Bonzini wrote:


On 03/11/2015 06:49, Kai Huang wrote:

I found PML was broken since below commit:

commit feda805fe7c4ed9cf78158e73b1218752e3b4314
Author: Xiao Guangrong 
Date:   Wed Sep 9 14:05:55 2015 +0800

KVM: VMX: unify SECONDARY_VM_EXEC_CONTROL update

Unify the update in vmx_cpuid_update()

Signed-off-by: Xiao Guangrong 
[Rewrite to use vmcs_set_secondary_exec_control. - Paolo]
Signed-off-by: Paolo Bonzini 

The reason is PML after above commit vmx_cpuid_update calls
vmx_secondary_exec_control, in which PML is disabled unconditionally, as PML is
enabled in creating vcpu. Therefore if vcpu_cpuid_update is called after vcpu is
created, PML will be disabled unexpectedly while log-dirty code still think PML
is used. Actually looks calling vmx_secondary_exec_control in vmx_cpuid_update
is likely to break any VMX features that is enabled/disabled on demand by
updating SECONDARY_VM_EXEC_CONTROL, if vmx_cpuid_update is called between the
feature is enabled and disabled.

Fix this by calling vmcs_read32 to read out SECONDARY_VM_EXEC_CONTROL directly.

vmx_cpuid_update() is meant to be mostly idempotent; the parts that
depend on the current VMCS configuration are hidden in
vmcs_set_secondary_control.  So a better fix would be to add
SECONDARY_EXEC_ENABLE_PML to vmcs_set_secondary_exec_control's
"mask" variable.  However, you can see from the comment:

/*
 * These bits in the secondary execution controls field
 * are dynamic, the others are mostly based on the hypervisor
 * architecture and the guest's CPUID. Do not touch the
 * dynamic bits.
 */

that even this is not the optimal fix.  SECONDARY_EXEC_ENABLE_PML is
either always set or always clear, so it shouldn't be in "mask".

Instead, it should be in vmcs_config.cpu_based_2nd_exec_ctrl.  It isn't
because my review didn't notice this remnant of your original
implementation, which dynamically enabled/disabled PML.

In fact, cpu_has_vmx_pml() expects SECONDARY_EXEC_ENABLE_PML to be set
in vmcs_config.cpu_based_2nd_exec_ctrl, so it is a bit confusing to
remove the bit unconditionally in vmx_secondary_exec_control!

So I think SECONDARY_EXEC_ENABLE_PML should not be removed unconditionally
from exec_control in vmx_secondary_exec_control; the removal should be
conditional on !enable_pml, like we do for e.g. EPT or VPID.  If you do this,
vmx_enable_pml and vmx_disable_pml need not touch SECONDARY_VM_EXEC_CONTROL
anymore.  Do you agree?  If so, can you prepare a patch along these lines?

Thanks Paolo for your comments.

Sure I agree. I will send out the v2 patch by following what you suggested.



(Since you are at it, perhaps you can rename vmx_enable_pml and
vmx_disable_pml, since they will only allocate and free the PML page).
I intend to rename vmx_enable{disable}_pml to 
vmx_create{destroy}_pml_buffer, as besides allocating buffer, we also 
need to write buffer address and PML index to VMCS.


Thanks,
-Kai


Thanks for reporting the issue!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: VMX: Fix commit which broke PML

2015-11-02 Thread Kai Huang
I found PML was broken since below commit:

commit feda805fe7c4ed9cf78158e73b1218752e3b4314
Author: Xiao Guangrong 
Date:   Wed Sep 9 14:05:55 2015 +0800

KVM: VMX: unify SECONDARY_VM_EXEC_CONTROL update

Unify the update in vmx_cpuid_update()

Signed-off-by: Xiao Guangrong 
[Rewrite to use vmcs_set_secondary_exec_control. - Paolo]
Signed-off-by: Paolo Bonzini 

The reason is PML after above commit vmx_cpuid_update calls
vmx_secondary_exec_control, in which PML is disabled unconditionally, as PML is
enabled in creating vcpu. Therefore if vcpu_cpuid_update is called after vcpu is
created, PML will be disabled unexpectedly while log-dirty code still think PML
is used. Actually looks calling vmx_secondary_exec_control in vmx_cpuid_update
is likely to break any VMX features that is enabled/disabled on demand by
updating SECONDARY_VM_EXEC_CONTROL, if vmx_cpuid_update is called between the
feature is enabled and disabled.

Fix this by calling vmcs_read32 to read out SECONDARY_VM_EXEC_CONTROL directly.

Signed-off-by: Kai Huang 
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4d0aa31..4525c0a7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8902,7 +8902,7 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   u32 secondary_exec_ctl = vmx_secondary_exec_control(vmx);
+   u32 secondary_exec_ctl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
 
if (vmx_rdtscp_supported()) {
bool rdtscp_enabled = guest_cpuid_has_rdtscp(vcpu);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX

2015-02-05 Thread Kai Huang


On 02/05/2015 11:04 PM, Radim Krčmář wrote:

2015-02-05 14:23+0800, Kai Huang:

On 02/03/2015 11:18 PM, Radim Krčmář wrote:

You have it protected by CONFIG_X86_64, but use it unconditionally.

Thanks for catching. This has been fixed by another patch, and the fix has
also been merged by Paolo.

(And I haven't noticed the followup either ...)

I don't know of any present bugs in this patch and all questions have
been cleared,

So far I see no other bug reported :)

Thanks,
-Kai


Thanks.


+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;

What is the harm of enabling it here?

(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)

Because the PML feature detection is unconditional (meaning
SECONDARY_EXEC_ENABLE_PML is always in vmcs_config.cpu_based_2nd_exec_ctrl),
but the PML buffer is only created when vcpu is created, and it is
controlled by 'enable_pml' module parameter, if we always enable
SECONDARY_EXEC_ENABLE_PML here, no PML buffer will be created if PML is
disabled by 'enable_pml' parameter,

I meant

   if (!enable_pml)
 exec_control &= ~SECONDARY_EXEC_ENABLE_PML

here and no exec_control operations in vmx_create_vcpu().


 so it's better to enable it along with
creating PML buffer.

I think the reason why KVM split the setup into vmx_create_vcpu() and
vmx_secondary_exec_control() is to simplify nested virtualization.


+   if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
+   cpu_has_virtual_nmis() &&
+   (exit_qualification & INTR_INFO_UNBLOCK_NMI))
+   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+   GUEST_INTR_STATE_NMI);

Relevant part of the specification is pasted from 27.2.2 (Information
for VM Exits Due to Vectored Events), which makes me think that this bit
is mirrored to IDT-vectoring information field ...

Isn't vmx_recover_nmi_blocking() enough?

(I see the same code in handle_ept_violation(), but wasn't that needed
  just because of a hardware error?)

This needs to be handled in both EPT violation and PML. If you look at the
PML specification (the link is in cover letter), you can see the definition
of bit 12 of exit_qualification (section 1.5), which explains above code.
The same definition of exit_qualification is in EPT violation part in
Intel's SDM.

True ... IDT-vectoring doesn't set bit 12 because PML exit isn't a fault
and 27.2.2: "— For all other relevant VM exits, bit 12 is cleared to 0."
(This was humourously pasted to PML as well.)


+   /* PML index always points to next available PML buffer entity */
+   if (pml_idx >= PML_ENTITY_NUM)
+   pml_idx = 0;
+   else
+   pml_idx++;

If the pml_idx is >= PML_ENTITY_NUM and < 0x, the log is empty,

In this case, the log is full, actually. PML index is set to PML_ENTITY_NUM
- 1 initially, and hardware decrease it after GPA is logged.

Below is the pseudocode of PML hardware logic.

IF (PML Index[15:9] ≠ 0)
 THEN VM exit;
FI;

pml_idx >= PML_ENTITY_NUM exits without modifying the buffer,


set accessed and dirty flags for EPT;
IF (a dirty flag was updated from 0 to 1)
THEN
 PML address[PML index] ← 4-KByte-aligned guest-physical address;
 PML index is decremented;

0x is the only value that specifies full buffer, the rest means
that we incorrectly set up the initial index and VMX exited without
changing the buffer => we shouldn't read it.
(I should have said "untouched" instead of "empty".)


FI;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX

2015-02-05 Thread Kai Huang


On 02/05/2015 11:04 PM, Radim Krčmář wrote:

2015-02-05 14:23+0800, Kai Huang:

On 02/03/2015 11:18 PM, Radim Krčmář wrote:

You have it protected by CONFIG_X86_64, but use it unconditionally.

Thanks for catching. This has been fixed by another patch, and the fix has
also been merged by Paolo.

(And I haven't noticed the followup either ...)

I don't know of any present bugs in this patch and all questions have
been cleared,

Thanks.


+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;

What is the harm of enabling it here?

(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)

Because the PML feature detection is unconditional (meaning
SECONDARY_EXEC_ENABLE_PML is always in vmcs_config.cpu_based_2nd_exec_ctrl),
but the PML buffer is only created when vcpu is created, and it is
controlled by 'enable_pml' module parameter, if we always enable
SECONDARY_EXEC_ENABLE_PML here, no PML buffer will be created if PML is
disabled by 'enable_pml' parameter,

I meant

   if (!enable_pml)
 exec_control &= ~SECONDARY_EXEC_ENABLE_PML

here and no exec_control operations in vmx_create_vcpu().
This also works. I originally thought put the enabling part and buffer 
allocation together is more straightforward.



 so it's better to enable it along with
creating PML buffer.

I think the reason why KVM split the setup into vmx_create_vcpu() and
vmx_secondary_exec_control() is to simplify nested virtualization.

This is a good point and I'll think about it :)




+   if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
+   cpu_has_virtual_nmis() &&
+   (exit_qualification & INTR_INFO_UNBLOCK_NMI))
+   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+   GUEST_INTR_STATE_NMI);

Relevant part of the specification is pasted from 27.2.2 (Information
for VM Exits Due to Vectored Events), which makes me think that this bit
is mirrored to IDT-vectoring information field ...

Isn't vmx_recover_nmi_blocking() enough?

(I see the same code in handle_ept_violation(), but wasn't that needed
  just because of a hardware error?)

This needs to be handled in both EPT violation and PML. If you look at the
PML specification (the link is in cover letter), you can see the definition
of bit 12 of exit_qualification (section 1.5), which explains above code.
The same definition of exit_qualification is in EPT violation part in
Intel's SDM.

True ... IDT-vectoring doesn't set bit 12 because PML exit isn't a fault
and 27.2.2: "— For all other relevant VM exits, bit 12 is cleared to 0."
(This was humourously pasted to PML as well.)


+   /* PML index always points to next available PML buffer entity */
+   if (pml_idx >= PML_ENTITY_NUM)
+   pml_idx = 0;
+   else
+   pml_idx++;

If the pml_idx is >= PML_ENTITY_NUM and < 0x, the log is empty,

In this case, the log is full, actually. PML index is set to PML_ENTITY_NUM
- 1 initially, and hardware decrease it after GPA is logged.

Below is the pseudocode of PML hardware logic.

IF (PML Index[15:9] ≠ 0)
 THEN VM exit;
FI;

pml_idx >= PML_ENTITY_NUM exits without modifying the buffer,

Yes.



set accessed and dirty flags for EPT;
IF (a dirty flag was updated from 0 to 1)
THEN
 PML address[PML index] ← 4-KByte-aligned guest-physical address;
 PML index is decremented;

0x is the only value that specifies full buffer, the rest means
that we incorrectly set up the initial index and VMX exited without
changing the buffer => we shouldn't read it.
(I should have said "untouched" instead of "empty".)
Yes theoretically, according to the pseudocode, 0x is the only 
possible value that specifies full buffer. Probably a better way is to 
check if pml_idx is in range [-1, 511] at the beginning and give a 
warning if it's not, or just ASSERT it. Then we can simply to do a 
++pml_idx (with changing pml_idx to signed short type). But the current 
code should also work anyway.


Thanks,
-Kai



FI;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML

2015-02-04 Thread Kai Huang


On 02/03/2015 11:53 PM, Radim Krčmář wrote:

2015-01-28 10:54+0800, Kai Huang:

This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty
logging for particular memory slot, and to flush potentially logged dirty GPAs
before reporting slot->dirty_bitmap to userspace.

kvm x86 common code calls these hooks when they are available so PML logic can
be hidden to VMX specific. Other ARCHs won't be impacted as these hooks are NULL
for them.

Signed-off-by: Kai Huang 
---
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -802,6 +802,31 @@ struct kvm_x86_ops {
+
+   /*
+* Arch-specific dirty logging hooks. These hooks are only supposed to
+* be valid if the specific arch has hardware-accelerated dirty logging
+* mechanism. Currently only for PML on VMX.
+*
+*  - slot_enable_log_dirty:
+*  called when enabling log dirty mode for the slot.

(I guess that "log dirty mode" isn't the meaning that people will think
  after seeing 'log_dirty' ...
  I'd at least change 'log_dirty' to 'dirty_log' in these names.)


+*  - slot_disable_log_dirty:
+*  called when disabling log dirty mode for the slot.
+*  also called when slot is created with log dirty disabled.
+*  - flush_log_dirty:
+*  called before reporting dirty_bitmap to userspace.
+*  - enable_log_dirty_pt_masked:
+*  called when reenabling log dirty for the GFNs in the mask after
+*  corresponding bits are cleared in slot->dirty_bitmap.

This name is very confusing ... I think we should hint that this is
called after we learn that the page has been written to and would like
to monitor it again.

Using something like collected/refresh?  (I'd have to do horrible things
to come up with a good name, sorry.)


--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3780,6 +3780,12 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
  
  	mutex_lock(&kvm->slots_lock);
  
+	/*

+* Flush potentially hardware-cached dirty pages to dirty_bitmap.
+*/
+   if (kvm_x86_ops->flush_log_dirty)
+   kvm_x86_ops->flush_log_dirty(kvm);

(Flushing would make more sense in kvm_get_dirty_log_protect().)


+
r = kvm_get_dirty_log_protect(kvm, log, &is_dirty);
  
  	/*

@@ -7533,6 +7539,56 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
return 0;
  }
  
+static void kvm_mmu_slot_apply_flags(struct kvm *kvm,

+struct kvm_memory_slot *new)
+{
+   /* Still write protect RO slot */
+   if (new->flags & KVM_MEM_READONLY) {
+   kvm_mmu_slot_remove_write_access(kvm, new);

We didn't write protect RO slots before, does this patch depend on it?

No PML doesn't depend on it to work. It's suggested by Paolo.

Thanks,
-Kai



@@ -7562,16 +7618,15 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
-   if ((change != KVM_MR_DELETE) && (new->flags & KVM_MEM_LOG_DIRTY_PAGES))
-   kvm_mmu_slot_remove_write_access(kvm, new);
+   if (change != KVM_MR_DELETE)
+   kvm_mmu_slot_apply_flags(kvm, new);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX

2015-02-04 Thread Kai Huang


On 02/03/2015 11:18 PM, Radim Krčmář wrote:

2015-01-28 10:54+0800, Kai Huang:

This patch adds PML support in VMX. A new module parameter 'enable_pml' is added

(+module_param_named(pml, enable_pml, bool, S_IRUGO);)


to allow user to enable/disable it manually.

Signed-off-by: Kai Huang 
---
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 587149b..a139977 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -846,6 +846,24 @@ TRACE_EVENT(kvm_track_tsc,
+TRACE_EVENT(kvm_pml_full,
+   TP_PROTO(unsigned int vcpu_id),
+   TP_ARGS(vcpu_id),
+
+   TP_STRUCT__entry(
+   __field(unsigned int,   vcpu_id )
+   ),
+
+   TP_fast_assign(
+   __entry->vcpu_id = vcpu_id;
+   ),
+
+   TP_printk("vcpu %d: PML full", __entry->vcpu_id)
+);
+
  #endif /* CONFIG_X86_64 */

You have it protected by CONFIG_X86_64, but use it unconditionally.
Thanks for catching. This has been fixed by another patch, and the fix 
has also been merged by Paolo.


Thanks,
-Kai


(Also, we can get all this information from kvm_exit tracepoint;
  I'd just drop it.)


diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c987374..de5ce82 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -516,6 +519,10 @@ struct vcpu_vmx {
+   /* Support for PML */
+#define PML_ENTITY_NUM 512

(Readers of this struct likely aren't interested in this number, so it
  would be nicer to define it outside.  I thought it is here to hint the
  amount of allocated pages, but PML_ENTITY_NUM / 512 isn't obvious.)


+   struct page *pml_pg;
@@ -4355,6 +4368,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx 
*vmx)
   a current VMCS12
*/
exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;
+   /* PML is enabled/disabled in creating/destorying vcpu */
+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;

What is the harm of enabling it here?

(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)
Because the PML feature detection is unconditional (meaning 
SECONDARY_EXEC_ENABLE_PML is always in 
vmcs_config.cpu_based_2nd_exec_ctrl), but the PML buffer is only created 
when vcpu is created, and it is controlled by 'enable_pml' module 
parameter, if we always enable SECONDARY_EXEC_ENABLE_PML here, no PML 
buffer will be created if PML is disabled by 'enable_pml' parameter, so 
it's better to enable it along with creating PML buffer.





+
return exec_control;
  }
@@ -6971,6 +7001,31 @@ static bool vmx_test_pir(struct kvm_vcpu *vcpu, int 
vector)
+static int handle_pml_full(struct kvm_vcpu *vcpu)
+{
+   unsigned long exit_qualification;
+
+   trace_kvm_pml_full(vcpu->vcpu_id);
+
+   exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+   /*
+* PML buffer FULL happened while executing iret from NMI,
+* "blocked by NMI" bit has to be set before next VM entry.
+*/
+   if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
+   cpu_has_virtual_nmis() &&
+   (exit_qualification & INTR_INFO_UNBLOCK_NMI))
+   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+   GUEST_INTR_STATE_NMI);

Relevant part of the specification is pasted from 27.2.2 (Information
for VM Exits Due to Vectored Events), which makes me think that this bit
is mirrored to IDT-vectoring information field ...

Isn't vmx_recover_nmi_blocking() enough?

(I see the same code in handle_ept_violation(), but wasn't that needed
  just because of a hardware error?)
This needs to be handled in both EPT violation and PML. If you look at 
the PML specification (the link is in cover letter), you can see the 
definition of bit 12 of exit_qualification (section 1.5), which explains 
above code. The same definition of exit_qualification is in EPT 
violation part in Intel's SDM.



+
+   /*
+* PML buffer already flushed at beginning of VMEXIT. Nothing to do
+* here.., and there's no userspace involvement needed for PML.
+*/
+   return 1;
+}
@@ -7325,6 +7381,89 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 
*info1, u64 *info2)
+static int vmx_enable_pml(struct vcpu_vmx *vmx)
+{
+   struct page *pml_pg;
+   u32 exec_control;
+
+   pml_pg = alloc_page(GFP_KERNEL | __GFP_ZERO);
+   if (!pml_pg)
+   return -ENOMEM;
+
+   vmx->pml_pg = pml_pg;

(It's safe to use vmx->pml_pg directly.)


+
+   vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
+   vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
+
+   exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+   exec_control |= SECONDARY_EXEC_ENABLE_PML;
+   vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+
+ 

Re: [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML

2015-02-04 Thread Kai Huang


On 02/04/2015 01:34 AM, Radim Krčmář wrote:

2015-01-28 10:54+0800, Kai Huang:

This patch adds new mmu layer functions to clear/set D-bit for memory slot, and
to write protect superpages for memory slot.

In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU
updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages,
instead, we only need to clear D-bit in order to log that GPA.

For superpages, we still write protect it and let page fault code to handle
dirty page logging, as we still need to split superpage to 4K pages in PML.

As PML is always enabled during guest's lifetime, to eliminate unnecessary PML
GPA logging, we set D-bit manually for the slot with dirty logging disabled.

Signed-off-by: Kai Huang 

This message contains just several  stylistic tips, I wasn't able to
learn enough about large pages to review today.

Hi Radim,

Thanks for your review and sorry for the late reply. I was working on 
something else these days.


There are two reasons we still write protect the large page in case of 
PML. One is we still need to split large page to 4K page in dirty 
logging mode, and PML doesn't split large page automatically. The second 
is PML only logs dirty GPA but it doesn't distinguish if the dirty GPA 
is in 4K page or large page.


For rest of your comments, as this patch series have already been in 
Paolo's queue branch, I think I can send further patches to enhance 
them, if Paolo agrees the enhancement is needed.


Thanks,
-Kai



--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b18e65c..c438224 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4368,6 +4448,121 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
+void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
+   struct kvm_memory_slot *memslot)
+{
+   gfn_t last_gfn;
+   int i;
+   bool flush = false;
+
+   last_gfn = memslot->base_gfn + memslot->npages - 1;
+
+   spin_lock(&kvm->mmu_lock);
+
+   for (i = PT_PAGE_TABLE_LEVEL + 1; /* skip rmap for 4K page */

("+ 1" is the only difference from kvm_mmu_slot_remove_write_access();
  new argument, initial level, would be better.

  Btw, PT_PAGE_TABLE_LEVEL + 1 == PT_DIRECTORY_LEVEL)


+i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+   unsigned long *rmapp;
+   unsigned long last_index, index;
+
+   rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
+   last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
+
+   for (index = 0; index <= last_index; ++index, ++rmapp) {
+   if (*rmapp)
+   flush |= __rmap_write_protect(kvm, rmapp,
+   false);
+
+   if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+   cond_resched_lock(&kvm->mmu_lock);
+   }
+   }
+   spin_unlock(&kvm->mmu_lock);
+
+   /* see kvm_mmu_slot_remove_write_access */
+   lockdep_assert_held(&kvm->slots_lock);
+
+   if (flush)
+   kvm_flush_remote_tlbs(kvm);
+}
+void kvm_mmu_slot_set_dirty(struct kvm *kvm,
+   struct kvm_memory_slot *memslot)
+{
+   gfn_t last_gfn;
+   int i;
+   bool flush = false;
+
+   last_gfn = memslot->base_gfn + memslot->npages - 1;
+
+   spin_lock(&kvm->mmu_lock);
+
+   for (i = PT_PAGE_TABLE_LEVEL;
+i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+   unsigned long *rmapp;
+   unsigned long last_index, index;
+
+   rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
+   last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
+
+   for (index = 0; index <= last_index; ++index, ++rmapp) {
+   if (*rmapp)
+   flush |= __rmap_set_dirty(kvm, rmapp);

(And yet another similar walker ... We can either pass a worker function
  accepting 'kvm' and 'rmapp', use a 'switch' with a new operations enum,
  or have code duplication.  Paolo?)


+
+   if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+   cond_resched_lock(&kvm->mmu_lock);
+   }
+   }
+
+   spin_unlock(&kvm->mmu_lock);
+
+   lockdep_assert_held(&kvm->slots_lock);
+
+   /* see kvm_mmu_slot_leaf_clear_dirty */
+   if (flush)
+   kvm_flush_remote_tlbs(kvm);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty);
+void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
+  struct kvm_memory_slot *memslot)
+{
+  

[PATCH] KVM: trace: fix trace_kvm_pml_full build error on i386

2015-01-29 Thread Kai Huang
Fix trace_kvm_pml_full build error on i386, introduced by below commit:

commit 19cf3eb32410 ("KVM: VMX: Add PML support in VMX")

In above commit trace_kvm_pml_full was only defined in CONFIG_X86_64, which
breaks build on i386. Fix it by moving trace_kvm_pml_full definition out of
CONFIG_X86_64.

Tested with Fengguang's .config, and also did sanity test on x86_64.

Signed-off-by: Kai Huang 
---
 arch/x86/kvm/trace.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index a139977..7c7bc8b 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -846,6 +846,8 @@ TRACE_EVENT(kvm_track_tsc,
  __print_symbolic(__entry->host_clock, host_clocks))
 );
 
+#endif /* CONFIG_X86_64 */
+
 /*
  * Tracepoint for PML full VMEXIT.
  */
@@ -864,8 +866,6 @@ TRACE_EVENT(kvm_pml_full,
TP_printk("vcpu %d: PML full", __entry->vcpu_id)
 );
 
-#endif /* CONFIG_X86_64 */
-
 TRACE_EVENT(kvm_ple_window,
TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
TP_ARGS(grow, vcpu_id, new, old),
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log dirty

2015-01-27 Thread Kai Huang
We don't have to write protect guest memory for dirty logging if architecture
supports hardware dirty logging, such as PML on VMX, so rename it to be more
generic.

Signed-off-by: Kai Huang 
---
 arch/arm/kvm/mmu.c   | 18 --
 arch/x86/kvm/mmu.c   | 21 +++--
 include/linux/kvm_host.h |  2 +-
 virt/kvm/kvm_main.c  |  2 +-
 4 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 74aeaba..6034697 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1081,7 +1081,7 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
 }
 
 /**
- * kvm_arch_mmu_write_protect_pt_masked() - write protect dirty pages
+ * kvm_mmu_write_protect_pt_masked() - write protect dirty pages
  * @kvm:   The KVM pointer
  * @slot:  The memory slot associated with mask
  * @gfn_offset:The gfn offset in memory slot
@@ -1091,7 +1091,7 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
  * Walks bits set in mask write protects the associated pte's. Caller must
  * acquire kvm_mmu_lock.
  */
-void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
+static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset, unsigned long mask)
 {
@@ -1102,6 +1102,20 @@ void kvm_arch_mmu_write_protect_pt_masked(struct kvm 
*kvm,
stage2_wp_range(kvm, start, end);
 }
 
+/*
+ * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
+ * dirty pages.
+ *
+ * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
+ * enable dirty logging for them.
+ */
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long mask)
+{
+   kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0ed9f79..b18e65c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1216,7 +1216,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
 }
 
 /**
- * kvm_arch_mmu_write_protect_pt_masked - write protect selected PT level pages
+ * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
  * @slot: slot to protect
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1225,7 +1225,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
  */
-void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
+static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 struct kvm_memory_slot *slot,
 gfn_t gfn_offset, unsigned long mask)
 {
@@ -1241,6 +1241,23 @@ void kvm_arch_mmu_write_protect_pt_masked(struct kvm 
*kvm,
}
 }
 
+/**
+ * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
+ * PT level pages.
+ *
+ * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
+ * enable dirty logging for them.
+ *
+ * Used when we do not need to care about huge page mappings: e.g. during dirty
+ * logging we do not have any such mappings.
+ */
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long mask)
+{
+   kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+}
+
 static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
struct kvm_memory_slot *slot;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7d67195..32d0575 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -615,7 +615,7 @@ int kvm_get_dirty_log(struct kvm *kvm,
 int kvm_get_dirty_log_protect(struct kvm *kvm,
struct kvm_dirty_log *log, bool *is_dirty);
 
-void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset,
unsigned long mask);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a8490f0..0c28176 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1059,7 +1059,7 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
dirty_bitmap_buffer[i] = mask;
 
offset = i * BITS_PER_LONG;
-   kvm_arch_mmu_write_protect_pt_masked(kvm, memsl

[PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML

2015-01-27 Thread Kai Huang
This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty
logging for particular memory slot, and to flush potentially logged dirty GPAs
before reporting slot->dirty_bitmap to userspace.

kvm x86 common code calls these hooks when they are available so PML logic can
be hidden to VMX specific. Other ARCHs won't be impacted as these hooks are NULL
for them.

Signed-off-by: Kai Huang 
---
 arch/x86/include/asm/kvm_host.h | 25 +++
 arch/x86/kvm/mmu.c  |  6 +++-
 arch/x86/kvm/x86.c  | 71 -
 3 files changed, 93 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 67a98d7..57916ec 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -802,6 +802,31 @@ struct kvm_x86_ops {
int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
 
void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
+
+   /*
+* Arch-specific dirty logging hooks. These hooks are only supposed to
+* be valid if the specific arch has hardware-accelerated dirty logging
+* mechanism. Currently only for PML on VMX.
+*
+*  - slot_enable_log_dirty:
+*  called when enabling log dirty mode for the slot.
+*  - slot_disable_log_dirty:
+*  called when disabling log dirty mode for the slot.
+*  also called when slot is created with log dirty disabled.
+*  - flush_log_dirty:
+*  called before reporting dirty_bitmap to userspace.
+*  - enable_log_dirty_pt_masked:
+*  called when reenabling log dirty for the GFNs in the mask after
+*  corresponding bits are cleared in slot->dirty_bitmap.
+*/
+   void (*slot_enable_log_dirty)(struct kvm *kvm,
+ struct kvm_memory_slot *slot);
+   void (*slot_disable_log_dirty)(struct kvm *kvm,
+  struct kvm_memory_slot *slot);
+   void (*flush_log_dirty)(struct kvm *kvm);
+   void (*enable_log_dirty_pt_masked)(struct kvm *kvm,
+  struct kvm_memory_slot *slot,
+  gfn_t offset, unsigned long mask);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6c24af3..c5833ca 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1335,7 +1335,11 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset, unsigned long mask)
 {
-   kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+   if (kvm_x86_ops->enable_log_dirty_pt_masked)
+   kvm_x86_ops->enable_log_dirty_pt_masked(kvm, slot, gfn_offset,
+   mask);
+   else
+   kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
 static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3a7fcff..442ee7d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3780,6 +3780,12 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
 
mutex_lock(&kvm->slots_lock);
 
+   /*
+* Flush potentially hardware-cached dirty pages to dirty_bitmap.
+*/
+   if (kvm_x86_ops->flush_log_dirty)
+   kvm_x86_ops->flush_log_dirty(kvm);
+
r = kvm_get_dirty_log_protect(kvm, log, &is_dirty);
 
/*
@@ -7533,6 +7539,56 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
return 0;
 }
 
+static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
+struct kvm_memory_slot *new)
+{
+   /* Still write protect RO slot */
+   if (new->flags & KVM_MEM_READONLY) {
+   kvm_mmu_slot_remove_write_access(kvm, new);
+   return;
+   }
+
+   /*
+* Call kvm_x86_ops dirty logging hooks when they are valid.
+*
+* kvm_x86_ops->slot_disable_log_dirty is called when:
+*
+*  - KVM_MR_CREATE with dirty logging is disabled
+*  - KVM_MR_FLAGS_ONLY with dirty logging is disabled in new flag
+*
+* The reason is, in case of PML, we need to set D-bit for any slots
+* with dirty logging disabled in order to eliminate unnecessary GPA
+* logging in PML buffer (and potential PML buffer full VMEXT). This
+* guarantees leaving PML enabled during guest's lifetime won't have
+* any additonal overhead from PML when guest is running with dirty
+* logging disabled for memory slots.
+*
+* kvm_x86_ops->slot_enable_log_dirty is called when switching new slot
+* to dirty logging mode.
+*
+

[PATCH 2/6] KVM: MMU: Add mmu help functions to support PML

2015-01-27 Thread Kai Huang
This patch adds new mmu layer functions to clear/set D-bit for memory slot, and
to write protect superpages for memory slot.

In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU
updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages,
instead, we only need to clear D-bit in order to log that GPA.

For superpages, we still write protect it and let page fault code to handle
dirty page logging, as we still need to split superpage to 4K pages in PML.

As PML is always enabled during guest's lifetime, to eliminate unnecessary PML
GPA logging, we set D-bit manually for the slot with dirty logging disabled.

Signed-off-by: Kai Huang 
---
 arch/x86/include/asm/kvm_host.h |   9 ++
 arch/x86/kvm/mmu.c  | 195 
 2 files changed, 204 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 843bea0..4f6369b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -835,6 +835,15 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
+  struct kvm_memory_slot *memslot);
+void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
+   struct kvm_memory_slot *memslot);
+void kvm_mmu_slot_set_dirty(struct kvm *kvm,
+   struct kvm_memory_slot *memslot);
+void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+  struct kvm_memory_slot *slot,
+  gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b18e65c..c438224 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1215,6 +1215,60 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
return flush;
 }
 
+static bool spte_clear_dirty(struct kvm *kvm, u64 *sptep)
+{
+   u64 spte = *sptep;
+
+   rmap_printk("rmap_clear_dirty: spte %p %llx\n", sptep, *sptep);
+
+   spte &= ~shadow_dirty_mask;
+
+   return mmu_spte_update(sptep, spte);
+}
+
+static bool __rmap_clear_dirty(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *sptep;
+   struct rmap_iterator iter;
+   bool flush = false;
+
+   for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
+   BUG_ON(!(*sptep & PT_PRESENT_MASK));
+
+   flush |= spte_clear_dirty(kvm, sptep);
+   sptep = rmap_get_next(&iter);
+   }
+
+   return flush;
+}
+
+static bool spte_set_dirty(struct kvm *kvm, u64 *sptep)
+{
+   u64 spte = *sptep;
+
+   rmap_printk("rmap_set_dirty: spte %p %llx\n", sptep, *sptep);
+
+   spte |= shadow_dirty_mask;
+
+   return mmu_spte_update(sptep, spte);
+}
+
+static bool __rmap_set_dirty(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *sptep;
+   struct rmap_iterator iter;
+   bool flush = false;
+
+   for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
+   BUG_ON(!(*sptep & PT_PRESENT_MASK));
+
+   flush |= spte_set_dirty(kvm, sptep);
+   sptep = rmap_get_next(&iter);
+   }
+
+   return flush;
+}
+
 /**
  * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
@@ -1242,6 +1296,32 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm 
*kvm,
 }
 
 /**
+ * kvm_mmu_clear_dirty_pt_masked - clear MMU D-bit for PT level pages
+ * @kvm: kvm instance
+ * @slot: slot to clear D-bit
+ * @gfn_offset: start of the BITS_PER_LONG pages we care about
+ * @mask: indicates which pages we should clear D-bit
+ *
+ * Used for PML to re-log the dirty GPAs after userspace querying dirty_bitmap.
+ */
+void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+struct kvm_memory_slot *slot,
+gfn_t gfn_offset, unsigned long mask)
+{
+   unsigned long *rmapp;
+
+   while (mask) {
+   rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
+ PT_PAGE_TABLE_LEVEL, slot);
+   __rmap_clear_dirty(kvm, rmapp);
+
+   /* clear the first set bit */
+   mask &= mask - 1;
+   }
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_clear_dirty_pt_masked);
+
+/**
  * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
  * PT level pages.
  *
@@ -4368,6 +4448,121 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
kvm_flush_remote_tl

[PATCH 4/6] KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access

2015-01-27 Thread Kai Huang
This patch changes the second parameter of kvm_mmu_slot_remove_write_access from
'slot id' to 'struct kvm_memory_slot *' to align with kvm_x86_ops dirty logging
hooks, which will be introduced in further patch.

Better way is to change second parameter of kvm_arch_commit_memory_region from
'struct kvm_userspace_memory_region *' to 'struct kvm_memory_slot * new', but it
requires changes on other non-x86 ARCH too, so avoid it now.

Signed-off-by: Kai Huang 
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c  |  5 ++---
 arch/x86/kvm/x86.c  | 10 +++---
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4f6369b..67a98d7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -834,7 +834,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+ struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
   struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fb35535..6c24af3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4410,14 +4410,13 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
init_kvm_mmu(vcpu);
 }
 
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+ struct kvm_memory_slot *memslot)
 {
-   struct kvm_memory_slot *memslot;
gfn_t last_gfn;
int i;
bool flush = false;
 
-   memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;
 
spin_lock(&kvm->mmu_lock);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e10e3f..3a7fcff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7538,7 +7538,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *old,
enum kvm_mr_change change)
 {
-
+   struct kvm_memory_slot *new;
int nr_mmu_pages = 0;
 
if ((mem->slot >= KVM_USER_MEM_SLOTS) && (change == KVM_MR_DELETE)) {
@@ -7557,6 +7557,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
if (nr_mmu_pages)
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
+
+   /* It's OK to get 'new' slot here as it has already been installed */
+   new = id_to_memslot(kvm->memslots, mem->slot);
+
/*
 * Write protect all pages for dirty logging.
 *
@@ -7566,8 +7570,8 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 *
 * See the comments in fast_page_fault().
 */
-   if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
-   kvm_mmu_slot_remove_write_access(kvm, mem->slot);
+   if ((change != KVM_MR_DELETE) && (new->flags & KVM_MEM_LOG_DIRTY_PAGES))
+   kvm_mmu_slot_remove_write_access(kvm, new);
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] KVM: VMX: Add PML support in VMX

2015-01-27 Thread Kai Huang
This patch adds PML support in VMX. A new module parameter 'enable_pml' is added
to allow user to enable/disable it manually.

Signed-off-by: Kai Huang 
---
 arch/x86/include/asm/vmx.h  |   4 +
 arch/x86/include/uapi/asm/vmx.h |   1 +
 arch/x86/kvm/trace.h|  18 
 arch/x86/kvm/vmx.c  | 195 +++-
 arch/x86/kvm/x86.c  |   1 +
 5 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 45afaee..da772ed 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -69,6 +69,7 @@
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
+#define SECONDARY_EXEC_ENABLE_PML   0x0002
 #define SECONDARY_EXEC_XSAVES  0x0010
 
 
@@ -121,6 +122,7 @@ enum vmcs_field {
GUEST_LDTR_SELECTOR = 0x080c,
GUEST_TR_SELECTOR   = 0x080e,
GUEST_INTR_STATUS   = 0x0810,
+   GUEST_PML_INDEX = 0x0812,
HOST_ES_SELECTOR= 0x0c00,
HOST_CS_SELECTOR= 0x0c02,
HOST_SS_SELECTOR= 0x0c04,
@@ -140,6 +142,8 @@ enum vmcs_field {
VM_EXIT_MSR_LOAD_ADDR_HIGH  = 0x2009,
VM_ENTRY_MSR_LOAD_ADDR  = 0x200a,
VM_ENTRY_MSR_LOAD_ADDR_HIGH = 0x200b,
+   PML_ADDRESS = 0x200e,
+   PML_ADDRESS_HIGH= 0x200f,
TSC_OFFSET  = 0x2010,
TSC_OFFSET_HIGH = 0x2011,
VIRTUAL_APIC_PAGE_ADDR  = 0x2012,
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index ff2b8e2..c5f1a1d 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -73,6 +73,7 @@
 #define EXIT_REASON_XSETBV  55
 #define EXIT_REASON_APIC_WRITE  56
 #define EXIT_REASON_INVPCID 58
+#define EXIT_REASON_PML_FULL62
 #define EXIT_REASON_XSAVES  63
 #define EXIT_REASON_XRSTORS 64
 
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 587149b..a139977 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -846,6 +846,24 @@ TRACE_EVENT(kvm_track_tsc,
  __print_symbolic(__entry->host_clock, host_clocks))
 );
 
+/*
+ * Tracepoint for PML full VMEXIT.
+ */
+TRACE_EVENT(kvm_pml_full,
+   TP_PROTO(unsigned int vcpu_id),
+   TP_ARGS(vcpu_id),
+
+   TP_STRUCT__entry(
+   __field(unsigned int,   vcpu_id )
+   ),
+
+   TP_fast_assign(
+   __entry->vcpu_id= vcpu_id;
+   ),
+
+   TP_printk("vcpu %d: PML full", __entry->vcpu_id)
+);
+
 #endif /* CONFIG_X86_64 */
 
 TRACE_EVENT(kvm_ple_window,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c987374..de5ce82 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -101,6 +101,9 @@ module_param(nested, bool, S_IRUGO);
 
 static u64 __read_mostly host_xss;
 
+static bool __read_mostly enable_pml = 1;
+module_param_named(pml, enable_pml, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE)
 #define KVM_VM_CR0_ALWAYS_ON   \
@@ -516,6 +519,10 @@ struct vcpu_vmx {
/* Dynamic PLE window. */
int ple_window;
bool ple_window_dirty;
+
+   /* Support for PML */
+#define PML_ENTITY_NUM 512
+   struct page *pml_pg;
 };
 
 enum segment_cache_field {
@@ -1068,6 +1075,11 @@ static inline bool cpu_has_vmx_shadow_vmcs(void)
SECONDARY_EXEC_SHADOW_VMCS;
 }
 
+static inline bool cpu_has_vmx_pml(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_ENABLE_PML;
+}
+
 static inline bool report_flexpriority(void)
 {
return flexpriority_enabled;
@@ -2924,7 +2936,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_SHADOW_VMCS |
-   SECONDARY_EXEC_XSAVES;
+   SECONDARY_EXEC_XSAVES |
+   SECONDARY_EXEC_ENABLE_PML;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
&_cpu_based_2nd_exec_control) < 0)
@@ -4355,6 +4368,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx 
*vmx)
   a current VMCS12
*/
exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;
+ 

[PATCH 3/6] KVM: MMU: Explicitly set D-bit for writable spte.

2015-01-27 Thread Kai Huang
This patch avoids unnecessary dirty GPA logging to PML buffer in EPT violation
path by setting D-bit manually prior to the occurrence of the write from guest.

We only set D-bit manually in set_spte, and leave fast_page_fault path
unchanged, as fast_page_fault is very unlikely to happen in case of PML.

For the hva <-> pa change case, the spte is updated to either read-only (host
pte is read-only) or be dropped (host pte is writeable), and both cases will be
handled by above changes, therefore no change is necessary.

Signed-off-by: Kai Huang 
---
 arch/x86/kvm/mmu.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c438224..fb35535 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2597,8 +2597,14 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}
 
-   if (pte_access & ACC_WRITE_MASK)
+   if (pte_access & ACC_WRITE_MASK) {
mark_page_dirty(vcpu->kvm, gfn);
+   /*
+* Explicitly set dirty bit. It is used to eliminate unnecessary
+* dirty GPA logging in case of PML is enabled on VMX.
+*/
+   spte |= shadow_dirty_mask;
+   }
 
 set_pte:
if (mmu_spte_update(sptep, spte))
@@ -2914,6 +2920,16 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
 */
gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
 
+   /*
+* Theoretically we could also set dirty bit (and flush TLB) here in
+* order to eliminate the unnecessary PML logging. See comments in
+* set_spte. But as in case of PML, fast_page_fault is very unlikely to
+* happen so we leave it unchanged. This might result in the same GPA
+* to be logged in PML buffer again when the write really happens, and
+* eventually to be called by mark_page_dirty twice. But it's also no
+* harm. This also avoids the TLB flush needed after setting dirty bit
+* so non-PML cases won't be impacted.
+*/
if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
mark_page_dirty(vcpu->kvm, gfn);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support

2015-01-27 Thread Kai Huang
3 (== PML score in test a)

percent: 96.17% 91.99%  100%

performance gain:   96.17% - 91.99% = 4.18%

Booting guest without graphic window (--nographic)

PML WP  No monithring thread

104778  98967
104856  99380
103783  99406
105210  100638
106218  99763
105475  99287

avg:105053  99573   109756 (== PML score in test a)

percent: 95.72% 90.72%  100%

performance gain:  95.72% - 90.72% = 5%

So there's noticeable performance gain (around 4%~5%) of PML comparing to Write
Protection.


Kai Huang (6):
  KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic
for log dirty
  KVM: MMU: Add mmu help functions to support PML
  KVM: MMU: Explicitly set D-bit for writable spte.
  KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access
  KVM: x86: Add new dirty logging kvm_x86_ops for PML
  KVM: VMX: Add PML support in VMX

 arch/arm/kvm/mmu.c  |  18 ++-
 arch/x86/include/asm/kvm_host.h |  37 +-
 arch/x86/include/asm/vmx.h  |   4 +
 arch/x86/include/uapi/asm/vmx.h |   1 +
 arch/x86/kvm/mmu.c  | 243 +++-
 arch/x86/kvm/trace.h|  18 +++
 arch/x86/kvm/vmx.c  | 195 +++-
 arch/x86/kvm/x86.c  |  78 +++--
 include/linux/kvm_host.h|   2 +-
 virt/kvm/kvm_main.c |   2 +-
 10 files changed, 577 insertions(+), 21 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Optimize TLB flush in kvm_mmu_slot_remove_write_access.

2015-01-11 Thread Kai Huang
Apparently no TLB flush is needed when there's no valid rmap in memory slot.

Signed-off-by: Kai Huang 
---
 arch/x86/kvm/mmu.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f83fc6c..d43bf50 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4309,6 +4309,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
struct kvm_memory_slot *memslot;
gfn_t last_gfn;
int i;
+   bool flush = false;
 
memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;
@@ -4325,7 +4326,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
 
for (index = 0; index <= last_index; ++index, ++rmapp) {
if (*rmapp)
-   __rmap_write_protect(kvm, rmapp, false);
+   flush |= __rmap_write_protect(kvm, rmapp,
+   false);
 
if (need_resched() || spin_needbreak(&kvm->mmu_lock))
cond_resched_lock(&kvm->mmu_lock);
@@ -4352,7 +4354,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
 * instead of PT_WRITABLE_MASK, that means it does not depend
 * on PT_WRITABLE_MASK anymore.
 */
-   kvm_flush_remote_tlbs(kvm);
+   if (flush)
+   kvm_flush_remote_tlbs(kvm);
 }
 
 #define BATCH_ZAP_PAGES10
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Flush TLB when D bit is manually changed.

2015-01-09 Thread Kai Huang
When software changes D bit (either from 1 to 0, or 0 to 1), the corresponding
TLB entity in the hardware won't be updated immediately. We should flush it to
guarantee the consistence of D bit between TLB and MMU page table in memory.
This is required if some specific hardware feature uses D-bit status to do
specific things.

Sanity test was done on my machine with Intel processor.

Signed-off-by: Kai Huang 
---
 arch/x86/kvm/mmu.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 978f402..1feac0c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -547,6 +547,11 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 
new_spte, u64 bit_mask)
return (old_spte & bit_mask) && !(new_spte & bit_mask);
 }
 
+static bool spte_is_bit_changed(u64 old_spte, u64 new_spte, u64 bit_mask)
+{
+   return (old_spte & bit_mask) != (new_spte & bit_mask);
+}
+
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
  * Note: the sptep being assigned *must* be either not present
@@ -597,6 +602,13 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
if (!shadow_accessed_mask)
return ret;
 
+   /*
+* We also need to flush TLB when D-bit is changed by software to
+* guarantee the D-bit consistence between TLB and MMU page table.
+*/
+   if (spte_is_bit_changed(old_spte, new_spte, shadow_dirty_mask))
+   ret = true;
+
if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is there any consumer of virtio-balloon now?

2014-03-22 Thread Kai Huang

Just see the reply. Thanks a lot!

Thanks,
-Kai
On 2014年03月21日 17:03, Kashyap Chamarthy wrote:

On Fri, Mar 21, 2014 at 12:22:54PM +0800, Kai Huang wrote:

Hi,

I see the virtio-balloon is designed for memory auto balloon between
KVM host and guest, but from latest linux kernel mainline code, looks
currently there's no consumer actually using it? Would you let me know
who is the consumer if there's any?

>From a quick look up, I see at-least there are some users[1] on
libvirt-users list. And, here's a nice explanation/usage of it[2]


   [1] https://www.redhat.com/archives/libvirt-users/2011-October/msg00059.html
   [2] http://rwmj.wordpress.com/2010/07/17/virtio-balloon/



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is there any consumer of virtio-balloon now?

2014-03-22 Thread Kai Huang
Thanks Paolo. What's the user space tool / command to trigger the 
virtio_balloon functionality? Basically I am looking for the whole code 
patch that triggers the virtio_balloon.


Thanks,
-Kai
On 2014年03月21日 16:51, Paolo Bonzini wrote:

Il 21/03/2014 05:22, Kai Huang ha scritto:

Hi,

I see the virtio-balloon is designed for memory auto balloon between
KVM host and guest, but from latest linux kernel mainline code, looks
currently there's no consumer actually using it? Would you let me know
who is the consumer if there's any?


The virtio-balloon driver is in drivers/virtio/virtio_balloon.c. Right 
now it only does manual ballooning though, not automatic.


Paolo



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is there any consumer of virtio-balloon now?

2014-03-20 Thread Kai Huang
Hi,

I see the virtio-balloon is designed for memory auto balloon between
KVM host and guest, but from latest linux kernel mainline code, looks
currently there's no consumer actually using it? Would you let me know
who is the consumer if there's any?

Thanks,
-Kai
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 7/7] vfio: fix mapping of MSIX bar

2014-01-19 Thread Kai Huang
On Sun, Jan 19, 2014 at 10:11 PM, Alex Williamson
 wrote:
> On Sun, 2014-01-19 at 22:03 +0800, Kai Huang wrote:
>> On Sat, Jan 18, 2014 at 3:25 AM, Alex Williamson
>>  wrote:
>> > From: Alexey Kardashevskiy 
>> >
>> > VFIO virtualizes MSIX table for the guest but not mapping the part of
>> > a BAR which contains an MSIX table. Since vfio_mmap_bar() mmaps chunks
>> > before and after the MSIX table, they have to be aligned to the host
>> > page size which may be TARGET_PAGE_MASK (4K) or 64K in case of PPC64.
>> >
>> > This fixes boundaries calculations to use the real host page size.
>> >
>> > Without the patch, the chunk before MSIX table may overlap with the MSIX
>> > table and mmap will fail in the host kernel. The result will be serious
>> > slowdown as the whole BAR will be emulated by QEMU.
>> >
>> > Signed-off-by: Alexey Kardashevskiy 
>> > Signed-off-by: Alex Williamson 
>> > ---
>> >  hw/misc/vfio.c |6 +++---
>> >  1 file changed, 3 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> > index 432547c..8a1f1a1 100644
>> > --- a/hw/misc/vfio.c
>> > +++ b/hw/misc/vfio.c
>> > @@ -2544,7 +2544,7 @@ static void vfio_map_bar(VFIODevice *vdev, int nr)
>> >   * potentially insert a direct-mapped subregion before and after it.
>> >   */
>> >  if (vdev->msix && vdev->msix->table_bar == nr) {
>> > -size = vdev->msix->table_offset & TARGET_PAGE_MASK;
>> > +size = vdev->msix->table_offset & qemu_host_page_mask;
>> >  }
>> >
>> >  strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
>> > @@ -2556,8 +2556,8 @@ static void vfio_map_bar(VFIODevice *vdev, int nr)
>> >  if (vdev->msix && vdev->msix->table_bar == nr) {
>> >  unsigned start;
>> >
>> > -start = TARGET_PAGE_ALIGN(vdev->msix->table_offset +
>> > -  (vdev->msix->entries * 
>> > PCI_MSIX_ENTRY_SIZE));
>> > +start = HOST_PAGE_ALIGN(vdev->msix->table_offset +
>> > +(vdev->msix->entries * 
>> > PCI_MSIX_ENTRY_SIZE));
>> >
>> Hi Alex,
>>
>> I am new to vfio and qemu, and have some questions. Does MSIX have one
>> dedicated bar when qemu emulating the device? Looks your code maps
>> both the content before and after the MSIX table? If MSIX has
>> dedicated bar, I think we can just skip the MSIX bar, why do we need
>> to map the context before and after the MSIX table?
>
> vfio is used to pass through existing physical devices.  We don't get to
> define the MSI-X layout of those devices.  Therefore we must be prepared
> to handle any possible layout.  The BAR may be dedicated to the MSI-X
> table or it may also include memory mapped register space for the
> device.  Thanks,
>
> Alex
>

This sounds reasonable. So if there's bar contains both MSIX table and
register, the register access will be trapped and emulated?

Btw, did vfio_mmap_bar consider the pedding bit array table? I don't
think they can be accessed directly by userspace either.

Thanks,
-Kai
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 7/7] vfio: fix mapping of MSIX bar

2014-01-19 Thread Kai Huang
On Sat, Jan 18, 2014 at 3:25 AM, Alex Williamson
 wrote:
> From: Alexey Kardashevskiy 
>
> VFIO virtualizes MSIX table for the guest but not mapping the part of
> a BAR which contains an MSIX table. Since vfio_mmap_bar() mmaps chunks
> before and after the MSIX table, they have to be aligned to the host
> page size which may be TARGET_PAGE_MASK (4K) or 64K in case of PPC64.
>
> This fixes boundaries calculations to use the real host page size.
>
> Without the patch, the chunk before MSIX table may overlap with the MSIX
> table and mmap will fail in the host kernel. The result will be serious
> slowdown as the whole BAR will be emulated by QEMU.
>
> Signed-off-by: Alexey Kardashevskiy 
> Signed-off-by: Alex Williamson 
> ---
>  hw/misc/vfio.c |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 432547c..8a1f1a1 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -2544,7 +2544,7 @@ static void vfio_map_bar(VFIODevice *vdev, int nr)
>   * potentially insert a direct-mapped subregion before and after it.
>   */
>  if (vdev->msix && vdev->msix->table_bar == nr) {
> -size = vdev->msix->table_offset & TARGET_PAGE_MASK;
> +size = vdev->msix->table_offset & qemu_host_page_mask;
>  }
>
>  strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
> @@ -2556,8 +2556,8 @@ static void vfio_map_bar(VFIODevice *vdev, int nr)
>  if (vdev->msix && vdev->msix->table_bar == nr) {
>  unsigned start;
>
> -start = TARGET_PAGE_ALIGN(vdev->msix->table_offset +
> -  (vdev->msix->entries * 
> PCI_MSIX_ENTRY_SIZE));
> +start = HOST_PAGE_ALIGN(vdev->msix->table_offset +
> +(vdev->msix->entries * PCI_MSIX_ENTRY_SIZE));
>
Hi Alex,

I am new to vfio and qemu, and have some questions. Does MSIX have one
dedicated bar when qemu emulating the device? Looks your code maps
both the content before and after the MSIX table? If MSIX has
dedicated bar, I think we can just skip the MSIX bar, why do we need
to map the context before and after the MSIX table?

Thanks,
-Kai
>  size = start < bar->size ? bar->size - start : 0;
>  strncat(name, " msix-hi", sizeof(name) - strlen(name) - 1);
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-09 Thread Kai Huang
>
> -int iommu_unmap(struct iommu_domain *domain, unsigned long iova, int 
> gfp_order)
> +size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t 
> size)
>  {
> -       size_t size, unmapped;
> +       size_t unmapped_page, unmapped = 0;
> +       unsigned int min_pagesz;
>
>        if (unlikely(domain->ops->unmap == NULL))
>                return -ENODEV;
>
> -       size         = PAGE_SIZE << gfp_order;
> -
> -       BUG_ON(!IS_ALIGNED(iova, size));
> -
> -       unmapped = domain->ops->unmap(domain, iova, size);
> -
> -       return get_order(unmapped);
> +       /* find out the minimum page size supported */
> +       min_pagesz = 1 << __ffs(domain->ops->pgsize_bitmap);
> +
> +       /*
> +        * The virtual address, as well as the size of the mapping, must be
> +        * aligned (at least) to the size of the smallest page supported
> +        * by the hardware
> +        */
> +       if (!IS_ALIGNED(iova | size, min_pagesz)) {
> +               pr_err("unaligned: iova 0x%lx size 0x%lx min_pagesz 0x%x\n",
> +                                       iova, (unsigned long)size, 
> min_pagesz);
> +               return -EINVAL;
> +       }
> +
> +       pr_debug("unmap this: iova 0x%lx size 0x%lx\n", iova,
> +                                                       (unsigned long)size);
> +
> +       /*
> +        * Keep iterating until we either unmap 'size' bytes (or more)
> +        * or we hit an area that isn't mapped.
> +        */
> +       while (unmapped < size) {
> +               size_t left = size - unmapped;
> +
> +               unmapped_page = domain->ops->unmap(domain, iova, left);
> +               if (!unmapped_page)
> +                       break;
> +
> +               pr_debug("unmapped: iova 0x%lx size %lx\n", iova,
> +                                       (unsigned long)unmapped_page);
> +
> +               iova += unmapped_page;
> +               unmapped += unmapped_page;
> +       }
> +
> +       return unmapped;
>  }
>  EXPORT_SYMBOL_GPL(iommu_unmap);
>

Seems the unmap function don't take phys as parameter, does this mean
domain->ops->unmap will walk through the page table to find out the
actual page size?

And another question: have we considered the IOTLB flush operation? I
think we need to implement similar logic when flush the DVMA range.
Intel VT-d's manual says software needs to specify the appropriate
mask value to flush large pages, but it does not say we need to
exactly match the page size as it is mapped. I guess it's not
necessary for Intel IOMMU, but other vendor's IOMMU may have such
limitation (or some other limitations). In my understanding current
implementation does not provide page size information for particular
DVMA ranges that has been mapped, and it's not flexible to implement
IOTLB flush code (ex, we may need to walk through page table to find
out actual page size). Maybe we can also add iommu_ops->flush_iotlb ?

-cody
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's the usage model (purpose) of interrupt remapping in IOMMU?

2011-11-02 Thread Kai Huang
Clear, thanks!

-- Forwarded message --
From: Alex Williamson 
Date: Wed, Nov 2, 2011 at 11:31 PM
Subject: Re: What's the usage model (purpose) of interrupt remapping in IOMMU?
To: Kai Huang 
Cc: kvm@vger.kernel.org, linux-...@ger.kernel.org


On Wed, 2011-11-02 at 13:26 +0800, Kai Huang wrote:
> Hi,
>
> In case of direct io, without the interrupt remapping in IOMMU (intel
> VT-d or AMD IOMMU), hypervisor needs to inject interrupt for guest
> when the guest is scheduled to specific CPU. At the beginning I
> thought with IOMMU's interrupt remapping, the hardware can directly
> forward the interrupt to guest without trapping into hypervisor when
> the interrupt happens, but after reading the Intel VT-d's manual, I
> found the interrupt mapping feature just add another mapping which
> allows software to control (mainly) the destination and vector, and we
> still need hypervisor to inject the interrupt when the guest is
> scheduled as only after the guest is scheduled, the target CPU can be
> known. If my understanding is correct, seems the interrupt remapping
> does not bring any performance improvement. So what's the benefit of
> IOMMU's interrupt remapping? Can someone explain the usage model of
> interrupt remapping in IOMMU?

Interrupt remapping provides isolation and compatibility, not
performance.  The hypervisor being able to direct interrupts to a target
CPU also allows it the ability to filter interrupts and prevent the
device from signaling spurious or malicious interrupts.  This is
particularly important with message signaled interrupts since any device
capable of DMA is able to inject random MSIs into the host.  The
compatibility side is a feature of Intel platforms supporting x2apic.
The interrupt remapper provides a translation layer to allow xapic aware
hardware, such as ioapics, to function when the processors are switched
to x2apic mode.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


What's the usage model (purpose) of interrupt remapping in IOMMU?

2011-11-01 Thread Kai Huang
Hi,

In case of direct io, without the interrupt remapping in IOMMU (intel
VT-d or AMD IOMMU), hypervisor needs to inject interrupt for guest
when the guest is scheduled to specific CPU. At the beginning I
thought with IOMMU's interrupt remapping, the hardware can directly
forward the interrupt to guest without trapping into hypervisor when
the interrupt happens, but after reading the Intel VT-d's manual, I
found the interrupt mapping feature just add another mapping which
allows software to control (mainly) the destination and vector, and we
still need hypervisor to inject the interrupt when the guest is
scheduled as only after the guest is scheduled, the target CPU can be
known. If my understanding is correct, seems the interrupt remapping
does not bring any performance improvement. So what's the benefit of
IOMMU's interrupt remapping? Can someone explain the usage model of
interrupt remapping in IOMMU?

Thanks,
cody
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cache flush questions of Intel IOMMU

2011-08-23 Thread Kai Huang
Hi all,

I am working on Intel iommu staff and I have two questions -- just
send to kvm list as I am not sure which mail list should I send to,
and it will be very appreciated if you could help to forward to
related mail list. Thank you!

1) I see in Intel iommu's manual, caching behavior is reported by CM
bit in Capability register and Coherency bit in Extended register. My
question is from hardware's point of view, in case of CM=0, when some
specific address mapping changed (ex, set up new mapping, modify or
free existing mapping), do we need to flush IOTLB cache for that
address range, considering both C bit is 0 and 1? I guess if C=0, we
need to flush, and if C=1, we don't have to, right? And I guess this
should also be true for context cache?

Cache Mode bit in Capability register:

0: Not-present and erroneous entries are
not cached in any of the remapping caches.
Invalidations are not required for
modifications to individual not present or
invalid entries. However, any modifications
that result in decreasing the effective
permissions or partial permission increases
require invalidations for them to be
effective.

1: Not-present and erroneous mappings
may be cached in the remapping caches.
Any software updates to the remapping
structures (including updates to “not-
present” or erroneous entries) require
explicit invalidation

Coherency bit in Extended Capability register:

This field indicates if hardware access to the
root, context, page-table and interrupt-
remap structures are coherent (snooped) or
not.
• 0:Indicates hardware accesses to
  remapping structures are non-coherent.
• 1:Indicates hardware accesses to
  remapping structures are coherent.

2) I see in domain_flush_cache, when !domain->iommu_coherency
(actually it is the Coherency bit in Extended Capability register),
clflush_cache_range is called, but the clflush is used to flush cache
in CPU, not in IOMMU, why flush CPU cache, not IOMMU cache here?

static int domain_init(struct dmar_domain *domain, int guest_width)
{

if (ecap_coherent(iommu->ecap))
domain->iommu_coherency = 1;
else
domain->iommu_coherency = 0;
...
}

static void domain_flush_cache(struct dmar_domain *domain,
   void *addr, int size)
{
if (!domain->iommu_coherency)
clflush_cache_range(addr, size);
}

Thanks,
Cody
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html