Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes

2015-04-14 Thread Andres Lagar-Cavilla
On Mon, Apr 13, 2015 at 10:25 PM, Wanpeng Li wanpeng...@linux.intel.com wrote:
 Hi Andres,
 On Fri, Apr 10, 2015 at 11:05:26AM -0700, Andres Lagar-Cavilla wrote:
 [...]
 +   if (sp-role.direct 
 +   !kvm_is_reserved_pfn(pfn) 
 +   PageTransCompound(pfn_to_page(pfn))) {

Not your fault, but PageTransCompound is very unhappy naming, as it
also yields true for PageHuge. Suggestion: document this check covers
static hugetlbfs, or switch to PageCompound() check.

A slightly bolder approach would be to refactor and reuse the nearly
identical check done in transparent_hugepage_adjust, instead of
open-coding here. In essence this code is asking for the same check,
plus the out-of-band check for static hugepages.

 PageCompound() check still return true for both transparent huge pages
 and hugetlbfs pages, !PageHuge(page)  PageTransHuge(page) check can
 guarantee to catch the right transparent huge pages just as my old commit
 e76d30e20be5fc (mm/hwpoison: fix test for a transparent huge page).
 I will send a patch to fix this.

Why would you want to fix it that way? Aren't static hugepages supported?

(PageAnon is an inline check and much cheaper than !PageHuge(), which
is an actual function call)

Please consider my suggestion about refactoring the similar checks in
transparent_hugepage_adjust.

Thanks a ton
Andres


 +   drop_spte(kvm, sptep);
 +   sptep = rmap_get_first(*rmapp, iter);
 +   need_tlb_flush = 1;
 +   } else
 +   sptep = rmap_get_next(iter);
 +   }
 +
 +   return need_tlb_flush;
 +}
 +
 +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 +   struct kvm_memory_slot *memslot)
 +{
 +   bool flush = false;
 +   unsigned long *rmapp;
 +   unsigned long last_index, index;
 +   gfn_t gfn_start, gfn_end;
 +
 +   spin_lock(kvm-mmu_lock);
 +
 +   gfn_start = memslot-base_gfn;
 +   gfn_end = memslot-base_gfn + memslot-npages - 1;
 +
 +   if (gfn_start = gfn_end)
 +   goto out;

I don't understand the value of this check here. Are we looking for a
broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
would be trouble in many other ways.

All this to say: please remove the above 5 lines and make code simpler.

 I will send a patch to cleanup it. Thanks for your review. :)

 Regards,
 Wanpeng Li




-- 
Andres Lagar-Cavilla | Google Kernel Team | andre...@google.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes

2015-04-14 Thread Wanpeng Li
On Mon, Apr 13, 2015 at 11:06:25PM -0700, Andres Lagar-Cavilla wrote:
On Mon, Apr 13, 2015 at 10:25 PM, Wanpeng Li wanpeng...@linux.intel.com 
wrote:
 Hi Andres,
 On Fri, Apr 10, 2015 at 11:05:26AM -0700, Andres Lagar-Cavilla wrote:
 [...]
 +   if (sp-role.direct 
 +   !kvm_is_reserved_pfn(pfn) 
 +   PageTransCompound(pfn_to_page(pfn))) {

Not your fault, but PageTransCompound is very unhappy naming, as it
also yields true for PageHuge. Suggestion: document this check covers
static hugetlbfs, or switch to PageCompound() check.

A slightly bolder approach would be to refactor and reuse the nearly
identical check done in transparent_hugepage_adjust, instead of
open-coding here. In essence this code is asking for the same check,
plus the out-of-band check for static hugepages.

 PageCompound() check still return true for both transparent huge pages
 and hugetlbfs pages, !PageHuge(page)  PageTransHuge(page) check can
 guarantee to catch the right transparent huge pages just as my old commit
 e76d30e20be5fc (mm/hwpoison: fix test for a transparent huge page).
 I will send a patch to fix this.

Why would you want to fix it that way? Aren't static hugepages supported?

(PageAnon is an inline check and much cheaper than !PageHuge(), which
is an actual function call)

Please consider my suggestion about refactoring the similar checks in
transparent_hugepage_adjust.

Ok, will do. :)

Regards,
Wanpeng Li 


Thanks a ton
Andres


 +   drop_spte(kvm, sptep);
 +   sptep = rmap_get_first(*rmapp, iter);
 +   need_tlb_flush = 1;
 +   } else
 +   sptep = rmap_get_next(iter);
 +   }
 +
 +   return need_tlb_flush;
 +}
 +
 +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 +   struct kvm_memory_slot *memslot)
 +{
 +   bool flush = false;
 +   unsigned long *rmapp;
 +   unsigned long last_index, index;
 +   gfn_t gfn_start, gfn_end;
 +
 +   spin_lock(kvm-mmu_lock);
 +
 +   gfn_start = memslot-base_gfn;
 +   gfn_end = memslot-base_gfn + memslot-npages - 1;
 +
 +   if (gfn_start = gfn_end)
 +   goto out;

I don't understand the value of this check here. Are we looking for a
broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
would be trouble in many other ways.

All this to say: please remove the above 5 lines and make code simpler.

 I will send a patch to cleanup it. Thanks for your review. :)

 Regards,
 Wanpeng Li




-- 
Andres Lagar-Cavilla | Google Kernel Team | andre...@google.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes

2015-04-13 Thread Wanpeng Li
Hi Andres,
On Fri, Apr 10, 2015 at 11:05:26AM -0700, Andres Lagar-Cavilla wrote:
[...]
 +   if (sp-role.direct 
 +   !kvm_is_reserved_pfn(pfn) 
 +   PageTransCompound(pfn_to_page(pfn))) {

Not your fault, but PageTransCompound is very unhappy naming, as it
also yields true for PageHuge. Suggestion: document this check covers
static hugetlbfs, or switch to PageCompound() check.

A slightly bolder approach would be to refactor and reuse the nearly
identical check done in transparent_hugepage_adjust, instead of
open-coding here. In essence this code is asking for the same check,
plus the out-of-band check for static hugepages.

PageCompound() check still return true for both transparent huge pages
and hugetlbfs pages, !PageHuge(page)  PageTransHuge(page) check can 
guarantee to catch the right transparent huge pages just as my old commit 
e76d30e20be5fc (mm/hwpoison: fix test for a transparent huge page). 
I will send a patch to fix this.



 +   drop_spte(kvm, sptep);
 +   sptep = rmap_get_first(*rmapp, iter);
 +   need_tlb_flush = 1;
 +   } else
 +   sptep = rmap_get_next(iter);
 +   }
 +
 +   return need_tlb_flush;
 +}
 +
 +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 +   struct kvm_memory_slot *memslot)
 +{
 +   bool flush = false;
 +   unsigned long *rmapp;
 +   unsigned long last_index, index;
 +   gfn_t gfn_start, gfn_end;
 +
 +   spin_lock(kvm-mmu_lock);
 +
 +   gfn_start = memslot-base_gfn;
 +   gfn_end = memslot-base_gfn + memslot-npages - 1;
 +
 +   if (gfn_start = gfn_end)
 +   goto out;

I don't understand the value of this check here. Are we looking for a
broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
would be trouble in many other ways.

All this to say: please remove the above 5 lines and make code simpler.

I will send a patch to cleanup it. Thanks for your review. :)

Regards,
Wanpeng Li

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes

2015-04-13 Thread Andres Lagar-Cavilla
On Sun, Apr 12, 2015 at 6:45 PM, Xiao Guangrong
guangrong.x...@linux.intel.com wrote:


 On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote:

 On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li wanpeng...@linux.intel.com
 wrote:


 There are two scenarios for the requirement of collapsing small sptes
 into large sptes.
 - dirty logging tracks sptes in 4k granularity, so large sptes are split,
the large sptes will be reallocated in the destination machine and the
guest in the source machine will be destroyed when live migration
 successfully.
However, the guest in the source machine will continue to run if live
 migration
fail due to some reasons, the sptes still keep small which lead to bad
performance.
 - our customers write tools to track the dirty speed of guests by EPT D
 bit/PML
in order to determine the most appropriate one to be live migrated,
 however
sptes will still keep small after tracking dirty speed.

 This patch introduce lazy collapse small sptes into large sptes, the
 memory region
 will be scanned on the ioctl context when dirty log is stopped, the ones
 which can
 be collapsed into large pages will be dropped during the scan, it depends
 the on
 later #PF to reallocate all large sptes.

 Reviewed-by: Xiao Guangrong guangrong.x...@linux.intel.com
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com


 Hi, apologies for late review (vacation), but wanted to bring
 attention to a few matters:


 No problem, your comments are really valuable to us. :)

Glad to know, thanks.





 ---
 v2 - v3:
   * update comments
   * fix infinite for loop
 v1 - v2:
   * use 'bool' instead of 'int'
   * add more comments
   * fix can not get the next spte after drop the current spte

   arch/x86/include/asm/kvm_host.h |  2 ++
   arch/x86/kvm/mmu.c  | 73
 +
   arch/x86/kvm/x86.c  | 19 +++
   3 files changed, 94 insertions(+)

 diff --git a/arch/x86/include/asm/kvm_host.h
 b/arch/x86/include/asm/kvm_host.h
 index 30b28dc..91b5bdb 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64
 accessed_mask,
   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
   void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
struct kvm_memory_slot *memslot);
 +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 +   struct kvm_memory_slot *memslot);
   void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 struct kvm_memory_slot *memslot);
   void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index cee7592..ba002a0 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm
 *kvm,
  kvm_flush_remote_tlbs(kvm);
   }

 +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 +   unsigned long *rmapp)
 +{
 +   u64 *sptep;
 +   struct rmap_iterator iter;
 +   int need_tlb_flush = 0;
 +   pfn_t pfn;
 +   struct kvm_mmu_page *sp;
 +
 +   for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
 +   BUG_ON(!(*sptep  PT_PRESENT_MASK));
 +
 +   sp = page_header(__pa(sptep));
 +   pfn = spte_to_pfn(*sptep);
 +
 +   /*
 +* Lets support EPT only for now, there still needs to
 figure
 +* out an efficient way to let these codes be aware what
 mapping
 +* level used in guest.
 +*/
 +   if (sp-role.direct 
 +   !kvm_is_reserved_pfn(pfn) 
 +   PageTransCompound(pfn_to_page(pfn))) {


 Not your fault, but PageTransCompound is very unhappy naming, as it
 also yields true for PageHuge. Suggestion: document this check covers
 static hugetlbfs, or switch to PageCompound() check.

 A slightly bolder approach would be to refactor and reuse the nearly
 identical check done in transparent_hugepage_adjust, instead of
 open-coding here. In essence this code is asking for the same check,
 plus the out-of-band check for static hugepages.


 I agree.




 +   drop_spte(kvm, sptep);
 +   sptep = rmap_get_first(*rmapp, iter);
 +   need_tlb_flush = 1;
 +   } else
 +   sptep = rmap_get_next(iter);
 +   }
 +
 +   return need_tlb_flush;
 +}
 +
 +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 +   struct kvm_memory_slot *memslot)
 +{
 +   bool flush = false;
 +   unsigned long *rmapp;
 +   unsigned long last_index, index;
 +   gfn_t gfn_start, gfn_end;
 +
 +   spin_lock(kvm-mmu_lock);
 +
 +   gfn_start = memslot-base_gfn;
 +   gfn_end = memslot-base_gfn + 

Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes

2015-04-13 Thread Wanpeng Li
On Mon, Apr 13, 2015 at 09:45:39AM +0800, Xiao Guangrong wrote:


On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote:
On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li wanpeng...@linux.intel.com 
wrote:

There are two scenarios for the requirement of collapsing small sptes
into large sptes.
- dirty logging tracks sptes in 4k granularity, so large sptes are split,
   the large sptes will be reallocated in the destination machine and the
   guest in the source machine will be destroyed when live migration 
 successfully.
   However, the guest in the source machine will continue to run if live 
 migration
   fail due to some reasons, the sptes still keep small which lead to bad
   performance.
- our customers write tools to track the dirty speed of guests by EPT D 
bit/PML
   in order to determine the most appropriate one to be live migrated, 
 however
   sptes will still keep small after tracking dirty speed.

This patch introduce lazy collapse small sptes into large sptes, the memory 
region
will be scanned on the ioctl context when dirty log is stopped, the ones 
which can
be collapsed into large pages will be dropped during the scan, it depends 
the on
later #PF to reallocate all large sptes.

Reviewed-by: Xiao Guangrong guangrong.x...@linux.intel.com
Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com

Hi, apologies for late review (vacation), but wanted to bring
attention to a few matters:

No problem, your comments are really valuable to us. :)



---
v2 - v3:
  * update comments
  * fix infinite for loop
v1 - v2:
  * use 'bool' instead of 'int'
  * add more comments
  * fix can not get the next spte after drop the current spte

  arch/x86/include/asm/kvm_host.h |  2 ++
  arch/x86/kvm/mmu.c  | 73 
 +
  arch/x86/kvm/x86.c  | 19 +++
  3 files changed, 94 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h 
b/arch/x86/include/asm/kvm_host.h
index 30b28dc..91b5bdb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
   struct kvm_memory_slot *memslot);
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+   struct kvm_memory_slot *memslot);
  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
struct kvm_memory_slot *memslot);
  void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cee7592..ba002a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 kvm_flush_remote_tlbs(kvm);
  }

+static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
+   unsigned long *rmapp)
+{
+   u64 *sptep;
+   struct rmap_iterator iter;
+   int need_tlb_flush = 0;
+   pfn_t pfn;
+   struct kvm_mmu_page *sp;
+
+   for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
+   BUG_ON(!(*sptep  PT_PRESENT_MASK));
+
+   sp = page_header(__pa(sptep));
+   pfn = spte_to_pfn(*sptep);
+
+   /*
+* Lets support EPT only for now, there still needs to figure
+* out an efficient way to let these codes be aware what 
mapping
+* level used in guest.
+*/
+   if (sp-role.direct 
+   !kvm_is_reserved_pfn(pfn) 
+   PageTransCompound(pfn_to_page(pfn))) {

Not your fault, but PageTransCompound is very unhappy naming, as it
also yields true for PageHuge. Suggestion: document this check covers
static hugetlbfs, or switch to PageCompound() check.

A slightly bolder approach would be to refactor and reuse the nearly
identical check done in transparent_hugepage_adjust, instead of
open-coding here. In essence this code is asking for the same check,
plus the out-of-band check for static hugepages.

I agree.



+   drop_spte(kvm, sptep);
+   sptep = rmap_get_first(*rmapp, iter);
+   need_tlb_flush = 1;
+   } else
+   sptep = rmap_get_next(iter);
+   }
+
+   return need_tlb_flush;
+}
+
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+   struct kvm_memory_slot *memslot)
+{
+   bool flush = false;
+   unsigned long *rmapp;
+   unsigned long last_index, index;
+   gfn_t gfn_start, gfn_end;
+
+   spin_lock(kvm-mmu_lock);
+
+   gfn_start = memslot-base_gfn;
+   gfn_end = memslot-base_gfn + memslot-npages - 1;
+
+   if (gfn_start = gfn_end)
+   goto out;

I don't understand the value of this check here. Are we looking for a
broken memslot? 

Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes

2015-04-12 Thread Xiao Guangrong



On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote:

On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li wanpeng...@linux.intel.com wrote:


There are two scenarios for the requirement of collapsing small sptes
into large sptes.
- dirty logging tracks sptes in 4k granularity, so large sptes are split,
   the large sptes will be reallocated in the destination machine and the
   guest in the source machine will be destroyed when live migration 
successfully.
   However, the guest in the source machine will continue to run if live 
migration
   fail due to some reasons, the sptes still keep small which lead to bad
   performance.
- our customers write tools to track the dirty speed of guests by EPT D bit/PML
   in order to determine the most appropriate one to be live migrated, however
   sptes will still keep small after tracking dirty speed.

This patch introduce lazy collapse small sptes into large sptes, the memory 
region
will be scanned on the ioctl context when dirty log is stopped, the ones which 
can
be collapsed into large pages will be dropped during the scan, it depends the on
later #PF to reallocate all large sptes.

Reviewed-by: Xiao Guangrong guangrong.x...@linux.intel.com
Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com


Hi, apologies for late review (vacation), but wanted to bring
attention to a few matters:


No problem, your comments are really valuable to us. :)





---
v2 - v3:
  * update comments
  * fix infinite for loop
v1 - v2:
  * use 'bool' instead of 'int'
  * add more comments
  * fix can not get the next spte after drop the current spte

  arch/x86/include/asm/kvm_host.h |  2 ++
  arch/x86/kvm/mmu.c  | 73 +
  arch/x86/kvm/x86.c  | 19 +++
  3 files changed, 94 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 30b28dc..91b5bdb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
   struct kvm_memory_slot *memslot);
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+   struct kvm_memory_slot *memslot);
  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
struct kvm_memory_slot *memslot);
  void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cee7592..ba002a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 kvm_flush_remote_tlbs(kvm);
  }

+static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
+   unsigned long *rmapp)
+{
+   u64 *sptep;
+   struct rmap_iterator iter;
+   int need_tlb_flush = 0;
+   pfn_t pfn;
+   struct kvm_mmu_page *sp;
+
+   for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
+   BUG_ON(!(*sptep  PT_PRESENT_MASK));
+
+   sp = page_header(__pa(sptep));
+   pfn = spte_to_pfn(*sptep);
+
+   /*
+* Lets support EPT only for now, there still needs to figure
+* out an efficient way to let these codes be aware what mapping
+* level used in guest.
+*/
+   if (sp-role.direct 
+   !kvm_is_reserved_pfn(pfn) 
+   PageTransCompound(pfn_to_page(pfn))) {


Not your fault, but PageTransCompound is very unhappy naming, as it
also yields true for PageHuge. Suggestion: document this check covers
static hugetlbfs, or switch to PageCompound() check.

A slightly bolder approach would be to refactor and reuse the nearly
identical check done in transparent_hugepage_adjust, instead of
open-coding here. In essence this code is asking for the same check,
plus the out-of-band check for static hugepages.


I agree.





+   drop_spte(kvm, sptep);
+   sptep = rmap_get_first(*rmapp, iter);
+   need_tlb_flush = 1;
+   } else
+   sptep = rmap_get_next(iter);
+   }
+
+   return need_tlb_flush;
+}
+
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+   struct kvm_memory_slot *memslot)
+{
+   bool flush = false;
+   unsigned long *rmapp;
+   unsigned long last_index, index;
+   gfn_t gfn_start, gfn_end;
+
+   spin_lock(kvm-mmu_lock);
+
+   gfn_start = memslot-base_gfn;
+   gfn_end = memslot-base_gfn + memslot-npages - 1;
+
+   if (gfn_start = gfn_end)
+   goto out;


I don't understand the value of this check here. Are we looking for a
broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
about these 

Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes

2015-04-07 Thread Paolo Bonzini


On 03/04/2015 09:40, Wanpeng Li wrote:
 There are two scenarios for the requirement of collapsing small sptes
 into large sptes.
 - dirty logging tracks sptes in 4k granularity, so large sptes are split,
   the large sptes will be reallocated in the destination machine and the
   guest in the source machine will be destroyed when live migration 
 successfully.
   However, the guest in the source machine will continue to run if live 
 migration
   fail due to some reasons, the sptes still keep small which lead to bad
   performance.
 - our customers write tools to track the dirty speed of guests by EPT D 
 bit/PML
   in order to determine the most appropriate one to be live migrated, however
   sptes will still keep small after tracking dirty speed.
 
 This patch introduce lazy collapse small sptes into large sptes, the memory 
 region
 will be scanned on the ioctl context when dirty log is stopped, the ones 
 which can
 be collapsed into large pages will be dropped during the scan, it depends the 
 on
 later #PF to reallocate all large sptes.
 
 Reviewed-by: Xiao Guangrong guangrong.x...@linux.intel.com
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 ---
 v2 - v3:
  * update comments 
  * fix infinite for loop
 v1 - v2:
  * use 'bool' instead of 'int'
  * add more comments
  * fix can not get the next spte after drop the current spte
 
  arch/x86/include/asm/kvm_host.h |  2 ++
  arch/x86/kvm/mmu.c  | 73 
 +
  arch/x86/kvm/x86.c  | 19 +++
  3 files changed, 94 insertions(+)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 30b28dc..91b5bdb 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
 accessed_mask,
  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 struct kvm_memory_slot *memslot);
 +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 + struct kvm_memory_slot *memslot);
  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
  struct kvm_memory_slot *memslot);
  void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index cee7592..ba002a0 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
   kvm_flush_remote_tlbs(kvm);
  }
  
 +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 + unsigned long *rmapp)
 +{
 + u64 *sptep;
 + struct rmap_iterator iter;
 + int need_tlb_flush = 0;
 + pfn_t pfn;
 + struct kvm_mmu_page *sp;
 +
 + for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
 + BUG_ON(!(*sptep  PT_PRESENT_MASK));
 +
 + sp = page_header(__pa(sptep));
 + pfn = spte_to_pfn(*sptep);
 +
 + /*
 +  * Lets support EPT only for now, there still needs to figure
 +  * out an efficient way to let these codes be aware what mapping
 +  * level used in guest.
 +  */
 + if (sp-role.direct 
 + !kvm_is_reserved_pfn(pfn) 
 + PageTransCompound(pfn_to_page(pfn))) {
 + drop_spte(kvm, sptep);
 + sptep = rmap_get_first(*rmapp, iter);
 + need_tlb_flush = 1;
 + } else
 + sptep = rmap_get_next(iter);
 + }
 +
 + return need_tlb_flush;
 +}
 +
 +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 + struct kvm_memory_slot *memslot)
 +{
 + bool flush = false;
 + unsigned long *rmapp;
 + unsigned long last_index, index;
 + gfn_t gfn_start, gfn_end;
 +
 + spin_lock(kvm-mmu_lock);
 +
 + gfn_start = memslot-base_gfn;
 + gfn_end = memslot-base_gfn + memslot-npages - 1;
 +
 + if (gfn_start = gfn_end)
 + goto out;
 +
 + rmapp = memslot-arch.rmap[0];
 + last_index = gfn_to_index(gfn_end, memslot-base_gfn,
 + PT_PAGE_TABLE_LEVEL);
 +
 + for (index = 0; index = last_index; ++index, ++rmapp) {
 + if (*rmapp)
 + flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
 +
 + if (need_resched() || spin_needbreak(kvm-mmu_lock)) {
 + if (flush) {
 + kvm_flush_remote_tlbs(kvm);
 + flush = false;
 + }
 + cond_resched_lock(kvm-mmu_lock);
 + }
 + }
 +
 + if (flush)
 + kvm_flush_remote_tlbs(kvm);
 +
 +out:
 + spin_unlock(kvm-mmu_lock);
 +}
 +
  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
  

[PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes

2015-04-03 Thread Wanpeng Li
There are two scenarios for the requirement of collapsing small sptes
into large sptes.
- dirty logging tracks sptes in 4k granularity, so large sptes are split,
  the large sptes will be reallocated in the destination machine and the
  guest in the source machine will be destroyed when live migration 
successfully.
  However, the guest in the source machine will continue to run if live 
migration
  fail due to some reasons, the sptes still keep small which lead to bad
  performance.
- our customers write tools to track the dirty speed of guests by EPT D bit/PML
  in order to determine the most appropriate one to be live migrated, however
  sptes will still keep small after tracking dirty speed.

This patch introduce lazy collapse small sptes into large sptes, the memory 
region
will be scanned on the ioctl context when dirty log is stopped, the ones which 
can
be collapsed into large pages will be dropped during the scan, it depends the on
later #PF to reallocate all large sptes.

Reviewed-by: Xiao Guangrong guangrong.x...@linux.intel.com
Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
---
v2 - v3:
 * update comments 
 * fix infinite for loop
v1 - v2:
 * use 'bool' instead of 'int'
 * add more comments
 * fix can not get the next spte after drop the current spte

 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.c  | 73 +
 arch/x86/kvm/x86.c  | 19 +++
 3 files changed, 94 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 30b28dc..91b5bdb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
  struct kvm_memory_slot *memslot);
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+   struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
   struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cee7592..ba002a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
kvm_flush_remote_tlbs(kvm);
 }
 
+static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
+   unsigned long *rmapp)
+{
+   u64 *sptep;
+   struct rmap_iterator iter;
+   int need_tlb_flush = 0;
+   pfn_t pfn;
+   struct kvm_mmu_page *sp;
+
+   for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
+   BUG_ON(!(*sptep  PT_PRESENT_MASK));
+
+   sp = page_header(__pa(sptep));
+   pfn = spte_to_pfn(*sptep);
+
+   /*
+* Lets support EPT only for now, there still needs to figure
+* out an efficient way to let these codes be aware what mapping
+* level used in guest.
+*/
+   if (sp-role.direct 
+   !kvm_is_reserved_pfn(pfn) 
+   PageTransCompound(pfn_to_page(pfn))) {
+   drop_spte(kvm, sptep);
+   sptep = rmap_get_first(*rmapp, iter);
+   need_tlb_flush = 1;
+   } else
+   sptep = rmap_get_next(iter);
+   }
+
+   return need_tlb_flush;
+}
+
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+   struct kvm_memory_slot *memslot)
+{
+   bool flush = false;
+   unsigned long *rmapp;
+   unsigned long last_index, index;
+   gfn_t gfn_start, gfn_end;
+
+   spin_lock(kvm-mmu_lock);
+
+   gfn_start = memslot-base_gfn;
+   gfn_end = memslot-base_gfn + memslot-npages - 1;
+
+   if (gfn_start = gfn_end)
+   goto out;
+
+   rmapp = memslot-arch.rmap[0];
+   last_index = gfn_to_index(gfn_end, memslot-base_gfn,
+   PT_PAGE_TABLE_LEVEL);
+
+   for (index = 0; index = last_index; ++index, ++rmapp) {
+   if (*rmapp)
+   flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
+
+   if (need_resched() || spin_needbreak(kvm-mmu_lock)) {
+   if (flush) {
+   kvm_flush_remote_tlbs(kvm);
+   flush = false;
+   }
+   cond_resched_lock(kvm-mmu_lock);
+   }
+   }
+
+   if (flush)
+   kvm_flush_remote_tlbs(kvm);
+
+out:
+   spin_unlock(kvm-mmu_lock);
+}
+
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
   struct kvm_memory_slot *memslot)
 {
diff --git a/arch/x86/kvm/x86.c