[PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

2013-04-16 Thread Xiao Guangrong
This patchset is based on my previous two patchset:
[PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
(https://lkml.org/lkml/2013/4/1/2)

[PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
(https://lkml.org/lkml/2013/4/1/134)

Changlog:
V3:
  completely redesign the algorithm, please see below.

V2:
  - do not reset n_requested_mmu_pages and n_max_mmu_pages
  - batch free root shadow pages to reduce vcpu notification and mmu-lock
contention
  - remove the first patch that introduce kvm-arch.mmu_cache since we only
'memset zero' on hashtable rather than all mmu cache members in this
version
  - remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all

* Issue
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

* Idea
KVM maintains a global mmu invalid generation-number which is stored in
kvm-arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp-mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

The invalid-gen pages (sp-mmu_valid_gen != kvm-arch.mmu_valid_gen)
are keeped in mmu-cache until page allocator reclaims page.

* Challenges
Some page invalidation is requested when memslot is moved or deleted
and kvm is being destroy who call zap_all_pages to delete all sp using
their rmap and lpage-info, after call zap_all_pages, the rmap and lpage-info
will be freed. So, we should implement a fast way to delete sp from the rmap
and lpage-info.

For the lpage-info, we clear all lpage count when do zap-all-pages, then
all invalid shadow pages are not counted in lpage-info, after that lpage-info
on the invalid memslot can be safely freed. This is also good for the
performance - it allows guest to use hugepage as far as possible.

For the rmap, we introduce a way to unmap rmap out of mmu-lock.
In order to do that, we should resolve these problems:
1) do not corrupt the rmap
2) keep pte-list-descs available
3) keep shadow page available

Resolve 1):
we make the invalid rmap be remove-only that means we only delete and
clear spte from the rmap, no new sptes can be added to it.
This is reasonable since kvm can not do address translation on invalid rmap
(gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
not be reused (they belong to invalid shadow page).

Resolve 2):
We use the placeholder (PTE_LIST_SPTE_SKIP) to indicate spte has been deleted
from the rmap instead of freeing pte-list-descs and moving sptes. Then, the
pte-list-desc entry are available when concurrently unmap the rmap.
The pte-list-descs are freed when the memslot is not visible to all vcpus.

Resolve 3):
we protect the lifecycle of sp by this algorithm:

unmap-rmap-out-of-mmu-lock():
for-each-rmap-in-slot:
  preempt_disable
  kvm-arch.being_unmapped_rmap = rmapp

  clear spte and reset rmap entry

  kvm-arch.being_unmapped_rmap = NULL
  preempt_enable

Other patch like zap-sp and mmu-notify which are protected
by mmu-lock:

  clear spte and reset rmap entry
retry:
  if (kvm-arch.being_unmapped_rmap == rmap)
goto retry
(the wait is very rare and clear one rmap is very fast, it
is not bad even if wait is needed)

Then, we can sure the spte is always available when we concurrently unmap the
rmap


* TODO
Use a better algorithm to free pte-list-desc, for example, we can link them
together by desc-more.

* Performance
We observably reduce the contention of mmu-lock and make the invalidation
be preemptable.

Xiao Guangrong (15):
  KVM: x86: clean up and optimize for kvm_arch_free_memslot
  KVM: fold kvm_arch_create_memslot into kvm_arch_prepare_memory_region
  KVM: x86: do not reuse rmap when memslot is moved
  KVM: MMU: abstract memslot rmap related operations
  KVM: MMU: allow per-rmap operations
  KVM: MMU: allow concurrently clearing spte on remove-only pte-list
  KVM: MMU: introduce invalid rmap handlers
  KVM: MMU: allow unmap invalid rmap out of mmu-lock
  KVM: MMU: introduce free_meslot_rmap_desc_nolock
  KVM: x86: introduce memslot_set_lpage_disallowed
  KVM: MMU: introduce kvm_clear_all_lpage_info
  KVM: MMU: fast invalid all shadow pages
  KVM: x86: use the fast way to invalid all pages
  KVM: move srcu_read_lock/srcu_read_unlock to arch-specified code
  KVM: MMU: replace kvm_zap_all with kvm_mmu_invalid_all_pages

 arch/arm/kvm/arm.c  |5 -
 arch/ia64/kvm/kvm-ia64.c|5 -
 arch/powerpc/kvm/powerpc.c  |8 +-
 arch/s390/kvm/kvm-s390.c|5 -
 

[PATCH v3 02/15] KVM: fold kvm_arch_create_memslot into kvm_arch_prepare_memory_region

2013-04-16 Thread Xiao Guangrong
It removes a arch-specified interface and also removes unnecessary
empty functions on some architectures

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/arm/kvm/arm.c |5 -
 arch/ia64/kvm/kvm-ia64.c   |5 -
 arch/powerpc/kvm/powerpc.c |8 ++--
 arch/s390/kvm/kvm-s390.c   |5 -
 arch/x86/kvm/x86.c |7 ++-
 include/linux/kvm_host.h   |1 -
 virt/kvm/kvm_main.c|8 ++--
 7 files changed, 14 insertions(+), 25 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index e4ad0bb..c76e63e 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -159,11 +159,6 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
 {
 }
 
-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
-{
-   return 0;
-}
-
 /**
  * kvm_arch_destroy_vm - destroy the VM data structure
  * @kvm:   pointer to the KVM struct
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 7a54455..fcfb03b 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1553,11 +1553,6 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
 {
 }
 
-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
-{
-   return 0;
-}
-
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_memory_slot *memslot,
struct kvm_userspace_memory_region *mem,
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 16b4595..aab8039 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -405,9 +405,9 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
kvmppc_core_free_memslot(free, dont);
 }
 
-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
+static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
 {
-   return kvmppc_core_create_memslot(slot, npages);
+   return kvmppc_core_create_memslot(slot, slot-npages);
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
@@ -415,6 +415,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
   struct kvm_userspace_memory_region *mem,
   enum kvm_mr_change change)
 {
+   if (change == KVM_MR_CREATE)
+   if (kvm_arch_create_memslot(memslot))
+   return -ENOMEM;
+
return kvmppc_core_prepare_memory_region(kvm, memslot, mem);
 }
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 33161b4..7bfd6f6 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -967,11 +967,6 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
 {
 }
 
-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
-{
-   return 0;
-}
-
 /* Section: memory related */
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
   struct kvm_memory_slot *memslot,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0be7ec..447789c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6875,8 +6875,9 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
}
 }
 
-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
+static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
 {
+   unsigned long npages = slot-npages;
int i;
 
for (i = 0; i  KVM_NR_PAGE_SIZES; ++i) {
@@ -6938,6 +6939,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
 {
+   if (change == KVM_MR_CREATE)
+   if (kvm_arch_create_memslot(memslot))
+   return -ENOMEM;
+
/*
 * Only private memory slots need to be mapped here since
 * KVM_SET_MEMORY_REGION ioctl is no longer supported.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c0be23..f39ec18 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -493,7 +493,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem);
 void kvm_arch_free_memslot(struct kvm_memory_slot *free,
   struct kvm_memory_slot *dont);
-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long 
npages);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_memory_slot *memslot,
struct kvm_userspace_memory_region *mem,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d21694a..acc9f30 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -825,13 +825,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (!(new.flags  KVM_MEM_LOG_DIRTY_PAGES))
new.dirty_bitmap = NULL;
 
-   r = -ENOMEM;
-   if (change == 

[PATCH v3 01/15] KVM: x86: clean up and optimize for kvm_arch_free_memslot

2013-04-16 Thread Xiao Guangrong
memslot rmap and lpage-info are never partly reused and nothing need
be freed when new memslot is created

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |   21 -
 1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4be4733..b0be7ec 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6856,19 +6856,22 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
 {
int i;
 
+   if (dont  dont-arch.rmap[0] == free-arch.rmap[0])
+   return;
+
+   /* It is a empty memslot. */
+   if (!free-arch.rmap[0])
+   return;
+
for (i = 0; i  KVM_NR_PAGE_SIZES; ++i) {
-   if (!dont || free-arch.rmap[i] != dont-arch.rmap[i]) {
-   kvm_kvfree(free-arch.rmap[i]);
-   free-arch.rmap[i] = NULL;
-   }
+   kvm_kvfree(free-arch.rmap[i]);
+   free-arch.rmap[i] = NULL;
+
if (i == 0)
continue;
 
-   if (!dont || free-arch.lpage_info[i - 1] !=
-dont-arch.lpage_info[i - 1]) {
-   kvm_kvfree(free-arch.lpage_info[i - 1]);
-   free-arch.lpage_info[i - 1] = NULL;
-   }
+   kvm_kvfree(free-arch.lpage_info[i - 1]);
+   free-arch.lpage_info[i - 1] = NULL;
}
 }
 
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/15] KVM: MMU: allow per-rmap operations

2013-04-16 Thread Xiao Guangrong
Introduce rmap_operations to allow rmap having different operations,
then, we are able to handle invalid rmap specially

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   31 ---
 arch/x86/kvm/mmu.h  |   16 
 arch/x86/kvm/x86.c  |1 +
 4 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4e1f7cb..5fd6ed1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -511,6 +511,7 @@ struct kvm_lpage_info {
 };
 
 struct kvm_arch_memory_slot {
+   struct rmap_operations *ops;
unsigned long *rmap[KVM_NR_PAGE_SIZES];
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
 };
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 514f5b1..99ad2a4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1055,13 +1055,13 @@ static int slot_rmap_add(struct kvm_memory_slot *slot,
 struct kvm_vcpu *vcpu, unsigned long *rmapp,
 u64 *spte)
 {
-   return pte_list_add(vcpu, spte, rmapp);
+   return slot-arch.ops-rmap_add(vcpu, spte, rmapp);
 }
 
 static void slot_rmap_remove(struct kvm_memory_slot *slot,
 unsigned long *rmapp, u64 *spte)
 {
-   pte_list_remove(spte, rmapp);
+   slot-arch.ops-rmap_remove(spte, rmapp);
 }
 
 static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
@@ -1238,7 +1238,7 @@ static bool slot_rmap_write_protect(struct 
kvm_memory_slot *slot,
struct kvm *kvm, unsigned long *rmapp,
bool pt_protect)
 {
-   return __rmap_write_protect(kvm, rmapp, pt_protect);
+   return slot-arch.ops-rmap_write_protect(kvm, rmapp, pt_protect);
 }
 
 /**
@@ -1306,7 +1306,7 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
 static int slot_rmap_unmap(struct kvm *kvm, unsigned long *rmapp,
   struct kvm_memory_slot *slot, unsigned long data)
 {
-   return kvm_unmap_rmapp(kvm, rmapp);
+   return slot-arch.ops-rmap_unmap(kvm, rmapp);
 }
 
 static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
@@ -1353,7 +1353,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
 static int slot_rmap_set_pte(struct kvm *kvm, unsigned long *rmapp,
 struct kvm_memory_slot *slot, unsigned long data)
 {
-   return kvm_set_pte_rmapp(kvm, rmapp, (pte_t *)data);
+   return slot-arch.ops-rmap_set_pte(kvm, rmapp, (pte_t *)data);
 }
 
 static int kvm_handle_hva_range(struct kvm *kvm,
@@ -1470,7 +1470,7 @@ out:
 static int slot_rmap_age(struct kvm *kvm, unsigned long *rmapp,
 struct kvm_memory_slot *slot, unsigned long data)
 {
-   int young = kvm_age_rmapp(kvm, rmapp);
+   int young = slot-arch.ops-rmap_age(kvm, rmapp);
 
/* @data has hva passed to kvm_age_hva(). */
trace_kvm_age_page(data, slot, young);
@@ -1508,7 +1508,7 @@ static int slot_rmap_test_age(struct kvm *kvm, unsigned 
long *rmapp,
  struct kvm_memory_slot *slot,
  unsigned long data)
 {
-   return kvm_test_age_rmapp(kvm, rmapp);
+   return slot-arch.ops-rmap_test_age(kvm, rmapp);
 }
 
 #define RMAP_RECYCLE_THRESHOLD 1000
@@ -1537,6 +1537,23 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
return kvm_handle_hva(kvm, hva, 0, slot_rmap_test_age);
 }
 
+static struct rmap_operations normal_rmap_ops = {
+   .rmap_add = pte_list_add,
+   .rmap_remove = pte_list_remove,
+
+   .rmap_write_protect = __rmap_write_protect,
+
+   .rmap_set_pte = kvm_set_pte_rmapp,
+   .rmap_age = kvm_age_rmapp,
+   .rmap_test_age = kvm_test_age_rmapp,
+   .rmap_unmap = kvm_unmap_rmapp
+};
+
+void init_memslot_rmap_ops(struct kvm_memory_slot *slot)
+{
+   slot-arch.ops = normal_rmap_ops;
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ffd40d1..bb2b22e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -114,4 +114,20 @@ static inline bool permission_fault(struct kvm_mmu *mmu, 
unsigned pte_access,
return (mmu-permissions[pfec  1]  pte_access)  1;
 }
 
+struct rmap_operations {
+   int (*rmap_add)(struct kvm_vcpu *vcpu, u64 *spte,
+   unsigned long *rmap);
+   void (*rmap_remove)(u64 *spte, unsigned long *rmap);
+
+   bool (*rmap_write_protect)(struct kvm *kvm, unsigned long *rmap,
+  bool pt_protect);
+
+   int (*rmap_set_pte)(struct kvm *kvm, unsigned long *rmap,
+   pte_t *ptep);
+   int (*rmap_age)(struct kvm *kvm, unsigned long *rmap);
+   int 

[PATCH v3 04/15] KVM: MMU: abstract memslot rmap related operations

2013-04-16 Thread Xiao Guangrong
Introduce slot_rmap_* functions to abstract memslot rmap related
operations which makes the later patch more clearer

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c   |  108 +-
 arch/x86/kvm/mmu_audit.c |   10 +++--
 2 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dcc059c..514f5b1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1033,14 +1033,14 @@ static unsigned long *__gfn_to_rmap(gfn_t gfn, int 
level,
 }
 
 /*
- * Take gfn and return the reverse mapping to it.
+ * Take gfn and return the memslot and reverse mapping to it.
  */
-static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level)
+static unsigned long *gfn_to_rmap(struct kvm *kvm,
+ struct kvm_memory_slot **slot,
+ gfn_t gfn, int level)
 {
-   struct kvm_memory_slot *slot;
-
-   slot = gfn_to_memslot(kvm, gfn);
-   return __gfn_to_rmap(gfn, level, slot);
+   *slot = gfn_to_memslot(kvm, gfn);
+   return __gfn_to_rmap(gfn, level, *slot);
 }
 
 static bool rmap_can_add(struct kvm_vcpu *vcpu)
@@ -1051,27 +1051,42 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
return mmu_memory_cache_free_objects(cache);
 }
 
+static int slot_rmap_add(struct kvm_memory_slot *slot,
+struct kvm_vcpu *vcpu, unsigned long *rmapp,
+u64 *spte)
+{
+   return pte_list_add(vcpu, spte, rmapp);
+}
+
+static void slot_rmap_remove(struct kvm_memory_slot *slot,
+unsigned long *rmapp, u64 *spte)
+{
+   pte_list_remove(spte, rmapp);
+}
+
 static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 {
+   struct kvm_memory_slot *slot;
struct kvm_mmu_page *sp;
unsigned long *rmapp;
 
sp = page_header(__pa(spte));
kvm_mmu_page_set_gfn(sp, spte - sp-spt, gfn);
-   rmapp = gfn_to_rmap(vcpu-kvm, gfn, sp-role.level);
-   return pte_list_add(vcpu, spte, rmapp);
+   rmapp = gfn_to_rmap(vcpu-kvm,  slot, gfn, sp-role.level);
+   return slot_rmap_add(slot, vcpu, rmapp, spte);
 }
 
 static void rmap_remove(struct kvm *kvm, u64 *spte)
 {
+   struct kvm_memory_slot *slot;
struct kvm_mmu_page *sp;
gfn_t gfn;
unsigned long *rmapp;
 
sp = page_header(__pa(spte));
gfn = kvm_mmu_page_get_gfn(sp, spte - sp-spt);
-   rmapp = gfn_to_rmap(kvm, gfn, sp-role.level);
-   pte_list_remove(spte, rmapp);
+   rmapp = gfn_to_rmap(kvm, slot, gfn, sp-role.level);
+   slot_rmap_remove(slot, rmapp, spte);
 }
 
 /*
@@ -1219,6 +1234,13 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
return flush;
 }
 
+static bool slot_rmap_write_protect(struct kvm_memory_slot *slot,
+   struct kvm *kvm, unsigned long *rmapp,
+   bool pt_protect)
+{
+   return __rmap_write_protect(kvm, rmapp, pt_protect);
+}
+
 /**
  * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
@@ -1238,7 +1260,7 @@ void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
while (mask) {
rmapp = __gfn_to_rmap(slot-base_gfn + gfn_offset + __ffs(mask),
  PT_PAGE_TABLE_LEVEL, slot);
-   __rmap_write_protect(kvm, rmapp, false);
+   slot_rmap_write_protect(slot, kvm, rmapp, false);
 
/* clear the first set bit */
mask = mask - 1;
@@ -1257,14 +1279,14 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
for (i = PT_PAGE_TABLE_LEVEL;
 i  PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
rmapp = __gfn_to_rmap(gfn, i, slot);
-   write_protected |= __rmap_write_protect(kvm, rmapp, true);
+   write_protected |= slot_rmap_write_protect(slot, kvm, rmapp,
+  true);
}
 
return write_protected;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
-  struct kvm_memory_slot *slot, unsigned long data)
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1281,14 +1303,19 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
return need_tlb_flush;
 }
 
+static int slot_rmap_unmap(struct kvm *kvm, unsigned long *rmapp,
+  struct kvm_memory_slot *slot, unsigned long data)
+{
+   return kvm_unmap_rmapp(kvm, rmapp);
+}
+
 static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
-struct kvm_memory_slot *slot, unsigned long data)
+pte_t *ptep)
 {
u64 *sptep;
struct rmap_iterator iter;
 

[PATCH v3 10/15] KVM: x86: introduce memslot_set_lpage_disallowed

2013-04-16 Thread Xiao Guangrong
It is used to set disallowed lage page on the specified level, can be
used in later patch

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |   53 ++-
 1 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bec83cd..0c5bb2c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6875,13 +6875,46 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
}
 }
 
+static void memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
+unsigned long npages,
+int lpage_size, int lpages)
+{
+   struct kvm_lpage_info *lpage_info;
+   unsigned long ugfn;
+   int level = lpage_size + 1;
+
+   WARN_ON(!lpage_size);
+
+   lpage_info = slot-arch.lpage_info[lpage_size - 1];
+
+   if (slot-base_gfn  (KVM_PAGES_PER_HPAGE(level) - 1))
+   lpage_info[0].write_count = 1;
+
+   if ((slot-base_gfn + npages)  (KVM_PAGES_PER_HPAGE(level) - 1))
+   lpage_info[lpages - 1].write_count = 1;
+
+   ugfn = slot-userspace_addr  PAGE_SHIFT;
+
+   /*
+* If the gfn and userspace address are not aligned wrt each
+* other, or if explicitly asked to, disable large page
+* support for this slot
+*/
+   if ((slot-base_gfn ^ ugfn)  (KVM_PAGES_PER_HPAGE(level) - 1) ||
+ !kvm_largepages_enabled()) {
+   unsigned long j;
+
+   for (j = 0; j  lpages; ++j)
+   lpage_info[j].write_count = 1;
+   }
+}
+
 static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
 {
unsigned long npages = slot-npages;
int i;
 
for (i = 0; i  KVM_NR_PAGE_SIZES; ++i) {
-   unsigned long ugfn;
int lpages;
int level = i + 1;
 
@@ -6900,23 +6933,7 @@ static int kvm_arch_create_memslot(struct 
kvm_memory_slot *slot)
if (!slot-arch.lpage_info[i - 1])
goto out_free;
 
-   if (slot-base_gfn  (KVM_PAGES_PER_HPAGE(level) - 1))
-   slot-arch.lpage_info[i - 1][0].write_count = 1;
-   if ((slot-base_gfn + npages)  (KVM_PAGES_PER_HPAGE(level) - 
1))
-   slot-arch.lpage_info[i - 1][lpages - 1].write_count = 
1;
-   ugfn = slot-userspace_addr  PAGE_SHIFT;
-   /*
-* If the gfn and userspace address are not aligned wrt each
-* other, or if explicitly asked to, disable large page
-* support for this slot
-*/
-   if ((slot-base_gfn ^ ugfn)  (KVM_PAGES_PER_HPAGE(level) - 1) 
||
-   !kvm_largepages_enabled()) {
-   unsigned long j;
-
-   for (j = 0; j  lpages; ++j)
-   slot-arch.lpage_info[i - 1][j].write_count = 1;
-   }
+   memslot_set_lpage_disallowed(slot, npages, i, lpages);
}
 
init_memslot_rmap_ops(slot);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 15/15] KVM: MMU: replace kvm_zap_all with kvm_mmu_invalid_all_pages

2013-04-16 Thread Xiao Guangrong
Use kvm_mmu_invalid_all_pages in kvm_arch_flush_shadow_all and
rename kvm_zap_all to kvm_free_all which is used to free all
memmory used by kvm mmu when vm is being destroyed, at this time,
no vcpu exists and mmu-notify has been unregistered, so we can
free the shadow pages out of mmu-lock

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |2 +-
 arch/x86/kvm/mmu.c  |   15 ++-
 arch/x86/kvm/x86.c  |9 -
 3 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6f8ee18..a336055 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -771,7 +771,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int 
slot);
 void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 struct kvm_memory_slot *slot,
 gfn_t gfn_offset, unsigned long mask);
-void kvm_mmu_zap_all(struct kvm *kvm);
+void kvm_mmu_free_all(struct kvm *kvm);
 void kvm_arch_init_generation(struct kvm *kvm);
 void kvm_mmu_invalid_mmio_sptes(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 12129b7..10c43ea 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4639,28 +4639,17 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
spin_unlock(kvm-mmu_lock);
 }
 
-void kvm_mmu_zap_all(struct kvm *kvm)
+void kvm_mmu_free_all(struct kvm *kvm)
 {
struct kvm_mmu_page *sp, *node;
LIST_HEAD(invalid_list);
 
-   might_sleep();
-
-   spin_lock(kvm-mmu_lock);
 restart:
-   list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link) {
+   list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link)
if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list))
goto restart;
 
-   if (need_resched() || spin_needbreak(kvm-mmu_lock)) {
-   kvm_mmu_commit_zap_page(kvm, invalid_list);
-   cond_resched_lock(kvm-mmu_lock);
-   goto restart;
-   }
-   }
-
kvm_mmu_commit_zap_page(kvm, invalid_list);
-   spin_unlock(kvm-mmu_lock);
 }
 
 static void kvm_mmu_zap_mmio_sptes(struct kvm *kvm)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3dd0d5..4bb88f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6840,6 +6840,7 @@ void kvm_arch_sync_events(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+   kvm_mmu_free_all(kvm);
kvm_iommu_unmap_guest(kvm);
kfree(kvm-arch.vpic);
kfree(kvm-arch.vioapic);
@@ -7056,11 +7057,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-   int idx;
-
-   idx = srcu_read_lock(kvm-srcu);
-   kvm_mmu_zap_all(kvm);
-   srcu_read_unlock(kvm-srcu, idx);
+   mutex_lock(kvm-slots_lock);
+   kvm_mmu_invalid_memslot_pages(kvm, INVALID_ALL_SLOTS);
+   mutex_unlock(kvm-slots_lock);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

2013-04-16 Thread Xiao Guangrong
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

In this patch, we introduce a faster way to invalid all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm-arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp-mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

The invalid-gen pages (sp-mmu_valid_gen != kvm-arch.mmu_valid_gen)
are keeped in mmu-cache until page allocator reclaims page.

If the invalidation is due to memslot changed, its rmap amd lpage-info
will be freed soon, in order to avoiding use invalid memory, we unmap
all sptes on its rmap and always reset the large-info all memslots so
that rmap and lpage info can be safely freed.

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/kvm/mmu.c  |   85 +-
 arch/x86/kvm/mmu.h  |4 ++
 arch/x86/kvm/x86.c  |6 +++
 4 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1ad9a34..6f8ee18 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -223,6 +223,7 @@ struct kvm_mmu_page {
int root_count;  /* Currently serving as active root */
unsigned int unsync_children;
unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
+   unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);
 
 #ifdef CONFIG_X86_32
@@ -531,6 +532,7 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+   unsigned long mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9ac584f..12129b7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
.rmap_unmap = kvm_unmap_invalid_rmapp
 };
 
+static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
+{
+   slot-arch.ops = invalid_rmap_ops;
+}
+
 typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
 static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
 handle_rmap_fun fun, void *data)
@@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct kvm_memory_slot 
*slot)
walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
 }
 
+/*
+ * Fast invalid all shadow pages belong to @slot.
+ *
+ * @slot != NULL means the invalidation is caused the memslot specified
+ * by @slot is being deleted, in this case, we should ensure that rmap
+ * and lpage-info of the @slot can not be used after calling the function.
+ * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
+ * soon, it always happens when kvm is being destroyed.
+ *
+ * @slot == NULL means the invalidation due to other reasons, we need
+ * not care rmap and lpage-info since they are still valid after calling
+ * the function.
+ */
+void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
+  struct kvm_memory_slot *slot)
+{
+   struct kvm_memory_slot *each_slot;
+
+   spin_lock(kvm-mmu_lock);
+   kvm-arch.mmu_valid_gen++;
+
+   if (slot == INVALID_ALL_SLOTS)
+   kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
+   init_invalid_memslot_rmap_ops(each_slot);
+   else if (slot)
+   init_invalid_memslot_rmap_ops(slot);
+
+   /*
+* All shadow paes are invalid, reset the large page info,
+* then we can safely desotry the memslot, it is also good
+* for large page used.
+*/
+   kvm_clear_all_lpage_info(kvm);
+
+   /*
+* Notify all vcpus to reload its shadow page table
+* and flush TLB. Then all vcpus will switch to new
+* shadow page table with the new mmu_valid_gen.
+*
+* Note: we should do this under the protection of
+* mmu-lock, otherwise, vcpu would purge shadow page
+* but miss tlb flush.
+*/
+   kvm_reload_remote_mmus(kvm);
+   spin_unlock(kvm-mmu_lock);
+
+   if (slot == INVALID_ALL_SLOTS)
+   kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
+   

[PATCH v3 14/15] KVM: move srcu_read_lock/srcu_read_unlock to arch-specified code

2013-04-16 Thread Xiao Guangrong
Move srcu_read_lock/srcu_read_unlock in kvm_mmu_notifier_release
to kvm_arch_flush_shadow_all since we will hold slot-lock instead
of srcu

Only ARM, POWERPC and x86 are using mmu-notify and
kvm_arch_flush_shadow_all on ARM and POWERPC does nothing, so we
only need to modify the code on x86

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c  |4 
 virt/kvm/kvm_main.c |3 ---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e7c85b..d3dd0d5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7056,7 +7056,11 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
+   int idx;
+
+   idx = srcu_read_lock(kvm-srcu);
kvm_mmu_zap_all(kvm);
+   srcu_read_unlock(kvm-srcu, idx);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index acc9f30..f48eef9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -418,11 +418,8 @@ static void kvm_mmu_notifier_release(struct mmu_notifier 
*mn,
 struct mm_struct *mm)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
-   int idx;
 
-   idx = srcu_read_lock(kvm-srcu);
kvm_arch_flush_shadow_all(kvm);
-   srcu_read_unlock(kvm-srcu, idx);
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 13/15] KVM: x86: use the fast way to invalid all pages

2013-04-16 Thread Xiao Guangrong
Replace kvm_mmu_zap_all by kvm_mmu_invalid_all_pages except on
the path of mmu_notifier-release() which will be replaced in
the later patch

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6dbb80c..6e7c85b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5465,7 +5465,7 @@ static int emulator_fix_hypercall(struct x86_emulate_ctxt 
*ctxt)
 * to ensure that the updated hypercall appears atomically across all
 * VCPUs.
 */
-   kvm_mmu_zap_all(vcpu-kvm);
+   kvm_mmu_invalid_memslot_pages(vcpu-kvm, NULL);
 
kvm_x86_ops-patch_hypercall(vcpu, instruction);
 
@@ -7062,7 +7062,7 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
   struct kvm_memory_slot *slot)
 {
-   kvm_arch_flush_shadow_all(kvm);
+   kvm_mmu_invalid_memslot_pages(kvm, slot);
 }
 
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 11/15] KVM: MMU: introduce kvm_clear_all_lpage_info

2013-04-16 Thread Xiao Guangrong
This function is used to reset the large page info of all guest page
which will be used in later patch

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |   25 +
 arch/x86/kvm/x86.h |2 ++
 2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c5bb2c..fc4956c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6909,6 +6909,31 @@ static void memslot_set_lpage_disallowed(struct 
kvm_memory_slot *slot,
}
 }
 
+static void clear_memslot_lpage_info(struct kvm_memory_slot *slot)
+{
+   int i;
+
+   for (i = 1; i  KVM_NR_PAGE_SIZES; ++i) {
+   int lpages;
+   int level = i + 1;
+
+   lpages = gfn_to_index(slot-base_gfn + slot-npages - 1,
+ slot-base_gfn, level) + 1;
+
+   memset(slot-arch.lpage_info[i - 1], 0,
+  sizeof(*slot-arch.lpage_info[i - 1]));
+   memslot_set_lpage_disallowed(slot, slot-npages, i, lpages);
+   }
+}
+
+void kvm_clear_all_lpage_info(struct kvm *kvm)
+{
+   struct kvm_memory_slot *slot;
+
+   kvm_for_each_memslot(slot, kvm-memslots)
+   clear_memslot_lpage_info(slot);
+}
+
 static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
 {
unsigned long npages = slot-npages;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e224f7a..beae540 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -108,6 +108,8 @@ static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu 
*vcpu, gpa_t gpa)
return false;
 }
 
+void kvm_clear_all_lpage_info(struct kvm *kvm);
+
 void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/15] KVM: MMU: allow unmap invalid rmap out of mmu-lock

2013-04-16 Thread Xiao Guangrong
pte_list_clear_concurrently allows us to reset pte-desc entry
out of mmu-lock. We can reset spte out of mmu-lock if we can protect the
lifecycle of sp, we use this way to achieve the goal:

unmap_memslot_rmap_nolock():
for-each-rmap-in-slot:
  preempt_disable
  kvm-arch.being_unmapped_rmap = rmapp
  clear spte and reset rmap entry
  kvm-arch.being_unmapped_rmap = NULL
  preempt_enable

Other patch like zap-sp and mmu-notify which are protected
by mmu-lock:
  clear spte and reset rmap entry
retry:
  if (kvm-arch.being_unmapped_rmap == rmap)
goto retry
(the wait is very rare and clear one rmap is very fast, it
is not bad even if wait is needed)

Then, we can sure the spte is always available when we do
unmap_memslot_rmap_nolock

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/kvm/mmu.c  |  114 ---
 arch/x86/kvm/mmu.h  |2 +-
 3 files changed, 110 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5fd6ed1..1ad9a34 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -536,6 +536,8 @@ struct kvm_arch {
 * Hash table of struct kvm_mmu_page.
 */
struct list_head active_mmu_pages;
+   unsigned long *being_unmapped_rmap;
+
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
int iommu_flags;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2a7a5d0..e6414d2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1104,10 +1104,10 @@ static int slot_rmap_add(struct kvm_memory_slot *slot,
return slot-arch.ops-rmap_add(vcpu, spte, rmapp);
 }
 
-static void slot_rmap_remove(struct kvm_memory_slot *slot,
+static void slot_rmap_remove(struct kvm_memory_slot *slot, struct kvm *kvm,
 unsigned long *rmapp, u64 *spte)
 {
-   slot-arch.ops-rmap_remove(spte, rmapp);
+   slot-arch.ops-rmap_remove(kvm, spte, rmapp);
 }
 
 static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
@@ -1132,7 +1132,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
sp = page_header(__pa(spte));
gfn = kvm_mmu_page_get_gfn(sp, spte - sp-spt);
rmapp = gfn_to_rmap(kvm, slot, gfn, sp-role.level);
-   slot_rmap_remove(slot, rmapp, spte);
+   slot_rmap_remove(slot, kvm, rmapp, spte);
 }
 
 /*
@@ -1589,9 +1589,14 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
return kvm_handle_hva(kvm, hva, 0, slot_rmap_test_age);
 }
 
+static void rmap_remove_spte(struct kvm *kvm, u64 *spte, unsigned long *rmapp)
+{
+   pte_list_remove(spte, rmapp);
+}
+
 static struct rmap_operations normal_rmap_ops = {
.rmap_add = pte_list_add,
-   .rmap_remove = pte_list_remove,
+   .rmap_remove = rmap_remove_spte,
 
.rmap_write_protect = __rmap_write_protect,
 
@@ -1613,9 +1618,27 @@ static int invalid_rmap_add(struct kvm_vcpu *vcpu, u64 
*spte,
return 0;
 }
 
-static void invalid_rmap_remove(u64 *spte, unsigned long *rmapp)
+static void sync_being_unmapped_rmap(struct kvm *kvm, unsigned long *rmapp)
+{
+   /*
+* Ensure all the sptes on the rmap have been zapped and
+* the rmap's entries have been reset so that
+* unmap_invalid_rmap_nolock can not get any spte from the
+* rmap after calling sync_being_unmapped_rmap().
+*/
+   smp_mb();
+retry:
+   if (unlikely(ACCESS_ONCE(kvm-arch.being_unmapped_rmap) == rmapp)) {
+   cpu_relax();
+   goto retry;
+   }
+}
+
+static void
+invalid_rmap_remove(struct kvm *kvm, u64 *spte, unsigned long *rmapp)
 {
pte_list_clear_concurrently(spte, rmapp);
+   sync_being_unmapped_rmap(kvm, rmapp);
 }
 
 static bool invalid_rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1635,7 +1658,11 @@ static int __kvm_unmap_invalid_rmapp(unsigned long 
*rmapp)
if (sptep == PTE_LIST_SPTE_SKIP)
continue;
 
-   /* Do not call .rmap_remove(). */
+   /*
+* Do not call .rmap_remove() since we do not want to wait
+* on sync_being_unmapped_rmap() when all sptes should be
+* removed from the rmap.
+*/
if (mmu_spte_clear_track_bits(sptep))
pte_list_clear_concurrently(sptep, rmapp);
}
@@ -1645,7 +1672,10 @@ static int __kvm_unmap_invalid_rmapp(unsigned long 
*rmapp)
 
 static int kvm_unmap_invalid_rmapp(struct kvm *kvm, unsigned long *rmapp)
 {
-   return __kvm_unmap_invalid_rmapp(rmapp);
+   int ret = __kvm_unmap_invalid_rmapp(rmapp);
+
+   sync_being_unmapped_rmap(kvm, rmapp);
+   return ret;
 }
 
 static int invalid_rmap_set_pte(struct kvm *kvm, unsigned long *rmapp,
@@ -1686,6 +1716,76 @@ 

[PATCH v3 09/15] KVM: MMU: introduce free_meslot_rmap_desc_nolock

2013-04-16 Thread Xiao Guangrong
It frees pte-list-descs used by memslot rmap after update
memslot is completed

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   26 ++
 arch/x86/kvm/mmu.h |1 +
 2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e6414d2..9ac584f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1049,6 +1049,22 @@ static void pte_list_clear_concurrently(u64 *spte, 
unsigned long *pte_list)
return;
 }
 
+static void pte_list_free_desc(unsigned long *pte_list)
+{
+   struct pte_list_desc *desc, *next;
+   unsigned long pte_value = *pte_list;
+
+   if (!(pte_value  1))
+   return;
+
+   desc = (struct pte_list_desc *)(pte_value  ~1ul);
+   do {
+   next = desc-more;
+   mmu_free_pte_list_desc(desc);
+   desc = next;
+   } while (desc);
+}
+
 typedef void (*pte_list_walk_fn) (u64 *spte);
 static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
 {
@@ -1786,6 +1802,16 @@ unmap_memslot_rmap_nolock(struct kvm *kvm, struct 
kvm_memory_slot *slot)
walk_memslot_rmap_nolock(slot, unmap_invalid_rmap_nolock, kvm);
 }
 
+static void free_rmap_desc_nolock(unsigned long *rmapp, void *data)
+{
+   pte_list_free_desc(rmapp);
+}
+
+void free_meslot_rmap_desc_nolock(struct kvm_memory_slot *slot)
+{
+   walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d6aa31a..ab434b7 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -130,4 +130,5 @@ struct rmap_operations {
 };
 
 void init_memslot_rmap_ops(struct kvm_memory_slot *slot);
+void free_meslot_rmap_desc_nolock(struct kvm_memory_slot *slot);
 #endif
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 06/15] KVM: MMU: allow concurrently clearing spte on remove-only pte-list

2013-04-16 Thread Xiao Guangrong
This patch introduce PTE_LIST_SPTE_SKIP which is the placeholder and
it will be set on pte-list after removing a spte so that other sptes
on this pte_list are not moved and the pte-list-descs on the pte-list
are not freed.

If vcpu can not add spte to the pte-list (e.g. the rmap on invalid
memslot) and spte can not be freed during pte-list walk, we can
concurrently clear sptes on the pte-list, the worst case is, we double
zap a spte that is safe.

This patch only ensures that concurrently zapping pte-list is safe,
we will keep spte available during concurrently clearing in the later
patches

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   62 +++
 1 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 99ad2a4..850eab5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -900,6 +900,18 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
 }
 
 /*
+ * It is the placeholder and it will be set on pte-list after removing
+ * a spte so that other sptes on this pte_list are not moved and the
+ * pte-list-descs on the pte-list are not freed.
+ *
+ * If vcpu can not add spte to the pte-list (e.g. the rmap on invalid
+ * memslot) and spte can not be freed during pte-list walk, we can
+ * cocurrently clear sptes on the pte-list, the worst case is, we double
+ * zap a spte that is safe.
+ */
+#define PTE_LIST_SPTE_SKIP (u64 *)((~0x0ul)  (~1))
+
+/*
  * Pte mapping structures:
  *
  * If pte_list bit zero is zero, then pte_list point to the spte.
@@ -1003,6 +1015,40 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
}
 }
 
+static void pte_list_clear_concurrently(u64 *spte, unsigned long *pte_list)
+{
+   struct pte_list_desc *desc;
+   unsigned long pte_value = *pte_list;
+   int i;
+
+   /* Empty pte list stores nothing. */
+   WARN_ON(!pte_value);
+
+   if (!(pte_value  1)) {
+   if ((u64 *)pte_value == spte) {
+   *pte_list = (unsigned long)PTE_LIST_SPTE_SKIP;
+   return;
+   }
+
+   /* someone has already cleared it. */
+   WARN_ON(pte_value != (unsigned long)PTE_LIST_SPTE_SKIP);
+   return;
+   }
+
+   desc = (struct pte_list_desc *)(pte_value  ~1ul);
+   while (desc) {
+   for (i = 0; i  PTE_LIST_EXT  desc-sptes[i]; ++i)
+   if (desc-sptes[i] == spte) {
+   desc-sptes[i] = PTE_LIST_SPTE_SKIP;
+   return;
+   }
+
+   desc = desc-more;
+   }
+
+   return;
+}
+
 typedef void (*pte_list_walk_fn) (u64 *spte);
 static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
 {
@@ -1214,6 +1260,12 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
*flush, bool pt_protect)
return false;
 }
 
+/* PTE_LIST_SPTE_SKIP is only used on invalid rmap. */
+static void check_valid_sptep(u64 *sptep)
+{
+   WARN_ON(sptep == PTE_LIST_SPTE_SKIP || !is_rmap_spte(*sptep));
+}
+
 static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 bool pt_protect)
 {
@@ -1222,7 +1274,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
bool flush = false;
 
for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
-   BUG_ON(!(*sptep  PT_PRESENT_MASK));
+   check_valid_sptep(sptep);
if (spte_write_protect(kvm, sptep, flush, pt_protect)) {
sptep = rmap_get_first(*rmapp, iter);
continue;
@@ -1293,7 +1345,7 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
int need_tlb_flush = 0;
 
while ((sptep = rmap_get_first(*rmapp, iter))) {
-   BUG_ON(!(*sptep  PT_PRESENT_MASK));
+   check_valid_sptep(sptep);
rmap_printk(kvm_rmap_unmap_hva: spte %p %llx\n, sptep, 
*sptep);
 
drop_spte(kvm, sptep);
@@ -1322,7 +1374,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
new_pfn = pte_pfn(*ptep);
 
for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
-   BUG_ON(!is_shadow_present_pte(*sptep));
+   check_valid_sptep(sptep);
rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, sptep, *sptep);
 
need_flush = 1;
@@ -1455,7 +1507,7 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
 
for (sptep = rmap_get_first(*rmapp, iter); sptep;
 sptep = rmap_get_next(iter)) {
-   BUG_ON(!is_shadow_present_pte(*sptep));
+   check_valid_sptep(sptep);
 
if (*sptep  shadow_accessed_mask) {
young = 1;
@@ -1493,7 +1545,7 @@ static int kvm_test_age_rmapp(struct kvm 

[PATCH v3 07/15] KVM: MMU: introduce invalid rmap handlers

2013-04-16 Thread Xiao Guangrong
Invalid rmaps is the rmap of the invalid memslot which is being
deleted, especially, we can treat all rmaps are invalid when
kvm is being destroyed since all memslot will be deleted soon.
MMU should remove all sptes on these rmaps before the invalid
memslot fully deleted

The reason why we separately handle invalid rmap is we want to
unmap invalid-rmap out of mmu-lock to achieve scale performance
on intensive memory and vcpu used guest

This patch make all the operations on invalid rmap are clearing
spte and reset rmap's entry. In the later patch, we will introduce
the path out of mmu-lock to unmap invalid rmap

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   80 
 1 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 850eab5..2a7a5d0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1606,6 +1606,86 @@ void init_memslot_rmap_ops(struct kvm_memory_slot *slot)
slot-arch.ops = normal_rmap_ops;
 }
 
+static int invalid_rmap_add(struct kvm_vcpu *vcpu, u64 *spte,
+   unsigned long *pte_list)
+{
+   WARN_ON(1);
+   return 0;
+}
+
+static void invalid_rmap_remove(u64 *spte, unsigned long *rmapp)
+{
+   pte_list_clear_concurrently(spte, rmapp);
+}
+
+static bool invalid_rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
+  bool pt_protect)
+{
+   WARN_ON(1);
+   return false;
+}
+
+static int __kvm_unmap_invalid_rmapp(unsigned long *rmapp)
+{
+   u64 *sptep;
+   struct rmap_iterator iter;
+
+   for (sptep = rmap_get_first(*rmapp, iter); sptep;
+ sptep = rmap_get_next(iter)) {
+   if (sptep == PTE_LIST_SPTE_SKIP)
+   continue;
+
+   /* Do not call .rmap_remove(). */
+   if (mmu_spte_clear_track_bits(sptep))
+   pte_list_clear_concurrently(sptep, rmapp);
+   }
+
+   return 0;
+}
+
+static int kvm_unmap_invalid_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   return __kvm_unmap_invalid_rmapp(rmapp);
+}
+
+static int invalid_rmap_set_pte(struct kvm *kvm, unsigned long *rmapp,
+   pte_t *ptep)
+{
+   return kvm_unmap_invalid_rmapp(kvm, rmapp);
+}
+
+/*
+ * Invalid rmaps is the rmap of the invalid memslot which is being
+ * deleted, especially, we can treat all rmaps are invalid when
+ * kvm is being destroyed since all memslot will be deleted soon.
+ * MMU should remove all sptes on these rmaps before the invalid
+ * memslot fully deleted.
+ *
+ * VCPUs can not do address translation on invalid memslots, that
+ * means no sptes can be added to their rmaps and no shadow page
+ * can be created in their memory regions, so rmap_add and
+ * rmap_write_protect on invalid memslot should never happen.
+ * Any sptes on invalid rmaps are stale and can not be reused,
+ * we drop all sptes on any other operations. So, all handlers
+ * on invalid rmap do the same thing - remove and zap sptes on
+ * the rmap.
+ *
+ * KVM use pte_list_clear_concurrently to clear spte on invalid
+ * rmap which resets rmap's entry but keeps rmap's memory. The
+ * rmap is fully destroyed when free the invalid memslot.
+ */
+static struct rmap_operations invalid_rmap_ops = {
+   .rmap_add = invalid_rmap_add,
+   .rmap_remove = invalid_rmap_remove,
+
+   .rmap_write_protect = invalid_rmap_write_protect,
+
+   .rmap_set_pte = invalid_rmap_set_pte,
+   .rmap_age = kvm_unmap_invalid_rmapp,
+   .rmap_test_age = kvm_unmap_invalid_rmapp,
+   .rmap_unmap = kvm_unmap_invalid_rmapp
+};
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 03/15] KVM: x86: do not reuse rmap when memslot is moved

2013-04-16 Thread Xiao Guangrong
Let kvm do not reuse the rmap of the memslot which is being moved
then the rmap of moved or deleted memslot can only be unmapped, no
new spte can be added on it.

This is good for us to unmap rmap out of mmu-lock in the later patches

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 447789c..839e666 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6939,7 +6939,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
 {
-   if (change == KVM_MR_CREATE)
+   if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
if (kvm_arch_create_memslot(memslot))
return -ENOMEM;
 
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Virt-test-devel] [virt-test][PATCH 4/7] virt: Adds named variants to Cartesian config.

2013-04-16 Thread Alex Jia

On 03/30/2013 01:14 AM, Jiří Župka wrote:

variants name=tests:
   - wait:
run = wait
variants:
  - long:
 time = short_time
  - short: long
 time = logn_time
   - test2:
run = test1

variants name=virt_system:
   - linux:
   - windows_XP:

variants name=host_os:
   - linux:
image = linux
   - windows_XP:
image = windows

testswait.short:
 shutdown = destroy

only host_oslinux
Jiří , I pasted above above example into demo.cfg and ran it via 
cartesian parser then I got the error __main__.ParserError: 'variants' 
is not allowed inside a conditional block 
(libvirt/tests/cfg/demo.cfg:4), any wrong with me? thanks.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [qemu-devel] Bug Report: VM crashed for some kinds of vCPU in nested virtualization

2013-04-16 Thread Jan Kiszka
On 2013-04-16 05:49, 李春奇 Arthur Chunqi Li wrote:
 I changed to the latest version of kvm kernel but the bug also occured.
 
 On the startup of L1 VM on the host, the host kern.log will output:
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458090] kvm [2808]: vcpu0
 unhandled rdmsr: 0x345
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458166] kvm_set_msr_common: 22
 callbacks suppressed
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458169] kvm [2808]: vcpu0
 unhandled wrmsr: 0x40 data 0
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458176] kvm [2808]: vcpu0
 unhandled wrmsr: 0x60 data 0
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458182] kvm [2808]: vcpu0
 unhandled wrmsr: 0x41 data 0
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458188] kvm [2808]: vcpu0
 unhandled wrmsr: 0x61 data 0
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458194] kvm [2808]: vcpu0
 unhandled wrmsr: 0x42 data 0
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458200] kvm [2808]: vcpu0
 unhandled wrmsr: 0x62 data 0
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458206] kvm [2808]: vcpu0
 unhandled wrmsr: 0x43 data 0
 Apr 16 11:28:22 Blade1-02 kernel: [ 4908.458211] kvm [2808]: vcpu0
 unhandled wrmsr: 0x63 data 0
 Apr 16 11:28:23 Blade1-02 kernel: [ 4908.471014] kvm [2808]: vcpu1
 unhandled wrmsr: 0x40 data 0
 Apr 16 11:28:23 Blade1-02 kernel: [ 4908.471024] kvm [2808]: vcpu1
 unhandled wrmsr: 0x60 data 0
 
 When L1 VM starts and crashes, its kern.log will output:
 Apr 16 11:28:55 kvm1 kernel: [   33.590101] device tap0 entered promiscuous
 mode
 Apr 16 11:28:55 kvm1 kernel: [   33.590140] br0: port 2(tap0) entered
 forwarding state
 Apr 16 11:28:55 kvm1 kernel: [   33.590146] br0: port 2(tap0) entered
 forwarding state
 Apr 16 11:29:04 kvm1 kernel: [   42.592103] br0: port 2(tap0) entered
 forwarding state
 Apr 16 11:29:19 kvm1 kernel: [   57.752731] kvm [1673]: vcpu0 unhandled
 rdmsr: 0x345
 Apr 16 11:29:19 kvm1 kernel: [   57.797261] kvm [1673]: vcpu0 unhandled
 wrmsr: 0x40 data 0
 Apr 16 11:29:19 kvm1 kernel: [   57.797315] kvm [1673]: vcpu0 unhandled
 wrmsr: 0x60 data 0
 Apr 16 11:29:19 kvm1 kernel: [   57.797366] kvm [1673]: vcpu0 unhandled
 wrmsr: 0x41 data 0
 Apr 16 11:29:19 kvm1 kernel: [   57.797416] kvm [1673]: vcpu0 unhandled
 wrmsr: 0x61 data 0
 Apr 16 11:29:19 kvm1 kernel: [   57.797466] kvm [1673]: vcpu0 unhandled
 wrmsr: 0x42 data 0
 Apr 16 11:29:19 kvm1 kernel: [   57.797516] kvm [1673]: vcpu0 unhandled
 wrmsr: 0x62 data 0
 Apr 16 11:29:19 kvm1 kernel: [   57.797566] kvm [1673]: vcpu0 unhandled
 wrmsr: 0x43 data 0
 Apr 16 11:29:19 kvm1 kernel: [   57.797616] kvm [1673]: vcpu0 unhandled
 wrmsr: 0x63 data 0
 
 The host will output simultaneously:
 Apr 16 11:29:20 Blade1-02 kernel: [ 4966.314742] nested_vmx_run: VMCS
 MSR_{LOAD,STORE} unsupported

That's an important information. KVM is not yet implementing this
feature, but L1 is using it - doomed to fail. This feature gap of nested
VMX needs to be closed at some point.

 
 And the callback trace displayed on the console is the same as the previous
 mail.
 
 Besides, the L1 and L2 guest may sometimes crash and output nothing, while
 sometimes it will output as above.
 
 
 So this indicates that the msr controls may fail for core2duo CPU emulator.
 

Maybe varying the CPU type (try e.g. -cpu kvm64,+vmx) reduces the
likeliness of this scenario with KVM as guest.

 
 For Jan,
 I have traced the code of qemu and KVM and found the relevant code of errno
 KVM: entry failed, hardware error 0x7. The relevant code is in kernel
 arch/x86/kvm/vmx.c, function vmx_handle_exit():
 
 if (exit_reason  VMX_EXIT_REASONS_FAILED_VMENTRY) {
 vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY;
 vcpu-run-fail_entry.hardware_entry_failure_reason
 = exit_reason;
 return 0;
 }
 
 if (unlikely(vmx-fail)) {
 vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY;
 vcpu-run-fail_entry.hardware_entry_failure_reason
 = vmcs_read32(VM_INSTRUCTION_ERROR);
 return 0;
 }
 
 The entry failed hardware error may be caused from these two points, both
 are caused by VMENTRY failed. Because macro VMX_EXIT_REASONS_FAILED_VMENTRY
 is 0x8000 and the output errno is 0x7, so this error is caused by the
 second branch. I'm not very clear what the result of
 vmcs_read32(VM_INSTRUCTION_ERROR) refers to.

Try to look this up in the Intel manual. It explains what instruction
error 7 means. You will also find it when tracing down the error message
of L0.

Jan




signature.asc
Description: OpenPGP digital signature


[PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

2013-04-16 Thread Xiao Guangrong
The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
and multiple -release()) breaks the fix:
3ad3d901bbcfb15a5e4690e55350db0899095a68
(mm: mmu_notifier: fix freed page still mapped in secondary MMU)

This patch reverts the commit and simply fix the bug spotted
by that patch

This bug is spotted by commit 751efd8610d3:
==
There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().

Assume two tasks, one calling mmu_notifier_unregister() as a result of a
filp_close() -flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).

A   B
t1  srcu_read_lock()
t2  if (!hlist_unhashed())
t3  srcu_read_unlock()
t4  srcu_read_lock()
t5  hlist_del_init_rcu()
t6  synchronize_srcu()
t7  srcu_read_unlock()
t8  hlist_del_rcu()  --- NULL pointer deref.
==

This can be fixed by using hlist_del_init_rcu instead of hlist_del_rcu.

The another issue spotted in the commit is
multiple -release() callouts, we needn't care it too much because
it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
after exit_mmap()) and the later call of multiple -release should be
fast since all the pages have already been released by the first call.

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 mm/mmu_notifier.c |   81 +++--
 1 files changed, 41 insertions(+), 40 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index be04122..606777a 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -40,48 +40,45 @@ void __mmu_notifier_release(struct mm_struct *mm)
int id;

/*
-* srcu_read_lock() here will block synchronize_srcu() in
-* mmu_notifier_unregister() until all registered
-* -release() callouts this function makes have
-* returned.
+* SRCU here will block mmu_notifier_unregister until
+* -release returns.
 */
id = srcu_read_lock(srcu);
+   hlist_for_each_entry_rcu(mn, mm-mmu_notifier_mm-list, hlist)
+   /*
+* if -release runs before mmu_notifier_unregister it
+* must be handled as it's the only way for the driver
+* to flush all existing sptes and stop the driver
+* from establishing any more sptes before all the
+* pages in the mm are freed.
+*/
+   if (mn-ops-release)
+   mn-ops-release(mn, mm);
+   srcu_read_unlock(srcu, id);
+
spin_lock(mm-mmu_notifier_mm-lock);
while (unlikely(!hlist_empty(mm-mmu_notifier_mm-list))) {
mn = hlist_entry(mm-mmu_notifier_mm-list.first,
 struct mmu_notifier,
 hlist);
-
/*
-* Unlink.  This will prevent mmu_notifier_unregister()
-* from also making the -release() callout.
+* We arrived before mmu_notifier_unregister so
+* mmu_notifier_unregister will do nothing other than
+* to wait -release to finish and
+* mmu_notifier_unregister to return.
 */
hlist_del_init_rcu(mn-hlist);
-   spin_unlock(mm-mmu_notifier_mm-lock);
-
-   /*
-* Clear sptes. (see 'release' description in mmu_notifier.h)
-*/
-   if (mn-ops-release)
-   mn-ops-release(mn, mm);
-
-   spin_lock(mm-mmu_notifier_mm-lock);
}
spin_unlock(mm-mmu_notifier_mm-lock);

/*
-* All callouts to -release() which we have done are complete.
-* Allow synchronize_srcu() in mmu_notifier_unregister() to complete
-*/
-   srcu_read_unlock(srcu, id);
-
-   /*
-* mmu_notifier_unregister() may have unlinked a notifier and may
-* still be calling out to it.  Additionally, other notifiers
-* may have been active via vmtruncate() et. al. Block here
-* to ensure that all notifier callouts for this mm have been
-* completed and the sptes are really cleaned up before returning
-* to exit_mmap().
+* synchronize_srcu here prevents mmu_notifier_release to
+* return to exit_mmap (which would proceed freeing all pages
+* in the mm) until the -release method returns, if it was
+* invoked by mmu_notifier_unregister.
+*
+* The mmu_notifier_mm can't go away from under us because one
+* mm_count is hold by exit_mmap.
 */
synchronize_srcu(srcu);
 }
@@ -292,31 +289,35 @@ void 

Re: KVM: x86: fix maintenance of guest/host xcr0 state

2013-04-16 Thread Paolo Bonzini
Il 16/04/2013 04:30, Marcelo Tosatti ha scritto:
 
 ** Untested **.
 
 Emulation of xcr0 writes zero guest_xcr0_loaded variable so that
 subsequent VM-entry reloads CPU's xcr0 with guests xcr0 value.
 
 However, this is incorrect because guest_xcr0_loaded variable is 
 read to decide whether to reload hosts xcr0.
 
 In case the vcpu thread is scheduled out after the guest_xcr0_loaded = 0
 assignment, and scheduler decides to preload FPU:
 
 switch_to
 {
   __switch_to
 __math_state_restore
   restore_fpu_checking
 fpu_restore_checking
   if (use_xsave())
   fpu_xrstor_checking
   xrstor64 with CPU's xcr0 == guests xcr0
 
 Fix by properly restoring hosts xcr0 during emulation of xcr0 writes.
 
 Analyzed-by: Ulrich Obergfell uober...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 999d124..222926a 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -555,6 +555,25 @@ void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw)
  }
  EXPORT_SYMBOL_GPL(kvm_lmsw);
  
 +static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
 +{
 + if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) 
 + !vcpu-guest_xcr0_loaded) {
 + /* kvm_set_xcr() also depends on this */
 + xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0);
 + vcpu-guest_xcr0_loaded = 1;
 + }
 +}
 +
 +static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
 +{
 + if (vcpu-guest_xcr0_loaded) {
 + if (vcpu-arch.xcr0 != host_xcr0)
 + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
 + vcpu-guest_xcr0_loaded = 0;
 + }
 +}
 +
  int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
  {
   u64 xcr0;

This is just code movement...

 @@ -571,8 +590,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
 xcr)
   return 1;
   if (xcr0  ~host_xcr0)
   return 1;
 + kvm_put_guest_xcr0(vcpu);
   vcpu-arch.xcr0 = xcr0;
 - vcpu-guest_xcr0_loaded = 0;
   return 0;
  }
  

... and this is the bulk of the fix: never set guest_xcr0_loaded, always
go through kvm_load/put_guest_xcr0.

Pending test,

Reviewed-by: Paolo Bonzini pbonz...@redhat.com

Paolo

 @@ -5600,25 +5619,6 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
   }
  }
  
 -static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
 -{
 - if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) 
 - !vcpu-guest_xcr0_loaded) {
 - /* kvm_set_xcr() also depends on this */
 - xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0);
 - vcpu-guest_xcr0_loaded = 1;
 - }
 -}
 -
 -static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
 -{
 - if (vcpu-guest_xcr0_loaded) {
 - if (vcpu-arch.xcr0 != host_xcr0)
 - xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
 - vcpu-guest_xcr0_loaded = 0;
 - }
 -}
 -
  static void process_nmi(struct kvm_vcpu *vcpu)
  {
   unsigned limit = 2;
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [User Question] Repeated severe performance problems on guest

2013-04-16 Thread Martin Wawro
On 04/16/2013 07:49 AM, Stefan Hajnoczi wrote:

Hi Stefan,

 Besides the kvm_stat, general performance data from the host is useful
 when dealing with high load averages.

 Do you have vmstat or sar data for periods of time when the machine was
 slow?

 Stefan

We do have a rather exhaustive log on the guest. As for the host, we did
not find
anything suspicious except  for the kvm_stat output. So we did not log
any more
than that.

Here is the output of vmstat 5 5 on the guest:

procs ---memory-- ---swap-- -io -system--
cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
id wa
84  0  19596 104404 60 2193261600   232   11092  7 
2 90  1
80  0  19596  98100 60 2193392000   106   119  854  912 79
21  0  0
89  0  19596  94216 60 2193276400   106   223  864  886 79
21  0  0
87  0  19596  95848 60 21927612008247  856  906 79
21  0  0

Load average at that time: 75 (1:20 AM)

The guest seems to have a hard time scheduling tasks. The log output, which
is triggered by a simple cronjob that executes commands like vmstat,
atop
or simply appending some information from /proc and /sys to the log
(in a sequential manner) is a bit scrambled (i.e. the expected order
of the output is not kept, most likely because the cronjobs get piled up).
This can also be seen on the rather large numbers in the first column
(there is
no workload scheduled for that time and virtually no one is using the
system,
the whole thing just seems to happen out of thin air).

The huge number of runnable tasks fits the high load average, but we
could not
see the reason why the tasks are piling up. There is/was no apparent I/O
issue on
disk or network and no messages from the kernel at that time. Also,
there was no
swap activity on either guest or host (host has swap disabled).


For comparison, here is the output from 22:55 (10:55 PM), a couple of hours
before the output from above:

procs ---memory-- ---swap-- -io -system--
cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
id wa
 5  0 36  97848 60 2425965600   227   11171  7 
2 91  1
 6  0 36 121532 60 2422034000   106   327  974  539 31
22 46  0
 9  0 36 126108 60 2421275600 6 0 1019  517 19 
7 73  0
 2  0 36 125264 60 2421278000 026 1007  481 25 
6 69  0
 6  0 36 147564 60 2421280800 4   120 1345 1161 25
10 65  0


If you need any more info, just let me know.

Best regards,

Martin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] provide an API to userspace doing memory snapshot

2013-04-16 Thread Wenchao Xia

于 2013-4-16 13:51, Stefan Hajnoczi 写道:

On Mon, Apr 15, 2013 at 09:03:36PM +0800, Wenchao Xia wrote:

   I'd like to add/export an function which allow userspace program
to take snapshot for a region of memory. Since it is not implemented yet
I will describe it as C APIs, it is quite simple now and if it is worthy
I'll improve the interface later:


We talked about a simple approach using fork(2) on IRC yesterday.

Is this email outdated?

Stefan


  No, after the discuss on IRC, I agree that fork() is a simpler
method to do it, which can comes to qemu fast, since user wants it.
  With a more consideration, still I think a KVM's mem snapshot would
be an long term solution for it:
  The source of the problem comes from acceleration module, kvm.ko, when
qemu does not use it, no troubles. This means an acceleration module
missed a function while caller requires. My instinct idea is: when
acceleration module replace a pure software one, it should try provide
all parts or not stop software filling the gap, and doing so brings
benefits, so hope to add it.
  My API description is old, the core is COW pages, maybe redesign if
reasonable.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 0/2] tcm_vhost flush

2013-04-16 Thread Asias He
Asias He (2):
  tcm_vhost: Pass vhost_scsi to vhost_scsi_allocate_cmd
  tcm_vhost: Wait for pending requests in vhost_scsi_flush()

 drivers/vhost/tcm_vhost.c | 106 +++---
 drivers/vhost/tcm_vhost.h |   5 +++
 2 files changed, 104 insertions(+), 7 deletions(-)

-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 1/2] tcm_vhost: Pass vhost_scsi to vhost_scsi_allocate_cmd

2013-04-16 Thread Asias He
It is needed in next patch.

Signed-off-by: Asias He as...@redhat.com
---
 drivers/vhost/tcm_vhost.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index da2021b..4ae6725 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -569,6 +569,7 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work 
*work)
 }
 
 static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd(
+   struct vhost_scsi *vs,
struct tcm_vhost_tpg *tv_tpg,
struct virtio_scsi_cmd_req *v_req,
u32 exp_data_len,
@@ -593,6 +594,7 @@ static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd(
tv_cmd-tvc_exp_data_len = exp_data_len;
tv_cmd-tvc_data_direction = data_direction;
tv_cmd-tvc_nexus = tv_nexus;
+   tv_cmd-tvc_vhost = vs;
 
return tv_cmd;
 }
@@ -848,7 +850,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs,
for (i = 0; i  data_num; i++)
exp_data_len += vq-iov[data_first + i].iov_len;
 
-   tv_cmd = vhost_scsi_allocate_cmd(tv_tpg, v_req,
+   tv_cmd = vhost_scsi_allocate_cmd(vs, tv_tpg, v_req,
exp_data_len, data_direction);
if (IS_ERR(tv_cmd)) {
vq_err(vq, vhost_scsi_allocate_cmd failed %ld\n,
@@ -858,7 +860,6 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs,
pr_debug(Allocated tv_cmd: %p exp_data_len: %d, data_direction
: %d\n, tv_cmd, exp_data_len, data_direction);
 
-   tv_cmd-tvc_vhost = vs;
tv_cmd-tvc_vq = vq;
tv_cmd-tvc_resp = vq-iov[out].iov_base;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 2/2] tcm_vhost: Wait for pending requests in vhost_scsi_flush()

2013-04-16 Thread Asias He
This patch makes vhost_scsi_flush() wait for all the pending requests
issued before the flush operation to be finished.

Changes in v5:
- Use kref and completion
- Fail req if vs-vs_inflight is NULL
- Rename tcm_vhost_alloc_inflight to tcm_vhost_set_inflight

Changes in v4:
- Introduce vhost_scsi_inflight
- Drop array to track flush
- Use RCU to protect vs_inflight explicitly

Changes in v3:
- Rebase
- Drop 'tcm_vhost: Wait for pending requests in
  vhost_scsi_clear_endpoint()' in this series, we already did that in
  'tcm_vhost: Use vq-private_data to indicate if the endpoint is setup'

Changes in v2:
- Increase/Decrease inflight requests in
  vhost_scsi_{allocate,free}_cmd and tcm_vhost_{allocate,free}_evt

Signed-off-by: Asias He as...@redhat.com
---
 drivers/vhost/tcm_vhost.c | 101 +++---
 drivers/vhost/tcm_vhost.h |   5 +++
 2 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index 4ae6725..ef40a8f 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -74,6 +74,11 @@ enum {
 #define VHOST_SCSI_MAX_VQ  128
 #define VHOST_SCSI_MAX_EVENT   128
 
+struct vhost_scsi_inflight {
+   struct completion comp; /* Wait for the flush operation to finish */
+   struct kref kref; /* Refcount for the inflight reqs */
+};
+
 struct vhost_scsi {
/* Protected by vhost_scsi-dev.mutex */
struct tcm_vhost_tpg **vs_tpg;
@@ -91,6 +96,8 @@ struct vhost_scsi {
struct mutex vs_events_lock; /* protect vs_events_dropped,events_nr */
bool vs_events_dropped; /* any missed events */
int vs_events_nr; /* num of pending events */
+
+   struct vhost_scsi_inflight __rcu *vs_inflight; /* track inflight reqs */
 };
 
 /* Local pointer to allocated TCM configfs fabric module */
@@ -108,6 +115,51 @@ static int iov_num_pages(struct iovec *iov)
   ((unsigned long)iov-iov_base  PAGE_MASK))  PAGE_SHIFT;
 }
 
+static int tcm_vhost_set_inflight(struct vhost_scsi *vs)
+{
+   struct vhost_scsi_inflight *inflight;
+   int ret = -ENOMEM;
+
+   inflight = kzalloc(sizeof(*inflight), GFP_KERNEL);
+   if (inflight) {
+   kref_init(inflight-kref);
+   init_completion(inflight-comp);
+   ret = 0;
+   }
+   rcu_assign_pointer(vs-vs_inflight, inflight);
+   synchronize_rcu();
+
+   return ret;
+}
+
+static struct vhost_scsi_inflight *
+tcm_vhost_inc_inflight(struct vhost_scsi *vs)
+{
+   struct vhost_scsi_inflight *inflight;
+
+   rcu_read_lock();
+   inflight = rcu_dereference(vs-vs_inflight);
+   if (inflight)
+   kref_get(inflight-kref);
+   rcu_read_unlock();
+
+   return inflight;
+}
+
+void tcm_vhost_done_inflight(struct kref *kref)
+{
+   struct vhost_scsi_inflight *inflight;
+
+   inflight = container_of(kref, struct vhost_scsi_inflight, kref);
+   complete(inflight-comp);
+}
+
+static void tcm_vhost_dec_inflight(struct vhost_scsi_inflight *inflight)
+{
+   if (inflight)
+   kref_put(inflight-kref, tcm_vhost_done_inflight);
+}
+
 static bool tcm_vhost_check_feature(struct vhost_scsi *vs, int feature)
 {
bool ret = false;
@@ -402,6 +454,7 @@ static int tcm_vhost_queue_tm_rsp(struct se_cmd *se_cmd)
 static void tcm_vhost_free_evt(struct vhost_scsi *vs, struct tcm_vhost_evt 
*evt)
 {
mutex_lock(vs-vs_events_lock);
+   tcm_vhost_dec_inflight(evt-inflight);
vs-vs_events_nr--;
kfree(evt);
mutex_unlock(vs-vs_events_lock);
@@ -413,21 +466,27 @@ static struct tcm_vhost_evt 
*tcm_vhost_allocate_evt(struct vhost_scsi *vs,
struct tcm_vhost_evt *evt;
 
mutex_lock(vs-vs_events_lock);
-   if (vs-vs_events_nr  VHOST_SCSI_MAX_EVENT) {
-   vs-vs_events_dropped = true;
-   mutex_unlock(vs-vs_events_lock);
-   return NULL;
-   }
+   if (vs-vs_events_nr  VHOST_SCSI_MAX_EVENT)
+   goto out;
 
evt = kzalloc(sizeof(*evt), GFP_KERNEL);
if (evt) {
evt-event.event = event;
evt-event.reason = reason;
+   evt-inflight = tcm_vhost_inc_inflight(vs);
+   if (!evt-inflight) {
+   kfree(evt);
+   goto out;
+   }
vs-vs_events_nr++;
}
mutex_unlock(vs-vs_events_lock);
 
return evt;
+out:
+   vs-vs_events_dropped = true;
+   mutex_unlock(vs-vs_events_lock);
+   return NULL;
 }
 
 static void vhost_scsi_free_cmd(struct tcm_vhost_cmd *tv_cmd)
@@ -445,6 +504,8 @@ static void vhost_scsi_free_cmd(struct tcm_vhost_cmd 
*tv_cmd)
kfree(tv_cmd-tvc_sgl);
}
 
+   tcm_vhost_dec_inflight(tv_cmd-inflight);
+
kfree(tv_cmd);
 }
 
@@ -595,6 +656,9 @@ static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd(
tv_cmd-tvc_data_direction = data_direction;
   

Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

2013-04-16 Thread Robin Holt
On Tue, Apr 16, 2013 at 02:39:49PM +0800, Xiao Guangrong wrote:
 The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
 and multiple -release()) breaks the fix:
 3ad3d901bbcfb15a5e4690e55350db0899095a68
 (mm: mmu_notifier: fix freed page still mapped in secondary MMU)

Can you describe how the page is still mapped?  I thought I had all
cases covered.  Whichever call hits first, I thought we had one callout
to the registered notifiers.  Are you saying we need multiple callouts?

Also, shouldn't you be asking for a revert commit and then supply a
subsequent commit for the real fix?  I thought that was the process for
doing a revert.

Thanks,
Robin Holt

 
 This patch reverts the commit and simply fix the bug spotted
 by that patch
 
 This bug is spotted by commit 751efd8610d3:
 ==
 There is a race condition between mmu_notifier_unregister() and
 __mmu_notifier_release().
 
 Assume two tasks, one calling mmu_notifier_unregister() as a result of a
 filp_close() -flush() callout (task A), and the other calling
 mmu_notifier_release() from an mmput() (task B).
 
 A   B
 t1  srcu_read_lock()
 t2  if (!hlist_unhashed())
 t3  srcu_read_unlock()
 t4  srcu_read_lock()
 t5  hlist_del_init_rcu()
 t6  synchronize_srcu()
 t7  srcu_read_unlock()
 t8  hlist_del_rcu()  --- NULL pointer deref.
 ==
 
 This can be fixed by using hlist_del_init_rcu instead of hlist_del_rcu.
 
 The another issue spotted in the commit is
 multiple -release() callouts, we needn't care it too much because
 it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
 after exit_mmap()) and the later call of multiple -release should be
 fast since all the pages have already been released by the first call.
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  mm/mmu_notifier.c |   81 
 +++--
  1 files changed, 41 insertions(+), 40 deletions(-)
 
 diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
 index be04122..606777a 100644
 --- a/mm/mmu_notifier.c
 +++ b/mm/mmu_notifier.c
 @@ -40,48 +40,45 @@ void __mmu_notifier_release(struct mm_struct *mm)
   int id;
 
   /*
 -  * srcu_read_lock() here will block synchronize_srcu() in
 -  * mmu_notifier_unregister() until all registered
 -  * -release() callouts this function makes have
 -  * returned.
 +  * SRCU here will block mmu_notifier_unregister until
 +  * -release returns.
*/
   id = srcu_read_lock(srcu);
 + hlist_for_each_entry_rcu(mn, mm-mmu_notifier_mm-list, hlist)
 + /*
 +  * if -release runs before mmu_notifier_unregister it
 +  * must be handled as it's the only way for the driver
 +  * to flush all existing sptes and stop the driver
 +  * from establishing any more sptes before all the
 +  * pages in the mm are freed.
 +  */
 + if (mn-ops-release)
 + mn-ops-release(mn, mm);
 + srcu_read_unlock(srcu, id);
 +
   spin_lock(mm-mmu_notifier_mm-lock);
   while (unlikely(!hlist_empty(mm-mmu_notifier_mm-list))) {
   mn = hlist_entry(mm-mmu_notifier_mm-list.first,
struct mmu_notifier,
hlist);
 -
   /*
 -  * Unlink.  This will prevent mmu_notifier_unregister()
 -  * from also making the -release() callout.
 +  * We arrived before mmu_notifier_unregister so
 +  * mmu_notifier_unregister will do nothing other than
 +  * to wait -release to finish and
 +  * mmu_notifier_unregister to return.
*/
   hlist_del_init_rcu(mn-hlist);
 - spin_unlock(mm-mmu_notifier_mm-lock);
 -
 - /*
 -  * Clear sptes. (see 'release' description in mmu_notifier.h)
 -  */
 - if (mn-ops-release)
 - mn-ops-release(mn, mm);
 -
 - spin_lock(mm-mmu_notifier_mm-lock);
   }
   spin_unlock(mm-mmu_notifier_mm-lock);
 
   /*
 -  * All callouts to -release() which we have done are complete.
 -  * Allow synchronize_srcu() in mmu_notifier_unregister() to complete
 -  */
 - srcu_read_unlock(srcu, id);
 -
 - /*
 -  * mmu_notifier_unregister() may have unlinked a notifier and may
 -  * still be calling out to it.  Additionally, other notifiers
 -  * may have been active via vmtruncate() et. al. Block here
 -  * to ensure that all notifier callouts for this mm have been
 -  * completed and the sptes are really cleaned up before returning
 -  * to exit_mmap().
 +  

Re: [qemu-devel] Bug Report: VM crashed for some kinds of vCPU in nested virtualization

2013-04-16 Thread Jan Kiszka
On 2013-04-16 12:19, 李春奇 Arthur Chunqi Li wrote:
 I looked up Intel manual for VM instruction error. Error number 7 means VM
 entry with invalid control field(s), which means in process of VM
 switching some control fields are not properly configured.
 
 I wonder why some emulated CPUs (e.g.Nehalem) can run properly without
 nested VMCS MSR support?

MSRs are only switched between host (L0) and guest (L1/L2) if their
value differ. That saves some cycles. Therefore, if either the guest is
not using a specific MSR (due to differences in the virtual CPU feature
set) or it is using it in the same way like the host, there is no
switching, thus no risk to hit this unimplemented feature.

Jan




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

2013-04-16 Thread Xiao Guangrong
On 04/16/2013 05:31 PM, Robin Holt wrote:
 On Tue, Apr 16, 2013 at 02:39:49PM +0800, Xiao Guangrong wrote:
 The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
 and multiple -release()) breaks the fix:
 3ad3d901bbcfb15a5e4690e55350db0899095a68
 (mm: mmu_notifier: fix freed page still mapped in secondary MMU)
 
 Can you describe how the page is still mapped?  I thought I had all
 cases covered.  Whichever call hits first, I thought we had one callout
 to the registered notifiers.  Are you saying we need multiple callouts?

No.

You patch did this:

hlist_del_init_rcu(mn-hlist);1 ==
+   spin_unlock(mm-mmu_notifier_mm-lock);
+
+   /*
+* Clear sptes. (see 'release' description in mmu_notifier.h)
+*/
+   if (mn-ops-release)
+   mn-ops-release(mn, mm);2 ==
+
+   spin_lock(mm-mmu_notifier_mm-lock);

At point 1, you delete the notify, but the page is still on LRU. Other
cpu can reclaim the page but without call -invalid_page().

At point 2, you call -release(), the secondary MMU make page Accessed/Dirty
but that page has already been on the free-list of page-alloctor.

 
 Also, shouldn't you be asking for a revert commit and then supply a
 subsequent commit for the real fix?  I thought that was the process for
 doing a revert.

Can not do that pure reversion since your patch moved hlist_for_each_entry_rcu
which has been modified now.

Should i do pure-eversion + hlist_for_each_entry_rcu update first?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

2013-04-16 Thread Robin Holt
On Tue, Apr 16, 2013 at 06:26:36PM +0800, Xiao Guangrong wrote:
 On 04/16/2013 05:31 PM, Robin Holt wrote:
  On Tue, Apr 16, 2013 at 02:39:49PM +0800, Xiao Guangrong wrote:
  The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
  and multiple -release()) breaks the fix:
  3ad3d901bbcfb15a5e4690e55350db0899095a68
  (mm: mmu_notifier: fix freed page still mapped in secondary MMU)
  
  Can you describe how the page is still mapped?  I thought I had all
  cases covered.  Whichever call hits first, I thought we had one callout
  to the registered notifiers.  Are you saying we need multiple callouts?
 
 No.
 
 You patch did this:
 
 hlist_del_init_rcu(mn-hlist);1 ==
 +   spin_unlock(mm-mmu_notifier_mm-lock);
 +
 +   /*
 +* Clear sptes. (see 'release' description in mmu_notifier.h)
 +*/
 +   if (mn-ops-release)
 +   mn-ops-release(mn, mm);2 ==
 +
 +   spin_lock(mm-mmu_notifier_mm-lock);
 
 At point 1, you delete the notify, but the page is still on LRU. Other
 cpu can reclaim the page but without call -invalid_page().
 
 At point 2, you call -release(), the secondary MMU make page Accessed/Dirty
 but that page has already been on the free-list of page-alloctor.

That expectation on srcu _REALLY_ needs to be documented better.
Maybe I missed it in the comments, but there is an expectation beyond
the synchronize_srcu().  This code has been extremely poorly described
and I think it is time we fix that up.

I do see that in comments for mmu_notifier_unregister, there is an
expectation upon already having all the spte's removed prior to making
this call.  I think that is also a stale comment as it mentions a lock
which I am not sure ever really existed.

  Also, shouldn't you be asking for a revert commit and then supply a
  subsequent commit for the real fix?  I thought that was the process for
  doing a revert.
 
 Can not do that pure reversion since your patch moved hlist_for_each_entry_rcu
 which has been modified now.
 
 Should i do pure-eversion + hlist_for_each_entry_rcu update first?

Let's not go off without considering this first.

It looks like what we really need to do is ensure there is a method
for ensuring that the mmu_notifier remains on the list while callouts
invalidate_page() callouts are being made and also a means of ensuring
that only one -release() callout is made.

First, is it the case that when kvm calls mmu_notifier_unregister(),
it has already cleared the spte's?  (what does spte stand for anyway)?
If so, then we really need to close the hole in __mmu_notifier_release().
I think we would need to modify code in both _unregister and _release,
but the issue is really _release.


I need to get ready and drive into work.  If you want to float something
out there, that is fine.  Otherwise, I will try to work something up
when I get to the office.

Thanks,
Robin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] kvm: Emulate MOVBE

2013-04-16 Thread Paolo Bonzini
Il 14/04/2013 23:02, Borislav Petkov ha scritto:
   *(u16 *)ctxt-dst.val = swab16((u16)ctxt-src.val);
 
   movzwl  112(%rdi), %eax # ctxt_5(D)-src.D.27823.val, tmp82
   rolw$8, %ax #, tmp82
   movw%ax, 240(%rdi)  # tmp82, MEM[(u16 *)ctxt_5(D) + 240B]

I think this breaks the C aliasing rules.

Using valptr is okay because it is a char.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

2013-04-16 Thread Robin Holt
Argh.  Taking a step back helped clear my head.

For the -stable releases, I agree we should just go with your
revert-plus-hlist_del_init_rcu patch.  I will give it a test
when I am in the office.

For the v3.10 release, we should work on making this more
correct and completely documented.

Robin

On Tue, Apr 16, 2013 at 06:25:53AM -0500, Robin Holt wrote:
 On Tue, Apr 16, 2013 at 06:26:36PM +0800, Xiao Guangrong wrote:
  On 04/16/2013 05:31 PM, Robin Holt wrote:
   On Tue, Apr 16, 2013 at 02:39:49PM +0800, Xiao Guangrong wrote:
   The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
   and multiple -release()) breaks the fix:
   3ad3d901bbcfb15a5e4690e55350db0899095a68
   (mm: mmu_notifier: fix freed page still mapped in secondary MMU)
   
   Can you describe how the page is still mapped?  I thought I had all
   cases covered.  Whichever call hits first, I thought we had one callout
   to the registered notifiers.  Are you saying we need multiple callouts?
  
  No.
  
  You patch did this:
  
  hlist_del_init_rcu(mn-hlist);1 ==
  +   spin_unlock(mm-mmu_notifier_mm-lock);
  +
  +   /*
  +* Clear sptes. (see 'release' description in 
  mmu_notifier.h)
  +*/
  +   if (mn-ops-release)
  +   mn-ops-release(mn, mm);2 ==
  +
  +   spin_lock(mm-mmu_notifier_mm-lock);
  
  At point 1, you delete the notify, but the page is still on LRU. Other
  cpu can reclaim the page but without call -invalid_page().
  
  At point 2, you call -release(), the secondary MMU make page Accessed/Dirty
  but that page has already been on the free-list of page-alloctor.
 
 That expectation on srcu _REALLY_ needs to be documented better.
 Maybe I missed it in the comments, but there is an expectation beyond
 the synchronize_srcu().  This code has been extremely poorly described
 and I think it is time we fix that up.
 
 I do see that in comments for mmu_notifier_unregister, there is an
 expectation upon already having all the spte's removed prior to making
 this call.  I think that is also a stale comment as it mentions a lock
 which I am not sure ever really existed.
 
   Also, shouldn't you be asking for a revert commit and then supply a
   subsequent commit for the real fix?  I thought that was the process for
   doing a revert.
  
  Can not do that pure reversion since your patch moved 
  hlist_for_each_entry_rcu
  which has been modified now.
  
  Should i do pure-eversion + hlist_for_each_entry_rcu update first?
 
 Let's not go off without considering this first.
 
 It looks like what we really need to do is ensure there is a method
 for ensuring that the mmu_notifier remains on the list while callouts
 invalidate_page() callouts are being made and also a means of ensuring
 that only one -release() callout is made.
 
 First, is it the case that when kvm calls mmu_notifier_unregister(),
 it has already cleared the spte's?  (what does spte stand for anyway)?
 If so, then we really need to close the hole in __mmu_notifier_release().
 I think we would need to modify code in both _unregister and _release,
 but the issue is really _release.
 
 
 I need to get ready and drive into work.  If you want to float something
 out there, that is fine.  Otherwise, I will try to work something up
 when I get to the office.
 
 Thanks,
 Robin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Emulate MOVBE

2013-04-16 Thread Paolo Bonzini
Il 10/04/2013 12:08, Gleb Natapov ha scritto:
  What is the opinion from the KVM folks on this? Shall we start to
  emulate instructions the host does not provide? In this particular case
  a relatively simple patch fixes a problem (starting Atom optimized
  kernels on non-Atom machines).
 We can add the emulation, but we should not start announcing the instruction
 availability to a guest if host cpu does not have it by default. This
 may trick a guest into thinking that movbe is the fastest way to do
 something when it is not.
 

This does highlight a weakness in CPU_GET_SUPPORTED_CPUID, but I think
this is not a problem in practice.

With a management layer such as oVirt it's not a problem.  For example,
oVirt has its own library of processors.  It doesn't care if KVM enables
movbe.  If you tell it your datacenter is a mix of Haswells and Sandy
Bridges it will pick the CPUID bits that are common to all.

However, even without a suitable management layer it is also not really
a problem.

The only processors that support MOVBE are Atom and Haswell.  Haswell
adds a whole lot of extra CPUID features, hence -cpu Haswell,enforce
will fail with or without movbe emulation.  -cpu Haswell will disable
all Haswell new features except movbe will remain slow; that's fine, I
think, anyway it's not what you'ld do except to play with CPU models.

Atom is not defined by QEMU; even if it was, it is unlikely to be
specified for a KVM guest since Atom doesn't support hardware
virtualization itself.

The next AMD processor that has MOVBE will probably have at least
another feature that is not in Opteron_G5, thus it will be the same as
Haswell.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Virt-test-devel] [virt-test][PATCH 4/7] virt: Adds named variants to Cartesian config.

2013-04-16 Thread Jiri Zupka
Hi Alex,
  I hope you use new version of cart config in github 
https://github.com/autotest/virt-test/pull/255.
This was older RFC version of vart config. And I'm preparing new version based 
on communication with Eduardo and Pablo.

If you don't please loot at documentation 
https://github.com/autotest/virt-test/wiki/VirtTestDocumentation#wiki-id26 
This documentation says how it works now.

regards
  Jiří Župka

- Original Message -
 On 03/30/2013 01:14 AM, Jiří Župka wrote:
  variants name=tests:
 - wait:
  run = wait
  variants:
- long:
   time = short_time
- short: long
   time = logn_time
 - test2:
  run = test1
 
  variants name=virt_system:
 - linux:
 - windows_XP:
 
  variants name=host_os:
 - linux:
  image = linux
 - windows_XP:
  image = windows
 
  testswait.short:
   shutdown = destroy
 
  only host_oslinux
 Jiří , I pasted above above example into demo.cfg and ran it via
 cartesian parser then I got the error __main__.ParserError: 'variants'
 is not allowed inside a conditional block
 (libvirt/tests/cfg/demo.cfg:4), any wrong with me? thanks.
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Emulate MOVBE

2013-04-16 Thread Borislav Petkov
On Tue, Apr 16, 2013 at 01:47:46PM +0200, Paolo Bonzini wrote:
 Atom is not defined by QEMU;

$ qemu-system-x86_64 -cpu ?

...

x86 n270  Intel(R) Atom(TM) CPU N270   @ 1.60GHz

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Emulate MOVBE

2013-04-16 Thread H. Peter Anvin
Note that Atom isn't a CPU but a line of CPUs. Sadly Qemu's N270 model is 
broken.

Borislav Petkov b...@alien8.de wrote:

On Tue, Apr 16, 2013 at 01:47:46PM +0200, Paolo Bonzini wrote:
 Atom is not defined by QEMU;

$ qemu-system-x86_64 -cpu ?

...

x86 n270  Intel(R) Atom(TM) CPU N270   @ 1.60GHz

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

2013-04-16 Thread Xiao Guangrong
On 04/16/2013 07:43 PM, Robin Holt wrote:
 Argh.  Taking a step back helped clear my head.
 
 For the -stable releases, I agree we should just go with your
 revert-plus-hlist_del_init_rcu patch.  I will give it a test
 when I am in the office.

Okay. Wait for your test report. Thank you in advance.

 
 For the v3.10 release, we should work on making this more
 correct and completely documented.

Better document is always welcomed.

Double call -release is not bad, like i mentioned it in the changelog:

it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
after exit_mmap()) and the later call of multiple -release should be
fast since all the pages have already been released by the first call.

But, of course, it's great if you have a _light_ way to avoid this.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2013-04-16

2013-04-16 Thread Juan Quintela
Juan Quintela quint...@redhat.com wrote:
 Hi

 Please send in any agenda topics you are interested in.

As there are no topics,  call is cancelled.

Have a nice week.

Later,  Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Virt-test-devel] [virt-test][PATCH 4/7] virt: Adds named variants to Cartesian config.

2013-04-16 Thread Jiri Zupka
Hi Alex,
  thanks again for review. I recognize now what you mean. I thought that
you another thread of mails. I was try it again with
https://github.com/autotest/virt-test/pull/255 and demo example works.

If you are interest in this feature. Check new version which I'll send in
future days. There will be some changes in syntax and will be added lexer
for better filtering.

regards,
  Jiří Župka



- Original Message -
 Hi Alex,
   I hope you use new version of cart config in github
   https://github.com/autotest/virt-test/pull/255.
 This was older RFC version of vart config. And I'm preparing new version
 based on communication with Eduardo and Pablo.
 
 If you don't please loot at documentation
 https://github.com/autotest/virt-test/wiki/VirtTestDocumentation#wiki-id26
 This documentation says how it works now.
 
 regards
   Jiří Župka
 
 - Original Message -
  On 03/30/2013 01:14 AM, Jiří Župka wrote:
   variants name=tests:
  - wait:
   run = wait
   variants:
 - long:
time = short_time
 - short: long
time = logn_time
  - test2:
   run = test1
  
   variants name=virt_system:
  - linux:
  - windows_XP:
  
   variants name=host_os:
  - linux:
   image = linux
  - windows_XP:
   image = windows
  
   testswait.short:
shutdown = destroy
  
   only host_oslinux
  Jiří , I pasted above above example into demo.cfg and ran it via
  cartesian parser then I got the error __main__.ParserError: 'variants'
  is not allowed inside a conditional block
  (libvirt/tests/cfg/demo.cfg:4), any wrong with me? thanks.
  
  
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm/ppc: don't call complete_mmio_load when it's a store

2013-04-16 Thread Alexander Graf

On 16.04.2013, at 03:07, Scott Wood wrote:

 complete_mmio_load writes back the mmio result into the
 destination register.  Doing this on a store results in
 register corruption.
 
 Signed-off-by: Scott Wood scottw...@freescale.com

Thanks, applied to kvm-ppc-queue. Since nobody really is using in-kernel 
devices yet I don't think we need to rush this into the next stable kernel 
release.


Alex

 ---
 arch/powerpc/kvm/powerpc.c |1 -
 1 file changed, 1 deletion(-)
 
 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 16b4595..a822659 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -683,7 +683,6 @@ int kvmppc_handle_store(struct kvm_run *run, struct 
 kvm_vcpu *vcpu,
 
   if (!kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, run-mmio.phys_addr,
 bytes, run-mmio.data)) {
 - kvmppc_complete_mmio_load(vcpu, run);
   vcpu-mmio_needed = 0;
   return EMULATE_DONE;
   }
 -- 
 1.7.10.4
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unittest] x86/svm: run cr3 read intercept emulate only on SMP

2013-04-16 Thread Paolo Bonzini
Il 16/04/2013 07:08, prasadjoshi.li...@gmail.com ha scritto:
 From: Prasad Joshi prasadjoshi.li...@gmail.com
 
 The SVM test 'cr3 read intercept emulate' when ran on uniprocessor
 system does not finish and blocks all the tests scheduled to be ran
 afterwords. Add check so that the test is only ran on the SMP VM.

test-scratch = 1;
while (test-scratch != 2)
barrier();

Indeed.

Reviewed-by: Paolo Bonzini pbonz...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Add a Kconfig shortcut for a kvm-bootable kernel

2013-04-16 Thread Borislav Petkov
On Sun, Apr 14, 2013 at 01:03:20PM +0200, Borislav Petkov wrote:
 On Sun, Apr 14, 2013 at 12:31:12PM +0300, Pekka Enberg wrote:
  I obviously support having something like this in mainline. I wonder
  though if we could just call this default standalone KVM guest
  config instead of emphasizing testing angle.
 
 /me nods agreeingly...
 
 And it should be unter HYPERVISOR_GUEST where the rest of this stuff
 resides. Good point.

Sanity check question:

Why not add the select stuff, i.e. this:

select NET
select NETDEVICES
select PCI
select BLOCK
select BLK_DEV
select NETWORK_FILESYSTEMS
select INET
select EXPERIMENTAL
select TTY
select SERIAL_8250
select SERIAL_8250_CONSOLE
select IP_PNP
select IP_PNP_DHCP
select BINFMT_ELF
select PCI_MSI
select HAVE_ARCH_KGDB
select DEBUG_KERNEL
select KGDB
select KGDB_SERIAL_CONSOLE
select VIRTUALIZATION
select VIRTIO
select VIRTIO_RING
select VIRTIO_PCI
select VIRTIO_BLK
select VIRTIO_CONSOLE
select VIRTIO_NET
select 9P_FS
select NET_9P
select NET_9P_VIRTIO

to the option below which we already have. It is in the same sense a KVM
guest support deal.

Hmm.

KVM people, any objections?

config KVM_GUEST
bool KVM Guest support (including kvmclock)
depends on PARAVIRT
select PARAVIRT_CLOCK
default y
---help---
  This option enables various optimizations for running under the KVM
  hypervisor. It includes a paravirtualized clock, so that instead
  of relying on a PIT (or probably other) emulation by the
  underlying device model, the host provides the guest with
  timing infrastructure such as time of day, and system time

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 6/7] KVM: Let ioapic know the irq line status

2013-04-16 Thread Alexander Graf

On 11.04.2013, at 13:21, Yang Zhang wrote:

 From: Yang Zhang yang.z.zh...@intel.com
 
 Userspace may deliver RTC interrupt without query the status. So we
 want to track RTC EOI for this case.
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com

This patch breaks ARM host support. Patch following.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: ARM: Fix kvm_vm_ioctl_irq_line

2013-04-16 Thread Alexander Graf
Commit aa2fbe6d broke the ARM KVM target by introducing a new parameter
to irq handling functions.

Fix the function prototype to get things compiling again and ignore the
parameter just like we did before

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/arm/kvm/arm.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index e4ad0bb..678596f 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -805,7 +805,8 @@ static int vcpu_interrupt_line(struct kvm_vcpu *vcpu, int 
number, bool level)
return 0;
 }
 
-int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level)
+int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level,
+ bool line_status)
 {
u32 irq = irq_level-irq;
unsigned int irq_type, vcpu_idx, irq_num;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] KVM: irqfd generalization prepare patch set

2013-04-16 Thread Alexander Graf
The concept of an irqfd and interrupt routing are nothing particularly tied
into the IOAPIC implementation. In fact, most of the code already is perfectly
generic.

This patch set decouples most bits of the existing irqchip and irqfd
implementation to make it reusable for non-IOAPIC platforms, like the PPC MPIC.

I also have a patch that implements working irqfd support on top of these,
but that requires the in-kernel MPIC implementation to go upstream first, so
I'm holding off on it until we settled everything there, so the concept
certainly does work.

Alex

Alexander Graf (7):
  KVM: Add KVM_IRQCHIP_NUM_PINS in addition to KVM_IOAPIC_NUM_PINS
  KVM: Introduce __KVM_HAVE_IRQCHIP
  KVM: Remove kvm_get_intr_delivery_bitmask
  KVM: Move irq routing to generic code
  KVM: Extract generic irqchip logic into irqchip.c
  KVM: Move irq routing setup to irqchip.c
  KVM: Move irqfd resample cap handling to generic code

 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/include/uapi/asm/kvm.h |1 +
 arch/x86/kvm/Makefile   |2 +-
 arch/x86/kvm/x86.c  |1 -
 include/linux/kvm_host.h|   14 +--
 include/trace/events/kvm.h  |   12 ++-
 include/uapi/linux/kvm.h|2 +-
 virt/kvm/assigned-dev.c |   30 -
 virt/kvm/eventfd.c  |6 +-
 virt/kvm/irq_comm.c |  193 +---
 virt/kvm/irqchip.c  |  237 +++
 virt/kvm/kvm_main.c |   33 ++
 12 files changed, 297 insertions(+), 236 deletions(-)
 create mode 100644 virt/kvm/irqchip.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] KVM: Move irq routing to generic code

2013-04-16 Thread Alexander Graf
The IRQ routing set ioctl lives in the hacky device assignment code inside
of KVM today. This is definitely the wrong place for it. Move it to the much
more natural kvm_main.c.

Signed-off-by: Alexander Graf ag...@suse.de
---
 virt/kvm/assigned-dev.c |   30 --
 virt/kvm/kvm_main.c |   30 ++
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index f4c7f59..8db4370 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -983,36 +983,6 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, 
unsigned ioctl,
goto out;
break;
}
-#ifdef KVM_CAP_IRQ_ROUTING
-   case KVM_SET_GSI_ROUTING: {
-   struct kvm_irq_routing routing;
-   struct kvm_irq_routing __user *urouting;
-   struct kvm_irq_routing_entry *entries;
-
-   r = -EFAULT;
-   if (copy_from_user(routing, argp, sizeof(routing)))
-   goto out;
-   r = -EINVAL;
-   if (routing.nr = KVM_MAX_IRQ_ROUTES)
-   goto out;
-   if (routing.flags)
-   goto out;
-   r = -ENOMEM;
-   entries = vmalloc(routing.nr * sizeof(*entries));
-   if (!entries)
-   goto out;
-   r = -EFAULT;
-   urouting = argp;
-   if (copy_from_user(entries, urouting-entries,
-  routing.nr * sizeof(*entries)))
-   goto out_free_irq_routing;
-   r = kvm_set_irq_routing(kvm, entries, routing.nr,
-   routing.flags);
-   out_free_irq_routing:
-   vfree(entries);
-   break;
-   }
-#endif /* KVM_CAP_IRQ_ROUTING */
 #ifdef __KVM_HAVE_MSIX
case KVM_ASSIGN_SET_MSIX_NR: {
struct kvm_assigned_msix_nr entry_nr;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ac3182e..6a71ee3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2273,6 +2273,36 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
 #endif
+#ifdef KVM_CAP_IRQ_ROUTING
+   case KVM_SET_GSI_ROUTING: {
+   struct kvm_irq_routing routing;
+   struct kvm_irq_routing __user *urouting;
+   struct kvm_irq_routing_entry *entries;
+
+   r = -EFAULT;
+   if (copy_from_user(routing, argp, sizeof(routing)))
+   goto out;
+   r = -EINVAL;
+   if (routing.nr = KVM_MAX_IRQ_ROUTES)
+   goto out;
+   if (routing.flags)
+   goto out;
+   r = -ENOMEM;
+   entries = vmalloc(routing.nr * sizeof(*entries));
+   if (!entries)
+   goto out;
+   r = -EFAULT;
+   urouting = argp;
+   if (copy_from_user(entries, urouting-entries,
+  routing.nr * sizeof(*entries)))
+   goto out_free_irq_routing;
+   r = kvm_set_irq_routing(kvm, entries, routing.nr,
+   routing.flags);
+   out_free_irq_routing:
+   vfree(entries);
+   break;
+   }
+#endif /* KVM_CAP_IRQ_ROUTING */
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
if (r == -ENOTTY)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] KVM: Remove kvm_get_intr_delivery_bitmask

2013-04-16 Thread Alexander Graf
The prototype has been stale for a while, I can't spot any real function
define behind it. Let's just remove it.

Signed-off-by: Alexander Graf ag...@suse.de
---
 include/linux/kvm_host.h |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4b30906..48af62a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -719,11 +719,6 @@ void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int 
irq,
 void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
 bool mask);
 
-#ifdef __KVM_HAVE_IOAPIC
-void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
-  union kvm_ioapic_redirect_entry *entry,
-  unsigned long *deliver_bitmask);
-#endif
 int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
bool line_status);
 int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int 
level);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] KVM: Extract generic irqchip logic into irqchip.c

2013-04-16 Thread Alexander Graf
The current irq_comm.c file contains pieces of code that are generic
across different irqchip implementations, as well as code that is
fully IOAPIC specific.

Split the generic bits out into irqchip.c.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/x86/kvm/Makefile  |2 +-
 include/trace/events/kvm.h |   12 +++-
 virt/kvm/irq_comm.c|  117 --
 virt/kvm/irqchip.c |  152 
 4 files changed, 163 insertions(+), 120 deletions(-)
 create mode 100644 virt/kvm/irqchip.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 04d3040..a797b8e 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
 
 kvm-y  += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
coalesced_mmio.o irq_comm.o eventfd.o \
-   assigned-dev.o)
+   assigned-dev.o irqchip.o)
 kvm-$(CONFIG_IOMMU_API)+= $(addprefix ../../../virt/kvm/, iommu.o)
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o)
 
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 19911dd..2fe2d53 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -37,7 +37,7 @@ TRACE_EVENT(kvm_userspace_exit,
  __entry-errno  0 ? -__entry-errno : __entry-reason)
 );
 
-#if defined(__KVM_HAVE_IRQ_LINE)
+#if defined(__KVM_HAVE_IRQ_LINE) || defined(__KVM_HAVE_IRQCHIP)
 TRACE_EVENT(kvm_set_irq,
TP_PROTO(unsigned int gsi, int level, int irq_source_id),
TP_ARGS(gsi, level, irq_source_id),
@@ -122,6 +122,10 @@ TRACE_EVENT(kvm_msi_set_irq,
{KVM_IRQCHIP_PIC_SLAVE, PIC slave},   \
{KVM_IRQCHIP_IOAPIC,IOAPIC}
 
+#endif /* defined(__KVM_HAVE_IOAPIC) */
+
+#if defined(__KVM_HAVE_IRQCHIP)
+
 TRACE_EVENT(kvm_ack_irq,
TP_PROTO(unsigned int irqchip, unsigned int pin),
TP_ARGS(irqchip, pin),
@@ -136,14 +140,18 @@ TRACE_EVENT(kvm_ack_irq,
__entry-pin= pin;
),
 
+#ifdef kvm_irqchip
TP_printk(irqchip %s pin %u,
  __print_symbolic(__entry-irqchip, kvm_irqchips),
 __entry-pin)
+#else
+   TP_printk(irqchip %d pin %u, __entry-irqchip, __entry-pin)
+#endif
 );
 
+#endif /* defined(__KVM_HAVE_IRQCHIP) */
 
 
-#endif /* defined(__KVM_HAVE_IOAPIC) */
 
 #define KVM_TRACE_MMIO_READ_UNSATISFIED 0
 #define KVM_TRACE_MMIO_READ 1
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index f02659b..3d900ba 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -151,59 +151,6 @@ static int kvm_set_msi_inatomic(struct 
kvm_kernel_irq_routing_entry *e,
return -EWOULDBLOCK;
 }
 
-int kvm_send_userspace_msi(struct kvm *kvm, struct kvm_msi *msi)
-{
-   struct kvm_kernel_irq_routing_entry route;
-
-   if (!irqchip_in_kernel(kvm) || msi-flags != 0)
-   return -EINVAL;
-
-   route.msi.address_lo = msi-address_lo;
-   route.msi.address_hi = msi-address_hi;
-   route.msi.data = msi-data;
-
-   return kvm_set_msi(route, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1, false);
-}
-
-/*
- * Return value:
- *   0   Interrupt was ignored (masked or not delivered for other reasons)
- *  = 0   Interrupt was coalesced (previous irq is still pending)
- *   0   Number of CPUs interrupt was delivered to
- */
-int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
-   bool line_status)
-{
-   struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
-   int ret = -1, i = 0;
-   struct kvm_irq_routing_table *irq_rt;
-
-   trace_kvm_set_irq(irq, level, irq_source_id);
-
-   /* Not possible to detect if the guest uses the PIC or the
-* IOAPIC.  So set the bit in both. The guest will ignore
-* writes to the unused one.
-*/
-   rcu_read_lock();
-   irq_rt = rcu_dereference(kvm-irq_routing);
-   if (irq  irq_rt-nr_rt_entries)
-   hlist_for_each_entry(e, irq_rt-map[irq], link)
-   irq_set[i++] = *e;
-   rcu_read_unlock();
-
-   while(i--) {
-   int r;
-   r = irq_set[i].set(irq_set[i], kvm, irq_source_id, level,
-   line_status);
-   if (r  0)
-   continue;
-
-   ret = r + ((ret  0) ? 0 : ret);
-   }
-
-   return ret;
-}
-
 /*
  * Deliver an IRQ in an atomic context if we can, or return a failure,
  * user can retry in a process context.
@@ -241,62 +188,6 @@ int kvm_set_irq_inatomic(struct kvm *kvm, int 
irq_source_id, u32 irq, int level)
return ret;
 }
 
-bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
-{
-   struct kvm_irq_ack_notifier *kian;
-   int gsi;
-
-   rcu_read_lock();
-   gsi = 

[PATCH 6/7] KVM: Move irq routing setup to irqchip.c

2013-04-16 Thread Alexander Graf
Setting up IRQ routes is nothing IOAPIC specific. Extract everything
that really is generic code into irqchip.c and only leave the ioapic
specific bits to irq_comm.c.

Signed-off-by: Alexander Graf ag...@suse.de
---
 include/linux/kvm_host.h |3 ++
 virt/kvm/irq_comm.c  |   76 ++---
 virt/kvm/irqchip.c   |   85 ++
 3 files changed, 91 insertions(+), 73 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48af62a..8b25ea2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -961,6 +961,9 @@ int kvm_set_irq_routing(struct kvm *kvm,
const struct kvm_irq_routing_entry *entries,
unsigned nr,
unsigned flags);
+int kvm_set_routing_entry(struct kvm_irq_routing_table *rt,
+ struct kvm_kernel_irq_routing_entry *e,
+ const struct kvm_irq_routing_entry *ue);
 void kvm_free_irq_routing(struct kvm *kvm);
 
 int kvm_send_userspace_msi(struct kvm *kvm, struct kvm_msi *msi);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 3d900ba..c4dbb94 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -272,27 +272,14 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned 
irqchip, unsigned pin,
rcu_read_unlock();
 }
 
-static int setup_routing_entry(struct kvm_irq_routing_table *rt,
-  struct kvm_kernel_irq_routing_entry *e,
-  const struct kvm_irq_routing_entry *ue)
+int kvm_set_routing_entry(struct kvm_irq_routing_table *rt,
+ struct kvm_kernel_irq_routing_entry *e,
+ const struct kvm_irq_routing_entry *ue)
 {
int r = -EINVAL;
int delta;
unsigned max_pin;
-   struct kvm_kernel_irq_routing_entry *ei;
 
-   /*
-* Do not allow GSI to be mapped to the same irqchip more than once.
-* Allow only one to one mapping between GSI and MSI.
-*/
-   hlist_for_each_entry(ei, rt-map[ue-gsi], link)
-   if (ei-type == KVM_IRQ_ROUTING_MSI ||
-   ue-type == KVM_IRQ_ROUTING_MSI ||
-   ue-u.irqchip.irqchip == ei-irqchip.irqchip)
-   return r;
-
-   e-gsi = ue-gsi;
-   e-type = ue-type;
switch (ue-type) {
case KVM_IRQ_ROUTING_IRQCHIP:
delta = 0;
@@ -329,68 +316,11 @@ static int setup_routing_entry(struct 
kvm_irq_routing_table *rt,
goto out;
}
 
-   hlist_add_head(e-link, rt-map[e-gsi]);
r = 0;
 out:
return r;
 }
 
-int kvm_set_irq_routing(struct kvm *kvm,
-   const struct kvm_irq_routing_entry *ue,
-   unsigned nr,
-   unsigned flags)
-{
-   struct kvm_irq_routing_table *new, *old;
-   u32 i, j, nr_rt_entries = 0;
-   int r;
-
-   for (i = 0; i  nr; ++i) {
-   if (ue[i].gsi = KVM_MAX_IRQ_ROUTES)
-   return -EINVAL;
-   nr_rt_entries = max(nr_rt_entries, ue[i].gsi);
-   }
-
-   nr_rt_entries += 1;
-
-   new = kzalloc(sizeof(*new) + (nr_rt_entries * sizeof(struct hlist_head))
- + (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
- GFP_KERNEL);
-
-   if (!new)
-   return -ENOMEM;
-
-   new-rt_entries = (void *)new-map[nr_rt_entries];
-
-   new-nr_rt_entries = nr_rt_entries;
-   for (i = 0; i  3; i++)
-   for (j = 0; j  KVM_IRQCHIP_NUM_PINS; j++)
-   new-chip[i][j] = -1;
-
-   for (i = 0; i  nr; ++i) {
-   r = -EINVAL;
-   if (ue-flags)
-   goto out;
-   r = setup_routing_entry(new, new-rt_entries[i], ue);
-   if (r)
-   goto out;
-   ++ue;
-   }
-
-   mutex_lock(kvm-irq_lock);
-   old = kvm-irq_routing;
-   kvm_irq_routing_update(kvm, new);
-   mutex_unlock(kvm-irq_lock);
-
-   synchronize_rcu();
-
-   new = old;
-   r = 0;
-
-out:
-   kfree(new);
-   return r;
-}
-
 #define IOAPIC_ROUTING_ENTRY(irq) \
{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,  \
  .u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
index d8b06ed..f3a875b 100644
--- a/virt/kvm/irqchip.c
+++ b/virt/kvm/irqchip.c
@@ -150,3 +150,88 @@ void kvm_free_irq_routing(struct kvm *kvm)
   at this stage */
kfree(kvm-irq_routing);
 }
+
+static int setup_routing_entry(struct kvm_irq_routing_table *rt,
+  struct kvm_kernel_irq_routing_entry *e,
+  const struct kvm_irq_routing_entry *ue)
+{
+   int r = -EINVAL;
+   struct kvm_kernel_irq_routing_entry *ei;
+
+   /*
+ 

[PATCH 7/7] KVM: Move irqfd resample cap handling to generic code

2013-04-16 Thread Alexander Graf
Now that we have most irqfd code completely platform agnostic, let's move
irqfd's resample capability return to generic code as well.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/x86/kvm/x86.c  |1 -
 virt/kvm/kvm_main.c |3 +++
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ae9744d..9d9904f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2513,7 +2513,6 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_PCI_2_3:
case KVM_CAP_KVMCLOCK_CTRL:
case KVM_CAP_READONLY_MEM:
-   case KVM_CAP_IRQFD_RESAMPLE:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a71ee3..6b44ad7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2432,6 +2432,9 @@ static long kvm_dev_ioctl_check_extension_generic(long 
arg)
 #ifdef CONFIG_HAVE_KVM_MSI
case KVM_CAP_SIGNAL_MSI:
 #endif
+#ifdef CONFIG_HAVE_KVM_IRQCHIP
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
return 1;
 #ifdef KVM_CAP_IRQ_ROUTING
case KVM_CAP_IRQ_ROUTING:
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] KVM: Introduce __KVM_HAVE_IRQCHIP

2013-04-16 Thread Alexander Graf
Quite a bit of code in KVM has been conditionalized on availability of
IOAPIC emulation. However, most of it is generically applicable to
platforms that don't have an IOPIC, but a different type of irq chip.

Introduce a new define to distinguish between generic code and IOAPIC
specific code.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/x86/include/uapi/asm/kvm.h |1 +
 include/linux/kvm_host.h|4 ++--
 include/uapi/linux/kvm.h|2 +-
 virt/kvm/eventfd.c  |6 +++---
 4 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a65ec29..923478e 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -28,6 +28,7 @@
 /* Select x86 specific features in linux/kvm.h */
 #define __KVM_HAVE_PIT
 #define __KVM_HAVE_IOAPIC
+#define __KVM_HAVE_IRQCHIP
 #define __KVM_HAVE_IRQ_LINE
 #define __KVM_HAVE_DEVICE_ASSIGNMENT
 #define __KVM_HAVE_MSI
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 303d05b..4b30906 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -304,7 +304,7 @@ struct kvm_kernel_irq_routing_entry {
struct hlist_node link;
 };
 
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 
 struct kvm_irq_routing_table {
int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
@@ -432,7 +432,7 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
 int __must_check vcpu_load(struct kvm_vcpu *vcpu);
 void vcpu_put(struct kvm_vcpu *vcpu);
 
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 int kvm_irqfd_init(void);
 void kvm_irqfd_exit(void);
 #else
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 74d0ff3..c38d269 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -579,7 +579,7 @@ struct kvm_ppc_smmu_info {
 #ifdef __KVM_HAVE_PIT
 #define KVM_CAP_REINJECT_CONTROL 24
 #endif
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 #define KVM_CAP_IRQ_ROUTING 25
 #endif
 #define KVM_CAP_IRQ_INJECT_STATUS 26
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index c5d43ff..0797571 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -35,7 +35,7 @@
 
 #include iodev.h
 
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 /*
  * 
  * irqfd: Allows an fd to be used to inject an interrupt to the guest
@@ -433,7 +433,7 @@ fail:
 void
 kvm_eventfd_init(struct kvm *kvm)
 {
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
spin_lock_init(kvm-irqfds.lock);
INIT_LIST_HEAD(kvm-irqfds.items);
INIT_LIST_HEAD(kvm-irqfds.resampler_list);
@@ -442,7 +442,7 @@ kvm_eventfd_init(struct kvm *kvm)
INIT_LIST_HEAD(kvm-ioeventfds);
 }
 
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 /*
  * shutdown any irqfd's that match fd+gsi
  */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 0/7] ppc patch queue 2013-03-22

2013-04-16 Thread Alexander Graf

On 12.04.2013, at 22:56, Alexander Graf wrote:

 
 On 12.04.2013, at 22:54, Marcelo Tosatti wrote:
 
 On Thu, Apr 11, 2013 at 03:50:13PM +0200, Alexander Graf wrote:
 
 On 11.04.2013, at 15:45, Marcelo Tosatti wrote:
 
 On Tue, Mar 26, 2013 at 12:59:04PM +1100, Paul Mackerras wrote:
 On Tue, Mar 26, 2013 at 03:33:12AM +0200, Gleb Natapov wrote:
 On Tue, Mar 26, 2013 at 12:35:09AM +0100, Alexander Graf wrote:
 I agree. So if it doesn't hurt to have the same commits in kvm/next and 
 kvm/master, I'd be more than happy to send another pull request with 
 the important fixes against kvm/master as well.
 
 If it will result in the same commit showing twice in the Linus tree in 
 3.10 we cannot do that.
 
 Why not?  In the circumstances it seems perfectly reasonable to me.
 Git should merge the branches without any problem, and even if it
 doesn't, Linus is good at fixing merge conflicts.
 
 Paul.
 
 Yes, should avoid duplicate commits but its not fatal for them to exist.
 
 So I may send a pull request against 3.9 with the 3 commits that already 
 are in kvm/next?
 
 If you decide that the fixes are important enough to justify the
 existance of duplicate commits, i don't see a problem.
 
 Great :). I already sent the pull request out with all patches that fix 
 regressions.

Ping? Did these go to Linus?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] KVM: Add KVM_IRQCHIP_NUM_PINS in addition to KVM_IOAPIC_NUM_PINS

2013-04-16 Thread Alexander Graf
The concept of routing interrupt lines to an irqchip is nothing
that is IOAPIC specific. Every irqchip has a maximum number of pins
that can be linked to irq lines.

So let's add a new define that allows us to reuse generic code for
non-IOAPIC platforms.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/x86/include/asm/kvm_host.h |2 ++
 include/linux/kvm_host.h|2 +-
 virt/kvm/irq_comm.c |2 +-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 82f1dc6..6a1871b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -43,6 +43,8 @@
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
 
+#define KVM_IRQCHIP_NUM_PINS  KVM_IOAPIC_NUM_PINS
+
 #define CR0_RESERVED_BITS   \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4a76aca..303d05b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -307,7 +307,7 @@ struct kvm_kernel_irq_routing_entry {
 #ifdef __KVM_HAVE_IOAPIC
 
 struct kvm_irq_routing_table {
-   int chip[KVM_NR_IRQCHIPS][KVM_IOAPIC_NUM_PINS];
+   int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
struct kvm_kernel_irq_routing_entry *rt_entries;
u32 nr_rt_entries;
/*
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 8efb580..f02659b 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -480,7 +480,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
 
new-nr_rt_entries = nr_rt_entries;
for (i = 0; i  3; i++)
-   for (j = 0; j  KVM_IOAPIC_NUM_PINS; j++)
+   for (j = 0; j  KVM_IRQCHIP_NUM_PINS; j++)
new-chip[i][j] = -1;
 
for (i = 0; i  nr; ++i) {
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Emulate MOVBE

2013-04-16 Thread Gleb Natapov
On Tue, Apr 16, 2013 at 01:47:46PM +0200, Paolo Bonzini wrote:
 Il 10/04/2013 12:08, Gleb Natapov ha scritto:
   What is the opinion from the KVM folks on this? Shall we start to
   emulate instructions the host does not provide? In this particular case
   a relatively simple patch fixes a problem (starting Atom optimized
   kernels on non-Atom machines).
  We can add the emulation, but we should not start announcing the instruction
  availability to a guest if host cpu does not have it by default. This
  may trick a guest into thinking that movbe is the fastest way to do
  something when it is not.
  
 
 This does highlight a weakness in CPU_GET_SUPPORTED_CPUID, but I think
 this is not a problem in practice.
 
 With a management layer such as oVirt it's not a problem.  For example,
 oVirt has its own library of processors.  It doesn't care if KVM enables
 movbe.  If you tell it your datacenter is a mix of Haswells and Sandy
 Bridges it will pick the CPUID bits that are common to all.
 
 However, even without a suitable management layer it is also not really
 a problem.
 
 The only processors that support MOVBE are Atom and Haswell.  Haswell
 adds a whole lot of extra CPUID features, hence -cpu Haswell,enforce
 will fail with or without movbe emulation.  -cpu Haswell will disable
 all Haswell new features except movbe will remain slow; that's fine, I
 think, anyway it's not what you'ld do except to play with CPU models.
 
No that's not fine. KVM should not trick userspace (QEMU is just one of
them) into nonoptimal configuration. And you forgot about -cpu host in your
analysis.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] kvm: Emulate MOVBE

2013-04-16 Thread Gleb Natapov
On Sun, Apr 14, 2013 at 07:32:16PM +0200, Borislav Petkov wrote:
 On Sun, Apr 14, 2013 at 10:41:07AM +0300, Gleb Natapov wrote:
  Currently userspace assumes that that cpuid configuration returned by
  KVM_GET_SUPPORTED_CPUID is the optimal one. What we want here is a way
  for KVM to tell userspace that it can emulate movbe though it is not
  optimal.
 
 Ok, I don't understand.
 
 You want to tell userspace: yes, we do support a hw feature but we
 emulate it.?
 
I am contemplating this, yes.

  Userspace alone cannot figure it out. It can check host's cpuid
  directly and assume that if cpuid bit is not present on the host cpu,
  but reported as supported by KVM then it is emulated by KVM and this is
  not optimal, but sometimes emulation is actually desirable (x2apic), so
  such assumption is not always correct.
 
 Right, and this is what we have, AFAICT. And if userspace does that what
 you exemplify above, you get exactly that - a feature bit not set in
 CPUID but KVM reporting it set means, it is emulated. There's no room
 for other interpretations here. Which probably means also not optimal
 because it is not done in hw.
 
This is not true for all emulated CPUID bits. X2APIC is emulated and it
is preferable for a guest to use it for example.

 Or, do you want to have a way to say with KVM_GET_SUPPORTED_CPUID that
 the features I'm reporting to you are those - a subset of them are not
 optimally supported because I'm emulating them.
 
 Am I close?
 
Yes, you are. I am considering such interface. Adding new specialized interfaces
is last resort though. I am open to other ideas.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: ARM: Fix kvm_vm_ioctl_irq_line

2013-04-16 Thread Christoffer Dall
On Tue, Apr 16, 2013 at 10:21 AM, Alexander Graf ag...@suse.de wrote:
 Commit aa2fbe6d broke the ARM KVM target by introducing a new parameter
 to irq handling functions.

 Fix the function prototype to get things compiling again and ignore the
 parameter just like we did before

 Signed-off-by: Alexander Graf ag...@suse.de
 ---
  arch/arm/kvm/arm.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index e4ad0bb..678596f 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -805,7 +805,8 @@ static int vcpu_interrupt_line(struct kvm_vcpu *vcpu, int 
 number, bool level)
 return 0;
  }

 -int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level)
 +int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level,
 + bool line_status)
  {
 u32 irq = irq_level-irq;
 unsigned int irq_type, vcpu_idx, irq_num;
 --
 1.6.0.2


Acked-by: Christoffer Dall cd...@cs.columbia.edu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

2013-04-16 Thread Robin Holt
On Tue, Apr 16, 2013 at 09:07:20PM +0800, Xiao Guangrong wrote:
 On 04/16/2013 07:43 PM, Robin Holt wrote:
  Argh.  Taking a step back helped clear my head.
  
  For the -stable releases, I agree we should just go with your
  revert-plus-hlist_del_init_rcu patch.  I will give it a test
  when I am in the office.
 
 Okay. Wait for your test report. Thank you in advance.
 
  
  For the v3.10 release, we should work on making this more
  correct and completely documented.
 
 Better document is always welcomed.
 
 Double call -release is not bad, like i mentioned it in the changelog:
 
 it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
 after exit_mmap()) and the later call of multiple -release should be
 fast since all the pages have already been released by the first call.
 
 But, of course, it's great if you have a _light_ way to avoid this.

Getting my test environment set back up took longer than I would have liked.

Your patch passed.  I got no NULL-pointer derefs.

How would you feel about adding the following to your patch?

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..ff2fd5f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -157,6 +157,7 @@ struct mmu_notifier_ops {
 struct mmu_notifier {
struct hlist_node hlist;
const struct mmu_notifier_ops *ops;
+   int released;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 606777a..949704b 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -44,7 +44,8 @@ void __mmu_notifier_release(struct mm_struct *mm)
 * -release returns.
 */
id = srcu_read_lock(srcu);
-   hlist_for_each_entry_rcu(mn, mm-mmu_notifier_mm-list, hlist)
+   hlist_for_each_entry_rcu(mn, mm-mmu_notifier_mm-list, hlist) {
+   int released;
/*
 * if -release runs before mmu_notifier_unregister it
 * must be handled as it's the only way for the driver
@@ -52,8 +53,10 @@ void __mmu_notifier_release(struct mm_struct *mm)
 * from establishing any more sptes before all the
 * pages in the mm are freed.
 */
-   if (mn-ops-release)
+   released = xchg(mn-released, 1);
+   if (mn-ops-release  !released)
mn-ops-release(mn, mm);
+   }
srcu_read_unlock(srcu, id);
 
spin_lock(mm-mmu_notifier_mm-lock);
@@ -214,6 +217,7 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
mm-mmu_notifier_mm = mmu_notifier_mm;
mmu_notifier_mm = NULL;
}
+   mn-released = 0;
atomic_inc(mm-mm_count);
 
/*
@@ -295,6 +299,7 @@ void mmu_notifier_unregister(struct mmu_notifier *mn, 
struct mm_struct *mm)
 * before freeing the pages.
 */
int id;
+   int released;
 
id = srcu_read_lock(srcu);
/*
@@ -302,7 +307,8 @@ void mmu_notifier_unregister(struct mmu_notifier *mn, 
struct mm_struct *mm)
 * guarantee -release is called before freeing the
 * pages.
 */
-   if (mn-ops-release)
+   released = xchg(mn-released, 1);
+   if (mn-ops-release  !released)
mn-ops-release(mn, mm);
srcu_read_unlock(srcu, id);
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/2] tcm_vhost: Wait for pending requests in vhost_scsi_flush()

2013-04-16 Thread Michael S. Tsirkin
On Tue, Apr 16, 2013 at 05:16:51PM +0800, Asias He wrote:
 This patch makes vhost_scsi_flush() wait for all the pending requests
 issued before the flush operation to be finished.
 
 Changes in v5:
 - Use kref and completion
 - Fail req if vs-vs_inflight is NULL
 - Rename tcm_vhost_alloc_inflight to tcm_vhost_set_inflight
 
 Changes in v4:
 - Introduce vhost_scsi_inflight
 - Drop array to track flush
 - Use RCU to protect vs_inflight explicitly
 
 Changes in v3:
 - Rebase
 - Drop 'tcm_vhost: Wait for pending requests in
   vhost_scsi_clear_endpoint()' in this series, we already did that in
   'tcm_vhost: Use vq-private_data to indicate if the endpoint is setup'
 
 Changes in v2:
 - Increase/Decrease inflight requests in
   vhost_scsi_{allocate,free}_cmd and tcm_vhost_{allocate,free}_evt
 
 Signed-off-by: Asias He as...@redhat.com

OK looks good, except error handling needs to be fixed.

 ---
  drivers/vhost/tcm_vhost.c | 101 
 +++---
  drivers/vhost/tcm_vhost.h |   5 +++
  2 files changed, 101 insertions(+), 5 deletions(-)
 
 diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
 index 4ae6725..ef40a8f 100644
 --- a/drivers/vhost/tcm_vhost.c
 +++ b/drivers/vhost/tcm_vhost.c
 @@ -74,6 +74,11 @@ enum {
  #define VHOST_SCSI_MAX_VQ128
  #define VHOST_SCSI_MAX_EVENT 128
  
 +struct vhost_scsi_inflight {
 + struct completion comp; /* Wait for the flush operation to finish */
 + struct kref kref; /* Refcount for the inflight reqs */
 +};
 +
  struct vhost_scsi {
   /* Protected by vhost_scsi-dev.mutex */
   struct tcm_vhost_tpg **vs_tpg;
 @@ -91,6 +96,8 @@ struct vhost_scsi {
   struct mutex vs_events_lock; /* protect vs_events_dropped,events_nr */
   bool vs_events_dropped; /* any missed events */
   int vs_events_nr; /* num of pending events */
 +
 + struct vhost_scsi_inflight __rcu *vs_inflight; /* track inflight reqs */
  };
  
  /* Local pointer to allocated TCM configfs fabric module */
 @@ -108,6 +115,51 @@ static int iov_num_pages(struct iovec *iov)
  ((unsigned long)iov-iov_base  PAGE_MASK))  PAGE_SHIFT;
  }
  
 +static int tcm_vhost_set_inflight(struct vhost_scsi *vs)
 +{
 + struct vhost_scsi_inflight *inflight;
 + int ret = -ENOMEM;
 +
 + inflight = kzalloc(sizeof(*inflight), GFP_KERNEL);

kzalloc is not needed, you initialize all fields.

 + if (inflight) {
 + kref_init(inflight-kref);
 + init_completion(inflight-comp);
 + ret = 0;
 + }
 + rcu_assign_pointer(vs-vs_inflight, inflight);

So if allocation fails, we stop tracking inflights?
This looks strange, and could break guests. Why not the usual
if (!inflight)
return -ENOMEM;

 + synchronize_rcu();

open call is different:
- sync is not needed
- should use RCU_INIT_POINTER and not rcu_assign_pointer

So please move these out and make this function return the struct:
struct vhost_scsi_inflight *inflight
tcm_vhost_alloc_inflight(void)


 +
 + return ret;
 +}
 +
 +static struct vhost_scsi_inflight *
 +tcm_vhost_inc_inflight(struct vhost_scsi *vs)

And then inc will not need to return inflight pointer,
which is really unusual.

 +{
 + struct vhost_scsi_inflight *inflight;
 +
 + rcu_read_lock();
 + inflight = rcu_dereference(vs-vs_inflight);
 + if (inflight)
 + kref_get(inflight-kref);
 + rcu_read_unlock();
 +
 + return inflight;
 +}
 +
 +void tcm_vhost_done_inflight(struct kref *kref)
 +{
 + struct vhost_scsi_inflight *inflight;
 +
 + inflight = container_of(kref, struct vhost_scsi_inflight, kref);
 + complete(inflight-comp);
 +}
 +
 +static void tcm_vhost_dec_inflight(struct vhost_scsi_inflight *inflight)
 +{
 + if (inflight)

Here as in other places, inflight must never be NULL.
Pls fix code so that invariant holds.

 + kref_put(inflight-kref, tcm_vhost_done_inflight);
 +}
 +
  static bool tcm_vhost_check_feature(struct vhost_scsi *vs, int feature)
  {
   bool ret = false;
 @@ -402,6 +454,7 @@ static int tcm_vhost_queue_tm_rsp(struct se_cmd *se_cmd)
  static void tcm_vhost_free_evt(struct vhost_scsi *vs, struct tcm_vhost_evt 
 *evt)
  {
   mutex_lock(vs-vs_events_lock);
 + tcm_vhost_dec_inflight(evt-inflight);
   vs-vs_events_nr--;
   kfree(evt);
   mutex_unlock(vs-vs_events_lock);
 @@ -413,21 +466,27 @@ static struct tcm_vhost_evt 
 *tcm_vhost_allocate_evt(struct vhost_scsi *vs,
   struct tcm_vhost_evt *evt;
  
   mutex_lock(vs-vs_events_lock);
 - if (vs-vs_events_nr  VHOST_SCSI_MAX_EVENT) {
 - vs-vs_events_dropped = true;
 - mutex_unlock(vs-vs_events_lock);
 - return NULL;
 - }
 + if (vs-vs_events_nr  VHOST_SCSI_MAX_EVENT)
 + goto out;
  
   evt = kzalloc(sizeof(*evt), GFP_KERNEL);

BTW it looks like we should replace this kzalloc with kmalloc.
Should be a separate patch 

[PATCH] kvm: Allow build-time configuration of KVM device assignment

2013-04-16 Thread Alex Williamson
We hope to at some point deprecate KVM legacy device assignment in
favor of VFIO-based assignment.  Towards that end, allow legacy
device assignment to be deconfigured.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

This depends on Alex Graf's irqfd generalization series to remove
IRQ routing code from assigned-dev.c.

 arch/ia64/include/uapi/asm/kvm.h |1 -
 arch/ia64/kvm/Kconfig|   13 +++--
 arch/ia64/kvm/Makefile   |6 +++---
 arch/ia64/kvm/kvm-ia64.c |2 --
 arch/x86/include/uapi/asm/kvm.h  |1 -
 arch/x86/kvm/Kconfig |   13 +++--
 arch/x86/kvm/Makefile|5 +++--
 arch/x86/kvm/x86.c   |6 --
 include/linux/kvm_host.h |   30 --
 include/uapi/linux/kvm.h |4 
 10 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/arch/ia64/include/uapi/asm/kvm.h b/arch/ia64/include/uapi/asm/kvm.h
index ec6c6b3..99503c2 100644
--- a/arch/ia64/include/uapi/asm/kvm.h
+++ b/arch/ia64/include/uapi/asm/kvm.h
@@ -27,7 +27,6 @@
 /* Select x86 specific features in linux/kvm.h */
 #define __KVM_HAVE_IOAPIC
 #define __KVM_HAVE_IRQ_LINE
-#define __KVM_HAVE_DEVICE_ASSIGNMENT
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index 2cd225f..e792664 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -21,8 +21,6 @@ config KVM
tristate Kernel-based Virtual Machine (KVM) support
depends on BROKEN
depends on HAVE_KVM  MODULES
-   # for device assignment:
-   depends on PCI
depends on BROKEN
select PREEMPT_NOTIFIERS
select ANON_INODES
@@ -50,6 +48,17 @@ config KVM_INTEL
  Provides support for KVM on Itanium 2 processors equipped with the VT
  extensions.
 
+config KVM_DEVICE_ASSIGNMENT
+   bool KVM legacy PCI device assignment support
+   depends on KVM  PCI  IOMMU_API
+   default y
+   ---help---
+ Provide support for legacy PCI device assignment through KVM.  The
+ kernel now also supports a full featured userspace device driver
+ framework through VFIO, which supersedes much of this support.
+
+ If unsure, say Y.
+
 source drivers/vhost/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/arch/ia64/kvm/Makefile b/arch/ia64/kvm/Makefile
index db3d7c5..1a40537 100644
--- a/arch/ia64/kvm/Makefile
+++ b/arch/ia64/kvm/Makefile
@@ -49,10 +49,10 @@ ccflags-y := -Ivirt/kvm -Iarch/ia64/kvm/
 asflags-y := -Ivirt/kvm -Iarch/ia64/kvm/
 
 common-objs = $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
-   coalesced_mmio.o irq_comm.o assigned-dev.o)
+   coalesced_mmio.o irq_comm.o)
 
-ifeq ($(CONFIG_IOMMU_API),y)
-common-objs += $(addprefix ../../../virt/kvm/, iommu.o)
+ifeq ($(CONFIG_KVM_DEVICE_ASSIGNMENT),y)
+common-objs += $(addprefix ../../../virt/kvm/, assigned-dev.o iommu.o)
 endif
 
 kvm-objs := $(common-objs) kvm-ia64.o kvm_fw.o
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 7a54455..a21c2c5 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1366,9 +1366,7 @@ void kvm_arch_sync_events(struct kvm *kvm)
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
kvm_iommu_unmap_guest(kvm);
-#ifdef  KVM_CAP_DEVICE_ASSIGNMENT
kvm_free_all_assigned_devices(kvm);
-#endif
kfree(kvm-arch.vioapic);
kvm_release_vm_pages(kvm);
 }
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 923478e..63e6622 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -30,7 +30,6 @@
 #define __KVM_HAVE_IOAPIC
 #define __KVM_HAVE_IRQCHIP
 #define __KVM_HAVE_IRQ_LINE
-#define __KVM_HAVE_DEVICE_ASSIGNMENT
 #define __KVM_HAVE_MSI
 #define __KVM_HAVE_USER_NMI
 #define __KVM_HAVE_GUEST_DEBUG
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 586f000..46e0832 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,8 +21,6 @@ config KVM
tristate Kernel-based Virtual Machine (KVM) support
depends on HAVE_KVM
depends on HIGH_RES_TIMERS
-   # for device assignment:
-   depends on PCI
# for TASKSTATS/TASK_DELAY_ACCT:
depends on NET
select PREEMPT_NOTIFIERS
@@ -82,6 +80,17 @@ config KVM_MMU_AUDIT
 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 audit  KVM MMU at runtime.
 
+config KVM_DEVICE_ASSIGNMENT
+   bool KVM legacy PCI device assignment support
+   depends on KVM  PCI  IOMMU_API
+   default y
+   ---help---
+ Provide support for legacy PCI device assignment through KVM.  The
+ kernel now also supports a full featured userspace device driver
+ framework through VFIO, which supersedes much of this support.
+
+ If unsure, say Y.
+
 # OK, it's a little counter-intuitive to do 

Perf tuning help?

2013-04-16 Thread Mason Turner
We have an in-house app, written in c, that is not performing as well as we'd 
hoped it would when moving to a VM. We've tried all the common tuning 
recommendations (virtio, tap interface, cpu pining), without any change in 
performance. Even terminating all of the other VMs on the host doesn't make a 
difference. The VM doesn't appear to be CPU, memory or IO bound. We are trying 
to maximize UDP-based QPS against the in-house app.

I've been running strace against the app and perf kvm against the VM to try 
to identify any bottlenecks. I would say there are a lot of kvm_exits, but I'm 
not sure how to quantify what is acceptable and what is not.

We are trying to maximize UDP queries against the app. I've read a few times 
that the virtio network stack results in a lot of vm_exits. Unfortunately, we 
can't use the direct PCI access with our hardware.

Is there a good resource inefficient system calls? Things that result in 
higher than normal kvm_exits, or other performance killers?

Thanks for the help.

Our hypdervisor is running on
CentOS 6.3: 2.6.32-279.22.1.el6.x86_64
qemu-kvm 0.12.1.2  
libvirt 0.9.10

Our app is running on
Centos 6.1: 2.6.32-131.0.15.el6.x86_64

domain type='kvm'
  namething1/name
  uuidabe76ce9-60a0-4727-a7ae-cf572e5c3f21/uuid
  memory unit='KiB'16384000/memory
  currentMemory unit='KiB'16384000/currentMemory
  vcpu placement='static'6/vcpu
  cputune
vcpupin vcpu='0' cpuset='0'/
vcpupin vcpu='1' cpuset='2'/
vcpupin vcpu='2' cpuset='4'/
vcpupin vcpu='3' cpuset='6'/
vcpupin vcpu='4' cpuset='8'/
vcpupin vcpu='5' cpuset='10'/
  /cputune
  numatune
memory mode='interleave' nodeset='0,2,4,6,8,10'/
  /numatune
  os
type arch='x86_64' machine='rhel6.0.0'hvm/type
boot dev='hd'/
  /os
  features
acpi/
apic/
pae/
  /features
  clock offset='utc'/
  on_poweroffdestroy/on_poweroff
  on_rebootrestart/on_reboot
  on_crashrestart/on_crash
  devices
emulator/usr/libexec/qemu-kvm/emulator
disk type='file' device='disk'
  driver name='qemu' type='raw' cache='none'/
  source file='/var/lib/libvirt/images/thing1-disk0'/
  target dev='vda' bus='virtio'/
  address type='pci' domain='0x' bus='0x00' slot='0x05' 
function='0x0'/
/disk
controller type='usb' index='0'
  address type='pci' domain='0x' bus='0x00' slot='0x01' 
function='0x2'/
/controller
interface type='bridge'
  mac address='00:5e:e3:e1:8a:aa'/
  source bridge='virbr0'/
  model type='virtio'/
  address type='pci' domain='0x' bus='0x00' slot='0x04' 
function='0x0'/
/interface
serial type='pty'
  target port='0'/
/serial
console type='pty'
  target type='serial' port='0'/
/console
input type='tablet' bus='usb'/
input type='mouse' bus='ps2'/
graphics type='vnc' port='-1' autoport='yes' keymap='en-us'/
video
  model type='cirrus' vram='9216' heads='1'/
  address type='pci' domain='0x' bus='0x00' slot='0x02' 
function='0x0'/
/video
memballoon model='virtio'
  address type='pci' domain='0x' bus='0x00' slot='0x06' 
function='0x0'/
/memballoon
  /devices
/domain

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM/ARM Minor fixes for 3.9

2013-04-16 Thread Marcelo Tosatti
On Mon, Apr 15, 2013 at 01:52:15AM -0700, Christoffer Dall wrote:
 Hi Marcelo and Gleb,
 
 The following changes since commit 41ef2d5678d83af030125550329b6ae8b74618fa:
 
   Linux 3.9-rc7 (2013-04-14 17:45:16 -0700)
 
 are available in the git repository at:
 
   git://github.com/columbia/linux-kvm-arm.git kvm-arm-fixes-for-3.9
 
 for you to fetch changes up to b03e1d4f253f256ef7ace14eb6ac8a421fd5d625:
 
   ARM: KVM: fix L_PTE_S2_RDWR to actually be Read/Write (2013-04-14
 21:52:27 -0700)
 
 I hope these can make it in in time,
 
 Thanks!
 -Christoffer
 
 
 Marc Zyngier (2):
   ARM: KVM: fix KVM_CAP_ARM_SET_DEVICE_ADDR reporting
   ARM: KVM: fix L_PTE_S2_RDWR to actually be Read/Write
 
  arch/arm/include/asm/pgtable-3level.h |2 +-
  arch/arm/kvm/arm.c|1 +
  2 files changed, 2 insertions(+), 1 deletion(-)

Should provide a pull request against kvm.git master branch (which is at
-rc6).

Are there any dependencies between these fixes and -rc7 ?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: Allow build-time configuration of KVM device assignment

2013-04-16 Thread Alexander Graf

On 16.04.2013, at 21:49, Alex Williamson wrote:

 We hope to at some point deprecate KVM legacy device assignment in
 favor of VFIO-based assignment.  Towards that end, allow legacy
 device assignment to be deconfigured.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com

Definitely a step into the right direction. And it should also fix a build 
error that I experienced with CONFIG_IOMMU and CONFIG_KVM on ppc and arm.


Reviewed-by: Alexander Graf ag...@suse.de

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 0/7] KVM: VMX: Add Posted Interrupt supporting

2013-04-16 Thread Marcelo Tosatti
On Sun, Apr 14, 2013 at 12:40:00PM +0300, Gleb Natapov wrote:
 On Thu, Apr 11, 2013 at 07:25:09PM +0800, Yang Zhang wrote:
  From: Yang Zhang yang.z.zh...@intel.com
  
  The follwoing patches are adding the Posted Interrupt supporting to KVM:
  The first patch enables the feature 'acknowledge interrupt on vmexit'.Since
  it is required by Posted interrupt, we need to enable it firstly.
  
  And the subsequent patches are adding the posted interrupt supporting:
  Posted Interrupt allows APIC interrupts to inject into guest directly
  without any vmexit.
  
  - When delivering a interrupt to guest, if target vcpu is running,
update Posted-interrupt requests bitmap and send a notification event
to the vcpu. Then the vcpu will handle this interrupt automatically,
without any software involvemnt.
  
  - If target vcpu is not running or there already a notification event
pending in the vcpu, do nothing. The interrupt will be handled by
next vm entry
  
 Reviewed-by: Gleb Natapov g...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: ARM: Fix kvm_vm_ioctl_irq_line

2013-04-16 Thread Marcelo Tosatti
On Tue, Apr 16, 2013 at 11:07:40AM -0700, Christoffer Dall wrote:
 On Tue, Apr 16, 2013 at 10:21 AM, Alexander Graf ag...@suse.de wrote:
  Commit aa2fbe6d broke the ARM KVM target by introducing a new parameter
  to irq handling functions.
 
  Fix the function prototype to get things compiling again and ignore the
  parameter just like we did before
 
  Signed-off-by: Alexander Graf ag...@suse.de
  ---
   arch/arm/kvm/arm.c |3 ++-
   1 files changed, 2 insertions(+), 1 deletions(-)
 
  diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
  index e4ad0bb..678596f 100644
  --- a/arch/arm/kvm/arm.c
  +++ b/arch/arm/kvm/arm.c
  @@ -805,7 +805,8 @@ static int vcpu_interrupt_line(struct kvm_vcpu *vcpu, 
  int number, bool level)
  return 0;
   }
 
  -int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level)
  +int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level,
  + bool line_status)
   {
  u32 irq = irq_level-irq;
  unsigned int irq_type, vcpu_idx, irq_num;
  --
  1.6.0.2
 
 
 Acked-by: Christoffer Dall cd...@cs.columbia.edu

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: nVMX: check vmcs12 for valid activity state

2013-04-16 Thread Marcelo Tosatti
On Mon, Apr 15, 2013 at 03:00:27PM +0200, Paolo Bonzini wrote:
 KVM does not use the activity state VMCS field, and does not support
 it in nested VMX either (the corresponding bits in the misc VMX feature
 MSR are zero).  Fail entry if the activity state is set to anything but
 active.
 
 Since the value will always be the same for L1 and L2, we do not need
 to read and write the corresponding VMCS field on L1/L2 transitions,
 either.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 0/7] ppc patch queue 2013-03-22

2013-04-16 Thread Marcelo Tosatti
On Tue, Apr 16, 2013 at 07:26:35PM +0200, Alexander Graf wrote:
  So I may send a pull request against 3.9 with the 3 commits that already 
  are in kvm/next?
  
  If you decide that the fixes are important enough to justify the
  existance of duplicate commits, i don't see a problem.
  
  Great :). I already sent the pull request out with all patches that fix 
  regressions.
 
 Ping? Did these go to Linus?

Waiting on Christoffer Dall to generate ARM fixes against kvm.git master
branch.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: VMX: Fix check guest state validity if a guest is in VM86 mode

2013-04-16 Thread Marcelo Tosatti
On Sun, Apr 14, 2013 at 04:07:37PM +0300, Gleb Natapov wrote:
 If guest vcpu is in VM86 mode the vcpu state should be checked as if in
 real mode.
 
 Signed-off-by: Gleb Natapov g...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM/ARM Minor fixes for 3.9

2013-04-16 Thread Christoffer Dall
On Tue, Apr 16, 2013 at 2:03 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Mon, Apr 15, 2013 at 01:52:15AM -0700, Christoffer Dall wrote:
 Hi Marcelo and Gleb,

 The following changes since commit 41ef2d5678d83af030125550329b6ae8b74618fa:

   Linux 3.9-rc7 (2013-04-14 17:45:16 -0700)

 are available in the git repository at:

   git://github.com/columbia/linux-kvm-arm.git kvm-arm-fixes-for-3.9

 for you to fetch changes up to b03e1d4f253f256ef7ace14eb6ac8a421fd5d625:

   ARM: KVM: fix L_PTE_S2_RDWR to actually be Read/Write (2013-04-14
 21:52:27 -0700)

 I hope these can make it in in time,

 Thanks!
 -Christoffer

 
 Marc Zyngier (2):
   ARM: KVM: fix KVM_CAP_ARM_SET_DEVICE_ADDR reporting
   ARM: KVM: fix L_PTE_S2_RDWR to actually be Read/Write

  arch/arm/include/asm/pgtable-3level.h |2 +-
  arch/arm/kvm/arm.c|1 +
  2 files changed, 2 insertions(+), 1 deletion(-)

 Should provide a pull request against kvm.git master branch (which is at
 -rc6).

 Are there any dependencies between these fixes and -rc7 ?



nope, I'll send a new pull request.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL v2] KVM/ARM Minor fixes for 3.9

2013-04-16 Thread Christoffer Dall
The following changes since commit 31880c37c11e28cb81c70757e38392b42e695dc6:

  Linux 3.9-rc6 (2013-04-07 20:49:54 -0700)

are available in the git repository at:

  git://github.com/columbia/linux-kvm-arm.git kvm-arm-fixes-3.9

for you to fetch changes up to 865499ea90d399e0682bcce3ae7af24277633699:

  ARM: KVM: fix L_PTE_S2_RDWR to actually be Read/Write (2013-04-16
16:21:25 -0700)


Marc Zyngier (2):
  ARM: KVM: fix KVM_CAP_ARM_SET_DEVICE_ADDR reporting
  ARM: KVM: fix L_PTE_S2_RDWR to actually be Read/Write

 arch/arm/include/asm/pgtable-3level.h |2 +-
 arch/arm/kvm/arm.c|1 +
 2 files changed, 2 insertions(+), 1 deletion(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM fixes for 3.9-rc7

2013-04-16 Thread Marcelo Tosatti

Linus,

Please pull from

git://git.kernel.org/pub/scm/virt/kvm/kvm.git master

To receive the following PPC and ARM KVM fixes

Marc Zyngier (2):
  ARM: KVM: fix KVM_CAP_ARM_SET_DEVICE_ADDR reporting
  ARM: KVM: fix L_PTE_S2_RDWR to actually be Read/Write

Marcelo Tosatti (1):
  Merge branch 'kvm-arm-fixes-3.9' of 
git://github.com/columbia/linux-kvm-arm

Scott Wood (4):
  kvm/powerpc/e500mc: fix tlb invalidation on cpu migration
  kvm/ppc/e500: h2g_tlb1_rmap: esel 0 is valid
  kvm/ppc/e500: g2h_tlb1_map: clear old bit before setting new bit
  kvm/ppc/e500: eliminate tlb_refs

 arch/arm/include/asm/pgtable-3level.h |2 
 arch/arm/kvm/arm.c|1 
 arch/powerpc/kvm/e500.h   |   24 +++--
 arch/powerpc/kvm/e500_mmu_host.c  |   86 +++---
 arch/powerpc/kvm/e500mc.c |7 ++
 5 files changed, 44 insertions(+), 76 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Add a Kconfig shortcut for a kvm-bootable kernel

2013-04-16 Thread Sasha Levin
On 04/16/2013 12:18 PM, Borislav Petkov wrote:
 On Sun, Apr 14, 2013 at 01:03:20PM +0200, Borislav Petkov wrote:
 On Sun, Apr 14, 2013 at 12:31:12PM +0300, Pekka Enberg wrote:
 I obviously support having something like this in mainline. I wonder
 though if we could just call this default standalone KVM guest
 config instead of emphasizing testing angle.

 /me nods agreeingly...

 And it should be unter HYPERVISOR_GUEST where the rest of this stuff
 resides. Good point.
 
 Sanity check question:
 
 Why not add the select stuff, i.e. this:
 
   select NET
   select NETDEVICES
   select PCI
   select BLOCK
   select BLK_DEV
   select NETWORK_FILESYSTEMS
   select INET
   select EXPERIMENTAL
   select TTY
   select SERIAL_8250
   select SERIAL_8250_CONSOLE
   select IP_PNP
   select IP_PNP_DHCP
   select BINFMT_ELF
   select PCI_MSI
   select HAVE_ARCH_KGDB
   select DEBUG_KERNEL
   select KGDB
   select KGDB_SERIAL_CONSOLE
   select VIRTUALIZATION
   select VIRTIO
   select VIRTIO_RING
   select VIRTIO_PCI
   select VIRTIO_BLK
   select VIRTIO_CONSOLE
   select VIRTIO_NET
   select 9P_FS
   select NET_9P
   select NET_9P_VIRTIO
 
 to the option below which we already have. It is in the same sense a KVM
 guest support deal.
 
 Hmm.
 
 KVM people, any objections?
 
 config KVM_GUEST
 bool KVM Guest support (including kvmclock)
 depends on PARAVIRT
 select PARAVIRT_CLOCK
 default y
 ---help---
   This option enables various optimizations for running under the KVM
   hypervisor. It includes a paravirtualized clock, so that instead
   of relying on a PIT (or probably other) emulation by the
   underlying device model, the host provides the guest with
   timing infrastructure such as time of day, and system time

KVM guests don't need a serial device, KGDB, DEBUG_KERNEL or 9p in particular.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: ia64: Fix kvm_vm_ioctl_irq_line

2013-04-16 Thread Yang Zhang
From: Yang Zhang yang.z.zh...@intel.com

Fix the compliling error with kvm_vm_ioctl_irq_line.

Signed-off-by: Yang Zhang yang.z.zh...@intel.com
---
 arch/ia64/kvm/kvm-ia64.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 7a54455..032c54d 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -924,13 +924,15 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
return 0;
 }
 
-int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event)
+int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
+   bool line_status)
 {
if (!irqchip_in_kernel(kvm))
return -ENXIO;
 
irq_event-status = kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID,
-   irq_event-irq, irq_event-level);
+   irq_event-irq, irq_event-level,
+   line_status);
return 0;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/2] tcm_vhost: Wait for pending requests in vhost_scsi_flush()

2013-04-16 Thread Asias He
On Tue, Apr 16, 2013 at 08:58:27PM +0300, Michael S. Tsirkin wrote:
 On Tue, Apr 16, 2013 at 05:16:51PM +0800, Asias He wrote:
  This patch makes vhost_scsi_flush() wait for all the pending requests
  issued before the flush operation to be finished.
  
  Changes in v5:
  - Use kref and completion
  - Fail req if vs-vs_inflight is NULL
  - Rename tcm_vhost_alloc_inflight to tcm_vhost_set_inflight
  
  Changes in v4:
  - Introduce vhost_scsi_inflight
  - Drop array to track flush
  - Use RCU to protect vs_inflight explicitly
  
  Changes in v3:
  - Rebase
  - Drop 'tcm_vhost: Wait for pending requests in
vhost_scsi_clear_endpoint()' in this series, we already did that in
'tcm_vhost: Use vq-private_data to indicate if the endpoint is setup'
  
  Changes in v2:
  - Increase/Decrease inflight requests in
vhost_scsi_{allocate,free}_cmd and tcm_vhost_{allocate,free}_evt
  
  Signed-off-by: Asias He as...@redhat.com
 
 OK looks good, except error handling needs to be fixed.
 
  ---
   drivers/vhost/tcm_vhost.c | 101 
  +++---
   drivers/vhost/tcm_vhost.h |   5 +++
   2 files changed, 101 insertions(+), 5 deletions(-)
  
  diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
  index 4ae6725..ef40a8f 100644
  --- a/drivers/vhost/tcm_vhost.c
  +++ b/drivers/vhost/tcm_vhost.c
  @@ -74,6 +74,11 @@ enum {
   #define VHOST_SCSI_MAX_VQ  128
   #define VHOST_SCSI_MAX_EVENT   128
   
  +struct vhost_scsi_inflight {
  +   struct completion comp; /* Wait for the flush operation to finish */
  +   struct kref kref; /* Refcount for the inflight reqs */
  +};
  +
   struct vhost_scsi {
  /* Protected by vhost_scsi-dev.mutex */
  struct tcm_vhost_tpg **vs_tpg;
  @@ -91,6 +96,8 @@ struct vhost_scsi {
  struct mutex vs_events_lock; /* protect vs_events_dropped,events_nr */
  bool vs_events_dropped; /* any missed events */
  int vs_events_nr; /* num of pending events */
  +
  +   struct vhost_scsi_inflight __rcu *vs_inflight; /* track inflight reqs */
   };
   
   /* Local pointer to allocated TCM configfs fabric module */
  @@ -108,6 +115,51 @@ static int iov_num_pages(struct iovec *iov)
 ((unsigned long)iov-iov_base  PAGE_MASK))  PAGE_SHIFT;
   }
   
  +static int tcm_vhost_set_inflight(struct vhost_scsi *vs)
  +{
  +   struct vhost_scsi_inflight *inflight;
  +   int ret = -ENOMEM;
  +
  +   inflight = kzalloc(sizeof(*inflight), GFP_KERNEL);
 
 kzalloc is not needed, you initialize all fields.

okay.

  +   if (inflight) {
  +   kref_init(inflight-kref);
  +   init_completion(inflight-comp);
  +   ret = 0;
  +   }
  +   rcu_assign_pointer(vs-vs_inflight, inflight);
 
 So if allocation fails, we stop tracking inflights?

 This looks strange, and could break guests. Why not the usual
   if (!inflight)
   return -ENOMEM;

If allocation fails, we abort further reqs. No need to track.
Why it will break guest and how?

  +   synchronize_rcu();
 
 open call is different:
   - sync is not needed
   - should use RCU_INIT_POINTER and not rcu_assign_pointer
 
 So please move these out and make this function return the struct:
   struct vhost_scsi_inflight *inflight
   tcm_vhost_alloc_inflight(void)

synchronize_rcu is actually needed. 

   tcm_vhost_inc_inflight
   {
   
   rcu_read_lock();
   inflight = rcu_dereference(vs-vs_inflight); 
   
  /* 
   * Possible race window here:
   * if inflight points to old inflight and
   * wait_for_completion runs before we call kref_get here,
   * We may free the old inflight
   * however, there is still one in flight which should be 
   * tracked by the old inflight.
   */
   
   kref_get(inflight-kref);
   rcu_read_unlock();
   
   return inflight;
   }

 
  +
  +   return ret;
  +}
  +
  +static struct vhost_scsi_inflight *
  +tcm_vhost_inc_inflight(struct vhost_scsi *vs)
 
 And then inc will not need to return inflight pointer,
 which is really unusual.

No you still need to return inflight. You need it for each tcm_vhost_cmd or
tcm_vhost_evt. 
 
  +{
  +   struct vhost_scsi_inflight *inflight;
  +
  +   rcu_read_lock();
  +   inflight = rcu_dereference(vs-vs_inflight);
  +   if (inflight)
  +   kref_get(inflight-kref);
  +   rcu_read_unlock();
  +
  +   return inflight;
  +}
  +
  +void tcm_vhost_done_inflight(struct kref *kref)
  +{
  +   struct vhost_scsi_inflight *inflight;
  +
  +   inflight = container_of(kref, struct vhost_scsi_inflight, kref);
  +   complete(inflight-comp);
  +}
  +
  +static void tcm_vhost_dec_inflight(struct vhost_scsi_inflight *inflight)
  +{
  +   if (inflight)
 
 Here as in other places, inflight must never be NULL.
 Pls fix code so that invariant holds.
 
  +   kref_put(inflight-kref, tcm_vhost_done_inflight);
  +}
  +
   static bool tcm_vhost_check_feature(struct vhost_scsi *vs, int 

Re: [Virt-test-devel] [virt-test][PATCH 4/7] virt: Adds named variants to Cartesian config.

2013-04-16 Thread Alex Jia
Jiří, okay, got it and thanks.

-- 
Regards, 
Alex


- Original Message -
From: Jiri Zupka jzu...@redhat.com
To: Alex Jia a...@redhat.com
Cc: virt-test-de...@redhat.com, kvm@vger.kernel.org, kvm-autot...@redhat.com, 
l...@redhat.com, ldok...@redhat.com, ehabk...@redhat.com, pbonz...@redhat.com
Sent: Tuesday, April 16, 2013 10:16:57 PM
Subject: Re: [Virt-test-devel] [virt-test][PATCH 4/7] virt: Adds named variants 
to Cartesian config.

Hi Alex,
  thanks again for review. I recognize now what you mean. I thought that
you another thread of mails. I was try it again with
https://github.com/autotest/virt-test/pull/255 and demo example works.

If you are interest in this feature. Check new version which I'll send in
future days. There will be some changes in syntax and will be added lexer
for better filtering.

regards,
  Jiří Župka



- Original Message -
 Hi Alex,
   I hope you use new version of cart config in github
   https://github.com/autotest/virt-test/pull/255.
 This was older RFC version of vart config. And I'm preparing new version
 based on communication with Eduardo and Pablo.
 
 If you don't please loot at documentation
 https://github.com/autotest/virt-test/wiki/VirtTestDocumentation#wiki-id26
 This documentation says how it works now.
 
 regards
   Jiří Župka
 
 - Original Message -
  On 03/30/2013 01:14 AM, Jiří Župka wrote:
   variants name=tests:
  - wait:
   run = wait
   variants:
 - long:
time = short_time
 - short: long
time = logn_time
  - test2:
   run = test1
  
   variants name=virt_system:
  - linux:
  - windows_XP:
  
   variants name=host_os:
  - linux:
   image = linux
  - windows_XP:
   image = windows
  
   testswait.short:
shutdown = destroy
  
   only host_oslinux
  Jiří , I pasted above above example into demo.cfg and ran it via
  cartesian parser then I got the error __main__.ParserError: 'variants'
  is not allowed inside a conditional block
  (libvirt/tests/cfg/demo.cfg:4), any wrong with me? thanks.
  
  
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

2013-04-16 Thread Xiao Guangrong
On 04/17/2013 02:08 AM, Robin Holt wrote:
 On Tue, Apr 16, 2013 at 09:07:20PM +0800, Xiao Guangrong wrote:
 On 04/16/2013 07:43 PM, Robin Holt wrote:
 Argh.  Taking a step back helped clear my head.

 For the -stable releases, I agree we should just go with your
 revert-plus-hlist_del_init_rcu patch.  I will give it a test
 when I am in the office.

 Okay. Wait for your test report. Thank you in advance.


 For the v3.10 release, we should work on making this more
 correct and completely documented.

 Better document is always welcomed.

 Double call -release is not bad, like i mentioned it in the changelog:

 it is really rare (e.g, can not happen on kvm since mmu-notify is 
 unregistered
 after exit_mmap()) and the later call of multiple -release should be
 fast since all the pages have already been released by the first call.

 But, of course, it's great if you have a _light_ way to avoid this.
 
 Getting my test environment set back up took longer than I would have liked.
 
 Your patch passed.  I got no NULL-pointer derefs.

Thanks for your test again.

 
 How would you feel about adding the following to your patch?

I prefer to make these changes as a separate patch, this change is the
improvement, please do not mix it with bugfix.

You can make a patchset (comments improvement and this change) based on
my fix.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] KVM: irqfd generalization prepare patch set

2013-04-16 Thread Alexander Graf
The concept of an irqfd and interrupt routing are nothing particularly tied
into the IOAPIC implementation. In fact, most of the code already is perfectly
generic.

This patch set decouples most bits of the existing irqchip and irqfd
implementation to make it reusable for non-IOAPIC platforms, like the PPC MPIC.

I also have a patch that implements working irqfd support on top of these,
but that requires the in-kernel MPIC implementation to go upstream first, so
I'm holding off on it until we settled everything there, so the concept
certainly does work.

Alex

Alexander Graf (7):
  KVM: Add KVM_IRQCHIP_NUM_PINS in addition to KVM_IOAPIC_NUM_PINS
  KVM: Introduce __KVM_HAVE_IRQCHIP
  KVM: Remove kvm_get_intr_delivery_bitmask
  KVM: Move irq routing to generic code
  KVM: Extract generic irqchip logic into irqchip.c
  KVM: Move irq routing setup to irqchip.c
  KVM: Move irqfd resample cap handling to generic code

 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/include/uapi/asm/kvm.h |1 +
 arch/x86/kvm/Makefile   |2 +-
 arch/x86/kvm/x86.c  |1 -
 include/linux/kvm_host.h|   14 +--
 include/trace/events/kvm.h  |   12 ++-
 include/uapi/linux/kvm.h|2 +-
 virt/kvm/assigned-dev.c |   30 -
 virt/kvm/eventfd.c  |6 +-
 virt/kvm/irq_comm.c |  193 +---
 virt/kvm/irqchip.c  |  237 +++
 virt/kvm/kvm_main.c |   33 ++
 12 files changed, 297 insertions(+), 236 deletions(-)
 create mode 100644 virt/kvm/irqchip.c

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] KVM: Move irq routing to generic code

2013-04-16 Thread Alexander Graf
The IRQ routing set ioctl lives in the hacky device assignment code inside
of KVM today. This is definitely the wrong place for it. Move it to the much
more natural kvm_main.c.

Signed-off-by: Alexander Graf ag...@suse.de
---
 virt/kvm/assigned-dev.c |   30 --
 virt/kvm/kvm_main.c |   30 ++
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index f4c7f59..8db4370 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -983,36 +983,6 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, 
unsigned ioctl,
goto out;
break;
}
-#ifdef KVM_CAP_IRQ_ROUTING
-   case KVM_SET_GSI_ROUTING: {
-   struct kvm_irq_routing routing;
-   struct kvm_irq_routing __user *urouting;
-   struct kvm_irq_routing_entry *entries;
-
-   r = -EFAULT;
-   if (copy_from_user(routing, argp, sizeof(routing)))
-   goto out;
-   r = -EINVAL;
-   if (routing.nr = KVM_MAX_IRQ_ROUTES)
-   goto out;
-   if (routing.flags)
-   goto out;
-   r = -ENOMEM;
-   entries = vmalloc(routing.nr * sizeof(*entries));
-   if (!entries)
-   goto out;
-   r = -EFAULT;
-   urouting = argp;
-   if (copy_from_user(entries, urouting-entries,
-  routing.nr * sizeof(*entries)))
-   goto out_free_irq_routing;
-   r = kvm_set_irq_routing(kvm, entries, routing.nr,
-   routing.flags);
-   out_free_irq_routing:
-   vfree(entries);
-   break;
-   }
-#endif /* KVM_CAP_IRQ_ROUTING */
 #ifdef __KVM_HAVE_MSIX
case KVM_ASSIGN_SET_MSIX_NR: {
struct kvm_assigned_msix_nr entry_nr;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ac3182e..6a71ee3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2273,6 +2273,36 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
 #endif
+#ifdef KVM_CAP_IRQ_ROUTING
+   case KVM_SET_GSI_ROUTING: {
+   struct kvm_irq_routing routing;
+   struct kvm_irq_routing __user *urouting;
+   struct kvm_irq_routing_entry *entries;
+
+   r = -EFAULT;
+   if (copy_from_user(routing, argp, sizeof(routing)))
+   goto out;
+   r = -EINVAL;
+   if (routing.nr = KVM_MAX_IRQ_ROUTES)
+   goto out;
+   if (routing.flags)
+   goto out;
+   r = -ENOMEM;
+   entries = vmalloc(routing.nr * sizeof(*entries));
+   if (!entries)
+   goto out;
+   r = -EFAULT;
+   urouting = argp;
+   if (copy_from_user(entries, urouting-entries,
+  routing.nr * sizeof(*entries)))
+   goto out_free_irq_routing;
+   r = kvm_set_irq_routing(kvm, entries, routing.nr,
+   routing.flags);
+   out_free_irq_routing:
+   vfree(entries);
+   break;
+   }
+#endif /* KVM_CAP_IRQ_ROUTING */
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
if (r == -ENOTTY)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] KVM: Remove kvm_get_intr_delivery_bitmask

2013-04-16 Thread Alexander Graf
The prototype has been stale for a while, I can't spot any real function
define behind it. Let's just remove it.

Signed-off-by: Alexander Graf ag...@suse.de
---
 include/linux/kvm_host.h |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4b30906..48af62a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -719,11 +719,6 @@ void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int 
irq,
 void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
 bool mask);
 
-#ifdef __KVM_HAVE_IOAPIC
-void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
-  union kvm_ioapic_redirect_entry *entry,
-  unsigned long *deliver_bitmask);
-#endif
 int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
bool line_status);
 int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int 
level);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] KVM: Move irq routing setup to irqchip.c

2013-04-16 Thread Alexander Graf
Setting up IRQ routes is nothing IOAPIC specific. Extract everything
that really is generic code into irqchip.c and only leave the ioapic
specific bits to irq_comm.c.

Signed-off-by: Alexander Graf ag...@suse.de
---
 include/linux/kvm_host.h |3 ++
 virt/kvm/irq_comm.c  |   76 ++---
 virt/kvm/irqchip.c   |   85 ++
 3 files changed, 91 insertions(+), 73 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48af62a..8b25ea2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -961,6 +961,9 @@ int kvm_set_irq_routing(struct kvm *kvm,
const struct kvm_irq_routing_entry *entries,
unsigned nr,
unsigned flags);
+int kvm_set_routing_entry(struct kvm_irq_routing_table *rt,
+ struct kvm_kernel_irq_routing_entry *e,
+ const struct kvm_irq_routing_entry *ue);
 void kvm_free_irq_routing(struct kvm *kvm);
 
 int kvm_send_userspace_msi(struct kvm *kvm, struct kvm_msi *msi);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 3d900ba..c4dbb94 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -272,27 +272,14 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned 
irqchip, unsigned pin,
rcu_read_unlock();
 }
 
-static int setup_routing_entry(struct kvm_irq_routing_table *rt,
-  struct kvm_kernel_irq_routing_entry *e,
-  const struct kvm_irq_routing_entry *ue)
+int kvm_set_routing_entry(struct kvm_irq_routing_table *rt,
+ struct kvm_kernel_irq_routing_entry *e,
+ const struct kvm_irq_routing_entry *ue)
 {
int r = -EINVAL;
int delta;
unsigned max_pin;
-   struct kvm_kernel_irq_routing_entry *ei;
 
-   /*
-* Do not allow GSI to be mapped to the same irqchip more than once.
-* Allow only one to one mapping between GSI and MSI.
-*/
-   hlist_for_each_entry(ei, rt-map[ue-gsi], link)
-   if (ei-type == KVM_IRQ_ROUTING_MSI ||
-   ue-type == KVM_IRQ_ROUTING_MSI ||
-   ue-u.irqchip.irqchip == ei-irqchip.irqchip)
-   return r;
-
-   e-gsi = ue-gsi;
-   e-type = ue-type;
switch (ue-type) {
case KVM_IRQ_ROUTING_IRQCHIP:
delta = 0;
@@ -329,68 +316,11 @@ static int setup_routing_entry(struct 
kvm_irq_routing_table *rt,
goto out;
}
 
-   hlist_add_head(e-link, rt-map[e-gsi]);
r = 0;
 out:
return r;
 }
 
-int kvm_set_irq_routing(struct kvm *kvm,
-   const struct kvm_irq_routing_entry *ue,
-   unsigned nr,
-   unsigned flags)
-{
-   struct kvm_irq_routing_table *new, *old;
-   u32 i, j, nr_rt_entries = 0;
-   int r;
-
-   for (i = 0; i  nr; ++i) {
-   if (ue[i].gsi = KVM_MAX_IRQ_ROUTES)
-   return -EINVAL;
-   nr_rt_entries = max(nr_rt_entries, ue[i].gsi);
-   }
-
-   nr_rt_entries += 1;
-
-   new = kzalloc(sizeof(*new) + (nr_rt_entries * sizeof(struct hlist_head))
- + (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
- GFP_KERNEL);
-
-   if (!new)
-   return -ENOMEM;
-
-   new-rt_entries = (void *)new-map[nr_rt_entries];
-
-   new-nr_rt_entries = nr_rt_entries;
-   for (i = 0; i  3; i++)
-   for (j = 0; j  KVM_IRQCHIP_NUM_PINS; j++)
-   new-chip[i][j] = -1;
-
-   for (i = 0; i  nr; ++i) {
-   r = -EINVAL;
-   if (ue-flags)
-   goto out;
-   r = setup_routing_entry(new, new-rt_entries[i], ue);
-   if (r)
-   goto out;
-   ++ue;
-   }
-
-   mutex_lock(kvm-irq_lock);
-   old = kvm-irq_routing;
-   kvm_irq_routing_update(kvm, new);
-   mutex_unlock(kvm-irq_lock);
-
-   synchronize_rcu();
-
-   new = old;
-   r = 0;
-
-out:
-   kfree(new);
-   return r;
-}
-
 #define IOAPIC_ROUTING_ENTRY(irq) \
{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,  \
  .u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
index d8b06ed..f3a875b 100644
--- a/virt/kvm/irqchip.c
+++ b/virt/kvm/irqchip.c
@@ -150,3 +150,88 @@ void kvm_free_irq_routing(struct kvm *kvm)
   at this stage */
kfree(kvm-irq_routing);
 }
+
+static int setup_routing_entry(struct kvm_irq_routing_table *rt,
+  struct kvm_kernel_irq_routing_entry *e,
+  const struct kvm_irq_routing_entry *ue)
+{
+   int r = -EINVAL;
+   struct kvm_kernel_irq_routing_entry *ei;
+
+   /*
+ 

[PATCH 5/7] KVM: Extract generic irqchip logic into irqchip.c

2013-04-16 Thread Alexander Graf
The current irq_comm.c file contains pieces of code that are generic
across different irqchip implementations, as well as code that is
fully IOAPIC specific.

Split the generic bits out into irqchip.c.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/x86/kvm/Makefile  |2 +-
 include/trace/events/kvm.h |   12 +++-
 virt/kvm/irq_comm.c|  117 --
 virt/kvm/irqchip.c |  152 
 4 files changed, 163 insertions(+), 120 deletions(-)
 create mode 100644 virt/kvm/irqchip.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 04d3040..a797b8e 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
 
 kvm-y  += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
coalesced_mmio.o irq_comm.o eventfd.o \
-   assigned-dev.o)
+   assigned-dev.o irqchip.o)
 kvm-$(CONFIG_IOMMU_API)+= $(addprefix ../../../virt/kvm/, iommu.o)
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o)
 
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 19911dd..2fe2d53 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -37,7 +37,7 @@ TRACE_EVENT(kvm_userspace_exit,
  __entry-errno  0 ? -__entry-errno : __entry-reason)
 );
 
-#if defined(__KVM_HAVE_IRQ_LINE)
+#if defined(__KVM_HAVE_IRQ_LINE) || defined(__KVM_HAVE_IRQCHIP)
 TRACE_EVENT(kvm_set_irq,
TP_PROTO(unsigned int gsi, int level, int irq_source_id),
TP_ARGS(gsi, level, irq_source_id),
@@ -122,6 +122,10 @@ TRACE_EVENT(kvm_msi_set_irq,
{KVM_IRQCHIP_PIC_SLAVE, PIC slave},   \
{KVM_IRQCHIP_IOAPIC,IOAPIC}
 
+#endif /* defined(__KVM_HAVE_IOAPIC) */
+
+#if defined(__KVM_HAVE_IRQCHIP)
+
 TRACE_EVENT(kvm_ack_irq,
TP_PROTO(unsigned int irqchip, unsigned int pin),
TP_ARGS(irqchip, pin),
@@ -136,14 +140,18 @@ TRACE_EVENT(kvm_ack_irq,
__entry-pin= pin;
),
 
+#ifdef kvm_irqchip
TP_printk(irqchip %s pin %u,
  __print_symbolic(__entry-irqchip, kvm_irqchips),
 __entry-pin)
+#else
+   TP_printk(irqchip %d pin %u, __entry-irqchip, __entry-pin)
+#endif
 );
 
+#endif /* defined(__KVM_HAVE_IRQCHIP) */
 
 
-#endif /* defined(__KVM_HAVE_IOAPIC) */
 
 #define KVM_TRACE_MMIO_READ_UNSATISFIED 0
 #define KVM_TRACE_MMIO_READ 1
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index f02659b..3d900ba 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -151,59 +151,6 @@ static int kvm_set_msi_inatomic(struct 
kvm_kernel_irq_routing_entry *e,
return -EWOULDBLOCK;
 }
 
-int kvm_send_userspace_msi(struct kvm *kvm, struct kvm_msi *msi)
-{
-   struct kvm_kernel_irq_routing_entry route;
-
-   if (!irqchip_in_kernel(kvm) || msi-flags != 0)
-   return -EINVAL;
-
-   route.msi.address_lo = msi-address_lo;
-   route.msi.address_hi = msi-address_hi;
-   route.msi.data = msi-data;
-
-   return kvm_set_msi(route, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1, false);
-}
-
-/*
- * Return value:
- *   0   Interrupt was ignored (masked or not delivered for other reasons)
- *  = 0   Interrupt was coalesced (previous irq is still pending)
- *   0   Number of CPUs interrupt was delivered to
- */
-int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
-   bool line_status)
-{
-   struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
-   int ret = -1, i = 0;
-   struct kvm_irq_routing_table *irq_rt;
-
-   trace_kvm_set_irq(irq, level, irq_source_id);
-
-   /* Not possible to detect if the guest uses the PIC or the
-* IOAPIC.  So set the bit in both. The guest will ignore
-* writes to the unused one.
-*/
-   rcu_read_lock();
-   irq_rt = rcu_dereference(kvm-irq_routing);
-   if (irq  irq_rt-nr_rt_entries)
-   hlist_for_each_entry(e, irq_rt-map[irq], link)
-   irq_set[i++] = *e;
-   rcu_read_unlock();
-
-   while(i--) {
-   int r;
-   r = irq_set[i].set(irq_set[i], kvm, irq_source_id, level,
-   line_status);
-   if (r  0)
-   continue;
-
-   ret = r + ((ret  0) ? 0 : ret);
-   }
-
-   return ret;
-}
-
 /*
  * Deliver an IRQ in an atomic context if we can, or return a failure,
  * user can retry in a process context.
@@ -241,62 +188,6 @@ int kvm_set_irq_inatomic(struct kvm *kvm, int 
irq_source_id, u32 irq, int level)
return ret;
 }
 
-bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
-{
-   struct kvm_irq_ack_notifier *kian;
-   int gsi;
-
-   rcu_read_lock();
-   gsi = 

[PATCH 7/7] KVM: Move irqfd resample cap handling to generic code

2013-04-16 Thread Alexander Graf
Now that we have most irqfd code completely platform agnostic, let's move
irqfd's resample capability return to generic code as well.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/x86/kvm/x86.c  |1 -
 virt/kvm/kvm_main.c |3 +++
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ae9744d..9d9904f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2513,7 +2513,6 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_PCI_2_3:
case KVM_CAP_KVMCLOCK_CTRL:
case KVM_CAP_READONLY_MEM:
-   case KVM_CAP_IRQFD_RESAMPLE:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a71ee3..6b44ad7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2432,6 +2432,9 @@ static long kvm_dev_ioctl_check_extension_generic(long 
arg)
 #ifdef CONFIG_HAVE_KVM_MSI
case KVM_CAP_SIGNAL_MSI:
 #endif
+#ifdef CONFIG_HAVE_KVM_IRQCHIP
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
return 1;
 #ifdef KVM_CAP_IRQ_ROUTING
case KVM_CAP_IRQ_ROUTING:
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] KVM: Introduce __KVM_HAVE_IRQCHIP

2013-04-16 Thread Alexander Graf
Quite a bit of code in KVM has been conditionalized on availability of
IOAPIC emulation. However, most of it is generically applicable to
platforms that don't have an IOPIC, but a different type of irq chip.

Introduce a new define to distinguish between generic code and IOAPIC
specific code.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/x86/include/uapi/asm/kvm.h |1 +
 include/linux/kvm_host.h|4 ++--
 include/uapi/linux/kvm.h|2 +-
 virt/kvm/eventfd.c  |6 +++---
 4 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a65ec29..923478e 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -28,6 +28,7 @@
 /* Select x86 specific features in linux/kvm.h */
 #define __KVM_HAVE_PIT
 #define __KVM_HAVE_IOAPIC
+#define __KVM_HAVE_IRQCHIP
 #define __KVM_HAVE_IRQ_LINE
 #define __KVM_HAVE_DEVICE_ASSIGNMENT
 #define __KVM_HAVE_MSI
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 303d05b..4b30906 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -304,7 +304,7 @@ struct kvm_kernel_irq_routing_entry {
struct hlist_node link;
 };
 
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 
 struct kvm_irq_routing_table {
int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
@@ -432,7 +432,7 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
 int __must_check vcpu_load(struct kvm_vcpu *vcpu);
 void vcpu_put(struct kvm_vcpu *vcpu);
 
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 int kvm_irqfd_init(void);
 void kvm_irqfd_exit(void);
 #else
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 74d0ff3..c38d269 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -579,7 +579,7 @@ struct kvm_ppc_smmu_info {
 #ifdef __KVM_HAVE_PIT
 #define KVM_CAP_REINJECT_CONTROL 24
 #endif
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 #define KVM_CAP_IRQ_ROUTING 25
 #endif
 #define KVM_CAP_IRQ_INJECT_STATUS 26
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index c5d43ff..0797571 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -35,7 +35,7 @@
 
 #include iodev.h
 
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 /*
  * 
  * irqfd: Allows an fd to be used to inject an interrupt to the guest
@@ -433,7 +433,7 @@ fail:
 void
 kvm_eventfd_init(struct kvm *kvm)
 {
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
spin_lock_init(kvm-irqfds.lock);
INIT_LIST_HEAD(kvm-irqfds.items);
INIT_LIST_HEAD(kvm-irqfds.resampler_list);
@@ -442,7 +442,7 @@ kvm_eventfd_init(struct kvm *kvm)
INIT_LIST_HEAD(kvm-ioeventfds);
 }
 
-#ifdef __KVM_HAVE_IOAPIC
+#ifdef __KVM_HAVE_IRQCHIP
 /*
  * shutdown any irqfd's that match fd+gsi
  */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 0/7] ppc patch queue 2013-03-22

2013-04-16 Thread Alexander Graf

On 12.04.2013, at 22:56, Alexander Graf wrote:

 
 On 12.04.2013, at 22:54, Marcelo Tosatti wrote:
 
 On Thu, Apr 11, 2013 at 03:50:13PM +0200, Alexander Graf wrote:
 
 On 11.04.2013, at 15:45, Marcelo Tosatti wrote:
 
 On Tue, Mar 26, 2013 at 12:59:04PM +1100, Paul Mackerras wrote:
 On Tue, Mar 26, 2013 at 03:33:12AM +0200, Gleb Natapov wrote:
 On Tue, Mar 26, 2013 at 12:35:09AM +0100, Alexander Graf wrote:
 I agree. So if it doesn't hurt to have the same commits in kvm/next and 
 kvm/master, I'd be more than happy to send another pull request with 
 the important fixes against kvm/master as well.
 
 If it will result in the same commit showing twice in the Linus tree in 
 3.10 we cannot do that.
 
 Why not?  In the circumstances it seems perfectly reasonable to me.
 Git should merge the branches without any problem, and even if it
 doesn't, Linus is good at fixing merge conflicts.
 
 Paul.
 
 Yes, should avoid duplicate commits but its not fatal for them to exist.
 
 So I may send a pull request against 3.9 with the 3 commits that already 
 are in kvm/next?
 
 If you decide that the fixes are important enough to justify the
 existance of duplicate commits, i don't see a problem.
 
 Great :). I already sent the pull request out with all patches that fix 
 regressions.

Ping? Did these go to Linus?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html