date:20130205

Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2013-02-05:
  On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
  On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
  On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
  Any example how software relies on such
  two-interrupts-queued-in-IRR/ISR behaviour?
  Don't know about guests, but KVM relies on it to detect interrupt
  coalescing. So if interrupt is set in IRR but not in PIR interrupt will
  not be reported as coalesced, but it will be coalesced during PIR-IRR
  merge.
  
  Yes, so:
  
  1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
  2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
  3. vcpu outside of guest mode.
  4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
  5. vcpu enters guest mode.
  6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
  7. HW transfers PIR into IRR.
  
  set_irq return value at 7 is incorrect, interrupt event was _not_
  queued.
  Not sure I understand the flow of events in your description correctly. 
  As I
  understand it at 4 set_irq() will return incorrect result. Basically
  when PIR is set to 1 while IRR has 1 for the vector the value of
  set_irq() will be incorrect.
  
  At 4 it has not been coalesced: it has been queued to IRR.
  At 6 it has been coalesced: PIR bit merged into IRR bit.
  
  Frankly I do not see how it can be fixed
  without any race with present HW PIR design.
  
  At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
  already set, don't set PIR.
  
  Or:
  
  apic_accept_interrupt() {
  
  1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
  Never set IRR when HWAPIC enabled, even if outside of guest mode.
  2. Set PIR and let HW or SW VM-entry transfer it to IRR.
  3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
  }
  
  Two or more concurrent set_irq can race with each other, though. Can
  either document the race or add a lock.
 According the SDM, software should not touch the IRR when target vcpu is 
 running. Instead, use locked way to access PIR. So your solution may wrong.
Then your apicv patches are broken, because they do exactly that.

 The only problem is the step 6, but at that point, there already an interrupt 
 pending in IRR. This means the interrupt will be handled not lost. And even 
 in real hardware, this case do exist. So I think it should not be a problem. 
 
This is not the problem we are trying to fix. Sometimes we need to make
sure that each interrupt device generates result in an interrupt handler
invocation in a guest. If interrupt is coalesced (meaning it will not
correspond to separate invocation of a guest interrupt handler) it needs
to be re-injected. With PIR detection of such condition is broken.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
 On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
 vijaymohan.pandarat...@hp.com wrote:
  - Create eventfd per vfio device assigned to a guest and register an
event handler
 
  - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl
 
  - When the device encounters an error, the eventfd is signalled
and the qemu eventfd handler gets invoked.
 
  - In the handler decide what action to take. Current action taken
is to terminate the guest.
 
 Usually this is not OK, but I guess this is not guest triggerable.
 
Still not OK. Why not stop a guest with appropriate stop reason?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

I may have found a bug(unlikely) with Qemu-kvm 1.1

2013-02-05 Thread Veruca Salt



We are upgrading, carefully, from 1.0 to 1.1.
We welcome the improved AC97 support and the loss of the ehci warning message 
on startup.

I am finding an issue getting through to the monitor however.
Neither ctrl-alt-shift-2 nor ctrl-alt-2 exposes the monitor.

I have even tried using the -mon option code examples to force a monitor to 
stdio, but the monitor stubbornly refuses to appear.

Any advice would be greatly appreciated.

Simon

  --
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: win8 installation iso can not boot on rhel6.2 kernel?

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 02:55:07PM +0800, ya su wrote:
 I use the following cmd on rhel6.2 kernel 2.6.32-220.17.1:
 x86_64-softmmu/qemu-system-x86_64 -hda win8.img -cdrom
 window_8_pro.iso -m 2048 -L pc-bios -cpu host, it will display the
 following error:
 Your PC needs to restart.
 Please hold down the power button.
 Error Code: 0x005D
 Parameters:
 0x03100A00
 0x68747541
 0x69746E65
 0x444D4163
 
 I also tried the newest rhel6 kernel version: 2.6.32-279.19.1, it
 bring out the same result.
 
 if I try standard kernel 2.6.32 version, it can boot normally.
 
 Any suggestions? thanks.
 
You need latest RHEL6.4 kernel.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] Revert KVM: MMU: split kvm_mmu_free_page

2013-02-05 Thread Gleb Natapov

On Mon, Feb 04, 2013 at 11:24:01PM -0200, Marcelo Tosatti wrote:
 On Wed, Jan 30, 2013 at 04:45:05PM +0200, Gleb Natapov wrote:
  This reverts commit bd4c86eaa6ff10abc4e00d0f45d2a28b10b09df4.
  
  There is not user for kvm_mmu_isolate_page() any more.
  
  Signed-off-by: Gleb Natapov g...@redhat.com
  ---
   arch/x86/kvm/mmu.c |   21 +++--
 
 Applied all except this - Takuya's using it.
He actually revers this commit, but in non-obvious way. Reverting it
explicitly makes for more clear commit history.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: I may have found a bug(unlikely) with Qemu-kvm 1.1

2013-02-05 Thread Markus Armbruster

Veruca Salt verucasal...@hotmail.co.uk writes:

 We are upgrading, carefully, from 1.0 to 1.1.

Keep going; we're about to release 1.4 ;)

 We welcome the improved AC97 support and the loss of the ehci warning
 message on startup.

 I am finding an issue getting through to the monitor however.
 Neither ctrl-alt-shift-2 nor ctrl-alt-2 exposes the monitor.

I never use that one, but it should work.

 I have even tried using the -mon option code examples to force a
 monitor to stdio, but the monitor stubbornly refuses to appear.

 Any advice would be greatly appreciated.

These all work for me:

* Shorthand syntax

-monitor stdio

* Equivalent longhand

-chardev stdio,id=mon0 -mon chardev=mon0,mode=readline,default=on

* Same as config file snippet, for use with -readconfig

[chardev mon0]
  backend = stdio

[mon]
  chardev = mon0
  mode = readline
  default = on
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/5] KVM: MMU: unify and cleanup the code of walking pte list

2013-02-05 Thread Xiao Guangrong

Current code has two ways to walk pte_list, the one is pte_list_walk and
the another way is rmap_get_first and rmap_get_next, they have the same logic.
This patchset tries to unify the code and also make the code more tidy.

Patch 1: KVM: MMU: introduce mmu_spte_establish, which tries to eliminates
the different between walking parent pte list and rmap, prepare for the
later patch.

Patch 2: KVM: MMU: clarify the logic in kvm_set_pte_rmapp, which prepares for
the next patch, no logic changed.

Patch 3: KVM: MMU: unify the code of walking pte list, unify the walking code.

Patch 4: KVM: MMU: fix spte assertion, fix a minor bug and remove the duplicate
code.

Patch 5: KVM: MMU: fast drop all spte on the pte_list, optimize for dropping
all sptes on rmap and remove all the goto restart pattern introduced by
the Patch 3.

Marcelo, Gleb, please apply them after applying the patchset of
[PATCH v3 0/3] KVM: MMU: simple cleanups.

Changelog:
v3:
- address Gleb's comments, remove the remained goto restart in
  kvm_set_pte_rmapp
- improve the changelog

Thanks!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/5] KVM: MMU: introduce mmu_spte_establish

2013-02-05 Thread Xiao Guangrong

There is little different between walking parent pte and walking ramp:
all spte in rmap must be present but this is not true on parent pte list,
in kvm_mmu_alloc_page, we always link the parent list before set the spte
to present

This patch introduce mmu_spte_establish which set the spte before linking
it to parent list to eliminates the different then it is possible to unify
the code of walking pte list

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   81 ++-
 arch/x86/kvm/paging_tmpl.h |   16 -
 2 files changed, 48 insertions(+), 49 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8041454..68d4d5f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1482,9 +1482,6 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp, u64 *parent_pte)
 {
-   if (!parent_pte)
-   return;
-
pte_list_add(vcpu, parent_pte, sp-parent_ptes);
 }

@@ -1502,7 +1499,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 }

 static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
-  u64 *parent_pte, int direct)
+  int direct)
 {
struct kvm_mmu_page *sp;
sp = mmu_memory_cache_alloc(vcpu-arch.mmu_page_header_cache);
@@ -1512,7 +1509,6 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
kvm_vcpu *vcpu,
set_page_private(virt_to_page(sp-spt), (unsigned long)sp);
list_add(sp-link, vcpu-kvm-arch.active_mmu_pages);
sp-parent_ptes = 0;
-   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
kvm_mod_used_mmu_pages(vcpu-kvm, +1);
return sp;
 }
@@ -1845,8 +1841,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
 gva_t gaddr,
 unsigned level,
 int direct,
-unsigned access,
-u64 *parent_pte)
+unsigned access)
 {
union kvm_mmu_page_role role;
unsigned quadrant;
@@ -1876,19 +1871,15 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
if (sp-unsync  kvm_sync_page_transient(vcpu, sp))
break;

-   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
-   if (sp-unsync_children) {
+   if (sp-unsync_children)
kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
-   kvm_mmu_mark_parents_unsync(sp);
-   } else if (sp-unsync)
-   kvm_mmu_mark_parents_unsync(sp);

__clear_sp_write_flooding_count(sp);
trace_kvm_mmu_get_page(sp, false);
return sp;
}
++vcpu-kvm-stat.mmu_cache_miss;
-   sp = kvm_mmu_alloc_page(vcpu, parent_pte, direct);
+   sp = kvm_mmu_alloc_page(vcpu, direct);
if (!sp)
return sp;
sp-gfn = gfn;
@@ -1908,6 +1899,35 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
return sp;
 }

+static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
+{
+   u64 spte;
+
+   spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
+  shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
+
+   mmu_spte_set(sptep, spte);
+}
+
+static struct kvm_mmu_page *
+mmu_spte_establish(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, gva_t gaddr,
+  unsigned level, int direct, unsigned access)
+{
+   struct kvm_mmu_page *sp;
+
+   WARN_ON(is_shadow_present_pte(*spte));
+
+   sp = kvm_mmu_get_page(vcpu, gfn, gaddr, level, direct, access);
+
+   link_shadow_page(spte, sp);
+   mmu_page_add_parent_pte(vcpu, sp, spte);
+
+   if (sp-unsync_children || sp-unsync)
+   kvm_mmu_mark_parents_unsync(sp);
+
+   return sp;
+}
+
 static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
 struct kvm_vcpu *vcpu, u64 addr)
 {
@@ -1957,16 +1977,6 @@ static void shadow_walk_next(struct 
kvm_shadow_walk_iterator *iterator)
return __shadow_walk_next(iterator, *iterator-sptep);
 }

-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
-{
-   u64 spte;
-
-   spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
-  shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
-
-   mmu_spte_set(sptep, spte);
-}
-
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
   unsigned direct_access)
 {
@@ -2023,11 +2033,6 @@ static void kvm_mmu_page_unlink_children(struct

[PATCH v3 2/5] KVM: MMU: clarify the logic in kvm_set_pte_rmapp

2013-02-05 Thread Xiao Guangrong

In kvm_set_pte_rmapp, if the new mapping is writable, we need to remove
all spte pointing to that page otherwisewe we only need to adjust the sptes
to let them point to the new page. This patch clarifys the logic and makes
the later patch more clean

[ Impact: no logic changed ]

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 68d4d5f..a0dc0d7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1225,16 +1225,16 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
WARN_ON(pte_huge(*ptep));
new_pfn = pte_pfn(*ptep);

-   for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
-   BUG_ON(!is_shadow_present_pte(*sptep));
-   rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, sptep, *sptep);
+   if (pte_write(*ptep))
+   need_flush = kvm_unmap_rmapp(kvm, rmapp, slot, data);
+   else
+   for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
+   BUG_ON(!is_shadow_present_pte(*sptep));
+   rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n,
+   sptep, *sptep);

-   need_flush = 1;
+   need_flush = 1;

-   if (pte_write(*ptep)) {
-   drop_spte(kvm, sptep);
-   sptep = rmap_get_first(*rmapp, iter);
-   } else {
new_spte = *sptep  ~PT64_BASE_ADDR_MASK;
new_spte |= (u64)new_pfn  PAGE_SHIFT;

@@ -1246,7 +1246,6 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
mmu_spte_set(sptep, new_spte);
sptep = rmap_get_next(iter);
}
-   }

if (need_flush)
kvm_flush_remote_tlbs(kvm);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/5] KVM: MMU: unify the code of walking pte list

2013-02-05 Thread Xiao Guangrong

Current code has two ways to walk pte_list, the one is pte_list_walk and
the another way is rmap_get_first and rmap_get_next, they have the same logic.

This patch introduces for_each_spte_in_pte_list to integrate their code

[ Impact: no logic changed, most of the change is function/struct rename ]

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c   |  178 ++
 arch/x86/kvm/mmu_audit.c |5 +-
 2 files changed, 86 insertions(+), 97 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a0dc0d7..2291ea3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -945,26 +945,75 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
}
 }

-typedef void (*pte_list_walk_fn) (u64 *spte);
-static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
+/*
+ * Used by the following functions to iterate through the sptes linked by a
+ * pte_list.  All fields are private and not assumed to be used outside.
+ */
+struct pte_list_iterator {
+   /* private fields */
+   struct pte_list_desc *desc; /* holds the sptep if not NULL */
+   int pos;/* index of the sptep */
+};
+
+/*
+ * Iteration must be started by this function.  This should also be used after
+ * removing/dropping sptes from the pte_list link because in such cases the
+ * information in the itererator may not be valid.
+ *
+ * Returns sptep if found, NULL otherwise.
+ */
+static u64 *pte_list_get_first(unsigned long pte_list,
+  struct pte_list_iterator *iter)
 {
-   struct pte_list_desc *desc;
-   int i;
+   if (!pte_list)
+   return NULL;

-   if (!*pte_list)
-   return;
+   if (!(pte_list  1)) {
+   iter-desc = NULL;
+   return (u64 *)pte_list;
+   }
+
+   iter-desc = (struct pte_list_desc *)(pte_list  ~1ul);
+   iter-pos = 0;
+   return iter-desc-sptes[iter-pos];
+}
+
+/*
+ * Must be used with a valid iterator: e.g. after pte_list_get_next().
+ *
+ * Returns sptep if found, NULL otherwise.
+ */
+static u64 *pte_list_get_next(struct pte_list_iterator *iter)
+{
+   if (iter-desc) {
+   if (iter-pos  PTE_LIST_EXT - 1) {
+   u64 *sptep;
+
+   ++iter-pos;
+   sptep = iter-desc-sptes[iter-pos];
+   if (sptep)
+   return sptep;
+   }

-   if (!(*pte_list  1))
-   return fn((u64 *)*pte_list);
+   iter-desc = iter-desc-more;

-   desc = (struct pte_list_desc *)(*pte_list  ~1ul);
-   while (desc) {
-   for (i = 0; i  PTE_LIST_EXT  desc-sptes[i]; ++i)
-   fn(desc-sptes[i]);
-   desc = desc-more;
+   if (iter-desc) {
+   iter-pos = 0;
+   /* desc-sptes[0] cannot be NULL */
+   return iter-desc-sptes[iter-pos];
+   }
}
+
+   return NULL;
 }

+#define for_each_spte_in_pte_list(pte_list, iter, spte)\
+  for (spte = pte_list_get_first(pte_list, (iter));   \
+ spte != NULL; spte = pte_list_get_next((iter)))
+
+#define for_each_spte_in_rmap(rmap, iter, spte)\
+  for_each_spte_in_pte_list(rmap, iter, spte)
+
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
 {
@@ -1016,67 +1065,6 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
pte_list_remove(spte, rmapp);
 }

-/*
- * Used by the following functions to iterate through the sptes linked by a
- * rmap.  All fields are private and not assumed to be used outside.
- */
-struct rmap_iterator {
-   /* private fields */
-   struct pte_list_desc *desc; /* holds the sptep if not NULL */
-   int pos;/* index of the sptep */
-};
-
-/*
- * Iteration must be started by this function.  This should also be used after
- * removing/dropping sptes from the rmap link because in such cases the
- * information in the itererator may not be valid.
- *
- * Returns sptep if found, NULL otherwise.
- */
-static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
-{
-   if (!rmap)
-   return NULL;
-
-   if (!(rmap  1)) {
-   iter-desc = NULL;
-   return (u64 *)rmap;
-   }
-
-   iter-desc = (struct pte_list_desc *)(rmap  ~1ul);
-   iter-pos = 0;
-   return iter-desc-sptes[iter-pos];
-}
-
-/*
- * Must be used with a valid iterator: e.g. after rmap_get_first().
- *
- * Returns sptep if found, NULL otherwise.
- */
-static u64 *rmap_get_next(struct rmap_iterator *iter)
-{
-   if (iter-desc) {
-   if (iter-pos  PTE_LIST_EXT - 1) {
-   u64 *sptep;
-
-

[PATCH v3 4/5] KVM: MMU: fix spte assertion

2013-02-05 Thread Xiao Guangrong

PT_PRESENT_MASK bit is not enough to see the spte has already been mapped
into pte-list for mmio spte also set this bit. Use is_shadow_present_pte
instead to fix it

Also, this patch move many assertions to the common place to clean up the
code

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   20 ++--
 1 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2291ea3..58f813a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1009,7 +1009,9 @@ static u64 *pte_list_get_next(struct pte_list_iterator 
*iter)

 #define for_each_spte_in_pte_list(pte_list, iter, spte)\
   for (spte = pte_list_get_first(pte_list, (iter));   \
- spte != NULL; spte = pte_list_get_next((iter)))
+ spte != NULL\
+ ({WARN_ON(!is_shadow_present_pte(*(spte))); 1; });\
+  spte = pte_list_get_next(iter))

 #define for_each_spte_in_rmap(rmap, iter, spte)\
   for_each_spte_in_pte_list(rmap, iter, spte)
@@ -1128,11 +1130,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
struct pte_list_iterator iter;
bool flush = false;

-   for_each_spte_in_rmap(*rmapp, iter, sptep) {
-   BUG_ON(!(*sptep  PT_PRESENT_MASK));
-
+   for_each_spte_in_rmap(*rmapp, iter, sptep)
flush |= spte_write_protect(kvm, sptep, pt_protect);
-   }

return flush;
 }
@@ -1215,7 +1214,6 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
need_flush = kvm_unmap_rmapp(kvm, rmapp, slot, data);
else
for_each_spte_in_rmap(*rmapp, iter, sptep) {
-   BUG_ON(!is_shadow_present_pte(*sptep));
rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n,
sptep, *sptep);

@@ -1336,15 +1334,12 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
goto out;
}

-   for_each_spte_in_rmap(*rmapp, iter, sptep) {
-   BUG_ON(!is_shadow_present_pte(*sptep));
-
+   for_each_spte_in_rmap(*rmapp, iter, sptep)
if (*sptep  shadow_accessed_mask) {
young = 1;
clear_bit((ffs(shadow_accessed_mask) - 1),
 (unsigned long *)sptep);
}
-   }
 out:
/* @data has hva passed to kvm_age_hva(). */
trace_kvm_age_page(data, slot, young);
@@ -1366,14 +1361,11 @@ static int kvm_test_age_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
if (!shadow_accessed_mask)
goto out;

-   for_each_spte_in_rmap(*rmapp, iter, sptep) {
-   BUG_ON(!is_shadow_present_pte(*sptep));
-
+   for_each_spte_in_rmap(*rmapp, iter, sptep)
if (*sptep  shadow_accessed_mask) {
young = 1;
break;
}
-   }
 out:
return young;
 }
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 5/5] KVM: MMU: fast drop all spte on the pte_list

2013-02-05 Thread Xiao Guangrong

If a shadow page is being zapped or a host page is going to be freed, kvm
will drop all the reverse-mappings on the shadow page or the gfn. Currently,
it drops the reverse-mapping one by one - it deletes the first reverse mapping,
then moves other reverse-mapping between the description-table. When the
last description-table become empty, it will be freed.

It works well if we only have a few reverse-mappings, but some pte_lists are
very long, during my tracking, i saw some gfns have more than 200 sptes listed
on its pte-list (1G memory in guest on softmmu). Optimization for dropping such
long pte-list is worthwhile, at lease it is good for deletion memslots and
ksm/thp merge pages.

This patch introduce a better way to optimize for this case, it walks all the
reverse-mappings and clear them, then free all description-tables together.

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   36 +++-
 1 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 58f813a..aa7a887 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -945,6 +945,25 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
}
 }

+static void pte_list_destroy(unsigned long *pte_list)
+{
+   struct pte_list_desc *desc;
+   unsigned long list_value = *pte_list;
+
+   *pte_list = 0;
+
+   if (!(list_value  1))
+   return;
+
+   desc = (struct pte_list_desc *)(list_value  ~1ul);
+   while (desc) {
+   struct pte_list_desc *next_desc = desc-more;
+
+   mmu_free_pte_list_desc(desc);
+   desc = next_desc;
+   }
+}
+
 /*
  * Used by the following functions to iterate through the sptes linked by a
  * pte_list.  All fields are private and not assumed to be used outside.
@@ -1183,17 +1202,17 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
 static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
   struct kvm_memory_slot *slot, unsigned long data)
 {
-   u64 *sptep;
struct pte_list_iterator iter;
+   u64 *sptep;
int need_tlb_flush = 0;

-restart:
for_each_spte_in_rmap(*rmapp, iter, sptep) {
-   drop_spte(kvm, sptep);
+   mmu_spte_clear_track_bits(sptep);
need_tlb_flush = 1;
-   goto restart;
}

+   pte_list_destroy(rmapp);
+
return need_tlb_flush;
 }

@@ -2016,11 +2035,10 @@ static void kvm_mmu_unlink_parents(struct kvm *kvm, 
struct kvm_mmu_page *sp)
u64 *sptep;
struct pte_list_iterator iter;

-restart:
-   for_each_spte_in_pte_list(sp-parent_ptes, iter, sptep) {
-   drop_parent_pte(sp, sptep);
-   goto restart;
-   }
+   for_each_spte_in_pte_list(sp-parent_ptes, iter, sptep)
+   mmu_spte_clear_no_track(sptep);
+
+   pte_list_destroy(sp-parent_ptes);
 }

 static int mmu_zap_unsync_children(struct kvm *kvm,
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: win8 installation iso can not boot on rhel6.2 kernel?

2013-02-05 Thread ya su

Gleb:

would you pls tell where to find RHEL 6.4 kernel, as the current
latest office release is 6.3?

and would you figure out what the root cause that produce the problem?

thanks.


Regards.

Suya

2013/2/5 ya su suya94...@gmail.com:
 Gleb:

 would you pls tell where to find RHEL 6.4 kernel, as the current
 latest office release is 6.3?

 and would you figure out what the root cause that produce the problem?

 thanks.

 Regards.

 Suya

 2013/2/5 Gleb Natapov g...@redhat.com:
 On Tue, Feb 05, 2013 at 02:55:07PM +0800, ya su wrote:
 I use the following cmd on rhel6.2 kernel 2.6.32-220.17.1:
 x86_64-softmmu/qemu-system-x86_64 -hda win8.img -cdrom
 window_8_pro.iso -m 2048 -L pc-bios -cpu host, it will display the
 following error:
 Your PC needs to restart.
 Please hold down the power button.
 Error Code: 0x005D
 Parameters:
 0x03100A00
 0x68747541
 0x69746E65
 0x444D4163

 I also tried the newest rhel6 kernel version: 2.6.32-279.19.1, it
 bring out the same result.

 if I try standard kernel 2.6.32 version, it can boot normally.

 Any suggestions? thanks.

 You need latest RHEL6.4 kernel.

 --
 Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: win8 installation iso can not boot on rhel6.2 kernel?

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 04:55:57PM +0800, ya su wrote:
 Gleb:
 
 would you pls tell where to find RHEL 6.4 kernel, as the current
 latest office release is 6.3?
 
 and would you figure out what the root cause that produce the problem?
 
 thanks.
 
Through your RHEL subscription.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Pandarathil, Vijaymohan R



 -Original Message-
 From: Gleb Natapov [mailto:g...@redhat.com]
 Sent: Tuesday, February 05, 2013 12:05 AM
 To: Blue Swirl
 Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance
 E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
 linux-ker...@vger.kernel.org
 Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
 PCI devices
 
 On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
  On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
  vijaymohan.pandarat...@hp.com wrote:
   - Create eventfd per vfio device assigned to a guest and
 register an
 event handler
  
   - This fd is passed to the vfio_pci driver through the SET_IRQ
 ioctl
  
   - When the device encounters an error, the eventfd is signalled
 and the qemu eventfd handler gets invoked.
  
   - In the handler decide what action to take. Current action
 taken
 is to terminate the guest.
 
  Usually this is not OK, but I guess this is not guest triggerable.
 
 Still not OK. Why not stop a guest with appropriate stop reason?

The thinking was that since this is a hardware error, we would want to stop the 
guest at the earliest. The hw_error() routine which aborts the qemu process was 
suggested by Alex and that seemed appropriate. Earlier I was using 
qemu_system_shutdown_request().  Any suggestions ?

Thanks

Vijay
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] KVM: MMU: lazily drop large spte

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 03:11:09PM +0800, Xiao Guangrong wrote:
 Currently, kvm zaps the large spte if write-protected is needed, the later
 read can fault on that spte. Actually, we can make the large spte readonly
 instead of making them un-present, the page fault caused by read access can
 be avoid
 
 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it reduces
 | jitter.  This removes the need for the return value.
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
Reviewed-by: Gleb Natapov g...@redhat.com

 ---
 Changelog:
 v3:
 - address Gleb's comments, we make the function return true if flush is
   needed instead of returning it via pointer to a variable
 - improve the changelog
 
  arch/x86/kvm/mmu.c |   23 +++
  1 files changed, 7 insertions(+), 16 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 42ba85c..ff2fc80 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1106,8 +1106,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
 *sptep)
 
  /*
   * Write-protect on the specified @sptep, @pt_protect indicates whether
 - * spte writ-protection is caused by protecting shadow page table.
 - * @flush indicates whether tlb need be flushed.
 + * spte write-protection is caused by protecting shadow page table.
   *
   * Note: write protection is difference between drity logging and spte
   * protection:
 @@ -1116,10 +1115,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
 *sptep)
   * - for spte protection, the spte can be writable only after unsync-ing
   *   shadow page.
   *
 - * Return true if the spte is dropped.
 + * Return true if tlb need be flushed.
   */
 -static bool
 -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
 +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
  {
   u64 spte = *sptep;
 
 @@ -1129,17 +1127,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
 *flush, bool pt_protect)
 
   rmap_printk(rmap_write_protect: spte %p %llx\n, sptep, *sptep);
 
 - if (__drop_large_spte(kvm, sptep)) {
 - *flush |= true;
 - return true;
 - }
 -
   if (pt_protect)
   spte = ~SPTE_MMU_WRITEABLE;
   spte = spte  ~PT_WRITABLE_MASK;
 
 - *flush |= mmu_spte_update(sptep, spte);
 - return false;
 + return mmu_spte_update(sptep, spte);
  }
 
  static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 @@ -1151,11 +1143,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
 unsigned long *rmapp,
 
   for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
   BUG_ON(!(*sptep  PT_PRESENT_MASK));
 - if (spte_write_protect(kvm, sptep, flush, pt_protect)) {
 - sptep = rmap_get_first(*rmapp, iter);
 - continue;
 - }
 
 + flush |= spte_write_protect(kvm, sptep, pt_protect);
   sptep = rmap_get_next(iter);
   }
 
 @@ -2611,6 +2600,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
 int write,
   break;
   }
 
 + drop_large_spte(vcpu, iterator.sptep);
 +
   if (!is_shadow_present_pte(*iterator.sptep)) {
   u64 base_addr = iterator.addr;
 
 -- 
 1.7.7.6
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote:

  -Original Message-
  From: Gleb Natapov [mailto:g...@redhat.com]
  Sent: Tuesday, February 05, 2013 12:05 AM
  To: Blue Swirl
  Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance
  E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
  linux-ker...@vger.kernel.org
  Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
  PCI devices

  On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
   On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
   vijaymohan.pandarat...@hp.com wrote:
- Create eventfd per vfio device assigned to a guest and
  register an
  event handler

- This fd is passed to the vfio_pci driver through the SET_IRQ
  ioctl

- When the device encounters an error, the eventfd is signalled
  and the qemu eventfd handler gets invoked.

- In the handler decide what action to take. Current action
  taken
  is to terminate the guest.

   Usually this is not OK, but I guess this is not guest triggerable.

  Still not OK. Why not stop a guest with appropriate stop reason?

 The thinking was that since this is a hardware error, we would want to stop 
 the guest at the earliest. The hw_error() routine which aborts the qemu 
 process was suggested by Alex and that seemed appropriate. Earlier I was 
 using qemu_system_shutdown_request().  Any suggestions ?

I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
involves sending IPIs to other cpus running guest's vcpus. Both exit()
and vm_stop() will do it, but former is implicitly in the kernel and
later is explicitly in QEMU.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: I may have found a bug(unlikely) with Qemu-kvm 1.1

2013-02-05 Thread Veruca Salt

 From: verucasal...@hotmail.co.uk
 To: arm...@redhat.com
 CC: kvm@vger.kernel.org
 Subject: RE: I may have found a bug(unlikely) with Qemu-kvm 1.1
 Date: Tue, 5 Feb 2013 09:19:13 +

 Thank you very much Markus.
 We were literally two minutes ahead, we've added a tty to -monitor so that a 
 function key exposes it; actually, this is better than the original qemu 
 monitor.
 So now it's (say) -monitor /dev/tty2 and a simple ctrl-alt-f2 exposes a 
 momitor screen.

 Sorry for the false alarm, and once again, thanks a million.

 Simon

  From: arm...@redhat.com
  To: verucasal...@hotmail.co.uk
  CC: kvm@vger.kernel.org
  Subject: Re: I may have found a bug(unlikely) with Qemu-kvm 1.1
  Date: Tue, 5 Feb 2013 09:48:59 +0100

  Veruca Salt verucasal...@hotmail.co.uk writes:

   We are upgrading, carefully, from 1.0 to 1.1.

  Keep going; we're about to release 1.4 ;)

   We welcome the improved AC97 support and the loss of the ehci warning
   message on startup.

   I am finding an issue getting through to the monitor however.
   Neither ctrl-alt-shift-2 nor ctrl-alt-2 exposes the monitor.

  I never use that one, but it should work.

   I have even tried using the -mon option code examples to force a
   monitor to stdio, but the monitor stubbornly refuses to appear.

   Any advice would be greatly appreciated.

  These all work for me:

  * Shorthand syntax

  -monitor stdio

  * Equivalent longhand

  -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline,default=on

  * Same as config file snippet, for use with -readconfig

  [chardev mon0]
  backend = stdio

  [mon]
  chardev = mon0
  mode = readline
  default = on
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at http://vger.kernel.org/majordomo-info.html

Also, sorry for top post. I know how you guys hate that.

  --
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: I may have found a bug(unlikely) with Qemu-kvm 1.1

2013-02-05 Thread Veruca Salt


Thank you very much Markus.
We were literally two minutes ahead, we've added a tty to -monitor so that a 
function key exposes it; actually, this is better than the original qemu 
monitor.
So now it's (say) -monitor /dev/tty2 and a simple ctrl-alt-f2 exposes a momitor 
screen.

Sorry for the false alarm, and once again, thanks a million.

Simon


 From: arm...@redhat.com
 To: verucasal...@hotmail.co.uk
 CC: kvm@vger.kernel.org
 Subject: Re: I may have found a bug(unlikely) with Qemu-kvm 1.1
 Date: Tue, 5 Feb 2013 09:48:59 +0100

 Veruca Salt verucasal...@hotmail.co.uk writes:

  We are upgrading, carefully, from 1.0 to 1.1.

 Keep going; we're about to release 1.4 ;)

  We welcome the improved AC97 support and the loss of the ehci warning
  message on startup.
 
  I am finding an issue getting through to the monitor however.
  Neither ctrl-alt-shift-2 nor ctrl-alt-2 exposes the monitor.

 I never use that one, but it should work.

  I have even tried using the -mon option code examples to force a
  monitor to stdio, but the monitor stubbornly refuses to appear.
 
  Any advice would be greatly appreciated.

 These all work for me:

 * Shorthand syntax

 -monitor stdio

 * Equivalent longhand

 -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline,default=on

 * Same as config file snippet, for use with -readconfig

 [chardev mon0]
 backend = stdio

 [mon]
 chardev = mon0
 mode = readline
 default = on
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html
  --
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-05 Thread Zhang, Yang Z

Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2013-02-05:
 On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
 On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
 On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
 Any example how software relies on such
 two-interrupts-queued-in-IRR/ISR behaviour?
 Don't know about guests, but KVM relies on it to detect interrupt
 coalescing. So if interrupt is set in IRR but not in PIR interrupt will
 not be reported as coalesced, but it will be coalesced during PIR-IRR
 merge.
 
 Yes, so:
 
 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
 3. vcpu outside of guest mode.
 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
 5. vcpu enters guest mode.
 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
 7. HW transfers PIR into IRR.
 
 set_irq return value at 7 is incorrect, interrupt event was _not_
 queued.
 Not sure I understand the flow of events in your description correctly. 
 As I
 understand it at 4 set_irq() will return incorrect result. Basically
 when PIR is set to 1 while IRR has 1 for the vector the value of
 set_irq() will be incorrect.
 
 At 4 it has not been coalesced: it has been queued to IRR.
 At 6 it has been coalesced: PIR bit merged into IRR bit.
 
 Frankly I do not see how it can be fixed
 without any race with present HW PIR design.
 
 At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
 already set, don't set PIR.
 
 Or:
 
 apic_accept_interrupt() {
 
 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
 Never set IRR when HWAPIC enabled, even if outside of guest mode.
 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
 }
 
 Two or more concurrent set_irq can race with each other, though. Can
 either document the race or add a lock.
 According the SDM, software should not touch the IRR when target vcpu is
 running. Instead, use locked way to access PIR. So your solution may wrong.
 Then your apicv patches are broken, because they do exactly that.
Which code is broken?
 
 The only problem is the step 6, but at that point, there already an interrupt
 pending in IRR. This means the interrupt will be handled not lost. And even 
 in real
 hardware, this case do exist. So I think it should not be a problem.
 
 This is not the problem we are trying to fix. Sometimes we need to make
 sure that each interrupt device generates result in an interrupt handler
 invocation in a guest. If interrupt is coalesced (meaning it will not
 correspond to separate invocation of a guest interrupt handler) it needs
 to be re-injected. With PIR detection of such condition is broken.


Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote:
  Marcelo Tosatti wrote on 2013-02-05:
  On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
  On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
  On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
  Any example how software relies on such
  two-interrupts-queued-in-IRR/ISR behaviour?
  Don't know about guests, but KVM relies on it to detect interrupt
  coalescing. So if interrupt is set in IRR but not in PIR interrupt 
  will
  not be reported as coalesced, but it will be coalesced during PIR-IRR
  merge.
  
  Yes, so:
  
  1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
  2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
  3. vcpu outside of guest mode.
  4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
  5. vcpu enters guest mode.
  6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
  7. HW transfers PIR into IRR.
  
  set_irq return value at 7 is incorrect, interrupt event was _not_
  queued.
  Not sure I understand the flow of events in your description correctly. 
  As I
  understand it at 4 set_irq() will return incorrect result. Basically
  when PIR is set to 1 while IRR has 1 for the vector the value of
  set_irq() will be incorrect.
  
  At 4 it has not been coalesced: it has been queued to IRR.
  At 6 it has been coalesced: PIR bit merged into IRR bit.
  
  Frankly I do not see how it can be fixed
  without any race with present HW PIR design.
  
  At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
  already set, don't set PIR.
  
  Or:
  
  apic_accept_interrupt() {
  
  1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
  Never set IRR when HWAPIC enabled, even if outside of guest mode.
  2. Set PIR and let HW or SW VM-entry transfer it to IRR.
  3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
  }
  
  Two or more concurrent set_irq can race with each other, though. Can
  either document the race or add a lock.
  According the SDM, software should not touch the IRR when target vcpu is
  running. Instead, use locked way to access PIR. So your solution may wrong.
  Then your apicv patches are broken, because they do exactly that.
 Which code is broken?
  
The one that updates IRR directly on the apic page.

  The only problem is the step 6, but at that point, there already an 
  interrupt
  pending in IRR. This means the interrupt will be handled not lost. And even 
  in real
  hardware, this case do exist. So I think it should not be a problem.
  
  This is not the problem we are trying to fix. Sometimes we need to make
  sure that each interrupt device generates result in an interrupt handler
  invocation in a guest. If interrupt is coalesced (meaning it will not
  correspond to separate invocation of a guest interrupt handler) it needs
  to be re-injected. With PIR detection of such condition is broken.
 
 
 Best regards,
 Yang

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-05 Thread Zhang, Yang Z

Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2013-02-05:
 On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
 On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
 On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
 Any example how software relies on such
 two-interrupts-queued-in-IRR/ISR behaviour?
 Don't know about guests, but KVM relies on it to detect interrupt
 coalescing. So if interrupt is set in IRR but not in PIR interrupt 
 will
 not be reported as coalesced, but it will be coalesced during PIR-IRR
 merge.
 
 Yes, so:
 
 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
 3. vcpu outside of guest mode.
 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
 5. vcpu enters guest mode.
 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
 7. HW transfers PIR into IRR.
 
 set_irq return value at 7 is incorrect, interrupt event was _not_
 queued.
 Not sure I understand the flow of events in your description
 correctly. As I understand it at 4 set_irq() will return incorrect
 result. Basically when PIR is set to 1 while IRR has 1 for the
 vector the value of set_irq() will be incorrect.
 
 At 4 it has not been coalesced: it has been queued to IRR.
 At 6 it has been coalesced: PIR bit merged into IRR bit.
 
 Frankly I do not see how it can be fixed
 without any race with present HW PIR design.
 
 At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
 already set, don't set PIR.
 
 Or:
 
 apic_accept_interrupt() {
 
 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
 Never set IRR when HWAPIC enabled, even if outside of guest mode.
 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
 }
 
 Two or more concurrent set_irq can race with each other, though. Can
 either document the race or add a lock.
 According the SDM, software should not touch the IRR when target vcpu is
 running. Instead, use locked way to access PIR. So your solution may wrong.
 Then your apicv patches are broken, because they do exactly that.
 Which code is broken?
 
 The one that updates IRR directly on the apic page.
No, all the updates are ensuring the target vcpu is not running. So it's safe 
to touch IRR.

 The only problem is the step 6, but at that point, there already an 
 interrupt
 pending in IRR. This means the interrupt will be handled not lost. And
 even in real hardware, this case do exist. So I think it should not be
 a problem.
 
 This is not the problem we are trying to fix. Sometimes we need to make
 sure that each interrupt device generates result in an interrupt handler
 invocation in a guest. If interrupt is coalesced (meaning it will not
 correspond to separate invocation of a guest interrupt handler) it needs
 to be re-injected. With PIR detection of such condition is broken.
 
 
 Best regards,
 Yang
 
 --
   Gleb.


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Pandarathil, Vijaymohan R

 -Original Message-
 From: Gleb Natapov [mailto:g...@redhat.com]
 Sent: Tuesday, February 05, 2013 1:21 AM
 To: Pandarathil, Vijaymohan R
 Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
 kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
 linux-ker...@vger.kernel.org
 Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
 PCI devices

 On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote:

   -Original Message-
   From: Gleb Natapov [mailto:g...@redhat.com]
   Sent: Tuesday, February 05, 2013 12:05 AM
   To: Blue Swirl
   Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz,
 Lance
   E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
 p...@vger.kernel.org;
   linux-ker...@vger.kernel.org
   Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
 VFIO-
   PCI devices

   On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
vijaymohan.pandarat...@hp.com wrote:
 - Create eventfd per vfio device assigned to a guest and
   register an
   event handler

 - This fd is passed to the vfio_pci driver through the
 SET_IRQ
   ioctl

 - When the device encounters an error, the eventfd is
 signalled
   and the qemu eventfd handler gets invoked.

 - In the handler decide what action to take. Current action
   taken
   is to terminate the guest.

Usually this is not OK, but I guess this is not guest triggerable.

   Still not OK. Why not stop a guest with appropriate stop reason?

  The thinking was that since this is a hardware error, we would want to
 stop the guest at the earliest. The hw_error() routine which aborts the
 qemu process was suggested by Alex and that seemed appropriate. Earlier I
 was using qemu_system_shutdown_request().  Any suggestions ?

 I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
 involves sending IPIs to other cpus running guest's vcpus. Both exit()
 and vm_stop() will do it, but former is implicitly in the kernel and
 later is explicitly in QEMU.

I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, 
guest ended up in a hang rather than exiting. There seems to some cleanup work 
which is being done as part of vm_stop. In our case, we wanted the guest to 
exit immediately. So use of hw_error() seemed appropriate.

Thoughts ?

Vijay

 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote:
  Marcelo Tosatti wrote on 2013-02-05:
  On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
  On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
  On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
  Any example how software relies on such
  two-interrupts-queued-in-IRR/ISR behaviour?
  Don't know about guests, but KVM relies on it to detect interrupt
  coalescing. So if interrupt is set in IRR but not in PIR interrupt 
  will
  not be reported as coalesced, but it will be coalesced during 
  PIR-IRR
  merge.
  
  Yes, so:
  
  1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
  2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
  3. vcpu outside of guest mode.
  4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
  5. vcpu enters guest mode.
  6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
  7. HW transfers PIR into IRR.
  
  set_irq return value at 7 is incorrect, interrupt event was _not_
  queued.
  Not sure I understand the flow of events in your description
  correctly. As I understand it at 4 set_irq() will return incorrect
  result. Basically when PIR is set to 1 while IRR has 1 for the
  vector the value of set_irq() will be incorrect.
  
  At 4 it has not been coalesced: it has been queued to IRR.
  At 6 it has been coalesced: PIR bit merged into IRR bit.
  
  Frankly I do not see how it can be fixed
  without any race with present HW PIR design.
  
  At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
  already set, don't set PIR.
  
  Or:
  
  apic_accept_interrupt() {
  
  1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
  Never set IRR when HWAPIC enabled, even if outside of guest mode.
  2. Set PIR and let HW or SW VM-entry transfer it to IRR.
  3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
  }
  
  Two or more concurrent set_irq can race with each other, though. Can
  either document the race or add a lock.
  According the SDM, software should not touch the IRR when target vcpu is
  running. Instead, use locked way to access PIR. So your solution may 
  wrong.
  Then your apicv patches are broken, because they do exactly that.
  Which code is broken?
  
  The one that updates IRR directly on the apic page.
 No, all the updates are ensuring the target vcpu is not running. So it's safe 
 to touch IRR.
 
Not at all. Read the code.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote:

  -Original Message-
  From: Gleb Natapov [mailto:g...@redhat.com]
  Sent: Tuesday, February 05, 2013 1:21 AM
  To: Pandarathil, Vijaymohan R
  Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
  kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
  linux-ker...@vger.kernel.org
  Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
  PCI devices

  On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote:

-Original Message-
From: Gleb Natapov [mailto:g...@redhat.com]
Sent: Tuesday, February 05, 2013 12:05 AM
To: Blue Swirl
Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz,
  Lance
E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
  p...@vger.kernel.org;
linux-ker...@vger.kernel.org
Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
  VFIO-
PCI devices

On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
 On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
 vijaymohan.pandarat...@hp.com wrote:
  - Create eventfd per vfio device assigned to a guest and
register an
event handler

  - This fd is passed to the vfio_pci driver through the
  SET_IRQ
ioctl

  - When the device encounters an error, the eventfd is
  signalled
and the qemu eventfd handler gets invoked.

  - In the handler decide what action to take. Current action
taken
is to terminate the guest.

 Usually this is not OK, but I guess this is not guest triggerable.

Still not OK. Why not stop a guest with appropriate stop reason?

   The thinking was that since this is a hardware error, we would want to
  stop the guest at the earliest. The hw_error() routine which aborts the
  qemu process was suggested by Alex and that seemed appropriate. Earlier I
  was using qemu_system_shutdown_request().  Any suggestions ?

  I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
  involves sending IPIs to other cpus running guest's vcpus. Both exit()
  and vm_stop() will do it, but former is implicitly in the kernel and
  later is explicitly in QEMU.

 I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, 
 guest ended up in a hang rather than exiting. There seems to some cleanup 
 work which is being done as part of vm_stop. In our case, we wanted the guest 
 to exit immediately. So use of hw_error() seemed appropriate.

What makes you think it hang? It stopped, precisely what it should do if
you call vm_stop(). Now it is possible for vm user to investigate what
happened and even salvage some data from guest memory.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/3] KVM: MMU: simple cleanups

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 03:26:21PM +0800, Xiao Guangrong wrote:
 There are the simple cleanups for MMU, no function / logic changed.
 
 Marcelo, Gleb, please apply them after applying
 [PATCH v3] KVM: MMU: lazily drop large spte
 
 Changelog:
 no change, just split them from the previous patchset for good review.
 
 Thanks!
Reviewed-by: Gleb Natapov g...@redhat.com

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Pandarathil, Vijaymohan R

 -Original Message-
 From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-
 ow...@vger.kernel.org] On Behalf Of Gleb Natapov
 Sent: Tuesday, February 05, 2013 3:37 AM
 To: Pandarathil, Vijaymohan R
 Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
 kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
 linux-ker...@vger.kernel.org
 Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
 PCI devices

 On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote:

   -Original Message-
   From: Gleb Natapov [mailto:g...@redhat.com]
   Sent: Tuesday, February 05, 2013 1:21 AM
   To: Pandarathil, Vijaymohan R
   Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
   kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
   linux-ker...@vger.kernel.org
   Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
 VFIO-
   PCI devices

   On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R
 wrote:

 -Original Message-
 From: Gleb Natapov [mailto:g...@redhat.com]
 Sent: Tuesday, February 05, 2013 12:05 AM
 To: Blue Swirl
 Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas;
 Ortiz,
   Lance
 E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
   p...@vger.kernel.org;
 linux-ker...@vger.kernel.org
 Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER
 for
   VFIO-
 PCI devices

 On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
  On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
  vijaymohan.pandarat...@hp.com wrote:
   - Create eventfd per vfio device assigned to a guest
 and
 register an
 event handler

   - This fd is passed to the vfio_pci driver through the
   SET_IRQ
 ioctl

   - When the device encounters an error, the eventfd is
   signalled
 and the qemu eventfd handler gets invoked.

   - In the handler decide what action to take. Current
 action
 taken
 is to terminate the guest.

  Usually this is not OK, but I guess this is not guest
 triggerable.

 Still not OK. Why not stop a guest with appropriate stop reason?

The thinking was that since this is a hardware error, we would want
 to
   stop the guest at the earliest. The hw_error() routine which aborts the
   qemu process was suggested by Alex and that seemed appropriate. Earlier
 I
   was using qemu_system_shutdown_request().  Any suggestions ?

   I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
   involves sending IPIs to other cpus running guest's vcpus. Both exit()
   and vm_stop() will do it, but former is implicitly in the kernel and
   later is explicitly in QEMU.

  I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while
 testing, guest ended up in a hang rather than exiting. There seems to some
 cleanup work which is being done as part of vm_stop. In our case, we wanted
 the guest to exit immediately. So use of hw_error() seemed appropriate.

 What makes you think it hang? It stopped, precisely what it should do if
 you call vm_stop(). Now it is possible for vm user to investigate what
 happened and even salvage some data from guest memory.

That was ignorance on my part on the expected behavior of vm_stop(). 
So what you are suggesting is to stop the guest displaying an appropriate 
error/next-steps message and have the users do any 
data-collection/investigation 
and then manually kill the guest, if they so desire. Right ?

Sounds reasonable. As long as the guest is not touching the device, it should 
be okay.
Alex, Any comments ?

Thanks

Vijay

 --
   Gleb.
 --
 To unsubscribe from this list: send the line unsubscribe linux-pci in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Investigating abnormal stealtimes

2013-02-05 Thread Stefan Hajnoczi

On Tue, Feb 5, 2013 at 1:26 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 - 'Steal time' is the amount of time taken while vcpu is able to run
 but not runnable. Maybe 'vmexit latency' is a better name.

You are right, 'vmexit latency' is a better name.

 - Perhaps it would be good to subtract the time the thread was
 involuntarily scheduled out due 'timeslice' expiration. Otherwise,
 running a CPU intensive task returns false positives (that is, long
 delays to due to reschedule due to 'timeslice' exhausted by guest CPU
 activity, not due to KVM or QEMU issues such as voluntarily schedule in
 pthread_mutex_lock).

 Alternatively you can raise the priority of the vcpu threads (to get rid
 of the false positives).

I think this depends on the use-case.  If the aim is to find out why
the guest has poor response times then timeslice expiration is
interesting.  If the aim is to optimize QEMU or kvm.ko then timeslice
expiration is a nuisance :).

Your idea to raise the vcpu thread priority sounds good to me.

 - Idea: Would be handy to extract trace events in the offending
 'latency above threshold' vmexit/vmentry region.
 Say that you enable other trace events (unrelated to kvm) which can
 help identify the culprit. Instead of scanning the file manually
 searching for 100466.1062486786 save one vmexit/vmentry cycle,
 along with other trace events in that period, in a separate file.

Good idea.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 12:05:11PM +, Pandarathil, Vijaymohan R wrote:

  -Original Message-
  From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-
  ow...@vger.kernel.org] On Behalf Of Gleb Natapov
  Sent: Tuesday, February 05, 2013 3:37 AM
  To: Pandarathil, Vijaymohan R
  Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
  kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
  linux-ker...@vger.kernel.org
  Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
  PCI devices

  On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote:

-Original Message-
From: Gleb Natapov [mailto:g...@redhat.com]
Sent: Tuesday, February 05, 2013 1:21 AM
To: Pandarathil, Vijaymohan R
Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
linux-ker...@vger.kernel.org
Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
  VFIO-
PCI devices

On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R
  wrote:

  -Original Message-
  From: Gleb Natapov [mailto:g...@redhat.com]
  Sent: Tuesday, February 05, 2013 12:05 AM
  To: Blue Swirl
  Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas;
  Ortiz,
Lance
  E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
p...@vger.kernel.org;
  linux-ker...@vger.kernel.org
  Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER
  for
VFIO-
  PCI devices

  On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
   On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
   vijaymohan.pandarat...@hp.com wrote:
- Create eventfd per vfio device assigned to a guest
  and
  register an
  event handler

- This fd is passed to the vfio_pci driver through the
SET_IRQ
  ioctl

- When the device encounters an error, the eventfd is
signalled
  and the qemu eventfd handler gets invoked.

- In the handler decide what action to take. Current
  action
  taken
  is to terminate the guest.

   Usually this is not OK, but I guess this is not guest
  triggerable.

  Still not OK. Why not stop a guest with appropriate stop reason?

 The thinking was that since this is a hardware error, we would want
  to
stop the guest at the earliest. The hw_error() routine which aborts the
qemu process was suggested by Alex and that seemed appropriate. Earlier
  I
was using qemu_system_shutdown_request().  Any suggestions ?

I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
involves sending IPIs to other cpus running guest's vcpus. Both exit()
and vm_stop() will do it, but former is implicitly in the kernel and
later is explicitly in QEMU.

   I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while
  testing, guest ended up in a hang rather than exiting. There seems to some
  cleanup work which is being done as part of vm_stop. In our case, we wanted
  the guest to exit immediately. So use of hw_error() seemed appropriate.

  What makes you think it hang? It stopped, precisely what it should do if
  you call vm_stop(). Now it is possible for vm user to investigate what
  happened and even salvage some data from guest memory.

 That was ignorance on my part on the expected behavior of vm_stop(). 
 So what you are suggesting is to stop the guest displaying an appropriate 
 error/next-steps message and have the users do any 
 data-collection/investigation 
 and then manually kill the guest, if they so desire. Right ?

Yes.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/5] KVM: MMU: unify and cleanup the code of walking pte list

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 04:52:35PM +0800, Xiao Guangrong wrote:
 Current code has two ways to walk pte_list, the one is pte_list_walk and
 the another way is rmap_get_first and rmap_get_next, they have the same logic.
 This patchset tries to unify the code and also make the code more tidy.
 
 Patch 1: KVM: MMU: introduce mmu_spte_establish, which tries to eliminates
 the different between walking parent pte list and rmap, prepare for the
 later patch.
 
 Patch 2: KVM: MMU: clarify the logic in kvm_set_pte_rmapp, which prepares for
 the next patch, no logic changed.
 
 Patch 3: KVM: MMU: unify the code of walking pte list, unify the walking code.
 
 Patch 4: KVM: MMU: fix spte assertion, fix a minor bug and remove the 
 duplicate
 code.
 
 Patch 5: KVM: MMU: fast drop all spte on the pte_list, optimize for dropping
 all sptes on rmap and remove all the goto restart pattern introduced by
 the Patch 3.
 
 Marcelo, Gleb, please apply them after applying the patchset of
 [PATCH v3 0/3] KVM: MMU: simple cleanups.
 
 Changelog:
 v3:
 - address Gleb's comments, remove the remained goto restart in
   kvm_set_pte_rmapp
 - improve the changelog
 
Reviewed-by: Gleb Natapov g...@redhat.com

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to limit upload bandwidth for a guest server?

2013-02-05 Thread Stefan Hajnoczi

On Sun, Feb 03, 2013 at 07:59:07PM -0600, Neil Aggarwal wrote:
 I have a CentOS server using KVM to host guest servers.
 I am trying to limit the bandwidth usable by a guest server.
 
 I tried to use tc, but that is only limiting the download bandwidth
 to a server.  It does not seem to filter packets uploaded by the
 server.
 
 Is there a tool to limit upload traffic for a guest server?

Consider using management tools like libvirt that handle tc and friends
for you.  The domain XML is interfacebandwidthinbound and
outbound.

Back to the question, are you looking for ingress qdiscs?

http://www.lartc.org/howto/lartc.adv-qdisc.ingress.html

This is a standard tc question, not related to virtualization.  You may
get better help from Linux networking mailing lists or IRC channels.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVH call agenda for 2013-02-05

2013-02-05 Thread Anthony Liguori

Juan Quintela quint...@redhat.com writes:

 Hi

 Please send in any agenda topics you are interested in.

FYI, I have a conflict for today so I won't be able to attend.

Regards,

Anthony Liguori


 Later, Juan.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-05 Thread Zhang, Yang Z

Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2013-02-05:
 On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
 On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
 On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
 Any example how software relies on such
 two-interrupts-queued-in-IRR/ISR behaviour?
 Don't know about guests, but KVM relies on it to detect
 interrupt coalescing. So if interrupt is set in IRR but not in
 PIR interrupt will not be reported as coalesced, but it will
 be coalesced during PIR-IRR merge.
 
 Yes, so:
 
 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
 3. vcpu outside of guest mode.
 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
 5. vcpu enters guest mode.
 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
 7. HW transfers PIR into IRR.
 
 set_irq return value at 7 is incorrect, interrupt event was _not_
 queued.
 Not sure I understand the flow of events in your description
 correctly. As I understand it at 4 set_irq() will return incorrect
 result. Basically when PIR is set to 1 while IRR has 1 for the
 vector the value of set_irq() will be incorrect.
 
 At 4 it has not been coalesced: it has been queued to IRR.
 At 6 it has been coalesced: PIR bit merged into IRR bit.
 
 Frankly I do not see how it can be fixed
 without any race with present HW PIR design.
 
 At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
 already set, don't set PIR.
 
 Or:
 
 apic_accept_interrupt() {
 
 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
 Never set IRR when HWAPIC enabled, even if outside of guest mode.
 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
 }
 
 Two or more concurrent set_irq can race with each other, though. Can
 either document the race or add a lock.
 According the SDM, software should not touch the IRR when target vcpu
 is
 running. Instead, use locked way to access PIR. So your solution may
 wrong. Then your apicv patches are broken, because they do exactly
 that.
 Which code is broken?
 
 The one that updates IRR directly on the apic page.
 No, all the updates are ensuring the target vcpu is not running. So
 it's safe to touch IRR.
 
 Not at all. Read the code.
Sorry. I still cannot figure out which code is wrong. All the places call 
sync_pir_to_irr() are on target vcpu.
Can you point out the code? Thanks.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 01:26:42PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote:
  Marcelo Tosatti wrote on 2013-02-05:
  On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
  On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
  On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
  Any example how software relies on such
  two-interrupts-queued-in-IRR/ISR behaviour?
  Don't know about guests, but KVM relies on it to detect
  interrupt coalescing. So if interrupt is set in IRR but not in
  PIR interrupt will not be reported as coalesced, but it will
  be coalesced during PIR-IRR merge.
  
  Yes, so:
  
  1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
  2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
  3. vcpu outside of guest mode.
  4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
  5. vcpu enters guest mode.
  6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
  7. HW transfers PIR into IRR.
  
  set_irq return value at 7 is incorrect, interrupt event was _not_
  queued.
  Not sure I understand the flow of events in your description
  correctly. As I understand it at 4 set_irq() will return incorrect
  result. Basically when PIR is set to 1 while IRR has 1 for the
  vector the value of set_irq() will be incorrect.
  
  At 4 it has not been coalesced: it has been queued to IRR.
  At 6 it has been coalesced: PIR bit merged into IRR bit.
  
  Frankly I do not see how it can be fixed
  without any race with present HW PIR design.
  
  At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if 
  IRR
  already set, don't set PIR.
  
  Or:
  
  apic_accept_interrupt() {
  
  1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
  Never set IRR when HWAPIC enabled, even if outside of guest mode.
  2. Set PIR and let HW or SW VM-entry transfer it to IRR.
  3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
  }
  
  Two or more concurrent set_irq can race with each other, though. Can
  either document the race or add a lock.
  According the SDM, software should not touch the IRR when target vcpu
  is
  running. Instead, use locked way to access PIR. So your solution may
  wrong. Then your apicv patches are broken, because they do exactly
  that.
  Which code is broken?
  
  The one that updates IRR directly on the apic page.
  No, all the updates are ensuring the target vcpu is not running. So
  it's safe to touch IRR.
  
  Not at all. Read the code.
 Sorry. I still cannot figure out which code is wrong. All the places call 
 sync_pir_to_irr() are on target vcpu.
 Can you point out the code? Thanks.
 
I am taking about vapic patches which are already in, not pir patches.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: DMAR faults from unrelated device when vfio is used

2013-02-05 Thread David Gstir

Am Montag, den 04.02.2013, 08:49 -0700 schrieb Alex Williamson:

 Can you clarify what you mean by assign?  Are you actually assigning the
 root ports to the qemu guest (1c.0  1c.6)?  vfio will require they be
 owned by vfio-pci to make use of 3:00.0, but assigning them to the guest
 is not recommended.  Can you provided your qemu command line?  

I did hand all of them to the guest OS. Removing 1c.0  1c.6 from the qemu 
command line seems to have done the trick. Thanks!

Here's my working qemu command line:
qemu-kvm -no-reboot -enable-kvm -cpu host -smp 4 -m 6G \
  -drive 
file=/home/test/qemu/images/win7_base_updated.qcow2,if=virtio,cache=none,media=disk,format=qcow2,index=0
 \
  -full-screen -no-quit -no-frame -display sdl -vnc :1 -k de -usbdevice tablet \
  -vga std -global VGA.vgamem_mb=256 \
  -netdev tap,id=guest0,ifname=tap0,script=no,downscript=no \
  -net nic,netdev=guest0,model=virtio,macaddr=00:16:35:BE:EF:12  \
  -rtc base=localtime \
  -device vfio-pci,host=00:1b.0,id=audio \
  -device vfio-pci,host=00:1a.0,id=ehci1 \
  -device vfio-pci,host=00:1d.0,id=ehci2 \
  -device vfio-pci,host=03:00.0,id=xhci1 \
  -monitor tcp::,server,nowait


 We need
 to re-visit how to handle pcieport devices with vfio-pci, perhaps
 white-listing it as a vfio compatible driver, but this still should
 not interfere with devices external to the group.
 
 The DMAR fault address looks pretty bogus unless you happen to have
 100GB+ of ram in the system.

Nope, definitely not. :)

 vfio makes use of the IOMMU API for programming DMA translations, so an
 reserved fields would have to be programmed by intel-iommu itself.  We
 could of course be passing some kind of bogus data that intel-iommu
 isn't catching.  If you're assigning the root ports to the guest, I'd
 start with that, don't do it.  Attach them to vfio, but don't give them
 to the guest.  Maybe that'll give us a hint.  I also notice that your
 USB 3 controller is dead:
 
 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
 (rev ff) (prog-if ff)
   !!! Unknown header type 7f
 
 We only see unknown header type 7f when the read from the device returns
 -1.  This might have something to do with the root port above it (1c.6)
 being in state D3.  Windows likes to put unused devices in D3, which
 leads me to suspect you are giving it to the guest.  

There error does no longer occur. lspci now shows this:

-- snip --
03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 
04) (prog-if 30 [XHCI])
Subsystem: Intel Corporation Device 2008
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort+ SERR- PERR- INTx-
Interrupt: pin A routed to IRQ 18
Region 0: Memory at fe50 (64-bit, non-prefetchable) [disabled] 
[size=8K]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
Address:   Data: 
Capabilities: [90] MSI-X: Enable- Count=8 Masked-
Vector table: BAR=0 offset=1000
PBA: BAR=0 offset=1080
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ 
TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 
4us, L1 unlimited
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, 
Selectable De-emphasis: -6dB
 Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
 Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, 
EqualizationComplete-, EqualizationPhase1-
 EqualizationPhase2-, EqualizationPhase3-, 
LinkEqualizationRequest-
Capabilities: [100

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Alex Williamson

On Tue, 2013-02-05 at 12:05 +, Pandarathil, Vijaymohan R wrote:

  -Original Message-
  From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-
  ow...@vger.kernel.org] On Behalf Of Gleb Natapov
  Sent: Tuesday, February 05, 2013 3:37 AM
  To: Pandarathil, Vijaymohan R
  Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
  kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
  linux-ker...@vger.kernel.org
  Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
  PCI devices

  On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote:

-Original Message-
From: Gleb Natapov [mailto:g...@redhat.com]
Sent: Tuesday, February 05, 2013 1:21 AM
To: Pandarathil, Vijaymohan R
Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
linux-ker...@vger.kernel.org
Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
  VFIO-
PCI devices

On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R
  wrote:

  -Original Message-
  From: Gleb Natapov [mailto:g...@redhat.com]
  Sent: Tuesday, February 05, 2013 12:05 AM
  To: Blue Swirl
  Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas;
  Ortiz,
Lance
  E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
p...@vger.kernel.org;
  linux-ker...@vger.kernel.org
  Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER
  for
VFIO-
  PCI devices

  On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
   On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
   vijaymohan.pandarat...@hp.com wrote:
- Create eventfd per vfio device assigned to a guest
  and
  register an
  event handler

- This fd is passed to the vfio_pci driver through the
SET_IRQ
  ioctl

- When the device encounters an error, the eventfd is
signalled
  and the qemu eventfd handler gets invoked.

- In the handler decide what action to take. Current
  action
  taken
  is to terminate the guest.

   Usually this is not OK, but I guess this is not guest
  triggerable.

  Still not OK. Why not stop a guest with appropriate stop reason?

 The thinking was that since this is a hardware error, we would want
  to
stop the guest at the earliest. The hw_error() routine which aborts the
qemu process was suggested by Alex and that seemed appropriate. Earlier
  I
was using qemu_system_shutdown_request().  Any suggestions ?

I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
involves sending IPIs to other cpus running guest's vcpus. Both exit()
and vm_stop() will do it, but former is implicitly in the kernel and
later is explicitly in QEMU.

   I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while
  testing, guest ended up in a hang rather than exiting. There seems to some
  cleanup work which is being done as part of vm_stop. In our case, we wanted
  the guest to exit immediately. So use of hw_error() seemed appropriate.

  What makes you think it hang? It stopped, precisely what it should do if
  you call vm_stop(). Now it is possible for vm user to investigate what
  happened and even salvage some data from guest memory.

 That was ignorance on my part on the expected behavior of vm_stop(). 
 So what you are suggesting is to stop the guest displaying an appropriate 
 error/next-steps message and have the users do any 
 data-collection/investigation 
 and then manually kill the guest, if they so desire. Right ?

 Sounds reasonable. As long as the guest is not touching the device, it should 
 be okay.
 Alex, Any comments ?

What's the libvirt behavior when a guest goes to vm_stop?  My only
concern would be whether the user is going to be confused by a state
where the vm is still up, but not running.  I imagine they'll have to
manually stop it and restart it to continue.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-05 Thread Zhang, Yang Z

Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 01:26:42PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
 On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2013-02-05:
 On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
 On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
 On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
 Any example how software relies on such
 two-interrupts-queued-in-IRR/ISR behaviour?
 Don't know about guests, but KVM relies on it to detect
 interrupt coalescing. So if interrupt is set in IRR but not in
 PIR interrupt will not be reported as coalesced, but it will
 be coalesced during PIR-IRR merge.
 
 Yes, so:
 
 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
 3. vcpu outside of guest mode.
 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
 5. vcpu enters guest mode.
 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
 7. HW transfers PIR into IRR.
 
 set_irq return value at 7 is incorrect, interrupt event was _not_
 queued.
 Not sure I understand the flow of events in your description
 correctly. As I understand it at 4 set_irq() will return incorrect
 result. Basically when PIR is set to 1 while IRR has 1 for the
 vector the value of set_irq() will be incorrect.
 
 At 4 it has not been coalesced: it has been queued to IRR.
 At 6 it has been coalesced: PIR bit merged into IRR bit.
 
 Frankly I do not see how it can be fixed
 without any race with present HW PIR design.
 
 At kvm_accept_apic_interrupt, check IRR before setting PIR bit,
 if IRR already set, don't set PIR.
 
 Or:
 
 apic_accept_interrupt() {
 
 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
 Never set IRR when HWAPIC enabled, even if outside of guest mode.
 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
 }
 
 Two or more concurrent set_irq can race with each other, though. Can
 either document the race or add a lock.
 According the SDM, software should not touch the IRR when target
 vcpu
 is
 running. Instead, use locked way to access PIR. So your solution may
 wrong. Then your apicv patches are broken, because they do exactly
 that.
 Which code is broken?
 
 The one that updates IRR directly on the apic page.
 No, all the updates are ensuring the target vcpu is not running. So
 it's safe to touch IRR.
 
 Not at all. Read the code.
 Sorry. I still cannot figure out which code is wrong. All the places
 call sync_pir_to_irr() are on target vcpu. Can you point out the code?
 Thanks.
 
 I am taking about vapic patches which are already in, not pir patches.
Yes, but the issue will be fixed with pir patches. With posted interrupt, it 
will touch PIR instead IRR and access PIR is allowed by HW.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 06:37:35AM -0700, Alex Williamson wrote:
 On Tue, 2013-02-05 at 12:05 +, Pandarathil, Vijaymohan R wrote:

   -Original Message-
   From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-
   ow...@vger.kernel.org] On Behalf Of Gleb Natapov
   Sent: Tuesday, February 05, 2013 3:37 AM
   To: Pandarathil, Vijaymohan R
   Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
   kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
   linux-ker...@vger.kernel.org
   Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for 
   VFIO-
   PCI devices

   On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote:

 -Original Message-
 From: Gleb Natapov [mailto:g...@redhat.com]
 Sent: Tuesday, February 05, 2013 1:21 AM
 To: Pandarathil, Vijaymohan R
 Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
 kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
 linux-ker...@vger.kernel.org
 Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
   VFIO-
 PCI devices

 On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R
   wrote:

   -Original Message-
   From: Gleb Natapov [mailto:g...@redhat.com]
   Sent: Tuesday, February 05, 2013 12:05 AM
   To: Blue Swirl
   Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas;
   Ortiz,
 Lance
   E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
 p...@vger.kernel.org;
   linux-ker...@vger.kernel.org
   Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER
   for
 VFIO-
   PCI devices

   On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
vijaymohan.pandarat...@hp.com wrote:
 - Create eventfd per vfio device assigned to a guest
   and
   register an
   event handler

 - This fd is passed to the vfio_pci driver through the
 SET_IRQ
   ioctl

 - When the device encounters an error, the eventfd is
 signalled
   and the qemu eventfd handler gets invoked.

 - In the handler decide what action to take. Current
   action
   taken
   is to terminate the guest.

Usually this is not OK, but I guess this is not guest
   triggerable.

   Still not OK. Why not stop a guest with appropriate stop reason?

  The thinking was that since this is a hardware error, we would want
   to
 stop the guest at the earliest. The hw_error() routine which aborts 
 the
 qemu process was suggested by Alex and that seemed appropriate. 
 Earlier
   I
 was using qemu_system_shutdown_request().  Any suggestions ?

 I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
 involves sending IPIs to other cpus running guest's vcpus. Both exit()
 and vm_stop() will do it, but former is implicitly in the kernel and
 later is explicitly in QEMU.

I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while
   testing, guest ended up in a hang rather than exiting. There seems to some
   cleanup work which is being done as part of vm_stop. In our case, we 
   wanted
   the guest to exit immediately. So use of hw_error() seemed appropriate.

   What makes you think it hang? It stopped, precisely what it should do if
   you call vm_stop(). Now it is possible for vm user to investigate what
   happened and even salvage some data from guest memory.

  That was ignorance on my part on the expected behavior of vm_stop(). 
  So what you are suggesting is to stop the guest displaying an appropriate 
  error/next-steps message and have the users do any 
  data-collection/investigation 
  and then manually kill the guest, if they so desire. Right ?

  Sounds reasonable. As long as the guest is not touching the device, it 
  should be okay.
  Alex, Any comments ?

 What's the libvirt behavior when a guest goes to vm_stop?  My only
 concern would be whether the user is going to be confused by a state
 where the vm is still up, but not running.  I imagine they'll have to
 manually stop it and restart it to continue.  Thanks,

vm_stop() is already the behaviour on KVM internal errors, so management
stack knows how to handle those. Of course vfio should use meaningful
stop reason and not just reuse existing one.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 01:40:44PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 01:26:42PM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-02-05:
  On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote:
  Marcelo Tosatti wrote on 2013-02-05:
  On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
  On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
  On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
  Any example how software relies on such
  two-interrupts-queued-in-IRR/ISR behaviour?
  Don't know about guests, but KVM relies on it to detect
  interrupt coalescing. So if interrupt is set in IRR but not in
  PIR interrupt will not be reported as coalesced, but it will
  be coalesced during PIR-IRR merge.
  
  Yes, so:
  
  1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
  2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer.
  3. vcpu outside of guest mode.
  4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
  5. vcpu enters guest mode.
  6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
  7. HW transfers PIR into IRR.
  
  set_irq return value at 7 is incorrect, interrupt event was _not_
  queued.
  Not sure I understand the flow of events in your description
  correctly. As I understand it at 4 set_irq() will return incorrect
  result. Basically when PIR is set to 1 while IRR has 1 for the
  vector the value of set_irq() will be incorrect.
  
  At 4 it has not been coalesced: it has been queued to IRR.
  At 6 it has been coalesced: PIR bit merged into IRR bit.
  
  Frankly I do not see how it can be fixed
  without any race with present HW PIR design.
  
  At kvm_accept_apic_interrupt, check IRR before setting PIR bit,
  if IRR already set, don't set PIR.
  
  Or:
  
  apic_accept_interrupt() {
  
  1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
  Never set IRR when HWAPIC enabled, even if outside of guest mode.
  2. Set PIR and let HW or SW VM-entry transfer it to IRR.
  3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
  }
  
  Two or more concurrent set_irq can race with each other, though. Can
  either document the race or add a lock.
  According the SDM, software should not touch the IRR when target
  vcpu
  is
  running. Instead, use locked way to access PIR. So your solution may
  wrong. Then your apicv patches are broken, because they do exactly
  that.
  Which code is broken?
  
  The one that updates IRR directly on the apic page.
  No, all the updates are ensuring the target vcpu is not running. So
  it's safe to touch IRR.
  
  Not at all. Read the code.
  Sorry. I still cannot figure out which code is wrong. All the places
  call sync_pir_to_irr() are on target vcpu. Can you point out the code?
  Thanks.
  
  I am taking about vapic patches which are already in, not pir patches.
 Yes, but the issue will be fixed with pir patches. With posted interrupt, it 
 will touch PIR instead IRR and access PIR is allowed by HW.
 
So the current code is broken?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Alex Williamson

On Tue, 2013-02-05 at 15:42 +0200, Gleb Natapov wrote:
 On Tue, Feb 05, 2013 at 06:37:35AM -0700, Alex Williamson wrote:
  On Tue, 2013-02-05 at 12:05 +, Pandarathil, Vijaymohan R wrote:

-Original Message-
From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-
ow...@vger.kernel.org] On Behalf Of Gleb Natapov
Sent: Tuesday, February 05, 2013 3:37 AM
To: Pandarathil, Vijaymohan R
Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
linux-ker...@vger.kernel.org
Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for 
VFIO-
PCI devices

On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R 
wrote:

  -Original Message-
  From: Gleb Natapov [mailto:g...@redhat.com]
  Sent: Tuesday, February 05, 2013 1:21 AM
  To: Pandarathil, Vijaymohan R
  Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
  kvm@vger.kernel.org; qemu-de...@nongnu.org; 
  linux-...@vger.kernel.org;
  linux-ker...@vger.kernel.org
  Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER 
  for
VFIO-
  PCI devices

  On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R
wrote:

-Original Message-
From: Gleb Natapov [mailto:g...@redhat.com]
Sent: Tuesday, February 05, 2013 12:05 AM
To: Blue Swirl
Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas;
Ortiz,
  Lance
E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-
  p...@vger.kernel.org;
linux-ker...@vger.kernel.org
Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support 
AER
for
  VFIO-
PCI devices

On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
 On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
 vijaymohan.pandarat...@hp.com wrote:
  - Create eventfd per vfio device assigned to a guest
and
register an
event handler

  - This fd is passed to the vfio_pci driver through 
  the
  SET_IRQ
ioctl

  - When the device encounters an error, the eventfd 
  is
  signalled
and the qemu eventfd handler gets invoked.

  - In the handler decide what action to take. Current
action
taken
is to terminate the guest.

 Usually this is not OK, but I guess this is not guest
triggerable.

Still not OK. Why not stop a guest with appropriate stop reason?

   The thinking was that since this is a hardware error, we would 
   want
to
  stop the guest at the earliest. The hw_error() routine which aborts 
  the
  qemu process was suggested by Alex and that seemed appropriate. 
  Earlier
I
  was using qemu_system_shutdown_request().  Any suggestions ?

  I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
  involves sending IPIs to other cpus running guest's vcpus. Both 
  exit()
  and vm_stop() will do it, but former is implicitly in the kernel and
  later is explicitly in QEMU.

 I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while
testing, guest ended up in a hang rather than exiting. There seems to 
some
cleanup work which is being done as part of vm_stop. In our case, we 
wanted
the guest to exit immediately. So use of hw_error() seemed appropriate.

What makes you think it hang? It stopped, precisely what it should do if
you call vm_stop(). Now it is possible for vm user to investigate what
happened and even salvage some data from guest memory.

   That was ignorance on my part on the expected behavior of vm_stop(). 
   So what you are suggesting is to stop the guest displaying an appropriate 
   error/next-steps message and have the users do any 
   data-collection/investigation 
   and then manually kill the guest, if they so desire. Right ?

   Sounds reasonable. As long as the guest is not touching the device, it 
   should be okay.
   Alex, Any comments ?

  What's the libvirt behavior when a guest goes to vm_stop?  My only
  concern would be whether the user is going to be confused by a state
  where the vm is still up, but not running.  I imagine they'll have to
  manually stop it and restart it to continue.  Thanks,

 vm_stop() is already the behaviour on KVM internal errors, so management
 stack knows how to handle those. Of course vfio should use meaningful
 stop reason and not just reuse existing one.

Ok, sounds like a good approach to me.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a

Re: KVH call agenda for 2013-02-05

2013-02-05 Thread Juan Quintela

Anthony Liguori anth...@codemonkey.ws wrote:
 Juan Quintela quint...@redhat.com writes:

 Hi

 Please send in any agenda topics you are interested in.

 FYI, I have a conflict for today so I won't be able to attend.

As this was the only answer to the agenda, today call gets cancelled.

Happy hacking, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: DMAR faults from unrelated device when vfio is used

2013-02-05 Thread Alex Williamson

On Tue, 2013-02-05 at 14:31 +0100, David Gstir wrote:
 Am Montag, den 04.02.2013, 08:49 -0700 schrieb Alex Williamson:
 
  Can you clarify what you mean by assign?  Are you actually assigning the
  root ports to the qemu guest (1c.0  1c.6)?  vfio will require they be
  owned by vfio-pci to make use of 3:00.0, but assigning them to the guest
  is not recommended.  Can you provided your qemu command line?  
 
 I did hand all of them to the guest OS. Removing 1c.0  1c.6 from the qemu 
 command line seems to have done the trick. Thanks!

Great, though I'm still not sure how we were generating those DMAR
faults.

 Here's my working qemu command line:
 qemu-kvm -no-reboot -enable-kvm -cpu host -smp 4 -m 6G \
   -drive 
 file=/home/test/qemu/images/win7_base_updated.qcow2,if=virtio,cache=none,media=disk,format=qcow2,index=0
  \
   -full-screen -no-quit -no-frame -display sdl -vnc :1 -k de -usbdevice 
 tablet \
   -vga std -global VGA.vgamem_mb=256 \
   -netdev tap,id=guest0,ifname=tap0,script=no,downscript=no \
   -net nic,netdev=guest0,model=virtio,macaddr=00:16:35:BE:EF:12  \
   -rtc base=localtime \
   -device vfio-pci,host=00:1b.0,id=audio \
   -device vfio-pci,host=00:1a.0,id=ehci1 \
   -device vfio-pci,host=00:1d.0,id=ehci2 \
   -device vfio-pci,host=03:00.0,id=xhci1 \
   -monitor tcp::,server,nowait
 
 
  We need
  to re-visit how to handle pcieport devices with vfio-pci, perhaps
  white-listing it as a vfio compatible driver, but this still should
  not interfere with devices external to the group.
  
  The DMAR fault address looks pretty bogus unless you happen to have
  100GB+ of ram in the system.
 
 Nope, definitely not. :)
 
  vfio makes use of the IOMMU API for programming DMA translations, so an
  reserved fields would have to be programmed by intel-iommu itself.  We
  could of course be passing some kind of bogus data that intel-iommu
  isn't catching.  If you're assigning the root ports to the guest, I'd
  start with that, don't do it.  Attach them to vfio, but don't give them
  to the guest.  Maybe that'll give us a hint.  I also notice that your
  USB 3 controller is dead:
  
  03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
  (rev ff) (prog-if ff)
  !!! Unknown header type 7f
  
  We only see unknown header type 7f when the read from the device returns
  -1.  This might have something to do with the root port above it (1c.6)
  being in state D3.  Windows likes to put unused devices in D3, which
  leads me to suspect you are giving it to the guest.  
 
 There error does no longer occur. lspci now shows this:
 
 -- snip --
 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
 (rev 04) (prog-if 30 [XHCI])
   Subsystem: Intel Corporation Device 2008
   Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
 Stepping- SERR+ FastB2B- DisINTx+
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
 MAbort+ SERR- PERR- INTx-
   Interrupt: pin A routed to IRQ 18
   Region 0: Memory at fe50 (64-bit, non-prefetchable) [disabled] 
 [size=8K]
   Capabilities: [50] Power Management version 3
   Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA 
 PME(D0+,D1-,D2-,D3hot+,D3cold+)
   Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
   Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
   Address:   Data: 
   Capabilities: [90] MSI-X: Enable- Count=8 Masked-
   Vector table: BAR=0 offset=1000
   PBA: BAR=0 offset=1080
   Capabilities: [a0] Express (v2) Endpoint, MSI 00
   DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
 unlimited, L1 unlimited
   ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
   DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
 Unsupported-
   RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
   MaxPayload 128 bytes, MaxReadReq 128 bytes
   DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ 
 TransPend-
   LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 
 4us, L1 unlimited
   ClockPM+ Surprise- LLActRep- BwNot-
   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
   ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
   LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
 DLActive- BWMgmt- ABWMgmt-
   DevCap2: Completion Timeout: Not Supported, TimeoutDis+
   DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
   LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, 
 Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, 
 EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
   LnkSta2: Current De-emphasis Level:

Q: Why not use struct mm_struct to manage guest physical addresses in new port?

2013-02-05 Thread David Daney


Hi,

I am starting to working on a port of KVM to an architecture that has a 
dual TLB.  The Guest Virtual Addresses (GVA) are translated to Guest 
Physical Addresses (GPA) by the first TLB, then a second TLB translates 
the GPA to a Root Physical Address (RPA).  For the sake of this 
question, we will ignore the GVA-GPA TLB and consider only the GPA-RPA 
TLB.


I seems that most existing ports have a bunch of custom code that 
manages the GPA-RPA TLB and page tables.


Here is what I would like to try:  Create a mm for the GPA-RPA mappings 
each vma would have a fault handler that calls gfn_to_pfn() to look up 
the proper page.  In kvm_arch_vcpu_ioctl_run() we would call switch_mm() 
to this new 'gva_mm'.  Upon exiting guest mode we would switch back to 
the original mm of the controlling process.


For me the benefit of this approach is that all the code that manages 
the TLB is already implemented and works well for struct mm_struct.  The 
only thing I need to do is write a vma fault handler.  That is a lot 
easier and less error prone than maintaining a parallel TLB management 
framework and making sure it interacts properly with the existing TLB 
code for 'normal' processes.



Q1: Am I crazy for wanting to try this?

Q2: Have others tried this and rejected it?  What were the reasons?


Thanks in advance,
David Daney
Cavium, Inc.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: DMAR faults from unrelated device when vfio is used

2013-02-05 Thread Alex Williamson

On Tue, 2013-02-05 at 08:37 -0700, Alex Williamson wrote:
 On Tue, 2013-02-05 at 14:31 +0100, David Gstir wrote:
  Am Montag, den 04.02.2013, 08:49 -0700 schrieb Alex Williamson:
  
   Can you clarify what you mean by assign?  Are you actually assigning the
   root ports to the qemu guest (1c.0  1c.6)?  vfio will require they be
   owned by vfio-pci to make use of 3:00.0, but assigning them to the guest
   is not recommended.  Can you provided your qemu command line?  
  
  I did hand all of them to the guest OS. Removing 1c.0  1c.6 from the qemu 
  command line seems to have done the trick. Thanks!
 
 Great, though I'm still not sure how we were generating those DMAR
 faults.
 
  Here's my working qemu command line:
  qemu-kvm -no-reboot -enable-kvm -cpu host -smp 4 -m 6G \
-drive 
  file=/home/test/qemu/images/win7_base_updated.qcow2,if=virtio,cache=none,media=disk,format=qcow2,index=0
   \
-full-screen -no-quit -no-frame -display sdl -vnc :1 -k de -usbdevice 
  tablet \
-vga std -global VGA.vgamem_mb=256 \
-netdev tap,id=guest0,ifname=tap0,script=no,downscript=no \
-net nic,netdev=guest0,model=virtio,macaddr=00:16:35:BE:EF:12  \
-rtc base=localtime \
-device vfio-pci,host=00:1b.0,id=audio \
-device vfio-pci,host=00:1a.0,id=ehci1 \
-device vfio-pci,host=00:1d.0,id=ehci2 \
-device vfio-pci,host=03:00.0,id=xhci1 \
-monitor tcp::,server,nowait
  
  
   We need
   to re-visit how to handle pcieport devices with vfio-pci, perhaps
   white-listing it as a vfio compatible driver, but this still should
   not interfere with devices external to the group.
   
   The DMAR fault address looks pretty bogus unless you happen to have
   100GB+ of ram in the system.
  
  Nope, definitely not. :)
  
   vfio makes use of the IOMMU API for programming DMA translations, so an
   reserved fields would have to be programmed by intel-iommu itself.  We
   could of course be passing some kind of bogus data that intel-iommu
   isn't catching.  If you're assigning the root ports to the guest, I'd
   start with that, don't do it.  Attach them to vfio, but don't give them
   to the guest.  Maybe that'll give us a hint.  I also notice that your
   USB 3 controller is dead:
   
   03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
   (rev ff) (prog-if ff)
 !!! Unknown header type 7f
   
   We only see unknown header type 7f when the read from the device returns
   -1.  This might have something to do with the root port above it (1c.6)
   being in state D3.  Windows likes to put unused devices in D3, which
   leads me to suspect you are giving it to the guest.  
  
  There error does no longer occur. lspci now shows this:
  
  -- snip --
  03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
  (rev 04) (prog-if 30 [XHCI])
  Subsystem: Intel Corporation Device 2008
  Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
  Stepping- SERR+ FastB2B- DisINTx+
  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
  MAbort+ SERR- PERR- INTx-
  Interrupt: pin A routed to IRQ 18
  Region 0: Memory at fe50 (64-bit, non-prefetchable) [disabled] 
  [size=8K]
  Capabilities: [50] Power Management version 3
  Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA 
  PME(D0+,D1-,D2-,D3hot+,D3cold+)
  Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
  Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
  Address:   Data: 
  Capabilities: [90] MSI-X: Enable- Count=8 Masked-
  Vector table: BAR=0 offset=1000
  PBA: BAR=0 offset=1080
  Capabilities: [a0] Express (v2) Endpoint, MSI 00
  DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
  unlimited, L1 unlimited
  ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
  Unsupported-
  RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
  MaxPayload 128 bytes, MaxReadReq 128 bytes
  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ 
  TransPend-
  LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 
  4us, L1 unlimited
  ClockPM+ Surprise- LLActRep- BwNot-
  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
  ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
  DLActive- BWMgmt- ABWMgmt-
  DevCap2: Completion Timeout: Not Supported, TimeoutDis+
  DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
  LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, 
  Selectable De-emphasis: -6dB
   Transmit Margin: Normal Operating Range, 
  EnterModifiedCompliance- ComplianceSOS-

Re: DMAR faults from unrelated device when vfio is used

2013-02-05 Thread Richard Weinberger

Am Tue, 05 Feb 2013 13:36:53 -0700
schrieb Alex Williamson alex.william...@redhat.com:
  Ugh, the infamous and useless error 10.  It could be anything.
  I've got a system with onboard usb3, let me see what windows does
  with it here first.  Thanks,
 
 Well, I've got an Etron USB3 HBA and (un)fortunately it works just
 fine with a Win7 guest.  There's really nothing special about USB
 controllers from a PCI device assignment perspective.  Have you tried
 the latest upstream qemu bits?  Thanks,

We tried also qemu v1.4.0-rc0 (git as of today) without success.
As next step we'll test Linux as guest, maybe it is more chatty than
Windows regarding the issue. :-)

Thanks,
//richard
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] Expand the steal time msr to also contain the consigned time.

2013-02-05 Thread Michael Wolf

Expand the steal time msr to also contain the consigned time.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/paravirt.h   |4 ++--
 arch/x86/include/asm/paravirt_types.h |2 +-
 arch/x86/kernel/kvm.c |7 ++-
 kernel/sched/core.c   |   10 +-
 kernel/sched/cputime.c|2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 5edd174..9b753ea 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline void paravirt_steal_clock(int cpu, u64 *steal)
 {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
-   unsigned long long (*steal_clock)(int cpu);
+   void (*steal_clock)(int cpu, unsigned long long *steal);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index fe75a28..89e5468 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -386,9 +386,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
 {
-   u64 steal;
struct kvm_steal_time *src;
int version;
 
@@ -396,11 +395,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src-version;
rmb();
-   steal = src-steal;
+   *steal = src-steal;
rmb();
} while ((version  1) || (version != src-version));
-
-   return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..efc2652 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -757,6 +757,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
  */
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
+   u64 consigned = 0;
 #endif
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
@@ -785,8 +786,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((paravirt_steal_rq_enabled))) {
u64 st;
+   u64 cs;
 
-   steal = paravirt_steal_clock(cpu_of(rq));
+   paravirt_steal_clock(cpu_of(rq), steal, consigned);
+   /*
+* since we are not assigning the steal time to cpustats
+* here, just combine the steal and consigned times to
+* do the rest of the calculations.
+*/
+   steal += consigned;
steal -= rq-prev_steal_time_rq;
 
if (unlikely(steal  delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 825a956..0b4f1ec 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(paravirt_steal_enabled)) {
u64 steal, st = 0;
 
-   steal = paravirt_steal_clock(smp_processor_id());
+   paravirt_steal_clock(smp_processor_id(), steal);
steal -= this_rq()-prev_steal_time;
 
st = steal_ticks(steal);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] Add the code to send the consigned time from the host to the guest

2013-02-05 Thread Michael Wolf

Change the paravirt calls that retrieve the steal-time information
from the host.  Add to it getting the consigned value as well as
the steal time.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h  |1 +
 arch/x86/include/asm/paravirt.h  |4 ++--
 arch/x86/include/uapi/asm/kvm_para.h |3 ++-
 arch/x86/kernel/kvm.c|3 ++-
 arch/x86/kernel/paravirt.c   |4 ++--
 arch/x86/kvm/x86.c   |2 ++
 include/linux/kernel_stat.h  |1 +
 kernel/sched/cputime.c   |   21 +++--
 kernel/sched/sched.h |2 ++
 9 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dc87b65..fe5a37b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -428,6 +428,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+   u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9b753ea..77f05e7 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline void paravirt_steal_clock(int cpu, u64 *steal)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+   PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 06fdbd9..55d617f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -42,9 +42,10 @@
 
 struct kvm_steal_time {
__u64 steal;
+   __u64 consigned;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 pad[10];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 89e5468..fb52f8a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -386,7 +386,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static void kvm_steal_clock(int cpu, u64 *steal)
+static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
struct kvm_steal_time *src;
int version;
@@ -396,6 +396,7 @@ static void kvm_steal_clock(int cpu, u64 *steal)
version = src-version;
rmb();
*steal = src-steal;
+   *consigned = src-consigned;
rmb();
} while ((version  1) || (version != src-version));
 }
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 17fff18..3797683 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr)
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
-static u64 native_steal_clock(int cpu)
+static void native_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   return 0;
+   *steal = *consigned = 0;
 }
 
 /* These are in entry.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c243b81..51b63d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1867,8 +1867,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;
 
vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal;
+   vcpu-arch.st.steal.consigned += vcpu-arch.st.accum_consigned;
vcpu-arch.st.steal.version += 2;
vcpu-arch.st.accum_steal = 0;
+   vcpu-arch.st.accum_consigned = 0;
 
kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
vcpu-arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e352052..f58ed0f 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct 
task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0b4f1ec..2a2d4be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
 }
 
 /*

[PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-02-05 Thread Michael Wolf

Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings.  The values will be used
to set up a timer that is used to separate the consigned from the
steal time.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |9 ++
 arch/x86/kvm/x86.c  |   62 ++-
 kernel/sched/core.c |   20 +
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fe5a37b..9518613 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -355,6 +355,15 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
 
/*
+* timer used to determine if the time should be counted as
+* steal time or consigned time.
+*/
+   struct hrtimer steal_timer;
+   u64 current_consigned;
+   s64 consigned_quota;
+   s64 consigned_period;
+
+   /*
 * Paging state of the vcpu
 *
 * If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 51b63d1..79d144d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1848,13 +1848,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
u64 delta;
+   u64 steal_delta;
+   u64 consigned_delta;
 
if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
return;
 
delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
vcpu-arch.st.last_steal = current-sched_info.run_delay;
-   vcpu-arch.st.accum_steal = delta;
+
+   /* split the delta into steal and consigned */
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_quota) {
+   vcpu-arch.current_consigned += delta;
+   if (vcpu-arch.current_consigned  vcpu-arch.consigned_quota) {
+   steal_delta = vcpu-arch.current_consigned
+   -  vcpu-arch.consigned_quota;
+   consigned_delta = delta - steal_delta;
+   } else {
+   consigned_delta = delta;
+   steal_delta = 0;
+   }
+   } else {
+   consigned_delta = 0;
+   steal_delta = delta;
+   }
+   vcpu-arch.st.accum_steal = steal_delta;
+   vcpu-arch.st.accum_consigned = consigned_delta;
 }
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -2629,8 +2648,35 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
!(vcpu-kvm-arch.iommu_flags  KVM_IOMMU_CACHE_COHERENCY);
 }
 
+extern int sched_use_hard_capping(int cpuid, int num_vcpus, s64 *quota,
+   s64 *period);
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm *kvm;
+   int num_vcpus;
+   ktime_t now;
+
+   vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+   kvm = vcpu-kvm;
+   num_vcpus = atomic_read(kvm-online_vcpus);
+   sched_use_hard_capping(vcpu-cpu, num_vcpus,
+   vcpu-arch.consigned_quota,
+   vcpu-arch.consigned_period);
+   vcpu-arch.current_consigned = 0;
+   now = ktime_get();
+   hrtimer_forward(vcpu-arch.steal_timer, now,
+   ktime_set(0, vcpu-arch.consigned_period));
+
+   return HRTIMER_RESTART;
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+   struct kvm *kvm;
+   int num_vcpus;
+   ktime_t ktime;
+
/* Address WBINVD may be executed by guest */
if (need_emulate_wbinvd(vcpu)) {
if (kvm_x86_ops-has_wbinvd_exit())
@@ -2670,6 +2716,18 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_migrate_timers(vcpu);
vcpu-cpu = cpu;
}
+   /* Initialize and start a timer to capture steal and consigned time */
+   kvm = vcpu-kvm;
+   num_vcpus = atomic_read(kvm-online_vcpus);
+   num_vcpus = (num_vcpus == 0) ? 1 : num_vcpus;
+   sched_use_hard_capping(vcpu-cpu, num_vcpus,
+   vcpu-arch.consigned_quota,
+   vcpu-arch.consigned_period);
+   hrtimer_init(vcpu-arch.steal_timer, CLOCK_MONOTONIC,
+   HRTIMER_MODE_REL);
+   vcpu-arch.steal_timer.function = steal_timer_fn;
+   ktime = ktime_set(0, vcpu-arch.consigned_period);
+   hrtimer_start(vcpu-arch.steal_timer, ktime, HRTIMER_MODE_REL);
 
accumulate_steal_time(vcpu);
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
@@ -2680,6 +2738,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_x86_ops-vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu-arch.last_host_tsc

[PATCH 0/4] Alter steal-time reporting in the guest

2013-02-05 Thread Michael Wolf

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  Tthe steal time will only be altered
if hard limits (cfs bandwidth control) is used.  The period and the quota used
to separate the consigned time (expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time accruing during
that period will show as the traditional steal time.

Changes from V2:
* Dropped the ioctl that allowed qemu to send the entitlement value to
  the guest.
* Added code to get the entitlement period and quota from cfs bandwidth.

Changes from V1:
* Removed the steal time allowed percentage from the guest
* Moved the separation of consigned (expected steal) and steal time to the
  host.
* No longer include a sysctl interface.

---

Michael Wolf (4):
  Alter the amount of steal time reported by the guest.
  Expand the steal time msr to also contain the consigned time.
  Add the code to send the consigned time from the host to the guest
  Add a timer to allow the separation of consigned from steal time.


 arch/x86/include/asm/kvm_host.h   |   10 +
 arch/x86/include/asm/paravirt.h   |4 +-
 arch/x86/include/asm/paravirt_types.h |2 +
 arch/x86/include/uapi/asm/kvm_para.h  |3 +-
 arch/x86/kernel/kvm.c |8 ++--
 arch/x86/kernel/paravirt.c|4 +-
 arch/x86/kvm/x86.c|   64 -
 fs/proc/stat.c|9 -
 include/linux/kernel_stat.h   |2 +
 kernel/sched/core.c   |   30 +++
 kernel/sched/cputime.c|   21 ++-
 kernel/sched/sched.h  |2 +
 12 files changed, 142 insertions(+), 17 deletions(-)

-- 
Signature

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] Alter the amount of steal time reported by the guest.

2013-02-05 Thread Michael Wolf

Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu hard capping.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
 fs/proc/stat.c  |9 +++--
 include/linux/kernel_stat.h |1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
-   u64 guest, guest_nice;
+   u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
-   guest = guest_nice = 0;
+   guest = guest_nice = consign = 0;
getboottime(boottime);
jif = boottime.tv_sec;
 
+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, cpu%d, i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, intr %llu, (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 66b7078..e352052 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+   CPUTIME_CONSIGN,
NR_STATS,
 };
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] KVM: VMX: disable SMEP feature when guest is in non-paging mode

2013-02-05 Thread Marcelo Tosatti

On Mon, Feb 04, 2013 at 11:50:43AM +0800, Dongxiao Xu wrote:
 Changes from v1 to v2:
  - Modify commit message and comments according to Gleb's suggestions.
 
 SMEP is disabled if CPU is in non-paging mode in hardware.
 However KVM always uses paging mode to emulate guest non-paging
 mode with TDP. To emulate this behavior, SMEP needs to be manually
 disabled when guest switches to non-paging mode.
 
 We met an issue that, SMP Linux guest with recent kernel (enable
 SMEP support, for example, 3.5.3) would crash with triple fault if
 setting unrestricted_guest=0. This is because KVM uses an identity
 mapping page table to emulate the non-paging mode, where the page
 table is set with USER flag. If SMEP is still enabled in this case,
 guest will meet unhandlable page fault and then crash.
 
 Signed-off-by: Dongxiao Xu dongxiao...@intel.com
 Signed-off-by: Xiantao Zhang xiantao.zh...@intel.com
 ---
  arch/x86/kvm/vmx.c |8 
  1 files changed, 8 insertions(+), 0 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: VMX: add missing exit names to VMX_EXIT_REASONS array

2013-02-05 Thread Marcelo Tosatti


Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Remove duplicate text in api.txt

2013-02-05 Thread Marcelo Tosatti

On Thu, Jan 31, 2013 at 12:06:08PM -0800, Geoff Levand wrote:
 Signed-off-by: Geoff Levand ge...@infradead.org
 ---
 
 Saw this in v3.8-rc5, please apply.
 
  Documentation/virtual/kvm/api.txt |   13 -
  1 file changed, 13 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] tcm_vhost: Multi-queue support

2013-02-05 Thread Asias He

This adds virtio-scsi multi-queue support to tcm_vhost.

Guest side virtio-scsi multi-queue support can be found here:

   https://lkml.org/lkml/2012/12/18/166

Some initial perf numbers:
1 queue,  4 targets, 1 lun per target
4K request size, 50% randread + 50% randwrite: 127K/127k IOPS

4 queues, 4 targets, 1 lun per target
4K request size, 50% randread + 50% randwrite: 181K/181k IOPS

Signed-off-by: Asias He as...@redhat.com
---
 drivers/vhost/tcm_vhost.c | 46 +-
 drivers/vhost/tcm_vhost.h |  2 ++
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index 81ecda5..9951297 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -48,6 +48,7 @@
 #include linux/virtio_net.h /* TODO vhost.h currently depends on this */
 #include linux/virtio_scsi.h
 #include linux/llist.h
+#include linux/bitmap.h
 
 #include vhost.c
 #include vhost.h
@@ -59,7 +60,8 @@ enum {
VHOST_SCSI_VQ_IO = 2,
 };
 
-#define VHOST_SCSI_MAX_TARGET 256
+#define VHOST_SCSI_MAX_TARGET  256
+#define VHOST_SCSI_MAX_VQ  128
 
 struct vhost_scsi {
/* Protected by vhost_scsi-dev.mutex */
@@ -68,7 +70,7 @@ struct vhost_scsi {
bool vs_endpoint;
 
struct vhost_dev dev;
-   struct vhost_virtqueue vqs[3];
+   struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ];
 
struct vhost_work vs_completion_work; /* cmd completion work item */
struct llist_head vs_completion_list; /* cmd completion queue */
@@ -366,12 +368,14 @@ static void vhost_scsi_complete_cmd_work(struct 
vhost_work *work)
 {
struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
vs_completion_work);
+   DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ);
struct virtio_scsi_cmd_resp v_rsp;
struct tcm_vhost_cmd *tv_cmd;
struct llist_node *llnode;
struct se_cmd *se_cmd;
-   int ret;
+   int ret, vq;
 
+   bitmap_zero(signal, VHOST_SCSI_MAX_VQ);
llnode = llist_del_all(vs-vs_completion_list);
while (llnode) {
tv_cmd = llist_entry(llnode, struct tcm_vhost_cmd,
@@ -390,15 +394,20 @@ static void vhost_scsi_complete_cmd_work(struct 
vhost_work *work)
memcpy(v_rsp.sense, tv_cmd-tvc_sense_buf,
   v_rsp.sense_len);
ret = copy_to_user(tv_cmd-tvc_resp, v_rsp, sizeof(v_rsp));
-   if (likely(ret == 0))
-   vhost_add_used(vs-vqs[2], tv_cmd-tvc_vq_desc, 0);
-   else
+   if (likely(ret == 0)) {
+   vhost_add_used(tv_cmd-tvc_vq, tv_cmd-tvc_vq_desc, 0);
+   vq = tv_cmd-tvc_vq - vs-vqs;
+   __set_bit(vq, signal);
+   } else
pr_err(Faulted on virtio_scsi_cmd_resp\n);
 
vhost_scsi_free_cmd(tv_cmd);
}
 
-   vhost_signal(vs-dev, vs-vqs[2]);
+   vq = -1;
+   while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1))
+VHOST_SCSI_MAX_VQ)
+   vhost_signal(vs-dev, vs-vqs[vq]);
 }
 
 static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd(
@@ -561,9 +570,9 @@ static void tcm_vhost_submission_work(struct work_struct 
*work)
}
 }
 
-static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
+static void vhost_scsi_handle_vq(struct vhost_scsi *vs,
+   struct vhost_virtqueue *vq)
 {
-   struct vhost_virtqueue *vq = vs-vqs[2];
struct virtio_scsi_cmd_req v_req;
struct tcm_vhost_tpg *tv_tpg;
struct tcm_vhost_cmd *tv_cmd;
@@ -656,7 +665,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
ret = __copy_to_user(resp, rsp, sizeof(rsp));
if (!ret)
vhost_add_used_and_signal(vs-dev,
-   vs-vqs[2], head, 0);
+ vq, head, 0);
else
pr_err(Faulted on virtio_scsi_cmd_resp\n);
 
@@ -678,6 +687,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
: %d\n, tv_cmd, exp_data_len, data_direction);
 
tv_cmd-tvc_vhost = vs;
+   tv_cmd-tvc_vq = vq;
 
if (unlikely(vq-iov[out].iov_len !=
sizeof(struct virtio_scsi_cmd_resp))) {
@@ -758,7 +768,7 @@ static void vhost_scsi_handle_kick(struct vhost_work *work)
poll.work);
struct vhost_scsi *vs = container_of(vq-dev, struct vhost_scsi, dev);
 
-   vhost_scsi_handle_vq(vs);
+   vhost_scsi_handle_vq(vs, vq);
 }
 
 /*
@@ -879,7 +889,7 @@ err:
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
struct vhost_scsi *s;
-   int r;
+   int r, i;
 
s =

Re: [PATCH] kvm tools: Fix powerpc build after kvm__dump_mem() change

2013-02-05 Thread Asias He

On Mon, Feb 4, 2013 at 9:37 PM, Will Deacon will.dea...@arm.com wrote:
 On Mon, Feb 04, 2013 at 12:45:53PM +, Pekka Enberg wrote:
 On Mon, Feb 4, 2013 at 2:17 PM, Michael Ellerman mich...@ellerman.id.au 
 wrote:
  Commit 21692d1 (Beautify debug output) broke the powerpc build because
  it changed the signature for kvm__dump_mem() but didn't update all callers.
 
  Signed-off-by: Michael Ellerman mich...@ellerman.id.au

 Applied, thanks Michael!

 D'oh! I was about to post a patch to fix this (arm is also broken)... I've
 dropped the powerpc hunk from my patch (see below).

Thanks, Michael and Will!

 Cheers,

 Will

 ---8

 From 9cdc8ecfdd2a1ec3075c0a129e934433f94c29dc Mon Sep 17 00:00:00 2001
 From: Will Deacon will.dea...@arm.com
 Date: Sun, 3 Feb 2013 15:37:48 +
 Subject: [PATCH] kvm tools: arm: fix fallout from debug_fd refactoring

 Commit 21692d19f744 (kvm tools: Beautify debug output) changed the
 kvm__dump_mem prototype but only fixed up calls from x86.

 This patch fixes arm to pass the debug_fd as required.

 Signed-off-by: Will Deacon will.dea...@arm.com
 ---
  tools/kvm/arm/aarch32/kvm-cpu.c | 11 +--
  tools/kvm/arm/aarch64/kvm-cpu.c | 11 +--
  2 files changed, 10 insertions(+), 12 deletions(-)

 diff --git a/tools/kvm/arm/aarch32/kvm-cpu.c b/tools/kvm/arm/aarch32/kvm-cpu.c
 index a528789..6a012db 100644
 --- a/tools/kvm/arm/aarch32/kvm-cpu.c
 +++ b/tools/kvm/arm/aarch32/kvm-cpu.c
 @@ -54,25 +54,24 @@ void kvm_cpu__show_code(struct kvm_cpu *vcpu)
  {
 struct kvm_one_reg reg;
 u32 data;
 +   int debug_fd = kvm_cpu__get_debug_fd();

 reg.addr = (u64)(unsigned long)data;

 -   printf(*pc:\n);
 +   dprintf(debug_fd, \n*pc:\n);
 reg.id = ARM_CORE_REG(usr_regs.ARM_pc);
 if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 die(KVM_GET_ONE_REG failed (show_code @ PC));

 -   kvm__dump_mem(vcpu-kvm, data, 32);
 -   printf(\n);
 +   kvm__dump_mem(vcpu-kvm, data, 32, debug_fd);

 -   printf(*lr (svc):\n);
 +   dprintf(debug_fd, \n*lr (svc):\n);
 reg.id = ARM_CORE_REG(svc_regs[1]);
 if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 die(KVM_GET_ONE_REG failed (show_code @ LR_svc));
 data = ~0x1;

 -   kvm__dump_mem(vcpu-kvm, data, 32);
 -   printf(\n);
 +   kvm__dump_mem(vcpu-kvm, data, 32, debug_fd);
  }

  void kvm_cpu__show_registers(struct kvm_cpu *vcpu)
 diff --git a/tools/kvm/arm/aarch64/kvm-cpu.c b/tools/kvm/arm/aarch64/kvm-cpu.c
 index 2eb06ea..7cdcb70 100644
 --- a/tools/kvm/arm/aarch64/kvm-cpu.c
 +++ b/tools/kvm/arm/aarch64/kvm-cpu.c
 @@ -109,24 +109,23 @@ void kvm_cpu__show_code(struct kvm_cpu *vcpu)
  {
 struct kvm_one_reg reg;
 unsigned long data;
 +   int debug_fd = kvm_cpu__get_debug_fd();

 reg.addr = (u64)data;

 -   printf(*pc:\n);
 +   dprintf(debug_fd, \n*pc:\n);
 reg.id = ARM64_CORE_REG(regs.pc);
 if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 die(KVM_GET_ONE_REG failed (show_code @ PC));

 -   kvm__dump_mem(vcpu-kvm, data, 32);
 -   printf(\n);
 +   kvm__dump_mem(vcpu-kvm, data, 32, debug_fd);

 -   printf(*lr:\n);
 +   dprintf(debug_fd, \n*lr:\n);
 reg.id = ARM64_CORE_REG(regs.regs[30]);
 if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 die(KVM_GET_ONE_REG failed (show_code @ LR));

 -   kvm__dump_mem(vcpu-kvm, data, 32);
 -   printf(\n);
 +   kvm__dump_mem(vcpu-kvm, data, 32, debug_fd);
  }

  void kvm_cpu__show_registers(struct kvm_cpu *vcpu)
 --
 1.8.0




-- 
Asias He
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tcm_vhost: Multi-queue support

2013-02-05 Thread Nicholas A. Bellinger

On Wed, 2013-02-06 at 13:20 +0800, Asias He wrote:
 This adds virtio-scsi multi-queue support to tcm_vhost.
 
 Guest side virtio-scsi multi-queue support can be found here:
 
https://lkml.org/lkml/2012/12/18/166
 
 Some initial perf numbers:
 1 queue,  4 targets, 1 lun per target
 4K request size, 50% randread + 50% randwrite: 127K/127k IOPS
 
 4 queues, 4 targets, 1 lun per target
 4K request size, 50% randread + 50% randwrite: 181K/181k IOPS
 

Nice single LUN small block random I/O improvement here with 4x vqueues.

Curious to see how virtio-scsi small block performance looks with
SCSI-core to multi-LUN tcm_vhost endpoints as well..  8-)

Btw, this does not apply atop current target-pending.git/for-next with
your other pending vhost patch series, and AFAICT this patch is supposed
to apply on top of your last PATCH-v3, no..?

--nab

 Signed-off-by: Asias He as...@redhat.com
 ---
  drivers/vhost/tcm_vhost.c | 46 +-
  drivers/vhost/tcm_vhost.h |  2 ++
  2 files changed, 31 insertions(+), 17 deletions(-)
 
 diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
 index 81ecda5..9951297 100644
 --- a/drivers/vhost/tcm_vhost.c
 +++ b/drivers/vhost/tcm_vhost.c
 @@ -48,6 +48,7 @@
  #include linux/virtio_net.h /* TODO vhost.h currently depends on this */
  #include linux/virtio_scsi.h
  #include linux/llist.h
 +#include linux/bitmap.h
  
  #include vhost.c
  #include vhost.h
 @@ -59,7 +60,8 @@ enum {
   VHOST_SCSI_VQ_IO = 2,
  };
  
 -#define VHOST_SCSI_MAX_TARGET 256
 +#define VHOST_SCSI_MAX_TARGET256
 +#define VHOST_SCSI_MAX_VQ128
  
  struct vhost_scsi {
   /* Protected by vhost_scsi-dev.mutex */
 @@ -68,7 +70,7 @@ struct vhost_scsi {
   bool vs_endpoint;
  
   struct vhost_dev dev;
 - struct vhost_virtqueue vqs[3];
 + struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ];
  
   struct vhost_work vs_completion_work; /* cmd completion work item */
   struct llist_head vs_completion_list; /* cmd completion queue */
 @@ -366,12 +368,14 @@ static void vhost_scsi_complete_cmd_work(struct 
 vhost_work *work)
  {
   struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
   vs_completion_work);
 + DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ);
   struct virtio_scsi_cmd_resp v_rsp;
   struct tcm_vhost_cmd *tv_cmd;
   struct llist_node *llnode;
   struct se_cmd *se_cmd;
 - int ret;
 + int ret, vq;
  
 + bitmap_zero(signal, VHOST_SCSI_MAX_VQ);
   llnode = llist_del_all(vs-vs_completion_list);
   while (llnode) {
   tv_cmd = llist_entry(llnode, struct tcm_vhost_cmd,
 @@ -390,15 +394,20 @@ static void vhost_scsi_complete_cmd_work(struct 
 vhost_work *work)
   memcpy(v_rsp.sense, tv_cmd-tvc_sense_buf,
  v_rsp.sense_len);
   ret = copy_to_user(tv_cmd-tvc_resp, v_rsp, sizeof(v_rsp));
 - if (likely(ret == 0))
 - vhost_add_used(vs-vqs[2], tv_cmd-tvc_vq_desc, 0);
 - else
 + if (likely(ret == 0)) {
 + vhost_add_used(tv_cmd-tvc_vq, tv_cmd-tvc_vq_desc, 0);
 + vq = tv_cmd-tvc_vq - vs-vqs;
 + __set_bit(vq, signal);
 + } else
   pr_err(Faulted on virtio_scsi_cmd_resp\n);
  
   vhost_scsi_free_cmd(tv_cmd);
   }
  
 - vhost_signal(vs-dev, vs-vqs[2]);
 + vq = -1;
 + while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1))
 +  VHOST_SCSI_MAX_VQ)
 + vhost_signal(vs-dev, vs-vqs[vq]);
  }
  
  static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd(
 @@ -561,9 +570,9 @@ static void tcm_vhost_submission_work(struct work_struct 
 *work)
   }
  }
  
 -static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
 +static void vhost_scsi_handle_vq(struct vhost_scsi *vs,
 + struct vhost_virtqueue *vq)
  {
 - struct vhost_virtqueue *vq = vs-vqs[2];
   struct virtio_scsi_cmd_req v_req;
   struct tcm_vhost_tpg *tv_tpg;
   struct tcm_vhost_cmd *tv_cmd;
 @@ -656,7 +665,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
   ret = __copy_to_user(resp, rsp, sizeof(rsp));
   if (!ret)
   vhost_add_used_and_signal(vs-dev,
 - vs-vqs[2], head, 0);
 +   vq, head, 0);
   else
   pr_err(Faulted on virtio_scsi_cmd_resp\n);
  
 @@ -678,6 +687,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
   : %d\n, tv_cmd, exp_data_len, data_direction);
  
   tv_cmd-tvc_vhost = vs;
 + tv_cmd-tvc_vq = vq;
  
   if (unlikely(vq-iov[out].iov_len !=
   sizeof(struct virtio_scsi_cmd_resp))) {
 @@ -758,7 +768,7 @@ static

Re: [PATCH] tcm_vhost: Multi-queue support

2013-02-05 Thread Asias He

On 02/06/2013 02:45 PM, Nicholas A. Bellinger wrote:
 On Wed, 2013-02-06 at 13:20 +0800, Asias He wrote:
 This adds virtio-scsi multi-queue support to tcm_vhost.

 Guest side virtio-scsi multi-queue support can be found here:

https://lkml.org/lkml/2012/12/18/166

 Some initial perf numbers:
 1 queue,  4 targets, 1 lun per target
 4K request size, 50% randread + 50% randwrite: 127K/127k IOPS

 4 queues, 4 targets, 1 lun per target
 4K request size, 50% randread + 50% randwrite: 181K/181k IOPS

 
 Nice single LUN small block random I/O improvement here with 4x vqueues.
 
 Curious to see how virtio-scsi small block performance looks with
 SCSI-core to multi-LUN tcm_vhost endpoints as well..  8-)

Do you mean something like this?

1 queue,  2 targets, 2 lun per target
4 queue,  2 targets, 2 lun per target

 Btw, this does not apply atop current target-pending.git/for-next with
 your other pending vhost patch series, and AFAICT this patch is supposed
 to apply on top of your last PATCH-v3, no..?

Ah, this applies on top of mst's 'tcm_vhost: fix pr_err on early kick
patch.' plus my last v3 of 'tcm_vhost: Multi-target support'.

 --nab
 
 Signed-off-by: Asias He as...@redhat.com
 ---
  drivers/vhost/tcm_vhost.c | 46 
 +-
  drivers/vhost/tcm_vhost.h |  2 ++
  2 files changed, 31 insertions(+), 17 deletions(-)

 diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
 index 81ecda5..9951297 100644
 --- a/drivers/vhost/tcm_vhost.c
 +++ b/drivers/vhost/tcm_vhost.c
 @@ -48,6 +48,7 @@
  #include linux/virtio_net.h /* TODO vhost.h currently depends on this */
  #include linux/virtio_scsi.h
  #include linux/llist.h
 +#include linux/bitmap.h
  
  #include vhost.c
  #include vhost.h
 @@ -59,7 +60,8 @@ enum {
  VHOST_SCSI_VQ_IO = 2,
  };
  
 -#define VHOST_SCSI_MAX_TARGET 256
 +#define VHOST_SCSI_MAX_TARGET   256
 +#define VHOST_SCSI_MAX_VQ   128
  
  struct vhost_scsi {
  /* Protected by vhost_scsi-dev.mutex */
 @@ -68,7 +70,7 @@ struct vhost_scsi {
  bool vs_endpoint;
  
  struct vhost_dev dev;
 -struct vhost_virtqueue vqs[3];
 +struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ];
  
  struct vhost_work vs_completion_work; /* cmd completion work item */
  struct llist_head vs_completion_list; /* cmd completion queue */
 @@ -366,12 +368,14 @@ static void vhost_scsi_complete_cmd_work(struct 
 vhost_work *work)
  {
  struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
  vs_completion_work);
 +DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ);
  struct virtio_scsi_cmd_resp v_rsp;
  struct tcm_vhost_cmd *tv_cmd;
  struct llist_node *llnode;
  struct se_cmd *se_cmd;
 -int ret;
 +int ret, vq;
  
 +bitmap_zero(signal, VHOST_SCSI_MAX_VQ);
  llnode = llist_del_all(vs-vs_completion_list);
  while (llnode) {
  tv_cmd = llist_entry(llnode, struct tcm_vhost_cmd,
 @@ -390,15 +394,20 @@ static void vhost_scsi_complete_cmd_work(struct 
 vhost_work *work)
  memcpy(v_rsp.sense, tv_cmd-tvc_sense_buf,
 v_rsp.sense_len);
  ret = copy_to_user(tv_cmd-tvc_resp, v_rsp, sizeof(v_rsp));
 -if (likely(ret == 0))
 -vhost_add_used(vs-vqs[2], tv_cmd-tvc_vq_desc, 0);
 -else
 +if (likely(ret == 0)) {
 +vhost_add_used(tv_cmd-tvc_vq, tv_cmd-tvc_vq_desc, 0);
 +vq = tv_cmd-tvc_vq - vs-vqs;
 +__set_bit(vq, signal);
 +} else
  pr_err(Faulted on virtio_scsi_cmd_resp\n);
  
  vhost_scsi_free_cmd(tv_cmd);
  }
  
 -vhost_signal(vs-dev, vs-vqs[2]);
 +vq = -1;
 +while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1))
 + VHOST_SCSI_MAX_VQ)
 +vhost_signal(vs-dev, vs-vqs[vq]);
  }
  
  static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd(
 @@ -561,9 +570,9 @@ static void tcm_vhost_submission_work(struct work_struct 
 *work)
  }
  }
  
 -static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
 +static void vhost_scsi_handle_vq(struct vhost_scsi *vs,
 +struct vhost_virtqueue *vq)
  {
 -struct vhost_virtqueue *vq = vs-vqs[2];
  struct virtio_scsi_cmd_req v_req;
  struct tcm_vhost_tpg *tv_tpg;
  struct tcm_vhost_cmd *tv_cmd;
 @@ -656,7 +665,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
  ret = __copy_to_user(resp, rsp, sizeof(rsp));
  if (!ret)
  vhost_add_used_and_signal(vs-dev,
 -vs-vqs[2], head, 0);
 +  vq, head, 0);
  else
  pr_err(Faulted on virtio_scsi_cmd_resp\n);
  
 @@ -678,6 +687,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
  : %d\n, tv_cmd, exp_data_len,

57 matches

Mail list logo