Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-14 Thread Avi Kivity
On 11/13/2011 08:26 PM, Orit Wasserman wrote:
  
   int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 
  sptes[4]);
   void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
   int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
  direct);
  diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
  index 507e2b8..70d4cfd 100644
  --- a/arch/x86/kvm/paging_tmpl.h
  +++ b/arch/x86/kvm/paging_tmpl.h
  @@ -39,6 +39,21 @@
 #define CMPXCHG cmpxchg64
 #define PT_MAX_FULL_LEVELS 2
 #endif
  +#elif PTTYPE == EPT
  +  #define pt_element_t u64
  +  #define FNAME(name) EPT_##name
  +  #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
  +  #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
  +  #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
  +  #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
  +  #define PT_LEVEL_BITS PT64_LEVEL_BITS
  +  #ifdef CONFIG_X86_64
  +  #define PT_MAX_FULL_LEVELS 4
  +  #define CMPXCHG cmpxchg
  +  #else
  +  #define CMPXCHG cmpxchg64
  +  #define PT_MAX_FULL_LEVELS 2
  +  #endif
  
  The various masks should be defined here, to avoid lots of #ifdefs later.
  

 That what I did first but than I was afraid that the MASK will be changed for 
 mmu.c too.
 so I decided on ifdefs.
 The more I think about it I think we need rapper function for mask checking 
 (at least for this file).
 What do you think ?

Either should work, as long as the main logic is clean.

 for (;;) {
 gfn_t real_gfn;
  @@ -186,9 +215,14 @@ retry_walk:
 pte_gpa   = gfn_to_gpa(table_gfn) + offset;
 walker-table_gfn[walker-level - 1] = table_gfn;
 walker-pte_gpa[walker-level - 1] = pte_gpa;
  -
  +#if PTTYPE == EPT 
  +  real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn),
  +EPT_WRITABLE_MASK);
  +#else
 real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn),
   PFERR_USER_MASK|PFERR_WRITE_MASK);
  +#endif
  +
  
  Unneeded, I think.

 Is it because translate_nested_gpa always set USER_MASK ? 

Yes... maybe that function needs to do something like

   access |= mmu-default_access;



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] KVM: Dirty logging optimization using rmap

2011-11-14 Thread Takuya Yoshikawa
This is a revised version of my previous work.  I hope that
the patches are more self explanatory than before.

Thanks,
Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] KVM: MMU: Clean up BUG_ON() conditions in rmap_write_protect()

2011-11-14 Thread Takuya Yoshikawa
Remove redundant checks and use is_large_pte() macro.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/mmu.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a9b3a32..973f254 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1027,7 +1027,6 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 
spte = rmap_next(kvm, rmapp, NULL);
while (spte) {
-   BUG_ON(!spte);
BUG_ON(!(*spte  PT_PRESENT_MASK));
rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte);
if (is_writable_pte(*spte)) {
@@ -1043,9 +1042,8 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
rmapp = gfn_to_rmap(kvm, gfn, i);
spte = rmap_next(kvm, rmapp, NULL);
while (spte) {
-   BUG_ON(!spte);
BUG_ON(!(*spte  PT_PRESENT_MASK));
-   BUG_ON((*spte  (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != 
(PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
+   BUG_ON(!is_large_pte(*spte));
pgprintk(rmap_write_protect(large): spte %p %llx 
%lld\n, spte, *spte, gfn);
if (is_writable_pte(*spte)) {
drop_spte(kvm, spte);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] KVM: MMU: Split gfn_to_rmap() into two functions

2011-11-14 Thread Takuya Yoshikawa
rmap_write_protect() calls gfn_to_rmap() for each level with gfn fixed.
This results in calling gfn_to_memslot() repeatedly with that gfn.

This patch introduces __gfn_to_rmap() which takes the slot as an
argument to avoid this.

This is also needed for the following dirty logging optimization.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/mmu.c |   26 +-
 1 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 973f254..fa71085 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -958,23 +958,29 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
}
 }
 
-/*
- * Take gfn and return the reverse mapping to it.
- */
-static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level)
+static unsigned long *__gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level,
+   struct kvm_memory_slot *slot)
 {
-   struct kvm_memory_slot *slot;
struct kvm_lpage_info *linfo;
 
-   slot = gfn_to_memslot(kvm, gfn);
if (likely(level == PT_PAGE_TABLE_LEVEL))
return slot-rmap[gfn - slot-base_gfn];
 
linfo = lpage_info_slot(gfn, slot, level);
-
return linfo-rmap_pde;
 }
 
+/*
+ * Take gfn and return the reverse mapping to it.
+ */
+static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level)
+{
+   struct kvm_memory_slot *slot;
+
+   slot = gfn_to_memslot(kvm, gfn);
+   return __gfn_to_rmap(kvm, gfn, level, slot);
+}
+
 static bool rmap_can_add(struct kvm_vcpu *vcpu)
 {
struct kvm_mmu_memory_cache *cache;
@@ -1019,12 +1025,14 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 
 static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
+   struct kvm_memory_slot *slot;
unsigned long *rmapp;
u64 *spte;
int i, write_protected = 0;
 
-   rmapp = gfn_to_rmap(kvm, gfn, PT_PAGE_TABLE_LEVEL);
+   slot = gfn_to_memslot(kvm, gfn);
 
+   rmapp = __gfn_to_rmap(kvm, gfn, PT_PAGE_TABLE_LEVEL, slot);
spte = rmap_next(kvm, rmapp, NULL);
while (spte) {
BUG_ON(!(*spte  PT_PRESENT_MASK));
@@ -1039,7 +1047,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
/* check for huge page mappings */
for (i = PT_DIRECTORY_LEVEL;
 i  PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
-   rmapp = gfn_to_rmap(kvm, gfn, i);
+   rmapp = __gfn_to_rmap(kvm, gfn, i, slot);
spte = rmap_next(kvm, rmapp, NULL);
while (spte) {
BUG_ON(!(*spte  PT_PRESENT_MASK));
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] KVM: Optimize dirty logging by rmap_write_protect()

2011-11-14 Thread Takuya Yoshikawa
Currently, write protecting a slot needs to walk all the shadow pages
and checks ones which have a pte mapping a page in it.

The walk is overly heavy when dirty pages in that slot are not so many
and checking the shadow pages would result in unwanted cache pollution.

To mitigate this problem, we use rmap_write_protect() and check only
the sptes which can be reached from gfns marked in the dirty bitmap
when the number of dirty pages are less than that of shadow pages.

This criterion is reasonable in its meaning and worked well in our test:
write protection became some times faster than before when the ratio of
dirty pages are low and was not worse even when the ratio was near the
criterion.

Note that the locking for this write protection becomes fine grained.
The reason why this is safe is descripted in the comments.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/kvm/mmu.c  |   14 +++---
 arch/x86/kvm/x86.c  |   58 ++-
 3 files changed, 63 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d83264..69b6525 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -648,6 +648,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
+  struct kvm_memory_slot *slot);
 void kvm_mmu_zap_all(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fa71085..aecdea2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1023,15 +1023,13 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
rmap_remove(kvm, sptep);
 }
 
-static int rmap_write_protect(struct kvm *kvm, u64 gfn)
+int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
+  struct kvm_memory_slot *slot)
 {
-   struct kvm_memory_slot *slot;
unsigned long *rmapp;
u64 *spte;
int i, write_protected = 0;
 
-   slot = gfn_to_memslot(kvm, gfn);
-
rmapp = __gfn_to_rmap(kvm, gfn, PT_PAGE_TABLE_LEVEL, slot);
spte = rmap_next(kvm, rmapp, NULL);
while (spte) {
@@ -1066,6 +1064,14 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
return write_protected;
 }
 
+static int rmap_write_protect(struct kvm *kvm, u64 gfn)
+{
+   struct kvm_memory_slot *slot;
+
+   slot = gfn_to_memslot(kvm, gfn);
+   return kvm_mmu_rmap_write_protect(kvm, gfn, slot);
+}
+
 static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
   unsigned long data)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5fa6801..1985ea1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3461,6 +3461,50 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
return 0;
 }
 
+/**
+ * write_protect_slot - write protect a slot for dirty logging
+ * @kvm: the kvm instance
+ * @memslot: the slot we protect
+ * @dirty_bitmap: the bitmap indicating which pages are dirty
+ * @nr_dirty_pages: the number of dirty pages
+ *
+ * We have two ways to find all sptes to protect:
+ * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
+ *checks ones that have a spte mapping a page in the slot.
+ * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
+ *
+ * Generally speaking, if there are not so many dirty pages compared to the
+ * number of shadow pages, we should use the latter.
+ *
+ * Note that letting others write into a page marked dirty in the old bitmap
+ * by using the remaining tlb entry is not a problem.  That page will become
+ * write protected again when we flush the tlb and then be reported dirty to
+ * the user space by copying the old bitmap.
+ */
+static void write_protect_slot(struct kvm *kvm,
+  struct kvm_memory_slot *memslot,
+  unsigned long *dirty_bitmap,
+  unsigned long nr_dirty_pages)
+{
+   /* Not many dirty pages compared to # of shadow pages. */
+   if (nr_dirty_pages  kvm-arch.n_used_mmu_pages) {
+   unsigned long gfn_offset;
+
+   for_each_set_bit(gfn_offset, dirty_bitmap, memslot-npages) {
+   unsigned long gfn = memslot-base_gfn + gfn_offset;
+
+   spin_lock(kvm-mmu_lock);
+   kvm_mmu_rmap_write_protect(kvm, gfn, memslot);
+   spin_unlock(kvm-mmu_lock);
+   }
+   kvm_flush_remote_tlbs(kvm);
+   } else {
+   

[PATCH 3/4] KVM: Count the number of dirty pages for dirty logging

2011-11-14 Thread Takuya Yoshikawa
Needed for the next patch which uses this number to decide how to write
protect a slot.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/x86.c   |9 +++--
 include/linux/kvm_host.h |1 +
 virt/kvm/kvm_main.c  |4 +++-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9eff4af..5fa6801 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3467,10 +3467,9 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
  struct kvm_dirty_log *log)
 {
-   int r, i;
+   int r;
struct kvm_memory_slot *memslot;
unsigned long n;
-   unsigned long is_dirty = 0;
 
mutex_lock(kvm-slots_lock);
 
@@ -3485,11 +3484,8 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 
n = kvm_dirty_bitmap_bytes(memslot);
 
-   for (i = 0; !is_dirty  i  n/sizeof(long); i++)
-   is_dirty = memslot-dirty_bitmap[i];
-
/* If nothing is dirty, don't bother messing with page tables. */
-   if (is_dirty) {
+   if (memslot-nr_dirty_pages) {
struct kvm_memslots *slots, *old_slots;
unsigned long *dirty_bitmap;
 
@@ -3504,6 +3500,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
goto out;
memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
slots-memslots[log-slot].dirty_bitmap = dirty_bitmap;
+   slots-memslots[log-slot].nr_dirty_pages = 0;
slots-generation++;
 
old_slots = kvm-memslots;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c6a2ec9..7c654aa 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -181,6 +181,7 @@ struct kvm_memory_slot {
unsigned long *rmap;
unsigned long *dirty_bitmap;
unsigned long *dirty_bitmap_head;
+   unsigned long nr_dirty_pages;
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
unsigned long userspace_addr;
int user_alloc;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4c5b9a2..af5c988 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -625,6 +625,7 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot 
*memslot)
return -ENOMEM;
 
memslot-dirty_bitmap_head = memslot-dirty_bitmap;
+   memslot-nr_dirty_pages = 0;
return 0;
 }
 #endif /* !CONFIG_S390 */
@@ -1491,7 +1492,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
if (memslot  memslot-dirty_bitmap) {
unsigned long rel_gfn = gfn - memslot-base_gfn;
 
-   __set_bit_le(rel_gfn, memslot-dirty_bitmap);
+   if (!__test_and_set_bit_le(rel_gfn, memslot-dirty_bitmap))
+   memslot-nr_dirty_pages++;
}
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Kevin Wolf
Am 12.11.2011 11:25, schrieb Avi Kivity:
 On 11/11/2011 12:15 PM, Kevin Wolf wrote:
 Am 10.11.2011 22:30, schrieb Anthony Liguori:
 Live migration with qcow2 or any other image format is just not going to 
 work 
 right now even with proper clustered storage.  I think doing a block level 
 flush 
 cache interface and letting block devices decide how to do it is the best 
 approach.

 I would really prefer reusing the existing open/close code. It means
 less (duplicated) code, is existing code that is well tested and doesn't
 make migration much of a special case.

 If you want to avoid reopening the file on the OS level, we can reopen
 only the topmost layer (i.e. the format, but not the protocol) for now
 and in 1.1 we can use bdrv_reopen().
 
 Intuitively I dislike _reopen style interfaces.  If the second open
 yields different results from the first, does it invalidate any
 computations in between?

Not sure what results and what computation you mean, but let me clarify
a bit about bdrv_reopen:

The main purpose of bdrv_reopen() is to change flags, for example toggle
O_SYNC during runtime in order to allow the guest to toggle WCE. This
doesn't necessarily mean a close()/open() sequence if there are other
means to change the flags, like fcntl() (or even using other protocols
than files).

The idea here was to extend this to invalidate all caches if some
specific flag is set. As you don't change any other flag, this will
usually not be a reopen on a lower level.

If we need to use open() though, and it fails (this is really the only
different result that comes to mind) then bdrv_reopen() would fail and
the old fd would stay in use. Migration would have to fail, but I don't
think this case is ever needed for reopening after migration.

 What's wrong with just delaying the open?

Nothing, except that with today's code it's harder to do.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: Count the number of dirty pages for dirty logging

2011-11-14 Thread Avi Kivity
On 11/14/2011 11:23 AM, Takuya Yoshikawa wrote:
 Needed for the next patch which uses this number to decide how to write
 protect a slot.

   /* If nothing is dirty, don't bother messing with page tables. */
 - if (is_dirty) {
 + if (memslot-nr_dirty_pages) {
   struct kvm_memslots *slots, *old_slots;
   unsigned long *dirty_bitmap;
  
 @@ -3504,6 +3500,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
   goto out;
   memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
   slots-memslots[log-slot].dirty_bitmap = dirty_bitmap;
 + slots-memslots[log-slot].nr_dirty_pages = 0;
   slots-generation++;
  
  #endif /* !CONFIG_S390 */
 @@ -1491,7 +1492,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct 
 kvm_memory_slot *memslot,
   if (memslot  memslot-dirty_bitmap) {
   unsigned long rel_gfn = gfn - memslot-base_gfn;
  
 - __set_bit_le(rel_gfn, memslot-dirty_bitmap);
 + if (!__test_and_set_bit_le(rel_gfn, memslot-dirty_bitmap))
 + memslot-nr_dirty_pages++;
   }
  }
  

The two assignments to -nr_dirty_pages can race, no?  Nothing protects it.

btw mark_page_dirty() itself seems to assume mmu_lock protection that
doesn't exist.  Marcelo?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2011 at 10:58:16AM +0100, Kevin Wolf wrote:
 Am 12.11.2011 11:25, schrieb Avi Kivity:
  On 11/11/2011 12:15 PM, Kevin Wolf wrote:
  Am 10.11.2011 22:30, schrieb Anthony Liguori:
  Live migration with qcow2 or any other image format is just not going to 
  work 
  right now even with proper clustered storage.  I think doing a block 
  level flush 
  cache interface and letting block devices decide how to do it is the best 
  approach.
 
  I would really prefer reusing the existing open/close code. It means
  less (duplicated) code, is existing code that is well tested and doesn't
  make migration much of a special case.
 
  If you want to avoid reopening the file on the OS level, we can reopen
  only the topmost layer (i.e. the format, but not the protocol) for now
  and in 1.1 we can use bdrv_reopen().
  
  Intuitively I dislike _reopen style interfaces.  If the second open
  yields different results from the first, does it invalidate any
  computations in between?
 
 Not sure what results and what computation you mean, but let me clarify
 a bit about bdrv_reopen:
 
 The main purpose of bdrv_reopen() is to change flags, for example toggle
 O_SYNC during runtime in order to allow the guest to toggle WCE. This
 doesn't necessarily mean a close()/open() sequence if there are other
 means to change the flags, like fcntl() (or even using other protocols
 than files).
 
 The idea here was to extend this to invalidate all caches if some
 specific flag is set. As you don't change any other flag, this will
 usually not be a reopen on a lower level.
 
 If we need to use open() though, and it fails (this is really the only
 different result that comes to mind) then bdrv_reopen() would fail and
 the old fd would stay in use. Migration would have to fail, but I don't
 think this case is ever needed for reopening after migration.
 
  What's wrong with just delaying the open?
 
 Nothing, except that with today's code it's harder to do.
 
 Kevin

It seems cleaner, though, doesn't it?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-14 Thread Sasha Levin
On Mon, 2011-11-14 at 10:04 +0800, Asias He wrote:
 Hi, Shsha
 
 On 11/13/2011 11:00 PM, Sasha Levin wrote:
  On Sun, 2011-11-13 at 12:24 +0200, Michael S. Tsirkin wrote:
  On Sat, Nov 12, 2011 at 12:12:01AM +0200, Sasha Levin wrote:
  This is a patch based on Krishna Kumar's patch series which implements
  multiple VQ support for virtio-net.
 
  The patch was tested with ver3 of the patch.
 
  Cc: Krishna Kumarkrkum...@in.ibm.com
  Cc: Michael S. Tsirkinm...@redhat.com
  Cc: Rusty Russellru...@rustcorp.com.au
  Cc: virtualizat...@lists.linux-foundation.org
  Cc: net...@vger.kernel.org
  Signed-off-by: Sasha Levinlevinsasha...@gmail.com
 
  Any performance numbers?
 
  I tried finding a box with more than two cores so I could test it on
  something like that as well.
 
  From what I see this patch causes a performance regression on my 2 core
  box.
 
  I'll send an updated KVM tools patch in a bit as well.
 
  Before:
 
  # netperf -H 192.168.33.4,ipv4 -t TCP_RR
  MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
  to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
  Local /Remote
  Socket Size   Request  Resp.   Elapsed  Trans.
  Send   Recv   Size SizeTime Rate
  bytes  Bytes  bytesbytes   secs.per sec
 
  16384  87380  11   10.0011160.63
  16384  87380
 
  # netperf -H 192.168.33.4,ipv4 -t UDP_RR
  MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
  to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
  Local /Remote
  Socket Size   Request  Resp.   Elapsed  Trans.
  Send   Recv   Size SizeTime Rate
  bytes  Bytes  bytesbytes   secs.per sec
 
  122880 122880 11   10.0012072.64
  229376 229376
 
  # netperf -H 192.168.33.4,ipv4 -t TCP_STREAM
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
  192.168.33.4 (192.168.33.4) port 0 AF_INET
  Recv   SendSend
  Socket Socket  Message  Elapsed
  Size   SizeSize Time Throughput
  bytes  bytes   bytessecs.10^6bits/sec
 
87380  16384  1638410.004654.50
 
  netperf -H 192.168.33.4,ipv4 -t TCP_STREAM -- -m 128
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
  192.168.33.4 (192.168.33.4) port 0 AF_INET
  Recv   SendSend
  Socket Socket  Message  Elapsed
  Size   SizeSize Time Throughput
  bytes  bytes   bytessecs.10^6bits/sec
 
87380  1638412810.00 635.45
 
  # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM
  MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
  192.168.33.4 (192.168.33.4) port 0 AF_INET
  Socket  Message  Elapsed  Messages
  SizeSize Time Okay Errors   Throughput
  bytes   bytessecs#  #   10^6bits/sec
 
  122880   65507   10.00  113894  05968.54
  229376   10.00   89373   4683.54
 
  # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM -- -m 128
  MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
  192.168.33.4 (192.168.33.4) port 0 AF_INET
  Socket  Message  Elapsed  Messages
  SizeSize Time Okay Errors   Throughput
  bytes   bytessecs#  #   10^6bits/sec
 
  122880 128   10.00  550634  0  56.38
  229376   10.00  398786 40.84
 
 
  After:
 
  # netperf -H 192.168.33.4,ipv4 -t TCP_RR
  MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
  to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
  Local /Remote
  Socket Size   Request  Resp.   Elapsed  Trans.
  Send   Recv   Size SizeTime Rate
  bytes  Bytes  bytesbytes   secs.per sec
 
  16384  87380  11   10.008952.47
  16384  87380
 
  # netperf -H 192.168.33.4,ipv4 -t UDP_RR
  MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
  to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
  Local /Remote
  Socket Size   Request  Resp.   Elapsed  Trans.
  Send   Recv   Size SizeTime Rate
  bytes  Bytes  bytesbytes   secs.per sec
 
  122880 122880 11   10.009534.52
  229376 229376
 
  # netperf -H 192.168.33.4,ipv4 -t TCP_STREAM
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
  192.168.33.4 (192.168.33.4) port 0 AF_INET
  Recv   SendSend
  Socket Socket  Message  Elapsed
  Size   SizeSize Time Throughput
  bytes  bytes   bytessecs.10^6bits/sec
 
87380  16384  1638410.132278.23
 
  # netperf -H 192.168.33.4,ipv4 -t TCP_STREAM -- -m 128
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
  192.168.33.4 (192.168.33.4) port 0 AF_INET
  Recv   SendSend
  Socket Socket  Message  Elapsed
  Size   SizeSize Time Throughput
  bytes  bytes   bytessecs.10^6bits/sec
 
87380  1638412810.00 623.27
 
  # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM
  MIGRATED UDP STREAM TEST 

Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Daniel P. Berrange
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
 On 11/11/2011 12:15 PM, Kevin Wolf wrote:
  Am 10.11.2011 22:30, schrieb Anthony Liguori:
   Live migration with qcow2 or any other image format is just not going to 
   work 
   right now even with proper clustered storage.  I think doing a block 
   level flush 
   cache interface and letting block devices decide how to do it is the best 
   approach.
 
  I would really prefer reusing the existing open/close code. It means
  less (duplicated) code, is existing code that is well tested and doesn't
  make migration much of a special case.
 
  If you want to avoid reopening the file on the OS level, we can reopen
  only the topmost layer (i.e. the format, but not the protocol) for now
  and in 1.1 we can use bdrv_reopen().
 
 
 Intuitively I dislike _reopen style interfaces.  If the second open
 yields different results from the first, does it invalidate any
 computations in between?
 
 What's wrong with just delaying the open?

If you delay the 'open' until the mgmt app issues 'cont', then you loose
the ability to rollback to the source host upon open failure for most
deployed versions of libvirt. We only fairly recently switched to a five
stage migration handshake to cope with rollback when 'cont' fails.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: Optimize dirty logging by rmap_write_protect()

2011-11-14 Thread Avi Kivity
On 11/14/2011 11:24 AM, Takuya Yoshikawa wrote:
 Currently, write protecting a slot needs to walk all the shadow pages
 and checks ones which have a pte mapping a page in it.

 The walk is overly heavy when dirty pages in that slot are not so many
 and checking the shadow pages would result in unwanted cache pollution.

 To mitigate this problem, we use rmap_write_protect() and check only
 the sptes which can be reached from gfns marked in the dirty bitmap
 when the number of dirty pages are less than that of shadow pages.

 This criterion is reasonable in its meaning and worked well in our test:
 write protection became some times faster than before when the ratio of
 dirty pages are low and was not worse even when the ratio was near the
 criterion.

 Note that the locking for this write protection becomes fine grained.
 The reason why this is safe is descripted in the comments.

  
 +/**
 + * write_protect_slot - write protect a slot for dirty logging
 + * @kvm: the kvm instance
 + * @memslot: the slot we protect
 + * @dirty_bitmap: the bitmap indicating which pages are dirty
 + * @nr_dirty_pages: the number of dirty pages
 + *
 + * We have two ways to find all sptes to protect:
 + * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
 + *checks ones that have a spte mapping a page in the slot.
 + * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
 + *
 + * Generally speaking, if there are not so many dirty pages compared to the
 + * number of shadow pages, we should use the latter.
 + *
 + * Note that letting others write into a page marked dirty in the old bitmap
 + * by using the remaining tlb entry is not a problem.  That page will become
 + * write protected again when we flush the tlb and then be reported dirty to
 + * the user space by copying the old bitmap.
 + */
 +static void write_protect_slot(struct kvm *kvm,
 +struct kvm_memory_slot *memslot,
 +unsigned long *dirty_bitmap,
 +unsigned long nr_dirty_pages)
 +{
 + /* Not many dirty pages compared to # of shadow pages. */
 + if (nr_dirty_pages  kvm-arch.n_used_mmu_pages) {

Seems a reasonable heuristic.  In particular, this is always true for
vga, yes?  That will get the code exercised.

 + unsigned long gfn_offset;
 +
 + for_each_set_bit(gfn_offset, dirty_bitmap, memslot-npages) {
 + unsigned long gfn = memslot-base_gfn + gfn_offset;
 +
 + spin_lock(kvm-mmu_lock);
 + kvm_mmu_rmap_write_protect(kvm, gfn, memslot);
 + spin_unlock(kvm-mmu_lock);
 + }
 + kvm_flush_remote_tlbs(kvm);
 + } else {
 + spin_lock(kvm-mmu_lock);
 + kvm_mmu_slot_remove_write_access(kvm, memslot-id);
 + spin_unlock(kvm-mmu_lock);
 + }
 +}
 +


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
 On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
  On 11/11/2011 12:15 PM, Kevin Wolf wrote:
   Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not going 
to work 
right now even with proper clustered storage.  I think doing a block 
level flush 
cache interface and letting block devices decide how to do it is the 
best approach.
  
   I would really prefer reusing the existing open/close code. It means
   less (duplicated) code, is existing code that is well tested and doesn't
   make migration much of a special case.
  
   If you want to avoid reopening the file on the OS level, we can reopen
   only the topmost layer (i.e. the format, but not the protocol) for now
   and in 1.1 we can use bdrv_reopen().
  
  
  Intuitively I dislike _reopen style interfaces.  If the second open
  yields different results from the first, does it invalidate any
  computations in between?
  
  What's wrong with just delaying the open?
 
 If you delay the 'open' until the mgmt app issues 'cont', then you loose
 the ability to rollback to the source host upon open failure for most
 deployed versions of libvirt. We only fairly recently switched to a five
 stage migration handshake to cope with rollback when 'cont' fails.
 
 Daniel

I guess reopen can fail as well, so this seems to me to be an important
fix but not a blocker.

 -- 
 |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
 |: http://libvirt.org  -o- http://virt-manager.org :|
 |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
 |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: Count the number of dirty pages for dirty logging

2011-11-14 Thread Avi Kivity
On 11/14/2011 12:07 PM, Avi Kivity wrote:
 On 11/14/2011 11:23 AM, Takuya Yoshikawa wrote:
  Needed for the next patch which uses this number to decide how to write
  protect a slot.
 
  /* If nothing is dirty, don't bother messing with page tables. */
  -   if (is_dirty) {
  +   if (memslot-nr_dirty_pages) {
  struct kvm_memslots *slots, *old_slots;
  unsigned long *dirty_bitmap;
   
  @@ -3504,6 +3500,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
  goto out;
  memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
  slots-memslots[log-slot].dirty_bitmap = dirty_bitmap;
  +   slots-memslots[log-slot].nr_dirty_pages = 0;
  slots-generation++;
   
   #endif /* !CONFIG_S390 */
  @@ -1491,7 +1492,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct 
  kvm_memory_slot *memslot,
  if (memslot  memslot-dirty_bitmap) {
  unsigned long rel_gfn = gfn - memslot-base_gfn;
   
  -   __set_bit_le(rel_gfn, memslot-dirty_bitmap);
  +   if (!__test_and_set_bit_le(rel_gfn, memslot-dirty_bitmap))
  +   memslot-nr_dirty_pages++;
  }
   }
   

 The two assignments to -nr_dirty_pages can race, no?  Nothing protects it.


er, it can't, it's before the rcu_assign_pointer().

-- 
error compiling committee.c: too many argumen
ts to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap

2011-11-14 Thread Avi Kivity
On 11/14/2011 11:20 AM, Takuya Yoshikawa wrote:
 This is a revised version of my previous work.  I hope that
 the patches are more self explanatory than before.


It looks good.  I'll let Marcelo (or anyone else?) review it as well
before applying.

Do you have performance measurements?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: Optimize dirty logging by rmap_write_protect()

2011-11-14 Thread Takuya Yoshikawa

(2011/11/14 19:22), Avi Kivity wrote:


+ *
+ * Generally speaking, if there are not so many dirty pages compared to the
+ * number of shadow pages, we should use the latter.
+ *
+ * Note that letting others write into a page marked dirty in the old bitmap
+ * by using the remaining tlb entry is not a problem.  That page will become
+ * write protected again when we flush the tlb and then be reported dirty to
+ * the user space by copying the old bitmap.
+ */
+static void write_protect_slot(struct kvm *kvm,
+  struct kvm_memory_slot *memslot,
+  unsigned long *dirty_bitmap,
+  unsigned long nr_dirty_pages)
+{
+   /* Not many dirty pages compared to # of shadow pages. */
+   if (nr_dirty_pages  kvm-arch.n_used_mmu_pages) {


Seems a reasonable heuristic.  In particular, this is always true for
vga, yes?  That will get the code exercised.


Almost always yes, once the guest warms up and shadow pages are allocated.

Takuya




+   unsigned long gfn_offset;
+
+   for_each_set_bit(gfn_offset, dirty_bitmap, memslot-npages) {
+   unsigned long gfn = memslot-base_gfn + gfn_offset;
+
+   spin_lock(kvm-mmu_lock);
+   kvm_mmu_rmap_write_protect(kvm, gfn, memslot);
+   spin_unlock(kvm-mmu_lock);
+   }
+   kvm_flush_remote_tlbs(kvm);
+   } else {
+   spin_lock(kvm-mmu_lock);
+   kvm_mmu_slot_remove_write_access(kvm, memslot-id);
+   spin_unlock(kvm-mmu_lock);
+   }
+}
+





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap

2011-11-14 Thread Takuya Yoshikawa

(2011/11/14 19:25), Avi Kivity wrote:

On 11/14/2011 11:20 AM, Takuya Yoshikawa wrote:

This is a revised version of my previous work.  I hope that
the patches are more self explanatory than before.



It looks good.  I'll let Marcelo (or anyone else?) review it as well
before applying.

Do you have performance measurements?



For VGA, 30-40us became 3-5us when the display was quiet, with a
enough warmed up guest.

Near the criterion, the number was not different much from the
original version.

For live migration, I forgot the number but the result was good.
But my test case was not enough to cover every pattern, so I changed
the criterion to be a bit conservative.

More tests may be able to find a better criterion.
I am not in a hurry about this, so it is OK to add some tests
before merging this.

But what I did not like was holding spin lock more than 100us or so
with the original version.  With this version, at least, the problem
should be cured some.

Takuya


One note:

kvm-unit-tests' dirty logging test was broken for 32-bit box: compile error.
I changed an idt to boot_idt and used it.

I do not know kvm-unit-tests well, so I want somebody to fix that officially.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Daniel P. Berrange
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
  On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
   On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
 Live migration with qcow2 or any other image format is just not going 
 to work 
 right now even with proper clustered storage.  I think doing a block 
 level flush 
 cache interface and letting block devices decide how to do it is the 
 best approach.
   
I would really prefer reusing the existing open/close code. It means
less (duplicated) code, is existing code that is well tested and doesn't
make migration much of a special case.
   
If you want to avoid reopening the file on the OS level, we can reopen
only the topmost layer (i.e. the format, but not the protocol) for now
and in 1.1 we can use bdrv_reopen().
   
   
   Intuitively I dislike _reopen style interfaces.  If the second open
   yields different results from the first, does it invalidate any
   computations in between?
   
   What's wrong with just delaying the open?
  
  If you delay the 'open' until the mgmt app issues 'cont', then you loose
  the ability to rollback to the source host upon open failure for most
  deployed versions of libvirt. We only fairly recently switched to a five
  stage migration handshake to cope with rollback when 'cont' fails.
  
  Daniel
 
 I guess reopen can fail as well, so this seems to me to be an important
 fix but not a blocker.

If if the initial open succeeds, then it is far more likely that a later
re-open will succeed too, because you have already elminated the possibility
of configuration mistakes, and will have caught most storage runtime errors
too. So there is a very significant difference in reliability between doing
an 'open at startup + reopen at cont' vs just 'open at cont'

Based on the bug reports I see, we want to be very good at detecting and
gracefully handling open errors because they are pretty frequent.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Kevin Wolf
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
 On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
 On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
 On 11/11/2011 12:15 PM, Kevin Wolf wrote:
 Am 10.11.2011 22:30, schrieb Anthony Liguori:
 Live migration with qcow2 or any other image format is just not going to 
 work 
 right now even with proper clustered storage.  I think doing a block 
 level flush 
 cache interface and letting block devices decide how to do it is the 
 best approach.

 I would really prefer reusing the existing open/close code. It means
 less (duplicated) code, is existing code that is well tested and doesn't
 make migration much of a special case.

 If you want to avoid reopening the file on the OS level, we can reopen
 only the topmost layer (i.e. the format, but not the protocol) for now
 and in 1.1 we can use bdrv_reopen().


 Intuitively I dislike _reopen style interfaces.  If the second open
 yields different results from the first, does it invalidate any
 computations in between?

 What's wrong with just delaying the open?

 If you delay the 'open' until the mgmt app issues 'cont', then you loose
 the ability to rollback to the source host upon open failure for most
 deployed versions of libvirt. We only fairly recently switched to a five
 stage migration handshake to cope with rollback when 'cont' fails.

 Daniel

 I guess reopen can fail as well, so this seems to me to be an important
 fix but not a blocker.
 
 If if the initial open succeeds, then it is far more likely that a later
 re-open will succeed too, because you have already elminated the possibility
 of configuration mistakes, and will have caught most storage runtime errors
 too. So there is a very significant difference in reliability between doing
 an 'open at startup + reopen at cont' vs just 'open at cont'
 
 Based on the bug reports I see, we want to be very good at detecting and
 gracefully handling open errors because they are pretty frequent.

Do you have some more details on the kind of errors? Missing files,
permissions, something like this? Or rather something related to the
actual content of an image file?

I'm asking because for avoiding the former, things like access() could
be enough, whereas for the latter we'd have to do a full open.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Daniel P. Berrange
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
 Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
  On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
  On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
  On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
  On 11/11/2011 12:15 PM, Kevin Wolf wrote:
  Am 10.11.2011 22:30, schrieb Anthony Liguori:
  Live migration with qcow2 or any other image format is just not going 
  to work 
  right now even with proper clustered storage.  I think doing a block 
  level flush 
  cache interface and letting block devices decide how to do it is the 
  best approach.
 
  I would really prefer reusing the existing open/close code. It means
  less (duplicated) code, is existing code that is well tested and doesn't
  make migration much of a special case.
 
  If you want to avoid reopening the file on the OS level, we can reopen
  only the topmost layer (i.e. the format, but not the protocol) for now
  and in 1.1 we can use bdrv_reopen().
 
 
  Intuitively I dislike _reopen style interfaces.  If the second open
  yields different results from the first, does it invalidate any
  computations in between?
 
  What's wrong with just delaying the open?
 
  If you delay the 'open' until the mgmt app issues 'cont', then you loose
  the ability to rollback to the source host upon open failure for most
  deployed versions of libvirt. We only fairly recently switched to a five
  stage migration handshake to cope with rollback when 'cont' fails.
 
  Daniel
 
  I guess reopen can fail as well, so this seems to me to be an important
  fix but not a blocker.
  
  If if the initial open succeeds, then it is far more likely that a later
  re-open will succeed too, because you have already elminated the possibility
  of configuration mistakes, and will have caught most storage runtime errors
  too. So there is a very significant difference in reliability between doing
  an 'open at startup + reopen at cont' vs just 'open at cont'
  
  Based on the bug reports I see, we want to be very good at detecting and
  gracefully handling open errors because they are pretty frequent.
 
 Do you have some more details on the kind of errors? Missing files,
 permissions, something like this? Or rather something related to the
 actual content of an image file?

Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
setup. Access permissions due to incorrect user / group setup, or read
only mounts, or SELinux denials. Actual I/O errors are less common and
are not so likely to cause QEMU to fail to start any, since QEMU is
likely to just report them to the guest OS instead.


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2011 at 11:08:02AM +, Daniel P. Berrange wrote:
 On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
  On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
   On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
 Am 10.11.2011 22:30, schrieb Anthony Liguori:
  Live migration with qcow2 or any other image format is just not 
  going to work 
  right now even with proper clustered storage.  I think doing a 
  block level flush 
  cache interface and letting block devices decide how to do it is 
  the best approach.

 I would really prefer reusing the existing open/close code. It means
 less (duplicated) code, is existing code that is well tested and 
 doesn't
 make migration much of a special case.

 If you want to avoid reopening the file on the OS level, we can reopen
 only the topmost layer (i.e. the format, but not the protocol) for now
 and in 1.1 we can use bdrv_reopen().


Intuitively I dislike _reopen style interfaces.  If the second open
yields different results from the first, does it invalidate any
computations in between?

What's wrong with just delaying the open?
   
   If you delay the 'open' until the mgmt app issues 'cont', then you loose
   the ability to rollback to the source host upon open failure for most
   deployed versions of libvirt. We only fairly recently switched to a five
   stage migration handshake to cope with rollback when 'cont' fails.
   
   Daniel
  
  I guess reopen can fail as well, so this seems to me to be an important
  fix but not a blocker.
 
 If if the initial open succeeds, then it is far more likely that a later
 re-open will succeed too, because you have already elminated the possibility
 of configuration mistakes, and will have caught most storage runtime errors
 too. So there is a very significant difference in reliability between doing
 an 'open at startup + reopen at cont' vs just 'open at cont'
 
 Based on the bug reports I see, we want to be very good at detecting and
 gracefully handling open errors because they are pretty frequent.
 
 Regards,
 Daniel

IIUC, the 'cont' that we were discussing is the startup of the VM
at destination after migration completes. A failure results in
migration failure, which libvirt has been able to handle since forever.
In case of the 'cont' command on source upon migration failure,
qemu was running there previously so it's likely configuration is OK.

Am I confused? If no, libvirt seems unaffected.


 -- 
 |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
 |: http://libvirt.org  -o- http://virt-manager.org :|
 |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
 |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote:
 On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
  Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
   On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
   On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
   On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
   On 11/11/2011 12:15 PM, Kevin Wolf wrote:
   Am 10.11.2011 22:30, schrieb Anthony Liguori:
   Live migration with qcow2 or any other image format is just not 
   going to work 
   right now even with proper clustered storage.  I think doing a block 
   level flush 
   cache interface and letting block devices decide how to do it is the 
   best approach.
  
   I would really prefer reusing the existing open/close code. It means
   less (duplicated) code, is existing code that is well tested and 
   doesn't
   make migration much of a special case.
  
   If you want to avoid reopening the file on the OS level, we can reopen
   only the topmost layer (i.e. the format, but not the protocol) for now
   and in 1.1 we can use bdrv_reopen().
  
  
   Intuitively I dislike _reopen style interfaces.  If the second open
   yields different results from the first, does it invalidate any
   computations in between?
  
   What's wrong with just delaying the open?
  
   If you delay the 'open' until the mgmt app issues 'cont', then you loose
   the ability to rollback to the source host upon open failure for most
   deployed versions of libvirt. We only fairly recently switched to a five
   stage migration handshake to cope with rollback when 'cont' fails.
  
   Daniel
  
   I guess reopen can fail as well, so this seems to me to be an important
   fix but not a blocker.
   
   If if the initial open succeeds, then it is far more likely that a later
   re-open will succeed too, because you have already elminated the 
   possibility
   of configuration mistakes, and will have caught most storage runtime 
   errors
   too. So there is a very significant difference in reliability between 
   doing
   an 'open at startup + reopen at cont' vs just 'open at cont'
   
   Based on the bug reports I see, we want to be very good at detecting and
   gracefully handling open errors because they are pretty frequent.
  
  Do you have some more details on the kind of errors? Missing files,
  permissions, something like this? Or rather something related to the
  actual content of an image file?
 
 Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
 setup. Access permissions due to incorrect user / group setup, or read
 only mounts, or SELinux denials. Actual I/O errors are less common and
 are not so likely to cause QEMU to fail to start any, since QEMU is
 likely to just report them to the guest OS instead.
 
 
 Daniel

Do you run qemu with -S, then give a 'cont' command to start it?

 -- 
 |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
 |: http://libvirt.org  -o- http://virt-manager.org :|
 |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
 |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Gleb Natapov
On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote:
  Do you have some more details on the kind of errors? Missing files,
  permissions, something like this? Or rather something related to the
  actual content of an image file?
 
 Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
 setup. Access permissions due to incorrect user / group setup, or read
 only mounts, or SELinux denials. Actual I/O errors are less common and
 are not so likely to cause QEMU to fail to start any, since QEMU is
 likely to just report them to the guest OS instead.
 
If started correctly QEMU should not report them to the guest OS, but
pause instead.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Daniel P. Berrange
On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote:
  On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
   Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
Am 10.11.2011 22:30, schrieb Anthony Liguori:
Live migration with qcow2 or any other image format is just not 
going to work 
right now even with proper clustered storage.  I think doing a 
block level flush 
cache interface and letting block devices decide how to do it is 
the best approach.
   
I would really prefer reusing the existing open/close code. It means
less (duplicated) code, is existing code that is well tested and 
doesn't
make migration much of a special case.
   
If you want to avoid reopening the file on the OS level, we can 
reopen
only the topmost layer (i.e. the format, but not the protocol) for 
now
and in 1.1 we can use bdrv_reopen().
   
   
Intuitively I dislike _reopen style interfaces.  If the second open
yields different results from the first, does it invalidate any
computations in between?
   
What's wrong with just delaying the open?
   
If you delay the 'open' until the mgmt app issues 'cont', then you 
loose
the ability to rollback to the source host upon open failure for most
deployed versions of libvirt. We only fairly recently switched to a 
five
stage migration handshake to cope with rollback when 'cont' fails.
   
Daniel
   
I guess reopen can fail as well, so this seems to me to be an important
fix but not a blocker.

If if the initial open succeeds, then it is far more likely that a later
re-open will succeed too, because you have already elminated the 
possibility
of configuration mistakes, and will have caught most storage runtime 
errors
too. So there is a very significant difference in reliability between 
doing
an 'open at startup + reopen at cont' vs just 'open at cont'

Based on the bug reports I see, we want to be very good at detecting and
gracefully handling open errors because they are pretty frequent.
   
   Do you have some more details on the kind of errors? Missing files,
   permissions, something like this? Or rather something related to the
   actual content of an image file?
  
  Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
  setup. Access permissions due to incorrect user / group setup, or read
  only mounts, or SELinux denials. Actual I/O errors are less common and
  are not so likely to cause QEMU to fail to start any, since QEMU is
  likely to just report them to the guest OS instead.
 
 Do you run qemu with -S, then give a 'cont' command to start it?

Yes

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM VMExit

2011-11-14 Thread Xin Tong
x86 has a Undefined Instruction, its opcode is 0F 0B and it generates
an invalid opcode exception. This instruction is provided for software
testing to explicitly generate an invalid opcode exception. The opcode
for this instruction is reserved for this purpose.

Thanks


Xin


On Sun, Nov 13, 2011 at 8:50 PM, 王永博 wangyongb...@gmail.com wrote:


 2011/11/14 Xin Tong xerox.time.t...@gmail.com

 I would like to know how I can make a UD2 (Undefined Instruction)
 cause a vmexit in KVM ?

 Thanks


 Xin

   what is UD2 ?

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote:
 On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
  On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote:
   On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
 On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
 On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
 On 11/11/2011 12:15 PM, Kevin Wolf wrote:
 Am 10.11.2011 22:30, schrieb Anthony Liguori:
 Live migration with qcow2 or any other image format is just not 
 going to work 
 right now even with proper clustered storage.  I think doing a 
 block level flush 
 cache interface and letting block devices decide how to do it is 
 the best approach.

 I would really prefer reusing the existing open/close code. It 
 means
 less (duplicated) code, is existing code that is well tested and 
 doesn't
 make migration much of a special case.

 If you want to avoid reopening the file on the OS level, we can 
 reopen
 only the topmost layer (i.e. the format, but not the protocol) 
 for now
 and in 1.1 we can use bdrv_reopen().


 Intuitively I dislike _reopen style interfaces.  If the second open
 yields different results from the first, does it invalidate any
 computations in between?

 What's wrong with just delaying the open?

 If you delay the 'open' until the mgmt app issues 'cont', then you 
 loose
 the ability to rollback to the source host upon open failure for 
 most
 deployed versions of libvirt. We only fairly recently switched to a 
 five
 stage migration handshake to cope with rollback when 'cont' fails.

 Daniel

 I guess reopen can fail as well, so this seems to me to be an 
 important
 fix but not a blocker.
 
 If if the initial open succeeds, then it is far more likely that a 
 later
 re-open will succeed too, because you have already elminated the 
 possibility
 of configuration mistakes, and will have caught most storage runtime 
 errors
 too. So there is a very significant difference in reliability between 
 doing
 an 'open at startup + reopen at cont' vs just 'open at cont'
 
 Based on the bug reports I see, we want to be very good at detecting 
 and
 gracefully handling open errors because they are pretty frequent.

Do you have some more details on the kind of errors? Missing files,
permissions, something like this? Or rather something related to the
actual content of an image file?
   
   Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
   setup. Access permissions due to incorrect user / group setup, or read
   only mounts, or SELinux denials. Actual I/O errors are less common and
   are not so likely to cause QEMU to fail to start any, since QEMU is
   likely to just report them to the guest OS instead.
  
  Do you run qemu with -S, then give a 'cont' command to start it?
 
 Yes
 
 Daniel

Probably in an attempt to improve reliability :)

So this is in fact unrelated to migration.  So we can either ignore this
bug (assuming no distros ship cutting edge qemu with an old libvirt), or
special-case -S and do an open/close cycle on startup.


 -- 
 |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
 |: http://libvirt.org  -o- http://virt-manager.org :|
 |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
 |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote:
 On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
  On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote:
   On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
 On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
 On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
 On 11/11/2011 12:15 PM, Kevin Wolf wrote:
 Am 10.11.2011 22:30, schrieb Anthony Liguori:
 Live migration with qcow2 or any other image format is just not 
 going to work 
 right now even with proper clustered storage.  I think doing a 
 block level flush 
 cache interface and letting block devices decide how to do it is 
 the best approach.

 I would really prefer reusing the existing open/close code. It 
 means
 less (duplicated) code, is existing code that is well tested and 
 doesn't
 make migration much of a special case.

 If you want to avoid reopening the file on the OS level, we can 
 reopen
 only the topmost layer (i.e. the format, but not the protocol) 
 for now
 and in 1.1 we can use bdrv_reopen().


 Intuitively I dislike _reopen style interfaces.  If the second open
 yields different results from the first, does it invalidate any
 computations in between?

 What's wrong with just delaying the open?

 If you delay the 'open' until the mgmt app issues 'cont', then you 
 loose
 the ability to rollback to the source host upon open failure for 
 most
 deployed versions of libvirt. We only fairly recently switched to a 
 five
 stage migration handshake to cope with rollback when 'cont' fails.

 Daniel

 I guess reopen can fail as well, so this seems to me to be an 
 important
 fix but not a blocker.
 
 If if the initial open succeeds, then it is far more likely that a 
 later
 re-open will succeed too, because you have already elminated the 
 possibility
 of configuration mistakes, and will have caught most storage runtime 
 errors
 too. So there is a very significant difference in reliability between 
 doing
 an 'open at startup + reopen at cont' vs just 'open at cont'
 
 Based on the bug reports I see, we want to be very good at detecting 
 and
 gracefully handling open errors because they are pretty frequent.

Do you have some more details on the kind of errors? Missing files,
permissions, something like this? Or rather something related to the
actual content of an image file?
   
   Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
   setup. Access permissions due to incorrect user / group setup, or read
   only mounts, or SELinux denials. Actual I/O errors are less common and
   are not so likely to cause QEMU to fail to start any, since QEMU is
   likely to just report them to the guest OS instead.
  
  Do you run qemu with -S, then give a 'cont' command to start it?
 
 Yes
 
 Daniel

OK, so let's go back one step now - how is this related to
'rollback to source host'?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Daniel P. Berrange
On Mon, Nov 14, 2011 at 01:51:40PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote:
  On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
   On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
 Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
  On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
  On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
  On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
  On 11/11/2011 12:15 PM, Kevin Wolf wrote:
  Am 10.11.2011 22:30, schrieb Anthony Liguori:
  Live migration with qcow2 or any other image format is just 
  not going to work 
  right now even with proper clustered storage.  I think doing a 
  block level flush 
  cache interface and letting block devices decide how to do it 
  is the best approach.
 
  I would really prefer reusing the existing open/close code. It 
  means
  less (duplicated) code, is existing code that is well tested 
  and doesn't
  make migration much of a special case.
 
  If you want to avoid reopening the file on the OS level, we can 
  reopen
  only the topmost layer (i.e. the format, but not the protocol) 
  for now
  and in 1.1 we can use bdrv_reopen().
 
 
  Intuitively I dislike _reopen style interfaces.  If the second 
  open
  yields different results from the first, does it invalidate any
  computations in between?
 
  What's wrong with just delaying the open?
 
  If you delay the 'open' until the mgmt app issues 'cont', then 
  you loose
  the ability to rollback to the source host upon open failure for 
  most
  deployed versions of libvirt. We only fairly recently switched to 
  a five
  stage migration handshake to cope with rollback when 'cont' fails.
 
  Daniel
 
  I guess reopen can fail as well, so this seems to me to be an 
  important
  fix but not a blocker.
  
  If if the initial open succeeds, then it is far more likely that a 
  later
  re-open will succeed too, because you have already elminated the 
  possibility
  of configuration mistakes, and will have caught most storage 
  runtime errors
  too. So there is a very significant difference in reliability 
  between doing
  an 'open at startup + reopen at cont' vs just 'open at cont'
  
  Based on the bug reports I see, we want to be very good at 
  detecting and
  gracefully handling open errors because they are pretty frequent.
 
 Do you have some more details on the kind of errors? Missing files,
 permissions, something like this? Or rather something related to the
 actual content of an image file?

Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
setup. Access permissions due to incorrect user / group setup, or read
only mounts, or SELinux denials. Actual I/O errors are less common and
are not so likely to cause QEMU to fail to start any, since QEMU is
likely to just report them to the guest OS instead.
   
   Do you run qemu with -S, then give a 'cont' command to start it?
 
 Probably in an attempt to improve reliability :)

Not really. We can't simply let QEMU start its own CPUs, because there are
various tasks that need performing after the migration transfer finishes,
but before the CPUs are allowed to be started. eg

 - Finish 802.11Qb{g,h} (VEPA) network port profile association on target
 - Release leases for any resources associated with the source QEMU
   via a configured lock manager (eg sanlock)
 - Acquire leases for any resources associated with the target QEMU
   via a configured lock manager (eg sanlock)

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Daniel P. Berrange
On Mon, Nov 14, 2011 at 01:56:36PM +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote:
  On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
   On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote:
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
 Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
  On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
  On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote:
  On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
  On 11/11/2011 12:15 PM, Kevin Wolf wrote:
  Am 10.11.2011 22:30, schrieb Anthony Liguori:
  Live migration with qcow2 or any other image format is just 
  not going to work 
  right now even with proper clustered storage.  I think doing a 
  block level flush 
  cache interface and letting block devices decide how to do it 
  is the best approach.
 
  I would really prefer reusing the existing open/close code. It 
  means
  less (duplicated) code, is existing code that is well tested 
  and doesn't
  make migration much of a special case.
 
  If you want to avoid reopening the file on the OS level, we can 
  reopen
  only the topmost layer (i.e. the format, but not the protocol) 
  for now
  and in 1.1 we can use bdrv_reopen().
 
 
  Intuitively I dislike _reopen style interfaces.  If the second 
  open
  yields different results from the first, does it invalidate any
  computations in between?
 
  What's wrong with just delaying the open?
 
  If you delay the 'open' until the mgmt app issues 'cont', then 
  you loose
  the ability to rollback to the source host upon open failure for 
  most
  deployed versions of libvirt. We only fairly recently switched to 
  a five
  stage migration handshake to cope with rollback when 'cont' fails.
 
  Daniel
 
  I guess reopen can fail as well, so this seems to me to be an 
  important
  fix but not a blocker.
  
  If if the initial open succeeds, then it is far more likely that a 
  later
  re-open will succeed too, because you have already elminated the 
  possibility
  of configuration mistakes, and will have caught most storage 
  runtime errors
  too. So there is a very significant difference in reliability 
  between doing
  an 'open at startup + reopen at cont' vs just 'open at cont'
  
  Based on the bug reports I see, we want to be very good at 
  detecting and
  gracefully handling open errors because they are pretty frequent.
 
 Do you have some more details on the kind of errors? Missing files,
 permissions, something like this? Or rather something related to the
 actual content of an image file?

Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
setup. Access permissions due to incorrect user / group setup, or read
only mounts, or SELinux denials. Actual I/O errors are less common and
are not so likely to cause QEMU to fail to start any, since QEMU is
likely to just report them to the guest OS instead.
   
   Do you run qemu with -S, then give a 'cont' command to start it?
  
  Yes
 
 OK, so let's go back one step now - how is this related to
 'rollback to source host'?

In the old libvirt migration protocol, by the time we run 'cont' on the
destination, the source QEMU has already been killed off, so there's
nothing to resume on failure.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Performance counter - vPMU

2011-11-14 Thread Balbir Singh
Hi,

I saw the a presentation on virtualizing performance counters at
http://www.linux-kvm.org/wiki/images/6/6d/Kvm-forum-2011-performance-monitoring.pdf.
Has the code been merged? Can I get something to play with/provide
feedback?

Balbir Singh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2011 at 11:58:14AM +, Daniel P. Berrange wrote:
 On Mon, Nov 14, 2011 at 01:56:36PM +0200, Michael S. Tsirkin wrote:
  On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote:
   On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote:
 On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
  Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
   On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin 
   wrote:
   On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange 
   wrote:
   On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
   On 11/11/2011 12:15 PM, Kevin Wolf wrote:
   Am 10.11.2011 22:30, schrieb Anthony Liguori:
   Live migration with qcow2 or any other image format is just 
   not going to work 
   right now even with proper clustered storage.  I think doing 
   a block level flush 
   cache interface and letting block devices decide how to do 
   it is the best approach.
  
   I would really prefer reusing the existing open/close code. 
   It means
   less (duplicated) code, is existing code that is well tested 
   and doesn't
   make migration much of a special case.
  
   If you want to avoid reopening the file on the OS level, we 
   can reopen
   only the topmost layer (i.e. the format, but not the 
   protocol) for now
   and in 1.1 we can use bdrv_reopen().
  
  
   Intuitively I dislike _reopen style interfaces.  If the second 
   open
   yields different results from the first, does it invalidate any
   computations in between?
  
   What's wrong with just delaying the open?
  
   If you delay the 'open' until the mgmt app issues 'cont', then 
   you loose
   the ability to rollback to the source host upon open failure 
   for most
   deployed versions of libvirt. We only fairly recently switched 
   to a five
   stage migration handshake to cope with rollback when 'cont' 
   fails.
  
   Daniel
  
   I guess reopen can fail as well, so this seems to me to be an 
   important
   fix but not a blocker.
   
   If if the initial open succeeds, then it is far more likely that 
   a later
   re-open will succeed too, because you have already elminated the 
   possibility
   of configuration mistakes, and will have caught most storage 
   runtime errors
   too. So there is a very significant difference in reliability 
   between doing
   an 'open at startup + reopen at cont' vs just 'open at cont'
   
   Based on the bug reports I see, we want to be very good at 
   detecting and
   gracefully handling open errors because they are pretty frequent.
  
  Do you have some more details on the kind of errors? Missing files,
  permissions, something like this? Or rather something related to the
  actual content of an image file?
 
 Missing files due to wrong/missing NFS mounts, or incorrect SAN / 
 iSCSI
 setup. Access permissions due to incorrect user / group setup, or read
 only mounts, or SELinux denials. Actual I/O errors are less common and
 are not so likely to cause QEMU to fail to start any, since QEMU is
 likely to just report them to the guest OS instead.

Do you run qemu with -S, then give a 'cont' command to start it?
   
   Yes
  
  OK, so let's go back one step now - how is this related to
  'rollback to source host'?
 
 In the old libvirt migration protocol, by the time we run 'cont' on the
 destination, the source QEMU has already been killed off, so there's
 nothing to resume on failure.
 
 Daniel

I see. So again there are two solutions I see:
1. ignore old libvirt as it can't restart source reliably anyway
2. open files when migration is completed (after startup, but before cont)



 -- 
 |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
 |: http://libvirt.org  -o- http://virt-manager.org :|
 |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
 |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance counter - vPMU

2011-11-14 Thread Gleb Natapov
On Mon, Nov 14, 2011 at 05:35:59PM +0530, Balbir Singh wrote:
 Hi,
 
 I saw the a presentation on virtualizing performance counters at
 http://www.linux-kvm.org/wiki/images/6/6d/Kvm-forum-2011-performance-monitoring.pdf.
 Has the code been merged? Can I get something to play with/provide
 feedback?
 
Not yet merged. You can take it from here https://lkml.org/lkml/2011/11/10/215

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-14 Thread Pekka Enberg
On Mon, Nov 14, 2011 at 4:04 AM, Asias He asias.he...@gmail.com wrote:
 Why both the bandwidth and latency performance are dropping so dramatically
 with multiple VQ?

What's the expected benefit from multiple VQs i.e. why are doing the
patches Sasha?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap

2011-11-14 Thread Avi Kivity
On 11/14/2011 12:56 PM, Takuya Yoshikawa wrote:
 (2011/11/14 19:25), Avi Kivity wrote:
 On 11/14/2011 11:20 AM, Takuya Yoshikawa wrote:
 This is a revised version of my previous work.  I hope that
 the patches are more self explanatory than before.


 It looks good.  I'll let Marcelo (or anyone else?) review it as well
 before applying.

 Do you have performance measurements?


 For VGA, 30-40us became 3-5us when the display was quiet, with a
 enough warmed up guest.


That's a nice improvement.

 Near the criterion, the number was not different much from the
 original version.

 For live migration, I forgot the number but the result was good.
 But my test case was not enough to cover every pattern, so I changed
 the criterion to be a bit conservative.

 More tests may be able to find a better criterion.
 I am not in a hurry about this, so it is OK to add some tests
 before merging this.

I think we can merge is as is, it's clear we get an improvement.


 But what I did not like was holding spin lock more than 100us or so
 with the original version.  With this version, at least, the problem
 should be cured some.

There was a patchset from Peter Zijlstra that converted mmu notifiers to
be preemptible, with that, we can convert the mmu spinlock to a mutex,
I'll see what happened to it.

There is a third method of doing write protection, and that is by
write-protecting at the higher levels of the paging hierarchy.  The
advantage there is that write protection is O(1) no matter how large the
guest is, or the number of dirty pages.

To write protect all guest memory, we just write protect the 512 PTEs at
the very top, and leave the rest alone.  When the guest writes to a
page, we allow writes for the top-level PTE that faulted, and
write-protect all the PTEs that it points to.

We can combine it with your method by having a small bitmap (say, just
64 bits) per shadow page.  Each bit represents 8 PTEs (total 512 PTEs)
and is set if any of those PTEs are writeable.


 Takuya


 One note:

 kvm-unit-tests' dirty logging test was broken for 32-bit box: compile
 error.
 I changed an idt to boot_idt and used it.

 I do not know kvm-unit-tests well, so I want somebody to fix that
 officially.

I'll look into it.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM VMExit

2011-11-14 Thread Nadav Har'El
On Mon, Nov 14, 2011, Xin Tong wrote about Re: KVM VMExit:
 x86 has a Undefined Instruction, its opcode is 0F 0B and it generates
 an invalid opcode exception. This instruction is provided for software
 testing to explicitly generate an invalid opcode exception. The opcode
 for this instruction is reserved for this purpose.

Seeing your recent questions on this list aren't really about KVM
development, but rather about VMX (apparently) basics, I suggest you
also grab yourself a copy of the Intel Software Manual - see volume 3
of:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
which answers all of your recent questions.

About this question - as far as I know there is no way to specifically
cause exit only on the UD2 instruction. What you can do, however, is to
cause exit on any #UD exception, including the one generated by UD2.
The VMCS has an exception bitmap which defines which exceptions cause
an exit, and you can turn on the #UD bit to ask for this exit.


-- 
Nadav Har'El|Monday, Nov 14 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Willpower: The ability to eat only one
http://nadav.harel.org.il   |salted peanut.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2011 at 02:25:17PM +0200, Pekka Enberg wrote:
 On Mon, Nov 14, 2011 at 4:04 AM, Asias He asias.he...@gmail.com wrote:
  Why both the bandwidth and latency performance are dropping so dramatically
  with multiple VQ?
 
 What's the expected benefit from multiple VQs

Heh, the original patchset didn't mention this :) It really should.
They are supposed to speed up networking for high smp guests.

 i.e. why are doing the patches Sasha?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter

2011-11-14 Thread Alexander Graf

On 11/09/2011 01:23 AM, Scott Wood wrote:

This prevents us from inappropriately blocking in a KVM_SET_REGS
ioctl -- the MSR[WE] will take effect when the guest is next entered.

It also causes SRR1[WE] to be set when we enter the guest's interrupt
handler, which is what e500 hardware is documented to do.

Signed-off-by: Scott Woodscottw...@freescale.com
---
  arch/powerpc/kvm/booke.c |   28 ++--
  1 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index feaefc4..557f028 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -124,12 +124,6 @@ void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr)
vcpu-arch.shared-msr = new_msr;

kvmppc_mmu_msr_notify(vcpu, old_msr);
-
-   if (vcpu-arch.shared-msr  MSR_WE) {
-   kvm_vcpu_block(vcpu);
-   kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
-   };
-
kvmppc_vcpu_sync_spe(vcpu);
  }

@@ -288,15 +282,12 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
return allowed;
  }

-/* Check pending exceptions and deliver one, if possible. */
-void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
+static void kvmppc_core_check_exceptions(struct kvm_vcpu *vcpu)
  {
unsigned long *pending =vcpu-arch.pending_exceptions;
unsigned long old_pending = vcpu-arch.pending_exceptions;
unsigned int priority;

-   WARN_ON_ONCE(!irqs_disabled());
-
priority = __ffs(*pending);
while (priority= BOOKE_IRQPRIO_MAX) {
if (kvmppc_booke_irqprio_deliver(vcpu, priority))
@@ -314,6 +305,23 @@ void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
vcpu-arch.shared-int_pending = 0;
  }

+/* Check pending exceptions and deliver one, if possible. */
+void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
+{
+   WARN_ON_ONCE(!irqs_disabled());
+
+   kvmppc_core_check_exceptions(vcpu);
+
+   if (vcpu-arch.shared-msr  MSR_WE) {
+   local_irq_enable();
+   kvm_vcpu_block(vcpu);
+   local_irq_disable();


Hrm. This specific irq enable/disable part isn't pretty but I can't 
think of a cleaner way either. Unless you move it out of the prepare 
function, since I don't see a way it could race with an interrupt.



+
+   kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
+   kvmppc_core_check_exceptions(vcpu);


Shouldn't

if (msr  MSR_WE) {
  ...
}

core_check_exceptions(vcpu);


work just as well?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-14 Thread Anthony Liguori

On 11/14/2011 04:16 AM, Daniel P. Berrange wrote:

On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:

On 11/11/2011 12:15 PM, Kevin Wolf wrote:

Am 10.11.2011 22:30, schrieb Anthony Liguori:

Live migration with qcow2 or any other image format is just not going to work
right now even with proper clustered storage.  I think doing a block level flush
cache interface and letting block devices decide how to do it is the best 
approach.


I would really prefer reusing the existing open/close code. It means
less (duplicated) code, is existing code that is well tested and doesn't
make migration much of a special case.

If you want to avoid reopening the file on the OS level, we can reopen
only the topmost layer (i.e. the format, but not the protocol) for now
and in 1.1 we can use bdrv_reopen().



Intuitively I dislike _reopen style interfaces.  If the second open
yields different results from the first, does it invalidate any
computations in between?

What's wrong with just delaying the open?


If you delay the 'open' until the mgmt app issues 'cont', then you loose
the ability to rollback to the source host upon open failure for most
deployed versions of libvirt. We only fairly recently switched to a five
stage migration handshake to cope with rollback when 'cont' fails.


Delayed open isn't a panacea.  With the series I sent, we should be able to 
migration with a qcow2 file on coherent shared storage.


There are two other cases that we care about: migration with nfs cache!=none and 
direct attached storage with cache!=none


Whether the open is deferred matters less with NFS than if the open happens 
after the close on the source.  To fix NFS cache!=none, we would have to do a 
bdrv_close() before sending the last byte of migration data and make sure that 
we bdrv_open() after receiving the last byte of migration data.


The problem with this IMHO is it creates a large window where noone has the file 
open and you're critically vulnerable to losing your VM.


I'm much more in favor of a smarter caching policy.  If we can fcntl() our way 
to O_DIRECT on NFS, that would be fairly interesting.  I'm not sure if this is 
supported today but it's something we could look into adding in the kernel. 
That way we could force NFS to O_DIRECT during migration which would solve this 
problem robustly.


Deferred open doesn't help with direct attached storage.  There simple is no 
guarantee that there isn't data in the page cache.


Again, I think defaulting DAS to cache=none|directsync is what makes the most 
sense here.


We can even add a migration blocker for DAS with cache=on.  If we can do dynamic 
toggling of the cache setting, then that's pretty friendly at the end of the day.


Regards,

Anthony Liguori



Daniel


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] ndo: ndo_queue_xmit/ndo_flush_xmit (was Re: [RFC] [ver3 PATCH 0/6] Implement multiqueue virtio-net)

2011-11-14 Thread Michael S. Tsirkin
On Sun, Nov 13, 2011 at 07:48:28PM +0200, Michael S. Tsirkin wrote:
 @@ -863,6 +872,9 @@ struct net_device_ops {
   int (*ndo_stop)(struct net_device *dev);
   netdev_tx_t (*ndo_start_xmit) (struct sk_buff *skb,
  struct net_device *dev);
 + netdev_tx_t (*ndo_queue_xmit)(struct sk_buff *skb,
 +   struct net_device *dev);
 + void(*ndo_flush_xmit)(struct net_device *dev);
   u16 (*ndo_select_queue)(struct net_device *dev,
   struct sk_buff *skb);
   void(*ndo_change_rx_flags)(struct net_device *dev,
 @@ -2099,6 +2111,10 @@ extern int dev_set_mac_address(struct 
 net_device *,

An alternative I considered was to add a boolean flag to ndo_start_xmit
'bool queue' or something like this, plus ndo_flush_xmit.
This will lead to cleaner code I think but will require all drivers
to be changed, so for a proof of concept I decided to go for one
that is less work.

Let me know what looks more palatable ... 

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter

2011-11-14 Thread Scott Wood
On 11/14/2011 07:13 AM, Alexander Graf wrote:
 On 11/09/2011 01:23 AM, Scott Wood wrote:
 +/* Check pending exceptions and deliver one, if possible. */
 +void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 +{
 +WARN_ON_ONCE(!irqs_disabled());
 +
 +kvmppc_core_check_exceptions(vcpu);
 +
 +if (vcpu-arch.shared-msr  MSR_WE) {
 +local_irq_enable();
 +kvm_vcpu_block(vcpu);
 +local_irq_disable();
 
 Hrm. This specific irq enable/disable part isn't pretty but I can't
 think of a cleaner way either. Unless you move it out of the prepare
 function, since I don't see a way it could race with an interrupt.

kvmppc_core_check_exceptions can clear MSR_WE, so we need to check after
that.  We can't enable interrupts after kvmppc_core_check_exceptions (or
rather, if we do, we need to check again once interrupts are
re-disabled, as in the MSR_WE case) because otherwise we could have an
exception delivered afterward and receive the resched IPI at just the
wrong time to take any action on it (just like the signal check).

 +
 +kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
 +kvmppc_core_check_exceptions(vcpu);
 
 Shouldn't
 
 if (msr  MSR_WE) {
   ...
 }
 
 core_check_exceptions(vcpu);
 
 
 work just as well?

That would have us pointlessly checking the exceptions twice in the
non-WE case -- unless you mean only check after, which as described
above means you'll fail to wake up.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for November 15th

2011-11-14 Thread Juan Quintela

Hi

Please send in any agenda items you are interested in covering.

Proposal:
- Migration debacle.

Thanks, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for November 15th

2011-11-14 Thread Anthony Liguori

On 11/14/2011 11:44 AM, Juan Quintela wrote:


Hi

Please send in any agenda items you are interested in covering.

Proposal:
- Migration debacle.


Ack.  You read my mind :-)

Regards,

Anthony Liguori



Thanks, Juan.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter

2011-11-14 Thread Alexander Graf

On 14.11.2011, at 18:22, Scott Wood scottw...@freescale.com wrote:

 On 11/14/2011 07:13 AM, Alexander Graf wrote:
 On 11/09/2011 01:23 AM, Scott Wood wrote:
 +/* Check pending exceptions and deliver one, if possible. */
 +void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 +{
 +WARN_ON_ONCE(!irqs_disabled());
 +
 +kvmppc_core_check_exceptions(vcpu);
 +
 +if (vcpu-arch.shared-msr  MSR_WE) {
 +local_irq_enable();
 +kvm_vcpu_block(vcpu);
 +local_irq_disable();
 
 Hrm. This specific irq enable/disable part isn't pretty but I can't
 think of a cleaner way either. Unless you move it out of the prepare
 function, since I don't see a way it could race with an interrupt.
 
 kvmppc_core_check_exceptions can clear MSR_WE, so we need to check after
 that.  We can't enable interrupts after kvmppc_core_check_exceptions (or
 rather, if we do, we need to check again once interrupts are
 re-disabled, as in the MSR_WE case) because otherwise we could have an
 exception delivered afterward and receive the resched IPI at just the
 wrong time to take any action on it (just like the signal check).

Well, but if you enable interrupts here you're basically rendering code that 
runs earlier in disabled-interrupt mode racy.

How does this align with the signal handling? We could receive a signal here 
and not handle it the same as the race you fixed earlier, no?

Alex

 
 +
 +kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
 +kvmppc_core_check_exceptions(vcpu);
 
 Shouldn't
 
 if (msr  MSR_WE) {
  ...
 }
 
 core_check_exceptions(vcpu);
 
 
 work just as well?
 
 That would have us pointlessly checking the exceptions twice in the
 non-WE case -- unless you mean only check after, which as described
 above means you'll fail to wake up.
 
 -Scott
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 RFC] virtio-pci: flexible configuration layout

2011-11-14 Thread Michael S. Tsirkin
Add a flexible mechanism to specify virtio configuration layout, using
pci vendor-specific capability.  A separate capability is used for each
of common, device specific and data-path accesses.

Warning: compiled only.
This patch also needs to be split up, pci_iomap changes
also need arch updates for non-x86.
There might also be more spec changes.

Posting here for early feedback, and to allow Sasha to
proceed with his kvm tool work.

Changes from v1:
Updated to match v3 of the spec, see:
Subject: [PATCHv3 RFC] virtio-spec: flexible configuration layout
Message-ID: 2010122436.ga13...@redhat.com
In-Reply-To: 2009195901.ga28...@redhat.com

In particular:
- split ISR out
- reorg capability
- add offset alignment

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_pci.c |  184 +++---
 include/asm-generic/io.h|4 +
 include/asm-generic/iomap.h |   11 +++
 include/linux/pci_regs.h|4 +
 include/linux/virtio_pci.h  |   41 ++
 lib/iomap.c |   40 -
 6 files changed, 265 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index ecb9254..bdd3876 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -37,8 +37,12 @@ struct virtio_pci_device
struct virtio_device vdev;
struct pci_dev *pci_dev;
 
-   /* the IO mapping for the PCI config space */
+   /* the IO address for the common PCI config space */
void __iomem *ioaddr;
+   /* the IO address for device specific config */
+   void __iomem *ioaddr_device;
+   /* the IO address to use for notifications operations */
+   void __iomem *ioaddr_notify;
 
/* a list of queues so we can dispatch IRQs */
spinlock_t lock;
@@ -57,8 +61,158 @@ struct virtio_pci_device
unsigned msix_used_vectors;
/* Whether we have vector per vq */
bool per_vq_vectors;
+
+   /* Various IO mappings: used for resource tracking only. */
+
+   /* Legacy BAR0: typically PIO. */
+   void __iomem *legacy_map;
+
+   /* Mappings specified by device capabilities: typically in MMIO */
+   void __iomem *isr_map;
+   void __iomem *notify_map;
+   void __iomem *common_map;
+   void __iomem *device_map;
 };
 
+/*
+ * With PIO, device-specific config moves as MSI-X is enabled/disabled.
+ * Use this accessor to keep pointer to that config in sync.
+ */
+static void virtio_pci_set_msix_enabled(struct virtio_pci_device *vp_dev, int 
enabled)
+{
+   vp_dev-msix_enabled = enabled;
+   if (vp_dev-device_map)
+   vp_dev-ioaddr_device = vp_dev-device_map;
+   else
+   vp_dev-ioaddr_device = vp_dev-legacy_map +
+   VIRTIO_PCI_CONFIG(vp_dev);
+}
+
+static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 
cap_id,
+   u32 align)
+{
+u32 size;
+u32 offset;
+u8 bir;
+u8 cap_len;
+   int pos;
+   struct pci_dev *dev = vp_dev-pci_dev;
+   void __iomem *p;
+
+   for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+pos  0;
+pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+   u8 id;
+   pci_read_config_byte(dev, pos + PCI_VNDR_CAP_LEN, cap_len);
+   if (cap_len  VIRTIO_PCI_CAP_ID + 1)
+   continue;
+   pci_read_config_byte(dev, pos + VIRTIO_PCI_CAP_ID, id);
+   id = VIRTIO_PCI_CAP_ID_SHIFT;
+   id = VIRTIO_PCI_CAP_ID_MASK;
+   if (id == cap_id)
+   break;
+   }
+
+   if (pos = 0)
+   return NULL;
+
+   if (cap_len  VIRTIO_PCI_CAP_CFG_BIR + 1)
+   goto err;
+pci_read_config_byte(dev, pos + VIRTIO_PCI_CAP_CFG_BIR, bir);
+   if (cap_len  VIRTIO_PCI_CAP_CFG_OFF + 4)
+   goto err;
+pci_read_config_dword(dev, pos + VIRTIO_PCI_CAP_CFG_OFF, offset);
+   if (cap_len  VIRTIO_PCI_CAP_CFG_SIZE + 4)
+   goto err;
+pci_read_config_dword(dev, pos + VIRTIO_PCI_CAP_CFG_SIZE, size);
+bir = VIRTIO_PCI_CAP_CFG_BIR_SHIFT;
+bir = VIRTIO_PCI_CAP_CFG_BIR_MASK;
+size = VIRTIO_PCI_CAP_CFG_SIZE_SHIFT;
+size = VIRTIO_PCI_CAP_CFG_SIZE_MASK;
+off = VIRTIO_PCI_CAP_CFG_OFF_SHIFT;
+off = VIRTIO_PCI_CAP_CFG_OFF_MASK;
+   /* Align offset appropriately */
+   off = ~(align - 1);
+
+   /* It's possible for a device to supply a huge config space,
+* but we'll never need to map more than a page ATM. */
+   p = pci_iomap_range(dev, bir, offset, size, PAGE_SIZE);
+   if (!p)
+   dev_err(vp_dev-vdev.dev, Unable to map virtio pci memory);
+   return p;
+err:
+   dev_err(vp_dev-vdev.dev, Unable to parse virtio pci capability);
+   return 

kvm pio

2011-11-14 Thread Xin Tong
I am investigating how PIO is emulated in KVM and QEMU. when a PIO is
encountered, it seems to me that its pio data is copied to
vcpu-arch.pio_data and a fixed offset is assigned to
vcpu-run-io.data_offset.

static int emulator_pio_out_emulated(int size, unsigned short port,
{...
memcpy(vcpu-arch.pio_data, val, size * count);
...

   vcpu-run-io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
}

later in QEMU,  it retrieves data from  (uint8_t *)run +
run-io.data_offset, how can we be sure than the memory the pio data
is copied to vcpu-arch.pio_data is where the  (uint8_t *)run +
run-io.data_offset is pointing to ?


Also, it seems that there is something called fast pio in which kvm
does not return to qemu. in what case does it happen ?

Thanks


Xin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter

2011-11-14 Thread Scott Wood
On 11/14/2011 12:01 PM, Alexander Graf wrote:
 
 On 14.11.2011, at 18:22, Scott Wood scottw...@freescale.com wrote:
 
 kvmppc_core_check_exceptions can clear MSR_WE, so we need to check after
 that.  We can't enable interrupts after kvmppc_core_check_exceptions (or
 rather, if we do, we need to check again once interrupts are
 re-disabled, as in the MSR_WE case) because otherwise we could have an
 exception delivered afterward and receive the resched IPI at just the
 wrong time to take any action on it (just like the signal check).
 
 Well, but if you enable interrupts here you're basically rendering code that 
 runs earlier in disabled-interrupt mode racy.

That's why once we wake, we disable interrupts and check again.

 How does this align with the signal handling? We could receive a signal here 
 and not handle it the same as the race you fixed earlier, no?

Signals are checked by the caller after calling prepare_to_enter.

Moving the check into prepare_to_enter wouldn't buy us much since the
method of returning is different, and so there'd still need to be code
in the caller to deal with it.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alex Williamson
On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
 On 11/03/2011 03:12 PM, Alex Williamson wrote:
  +Many modern system now provide DMA and interrupt remapping facilities
  +to help ensure I/O devices behave within the boundaries they've been
  +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
  +well as POWER systems with Partitionable Endpoints (PEs) and even
  +embedded powerpc systems (technology name unknown).  
 
 Maybe replace (technology name unknown) with (such as Freescale chips
 with PAMU) or similar?
 
 Or just leave out the parenthetical.

I was hoping that comment would lead to an answer.  Thanks for the
info ;)

  +As documented in linux/vfio.h, several ioctls are provided on the
  +group chardev:
  +
  +#define VFIO_GROUP_GET_FLAGS_IOR(';', 100, __u64)
  + #define VFIO_GROUP_FLAGS_VIABLE(1  0)
  + #define VFIO_GROUP_FLAGS_MM_LOCKED (1  1)
  +#define VFIO_GROUP_MERGE_IOW(';', 101, int)
  +#define VFIO_GROUP_UNMERGE  _IOW(';', 102, int)
  +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103)
  +#define VFIO_GROUP_GET_DEVICE_FD_IOW(';', 104, char *)
 
 This suggests the argument to VFIO_GROUP_GET_DEVICE_FD is a pointer to a
 pointer to char rather than a pointer to an array of char (just as e.g.
 VFIO_GROUP_MERGE takes a pointer to an int, not just an int).

I believe I was following the UI_SET_PHYS ioctl as an example, which is
defined as a char *.  I'll change to char and verify.

  +The IOMMU file descriptor provides this set of ioctls:
  +
  +#define VFIO_IOMMU_GET_FLAGS_IOR(';', 105, __u64)
  + #define VFIO_IOMMU_FLAGS_MAP_ANY   (1  0)
  +#define VFIO_IOMMU_MAP_DMA  _IOWR(';', 106, struct 
  vfio_dma_map)
  +#define VFIO_IOMMU_UNMAP_DMA_IOWR(';', 107, struct 
  vfio_dma_map)
 
 What is the implication if VFIO_IOMMU_FLAGS_MAP_ANY is clear?  Is such
 an implementation supposed to add a new flag that describes its
 restrictions?

If MAP_ANY is clear then I would expect a new flag is set defining a new
mapping paradigm, probably with an ioctl to describe the
restrictions/parameters.  MAP_ANY effectively means there are no
restrictions.

 Can we get a way to turn DMA access off and on, short of unmapping
 everything, and then mapping it again?

iommu_ops doesn't support such an interface, so no, not currently.

  +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
  +We currently only support IOMMU domains that are able to map any
  +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
  +
  +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
  +and unmapping IOVAs to process virtual addresses:
  +
  +struct vfio_dma_map {
  +__u64   len;/* length of structure */
  +__u64   vaddr;  /* process virtual addr */
  +__u64   dmaaddr;/* desired and/or returned dma address */
  +__u64   size;   /* size in bytes */
  +__u64   flags;
  +#define VFIO_DMA_MAP_FLAG_WRITE (1  0) /* req writeable DMA mem 
  */
  +};
 
 What are the semantics of desired and/or returned dma address?

I believe the original intention was that a user could leave dmaaddr
clear and let the iommu layer provide an iova address.  The iommu api
has since evolved and that mapping scheme really isn't present anymore.
We'll currently fail if we can map the requested address.  I'll update
the docs to make that be the definition.

 Are we always supposed to provide a desired address, but it may be
 different on return?  Or are there cases where we want to say give me
 whatever you want or give me this or fail?

Exactly, that's what it used to be, but we don't really implement that
any more.

 How much of this needs to be filled out for unmap?

dmaaddr  size, will update docs.

 Note that the length of structure approach means that ioctl numbers
 will change whenever this grows -- perhaps we should avoid encoding the
 struct size into these ioctls?

How so?  What's described here is effectively the base size.  If we
later add feature foo requiring additional fields, we set a flag, change
the size, and tack those fields onto the end.  The kernel side should
balk if the size doesn't match what it expects from the flags it
understands (which I think I probably need to be more strict about).

  +struct vfio_region_info {
  +__u32   len;/* length of structure */
  +__u32   index;  /* region number */
  +__u64   size;   /* size in bytes of region */
  +__u64   offset; /* start offset of region */
  +__u64   flags;
  +#define VFIO_REGION_INFO_FLAG_MMAP  (1  0)
  +#define VFIO_REGION_INFO_FLAG_RO(1  1)
  +#define VFIO_REGION_INFO_FLAG_PHYS_VALID(1  2)
  +__u64   phys;   /* physical address of region */
  +};

In light of the above, this struct should not include 

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alex Williamson
On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote:
 On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
  On 11/03/2011 03:12 PM, Alex Williamson wrote: 
   + for (i = 0; i  npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
   + unsigned long pfn = 0;
   +
   + ret = vaddr_get_pfn(vaddr, rdwr, pfn);
   + if (ret) {
   + __vfio_dma_unmap(iommu, start, i, rdwr);
   + return ret;
   + }
   +
   + /* Only add actual locked pages to accounting */
   + if (!is_invalid_reserved_pfn(pfn))
   + locked++;
   +
   + ret = iommu_map(iommu-domain, iova,
   + (phys_addr_t)pfn  PAGE_SHIFT, 0, prot);
   + if (ret) {
   + /* Back out mappings on error */
   + put_pfn(pfn, rdwr);
   + __vfio_dma_unmap(iommu, start, i, rdwr);
   + return ret;
   + }
   + }
  
  There's no way to hand this stuff to the IOMMU driver in chunks larger
  than a page?  That's going to be a problem for our IOMMU, which wants to
  deal with large windows.
 
 There is, this is just a simple implementation that maps individual
 pages.  We just need to determine physically contiguous chunks and
 mlock them instead of using get_user_pages.  The current implementation
 is much like how KVM maps iommu pages, but there shouldn't be a user API
 change to try to use larger chinks.  We want this for IOMMU large page
 support too.

Also, at one point intel-iommu didn't allow sub-ranges to be unmapped;
an unmap of a single page would unmap the entire original mapping that
contained that page.  That made it easier to map each page individually
for the flexibility it provided on unmap.  I need to see if we still
have that restriction.  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Scott Wood
On 11/14/2011 02:54 PM, Alex Williamson wrote:
 On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
 What are the semantics of desired and/or returned dma address?
 
 I believe the original intention was that a user could leave dmaaddr
 clear and let the iommu layer provide an iova address.  The iommu api
 has since evolved and that mapping scheme really isn't present anymore.
 We'll currently fail if we can map the requested address.  I'll update
 the docs to make that be the definition.

OK... if there is any desire in the future to have the kernel pick an
address (which could be useful for IOMMUs that don't set
VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this,
since zero could be a valid address to request (doesn't mean clear).

 Note that the length of structure approach means that ioctl numbers
 will change whenever this grows -- perhaps we should avoid encoding the
 struct size into these ioctls?
 
 How so?  What's described here is effectively the base size.  If we
 later add feature foo requiring additional fields, we set a flag, change
 the size, and tack those fields onto the end.  The kernel side should
 balk if the size doesn't match what it expects from the flags it
 understands (which I think I probably need to be more strict about).

The size of the struct is encoded into the ioctl number via the _IOWR()
macro.  If we want the struct to be growable in the future, we should
leave that out and just use _IO().  Otherwise if the size of the struct
changes, the ioctl number changes.  This is annoying for old userspace
plus new kernel (have to add compat entries to the switch), and broken
for old kernel plus new userspace.

 Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
 would still be useful for devices which don't do DMA, or where we accept
 the lack of protection/translation (e.g. we have a customer that wants
 to do KVM device assignment on one of our lower-end chips that lacks an
 IOMMU).
 
 Ugh.  I'm not really onboard with it given that we're trying to sell
 vfio as a secure user space driver interface with iommu-based
 protection.

That's its main use case, but it doesn't make much sense to duplicate
the non-iommu-related bits for other use cases.

This applies at runtime too, some devices don't do DMA at all (and thus
may not be part of an IOMMU group, even if there is an IOMMU present for
other devices -- could be considered a standalone group of one device,
with a null IOMMU backend).  Support for such devices can wait, but it's
good to keep the possibility in mind.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alexander Graf


Am 14.11.2011 um 23:26 schrieb Scott Wood scottw...@freescale.com:

 On 11/14/2011 02:54 PM, Alex Williamson wrote:
 On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
 What are the semantics of desired and/or returned dma address?
 
 I believe the original intention was that a user could leave dmaaddr
 clear and let the iommu layer provide an iova address.  The iommu api
 has since evolved and that mapping scheme really isn't present anymore.
 We'll currently fail if we can map the requested address.  I'll update
 the docs to make that be the definition.
 
 OK... if there is any desire in the future to have the kernel pick an
 address (which could be useful for IOMMUs that don't set
 VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this,
 since zero could be a valid address to request (doesn't mean clear).
 
 Note that the length of structure approach means that ioctl numbers
 will change whenever this grows -- perhaps we should avoid encoding the
 struct size into these ioctls?
 
 How so?  What's described here is effectively the base size.  If we
 later add feature foo requiring additional fields, we set a flag, change
 the size, and tack those fields onto the end.  The kernel side should
 balk if the size doesn't match what it expects from the flags it
 understands (which I think I probably need to be more strict about).
 
 The size of the struct is encoded into the ioctl number via the _IOWR()
 macro.  If we want the struct to be growable in the future, we should
 leave that out and just use _IO().  Otherwise if the size of the struct
 changes, the ioctl number changes.  This is annoying for old userspace
 plus new kernel (have to add compat entries to the switch), and broken
 for old kernel plus new userspace.

Avi wanted to write up a patch for this to allow ioctls with arbitrary size, 
for exctly this purpose.

 
 Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
 would still be useful for devices which don't do DMA, or where we accept
 the lack of protection/translation (e.g. we have a customer that wants
 to do KVM device assignment on one of our lower-end chips that lacks an
 IOMMU).
 
 Ugh.  I'm not really onboard with it given that we're trying to sell
 vfio as a secure user space driver interface with iommu-based
 protection.
 
 That's its main use case, but it doesn't make much sense to duplicate
 the non-iommu-related bits for other use cases.
 
 This applies at runtime too, some devices don't do DMA at all (and thus
 may not be part of an IOMMU group, even if there is an IOMMU present for
 other devices -- could be considered a standalone group of one device,
 with a null IOMMU backend).  Support for such devices can wait, but it's
 good to keep the possibility in mind.

I agree. Potentially backing a device with a nop iommu also makes testing 
easier.

Alex

 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alex Williamson
On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote:
  -Original Message-
  From: Alex Williamson [mailto:alex.william...@redhat.com]
  Sent: Friday, November 11, 2011 10:04 AM
  To: Christian Benvenuti (benve)
  Cc: chr...@sous-sol.org; a...@au1.ibm.com; p...@au1.ibm.com;
  d...@au1.ibm.com; joerg.roe...@amd.com; ag...@suse.de; Aaron Fabbri
  (aafabbri); b08...@freescale.com; b07...@freescale.com; a...@redhat.com;
  konrad.w...@oracle.com; kvm@vger.kernel.org; qemu-de...@nongnu.org;
  io...@lists.linux-foundation.org; linux-...@vger.kernel.org
  Subject: RE: [RFC PATCH] vfio: VFIO Driver core framework
  
  On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
   Here are few minor comments on vfio_iommu.c ...
  
  Sorry, I've been poking sticks at trying to figure out a clean way to
  solve the force vfio driver attach problem.
 
 Attach o detach?

Attach.  For the case when a new device appears that belongs to a group
that already in use.  I'll probably add a claim() operation to the
vfio_device_ops that tells the driver to grab it.  I was hoping for pci
this would just add it to the dynamic ids, but that hits device lock
problems.

diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
new file mode 100644
index 000..029dae3
--- /dev/null
+++ b/drivers/vfio/vfio_iommu.c
  snip
+
+#include vfio_private.h
  
   Doesn't the 'dma_'  prefix belong to the generic DMA code?
  
  Sure, we could these more vfio-centric.
 
 Like vfio_dma_map_page?

Something like that, though _page doesn't seem appropriate as it tracks
a region.

  
+struct dma_map_page {
+   struct list_headlist;
+   dma_addr_t  daddr;
+   unsigned long   vaddr;
+   int npage;
+   int rdwr;
+};
+
+/*
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+#define NPAGE_TO_SIZE(npage)   ((size_t)(npage)  PAGE_SHIFT)
+
+struct vwork {
+   struct mm_struct*mm;
+   int npage;
+   struct work_struct  work;
+};
+
+/* delayed decrement for locked_vm */
+static void vfio_lock_acct_bg(struct work_struct *work)
+{
+   struct vwork *vwork = container_of(work, struct vwork, work);
+   struct mm_struct *mm;
+
+   mm = vwork-mm;
+   down_write(mm-mmap_sem);
+   mm-locked_vm += vwork-npage;
+   up_write(mm-mmap_sem);
+   mmput(mm);  /* unref mm */
+   kfree(vwork);
+}
+
+static void vfio_lock_acct(int npage)
+{
+   struct vwork *vwork;
+   struct mm_struct *mm;
+
+   if (!current-mm) {
+   /* process exited */
+   return;
+   }
+   if (down_write_trylock(current-mm-mmap_sem)) {
+   current-mm-locked_vm += npage;
+   up_write(current-mm-mmap_sem);
+   return;
+   }
+   /*
+* Couldn't get mmap_sem lock, so must setup to decrement
 ^
  
   Increment?
  
  Yep

Actually, side note, this is increment/decrement depending on the sign
of the parameter.  So update may be more appropriate.  I think Tom
originally used increment in one place and decrement in another to show
it's dual use.

  snip
+int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
start,
+   size_t size, struct dma_map_page *mlp)
+{
+   struct dma_map_page *split;
+   int npage_lo, npage_hi;
+
+   /* Existing dma region is completely covered, unmap all */
  
   This works. However, given how vfio_dma_map_dm implements the merging
   logic, I think it is impossible to have
  
   (start  mlp-daddr 
start + size  mlp-daddr + NPAGE_TO_SIZE(mlp-npage))
  
  It's quite possible.  This allows userspace to create a sparse mapping,
  then blow it all away with a single unmap from 0 to ~0.
 
 I would prefer the user to use exact ranges in the unmap operations
 because it would make it easier to detect bugs/leaks in the map/unmap
 logic used by the callers.
 My assumptions are that:
 
 - the user always keeps track of the mappings

My qemu code plays a little on the loose side here, acting as a
passthrough for the internal memory client.  But even there, worst case
would probably be trying to unmap a non-existent entry, not unmapping a
sparse range.

 - the user either unmaps one specific mapping or 'all of them'.
   The 'all of them' case would also take care of those cases where
   the user does _not_ keep track of mappings and simply uses
   the unmap from 0 to ~0 each time.
 
 Because of this you could still provide an exact map/unmap logic
 and 

[RFC 2/5] virtio-pci: Fix compilation issue

2011-11-14 Thread Sasha Levin
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 drivers/virtio/virtio_pci.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 7625434..d242fcc 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -129,10 +129,10 @@ static void __iomem *virtio_pci_map_cfg(struct 
virtio_pci_device *vp_dev, u8 cap
 bir = VIRTIO_PCI_CAP_CFG_BIR_MASK;
 size = VIRTIO_PCI_CAP_CFG_SIZE_SHIFT;
 size = VIRTIO_PCI_CAP_CFG_SIZE_MASK;
-off = VIRTIO_PCI_CAP_CFG_OFF_SHIFT;
-off = VIRTIO_PCI_CAP_CFG_OFF_MASK;
+offset = VIRTIO_PCI_CAP_CFG_OFF_SHIFT;
+offset = VIRTIO_PCI_CAP_CFG_OFF_MASK;
/* Align offset appropriately */
-   off = ~(align - 1);
+   offset = ~(align - 1);
 
/* It's possible for a device to supply a huge config space,
 * but we'll never need to map more than a page ATM. */
-- 
1.7.8.rc1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 4/5] kvm tools: Free up the MSI-X PBA BAR

2011-11-14 Thread Sasha Levin
Free up the BAR to make space for the new virtio BARs. It isn't required
to have the PBA and the table in the separate BARs, and uniting them will
just give us extra BARs to play with.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/include/kvm/virtio-pci.h |1 -
 tools/kvm/virtio/pci.c |   35 ++-
 2 files changed, 14 insertions(+), 22 deletions(-)

diff --git a/tools/kvm/include/kvm/virtio-pci.h 
b/tools/kvm/include/kvm/virtio-pci.h
index 2bbb271..73f7486 100644
--- a/tools/kvm/include/kvm/virtio-pci.h
+++ b/tools/kvm/include/kvm/virtio-pci.h
@@ -30,7 +30,6 @@ struct virtio_pci {
u32 vq_vector[VIRTIO_PCI_MAX_VQ];
u32 gsis[VIRTIO_PCI_MAX_VQ];
u32 msix_io_block;
-   u32 msix_pba_block;
u64 msix_pba;
struct msix_table   msix_table[VIRTIO_PCI_MAX_VQ + 
VIRTIO_PCI_MAX_CONFIG];
 
diff --git a/tools/kvm/virtio/pci.c b/tools/kvm/virtio/pci.c
index 1660f06..da38ba5 100644
--- a/tools/kvm/virtio/pci.c
+++ b/tools/kvm/virtio/pci.c
@@ -214,23 +214,21 @@ static struct ioport_operations virtio_pci__io_ops = {
 static void callback_mmio_table(u64 addr, u8 *data, u32 len, u8 is_write, void 
*ptr)
 {
struct virtio_pci *vpci = ptr;
-   void *table = vpci-msix_table;
+   void *table;
+   u32 offset;
 
-   if (is_write)
-   memcpy(table + addr - vpci-msix_io_block, data, len);
-   else
-   memcpy(data, table + addr - vpci-msix_io_block, len);
-}
-
-static void callback_mmio_pba(u64 addr, u8 *data, u32 len, u8 is_write, void 
*ptr)
-{
-   struct virtio_pci *vpci = ptr;
-   void *pba = vpci-msix_pba;
+   if (addr  vpci-msix_io_block + PCI_IO_SIZE) {
+   table   = vpci-msix_pba;
+   offset  = vpci-msix_io_block + PCI_IO_SIZE;
+   } else {
+   table   = vpci-msix_table;
+   offset  = vpci-msix_io_block;
+   }
 
if (is_write)
-   memcpy(pba + addr - vpci-msix_pba_block, data, len);
+   memcpy(table + addr - offset, data, len);
else
-   memcpy(data, pba + addr - vpci-msix_pba_block, len);
+   memcpy(data, table + addr - offset, len);
 }
 
 int virtio_pci__signal_vq(struct kvm *kvm, struct virtio_trans *vtrans, u32 vq)
@@ -283,12 +281,10 @@ int virtio_pci__init(struct kvm *kvm, struct virtio_trans 
*vtrans, void *dev,
u8 pin, line, ndev;
 
vpci-dev = dev;
-   vpci-msix_io_block = pci_get_io_space_block(PCI_IO_SIZE);
-   vpci-msix_pba_block = pci_get_io_space_block(PCI_IO_SIZE);
+   vpci-msix_io_block = pci_get_io_space_block(PCI_IO_SIZE * 2);
 
vpci-base_addr = ioport__register(IOPORT_EMPTY, virtio_pci__io_ops, 
IOPORT_SIZE, vtrans);
kvm__register_mmio(kvm, vpci-msix_io_block, 0x100, 
callback_mmio_table, vpci);
-   kvm__register_mmio(kvm, vpci-msix_pba_block, 0x100, callback_mmio_pba, 
vpci);
 
vpci-pci_hdr = (struct pci_device_header) {
.vendor_id  = PCI_VENDOR_ID_REDHAT_QUMRANET,
@@ -299,10 +295,7 @@ int virtio_pci__init(struct kvm *kvm, struct virtio_trans 
*vtrans, void *dev,
.subsys_vendor_id   = 
PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET,
.subsys_id  = subsys_id,
.bar[0] = vpci-base_addr | 
PCI_BASE_ADDRESS_SPACE_IO,
-   .bar[1] = vpci-msix_io_block | 
PCI_BASE_ADDRESS_SPACE_MEMORY
-   | PCI_BASE_ADDRESS_MEM_TYPE_64,
-   .bar[3] = vpci-msix_pba_block | 
PCI_BASE_ADDRESS_SPACE_MEMORY
-   | PCI_BASE_ADDRESS_MEM_TYPE_64,
+   .bar[1] = vpci-msix_io_block | 
PCI_BASE_ADDRESS_SPACE_MEMORY,
.status = PCI_STATUS_CAP_LIST,
.capabilities   = (void *)vpci-pci_hdr.msix - (void 
*)vpci-pci_hdr,
};
@@ -327,7 +320,7 @@ int virtio_pci__init(struct kvm *kvm, struct virtio_trans 
*vtrans, void *dev,
 * we're not in short of BARs
 */
vpci-pci_hdr.msix.table_offset = 1; /* Use BAR 1 */
-   vpci-pci_hdr.msix.pba_offset = 3; /* Use BAR 3 */
+   vpci-pci_hdr.msix.pba_offset = 1 | PCI_IO_SIZE; /* Use BAR 1 with 
offset */
vpci-config_vector = 0;
 
if (irq__register_device(subsys_id, ndev, pin, line)  0)
-- 
1.7.8.rc1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 5/5] kvm tools: Support new virtio-pci configuration layout

2011-11-14 Thread Sasha Levin
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/include/kvm/pci.h|   13 ++-
 tools/kvm/include/kvm/virtio-pci.h |4 +
 tools/kvm/virtio/pci.c |  248 ++--
 3 files changed, 254 insertions(+), 11 deletions(-)

diff --git a/tools/kvm/include/kvm/pci.h b/tools/kvm/include/kvm/pci.h
index f71af0b..4f5a09f 100644
--- a/tools/kvm/include/kvm/pci.h
+++ b/tools/kvm/include/kvm/pci.h
@@ -39,6 +39,16 @@ struct msix_cap {
u32 pba_offset;
 };
 
+struct virtio_cap {
+   u8 cap;
+   u8 next;
+   u8 cap_len;
+   u8 structure_id;
+   u8 bir;
+   u32 size:24;
+   u32 offset;
+};
+
 struct pci_device_header {
u16 vendor_id;
u16 device_id;
@@ -63,7 +73,8 @@ struct pci_device_header {
u8  min_gnt;
u8  max_lat;
struct msix_cap msix;
-   u8  empty[136]; /* Rest of PCI config space */
+   struct virtio_cap virtio[4];
+   u8  empty[90]; /* Rest of PCI config space */
u32 bar_size[6];
 };
 
diff --git a/tools/kvm/include/kvm/virtio-pci.h 
b/tools/kvm/include/kvm/virtio-pci.h
index 73f7486..fc3c03f 100644
--- a/tools/kvm/include/kvm/virtio-pci.h
+++ b/tools/kvm/include/kvm/virtio-pci.h
@@ -19,6 +19,7 @@ struct virtio_pci_ioevent_param {
 struct virtio_pci {
struct pci_device_header pci_hdr;
void*dev;
+   struct kvm  *kvm;
 
u16 base_addr;
u8  status;
@@ -36,6 +37,9 @@ struct virtio_pci {
/* virtio queue */
u16 queue_selector;
struct virtio_pci_ioevent_param ioeventfds[VIRTIO_PCI_MAX_VQ];
+
+   u32 virtio_mmio;
+   u16 virtio_pio;
 };
 
 int virtio_pci__init(struct kvm *kvm, struct virtio_trans *vtrans, void *dev,
diff --git a/tools/kvm/virtio/pci.c b/tools/kvm/virtio/pci.c
index da38ba5..8606d87 100644
--- a/tools/kvm/virtio/pci.c
+++ b/tools/kvm/virtio/pci.c
@@ -40,7 +40,7 @@ static int virtio_pci__init_ioeventfd(struct kvm *kvm, struct 
virtio_trans *vtra
};
 
ioevent = (struct ioevent) {
-   .io_addr= vpci-base_addr + VIRTIO_PCI_QUEUE_NOTIFY,
+   .io_addr= vpci-virtio_pio,
.io_len = sizeof(u16),
.fn = virtio_pci__ioevent_callback,
.fn_ptr = vpci-ioeventfds[vq],
@@ -59,7 +59,7 @@ static inline bool virtio_pci__msix_enabled(struct virtio_pci 
*vpci)
return vpci-pci_hdr.msix.ctrl  PCI_MSIX_FLAGS_ENABLE;
 }
 
-static bool virtio_pci__specific_io_in(struct kvm *kvm, struct virtio_trans 
*vtrans, u16 port,
+static bool virtio_pci_legacy__specific_io_in(struct kvm *kvm, struct 
virtio_trans *vtrans, u16 port,
void *data, int size, int offset)
 {
u32 config_offset;
@@ -89,7 +89,8 @@ static bool virtio_pci__specific_io_in(struct kvm *kvm, 
struct virtio_trans *vtr
return false;
 }
 
-static bool virtio_pci__io_in(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
+static bool virtio_pci_legacy__io_in(struct ioport *ioport, struct kvm *kvm,
+   u16 port, void *data, int size)
 {
unsigned long offset;
bool ret = true;
@@ -124,15 +125,15 @@ static bool virtio_pci__io_in(struct ioport *ioport, 
struct kvm *kvm, u16 port,
vpci-isr = VIRTIO_IRQ_LOW;
break;
default:
-   ret = virtio_pci__specific_io_in(kvm, vtrans, port, data, size, 
offset);
+   ret = virtio_pci_legacy__specific_io_in(kvm, vtrans, port, 
data, size, offset);
break;
};
 
return ret;
 }
 
-static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_trans 
*vtrans, u16 port,
-   void *data, int size, int offset)
+static bool virtio_pci_legacy__specific_io_out(struct kvm *kvm, struct 
virtio_trans *vtrans,
+   u16 port, void *data, int size, int 
offset)
 {
struct virtio_pci *vpci = vtrans-virtio;
u32 config_offset, gsi, vec;
@@ -166,7 +167,8 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, 
struct virtio_trans *vt
return false;
 }
 
-static bool virtio_pci__io_out(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
+static bool virtio_pci_legacy__io_out(struct ioport *ioport,
+   struct kvm *kvm, u16 port, void *data, int size)
 {
unsigned long offset;
bool ret = true;
@@ -199,7 +201,76 @@ static bool virtio_pci__io_out(struct ioport *ioport, 
struct kvm *kvm, u16 port,
vpci-status= ioport__read8(data);
break;
default:
-   ret = 

[RFC 3/5] iomap: Don't ignore offset

2011-11-14 Thread Sasha Levin
Offset was ignored for start calulcations, making all mappings start at
offset 0 always.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 lib/iomap.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/lib/iomap.c b/lib/iomap.c
index f28720e..93ae915 100644
--- a/lib/iomap.c
+++ b/lib/iomap.c
@@ -272,6 +272,7 @@ void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
if (len = offset || !start)
return NULL;
 len -= offset;
+   start += offset;
 if (len  minlen)
return NULL;
if (maxlen  len  maxlen)
-- 
1.7.8.rc1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 1/5] virtio-pci: flexible configuration layout

2011-11-14 Thread Sasha Levin
From: Michael S. Tsirkin m...@redhat.com

Add a flexible mechanism to specify virtio configuration layout, using
pci vendor-specific capability.  A separate capability is used for each
of common, device specific and data-path accesses.

Warning: compiled only.
This patch also needs to be split up, pci_iomap changes
also need arch updates for non-x86.
There might also be more spec changes.

Posting here for early feedback, and to allow Sasha to
proceed with his kvm tool work.

Changes from v1:
Updated to match v3 of the spec, see:
Subject: [PATCHv3 RFC] virtio-spec: flexible configuration layout
Message-ID: 2010122436.ga13...@redhat.com
In-Reply-To: 2009195901.ga28...@redhat.com

In particular:
- split ISR out
- reorg capability
- add offset alignment

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_pci.c |  184 +++---
 include/asm-generic/io.h|4 +
 include/asm-generic/iomap.h |   11 +++
 include/linux/pci_regs.h|4 +
 include/linux/virtio_pci.h  |   41 ++
 lib/iomap.c |   40 -
 6 files changed, 265 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 3d1bf41..7625434 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -37,8 +37,12 @@ struct virtio_pci_device
struct virtio_device vdev;
struct pci_dev *pci_dev;
 
-   /* the IO mapping for the PCI config space */
+   /* the IO address for the common PCI config space */
void __iomem *ioaddr;
+   /* the IO address for device specific config */
+   void __iomem *ioaddr_device;
+   /* the IO address to use for notifications operations */
+   void __iomem *ioaddr_notify;
 
/* a list of queues so we can dispatch IRQs */
spinlock_t lock;
@@ -57,8 +61,158 @@ struct virtio_pci_device
unsigned msix_used_vectors;
/* Whether we have vector per vq */
bool per_vq_vectors;
+
+   /* Various IO mappings: used for resource tracking only. */
+
+   /* Legacy BAR0: typically PIO. */
+   void __iomem *legacy_map;
+
+   /* Mappings specified by device capabilities: typically in MMIO */
+   void __iomem *isr_map;
+   void __iomem *notify_map;
+   void __iomem *common_map;
+   void __iomem *device_map;
 };
 
+/*
+ * With PIO, device-specific config moves as MSI-X is enabled/disabled.
+ * Use this accessor to keep pointer to that config in sync.
+ */
+static void virtio_pci_set_msix_enabled(struct virtio_pci_device *vp_dev, int 
enabled)
+{
+   vp_dev-msix_enabled = enabled;
+   if (vp_dev-device_map)
+   vp_dev-ioaddr_device = vp_dev-device_map;
+   else
+   vp_dev-ioaddr_device = vp_dev-legacy_map +
+   VIRTIO_PCI_CONFIG(vp_dev);
+}
+
+static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 
cap_id,
+   u32 align)
+{
+u32 size;
+u32 offset;
+u8 bir;
+u8 cap_len;
+   int pos;
+   struct pci_dev *dev = vp_dev-pci_dev;
+   void __iomem *p;
+
+   for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+pos  0;
+pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+   u8 id;
+   pci_read_config_byte(dev, pos + PCI_VNDR_CAP_LEN, cap_len);
+   if (cap_len  VIRTIO_PCI_CAP_ID + 1)
+   continue;
+   pci_read_config_byte(dev, pos + VIRTIO_PCI_CAP_ID, id);
+   id = VIRTIO_PCI_CAP_ID_SHIFT;
+   id = VIRTIO_PCI_CAP_ID_MASK;
+   if (id == cap_id)
+   break;
+   }
+
+   if (pos = 0)
+   return NULL;
+
+   if (cap_len  VIRTIO_PCI_CAP_CFG_BIR + 1)
+   goto err;
+pci_read_config_byte(dev, pos + VIRTIO_PCI_CAP_CFG_BIR, bir);
+   if (cap_len  VIRTIO_PCI_CAP_CFG_OFF + 4)
+   goto err;
+pci_read_config_dword(dev, pos + VIRTIO_PCI_CAP_CFG_OFF, offset);
+   if (cap_len  VIRTIO_PCI_CAP_CFG_SIZE + 4)
+   goto err;
+pci_read_config_dword(dev, pos + VIRTIO_PCI_CAP_CFG_SIZE, size);
+bir = VIRTIO_PCI_CAP_CFG_BIR_SHIFT;
+bir = VIRTIO_PCI_CAP_CFG_BIR_MASK;
+size = VIRTIO_PCI_CAP_CFG_SIZE_SHIFT;
+size = VIRTIO_PCI_CAP_CFG_SIZE_MASK;
+off = VIRTIO_PCI_CAP_CFG_OFF_SHIFT;
+off = VIRTIO_PCI_CAP_CFG_OFF_MASK;
+   /* Align offset appropriately */
+   off = ~(align - 1);
+
+   /* It's possible for a device to supply a huge config space,
+* but we'll never need to map more than a page ATM. */
+   p = pci_iomap_range(dev, bir, offset, size, PAGE_SIZE);
+   if (!p)
+   dev_err(vp_dev-vdev.dev, Unable to map virtio pci memory);
+   return p;
+err:
+   dev_err(vp_dev-vdev.dev, Unable to parse 

[RFC 0/5] virtio-pci,kvm tools: Support new virtio-pci spec in kvm tools

2011-11-14 Thread Sasha Levin
This patch series contains two fixes for the RFC patch send by Michael. These
patches are pretty basic and should probably be merged into the next version
of that patch.

Other two patches add support to the new virtio-pci spec in KVM tools, allowing
for easy testing of any future virtio-pci kernel side patches. The usermode code
isn't complete, but it's working properly and should be enough for functionality
testing.

Michael S. Tsirkin (1):
  virtio-pci: flexible configuration layout

Sasha Levin (4):
  virtio-pci: Fix compilation issue
  iomap: Don't ignore offset
  kvm tools: Free up the MSI-X PBA BAR
  kvm tools: Support new virtio-pci configuration layout

 drivers/virtio/virtio_pci.c|  184 ++--
 include/asm-generic/io.h   |4 +
 include/asm-generic/iomap.h|   11 ++
 include/linux/pci_regs.h   |4 +
 include/linux/virtio_pci.h |   41 ++
 lib/iomap.c|   41 +-
 tools/kvm/include/kvm/pci.h|   13 ++-
 tools/kvm/include/kvm/virtio-pci.h |5 +-
 tools/kvm/virtio/pci.c |  275 
 9 files changed, 530 insertions(+), 48 deletions(-)

-- 
1.7.8.rc1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread David Gibson
On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
 Thanks Konrad!  Comments inline.
 On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
  On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
[snip]
   +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
   +
   +#define VFIO_DEVICE_GET_NUM_REGIONS _IOR(';', 109, int)
  
  Don't want __u32?
 
 It could be, not sure if it buys us anything maybe even restricts us.
 We likely don't need 2^32 regions (famous last words?), so we could
 later define 0 to something?

As a rule, it's best to use explicit fixed width types for all ioctl()
arguments, to avoid compat hell for 32-bit userland on 64-bit kernel
setups.

[snip]
   +Again, zero count entries are allowed (vfio-pci uses a static interrupt
   +type to index mapping).
  
  I am not really sure what that means.
 
 This is so PCI can expose:
 
 enum {
 VFIO_PCI_INTX_IRQ_INDEX,
 VFIO_PCI_MSI_IRQ_INDEX,
 VFIO_PCI_MSIX_IRQ_INDEX,
 VFIO_PCI_NUM_IRQS
 };
 
 So like regions it always exposes 3 IRQ indexes where count=0 if the
 device doesn't actually support that type of interrupt.  I just want to
 spell out that bus drivers have this kind of flexibility.

I knew what you were aiming for, so I could see what you meant here,
but I don't think the doco is very clearly expressed at all.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread David Gibson
On Mon, Nov 14, 2011 at 03:59:00PM -0700, Alex Williamson wrote:
 On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote:
[snip]

  - the user either unmaps one specific mapping or 'all of them'.
The 'all of them' case would also take care of those cases where
the user does _not_ keep track of mappings and simply uses
the unmap from 0 to ~0 each time.
  
  Because of this you could still provide an exact map/unmap logic
  and allow such unmap from 0 to ~0 by making the latter a special
  case.
  However, if we want to allow any arbitrary/inexact unmap request, then OK.
 
 I can't think of any good reasons we shouldn't be more strict.  I think
 it was primarily just convenient to hit all the corner cases since we
 merge all the requests together for tracking and need to be able to
 split them back apart.  It does feel a little awkward to have a 0/~0
 special case though, but I don't think it's worth adding another ioctl
 to handle it.

Being strict, or at least enforcing strictness, requires that the
infrastructure track all the maps, so that the unmaps can be
matching.  This is not a natural thing with the data structures you
want for all IOMMUs.  For example on POWER, the IOMMU (aka TCE table)
is a simple 1-level pagetable.  One pointer with a couple of
permission bits per IOMMU page.  Handling oddly overlapping operations
on that data structure is natural, enforcing strict matching of maps
and unmaps is not and would require extra information to be stored by
vfio.  On POWER, the IOMMU operations often *are* a hot path, so
manipulating those structures would have a real cost, too.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Benjamin Herrenschmidt
On Tue, 2011-11-15 at 11:05 +1100, David Gibson wrote:
 Being strict, or at least enforcing strictness, requires that the
 infrastructure track all the maps, so that the unmaps can be
 matching.  This is not a natural thing with the data structures you
 want for all IOMMUs.  For example on POWER, the IOMMU (aka TCE table)
 is a simple 1-level pagetable.  One pointer with a couple of
 permission bits per IOMMU page.  Handling oddly overlapping operations
 on that data structure is natural, enforcing strict matching of maps
 and unmaps is not and would require extra information to be stored by
 vfio.  On POWER, the IOMMU operations often *are* a hot path, so
 manipulating those structures would have a real cost, too. 

In fact they are a very hot path even. There's no way we can afford the
cost of tracking per page mapping/unmapping (other than bumping the page
count on a page that's currently mapped or via some debug-only feature).

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alex Williamson
On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote:
 On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
  On 11/03/2011 03:12 PM, Alex Williamson wrote:
   + int (*get)(void *);
   + void(*put)(void *);
   + ssize_t (*read)(void *, char __user *,
   + size_t, loff_t *);
   + ssize_t (*write)(void *, const char __user *,
   +  size_t, loff_t *);
   + long(*ioctl)(void *, unsigned int, unsigned long);
   + int (*mmap)(void *, struct vm_area_struct *);
   +};
  
  When defining an API, please do not omit parameter names.
 
 ok
 
  Should specify what the driver is supposed to do with get/put -- I guess
  not try to unbind when the count is nonzero?  Races could still lead the
  unbinder to be blocked, but I guess it lets the driver know when it's
  likely to succeed.
 
 Right, for the pci bus driver, it's mainly for reference counting,
 including the module_get to prevent vfio-pci from being unloaded.  On
 the first get for a device, we also do a pci_enable() and pci_disable()
 on last put.  I'll try to clarify in the docs.

Looking at these again, I should just rename them to open/release.  That
matches the points when they're called.  I suspect I started with just
reference counting and it grew to more of a full blown open/release.
Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance counter - vPMU

2011-11-14 Thread Balbir Singh
On Mon, Nov 14, 2011 at 5:48 PM, Gleb Natapov g...@redhat.com wrote:

 Not yet merged. You can take it from here https://lkml.org/lkml/2011/11/10/215

Thank you very much, Gleb!

Balbir Singh.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-14 Thread Krishna Kumar2
Sasha Levin levinsasha...@gmail.com wrote on 11/14/2011 03:45:40 PM:

  Why both the bandwidth and latency performance are dropping so
  dramatically with multiple VQ?

 It looks like theres no hash sync between host and guest, which makes
 the RX VQ change for every packet. This is my guess.

Yes, I confirmed this happens for macvtap. I am
using ixgbe - it calls skb_record_rx_queue when
a skb is allocated, but sets rxhash when a packet
arrives. Macvtap is relying on record_rx_queue
first ahead of rxhash (as part of my patch making
macvtap multiqueue), hence different skbs result
in macvtap selecting different vq's.

Reordering macvtap to use rxhash first results in
all packets going to the same VQ. The code snippet
is:

{
...
if (!numvtaps)
goto out;

rxq = skb_get_rxhash(skb);
if (rxq) {
tap = rcu_dereference(vlan-taps[rxq % numvtaps]);
if (tap)
goto out;
}

if (likely(skb_rx_queue_recorded(skb))) {
rxq = skb_get_rx_queue(skb);

while (unlikely(rxq = numvtaps))
rxq -= numvtaps;
tap = rcu_dereference(vlan-taps[rxq]);
if (tap)
goto out;
}
}

I will submit a patch for macvtap separately. I am working
towards the other issue pointed out - different vhost
threads handling rx/tx of a single flow.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Delivery Failure

2011-11-14 Thread Postmaster

-
The message you sent to bamboowang.com/sales was rejected because it would 
exceed the quota for the mailbox.

The subject of the message follows: 
Subject: Delivery reports about your e-mail


-

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter

2011-11-14 Thread Alexander Graf

On 14.11.2011, at 18:22, Scott Wood scottw...@freescale.com wrote:

 On 11/14/2011 07:13 AM, Alexander Graf wrote:
 On 11/09/2011 01:23 AM, Scott Wood wrote:
 +/* Check pending exceptions and deliver one, if possible. */
 +void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 +{
 +WARN_ON_ONCE(!irqs_disabled());
 +
 +kvmppc_core_check_exceptions(vcpu);
 +
 +if (vcpu-arch.shared-msr  MSR_WE) {
 +local_irq_enable();
 +kvm_vcpu_block(vcpu);
 +local_irq_disable();
 
 Hrm. This specific irq enable/disable part isn't pretty but I can't
 think of a cleaner way either. Unless you move it out of the prepare
 function, since I don't see a way it could race with an interrupt.
 
 kvmppc_core_check_exceptions can clear MSR_WE, so we need to check after
 that.  We can't enable interrupts after kvmppc_core_check_exceptions (or
 rather, if we do, we need to check again once interrupts are
 re-disabled, as in the MSR_WE case) because otherwise we could have an
 exception delivered afterward and receive the resched IPI at just the
 wrong time to take any action on it (just like the signal check).

Well, but if you enable interrupts here you're basically rendering code that 
runs earlier in disabled-interrupt mode racy.

How does this align with the signal handling? We could receive a signal here 
and not handle it the same as the race you fixed earlier, no?

Alex

 
 +
 +kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
 +kvmppc_core_check_exceptions(vcpu);
 
 Shouldn't
 
 if (msr  MSR_WE) {
  ...
 }
 
 core_check_exceptions(vcpu);
 
 
 work just as well?
 
 That would have us pointlessly checking the exceptions twice in the
 non-WE case -- unless you mean only check after, which as described
 above means you'll fail to wake up.
 
 -Scott
 
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html