Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-22 Thread Gerhard Wiesinger

On Wed, 21 Apr 2010, Avi Kivity wrote:


On 04/21/2010 09:50 PM, Gerhard Wiesinger wrote:

I don't think changing VGA window is a problem because there are
500.000-1Mio changes/s possible.


1MB/s, 500k-1M changes/s Coincidence?  Is it taking a page fault
or trap on every write?



To clarify:
Memory Performance writing to segmen A000 is about 1MB/st.


That indicates a fault every write (assuming 8-16 bit writes).  If you're 
using 256 color vga and not switching banks, this indicates a bug.




Yes, 256 color VGA and no bank switches involved.

Calling INT 10 set/get window function with different windows (e.g. 
toggling between window page 0 and 1) is about 500.000 to 1Mio function 
calls per second.


That's suprisingly fast. I'd expect 100-200k/sec.



Sorry, I mixed up the numbers:
1.) QEMU-KVM: ~111k
2.) QEMU only: 500k-1Mio


Please run kvm_stat and report output for both tests to confirm.



See below. 2nd column is per second statistic when running the test.



To get real good VGA performance both parameters should be:
About 50MB/s for writes to segment A000
~500.000 bank switches per second.


First should be doable easily, second is borderline.


I think this is very easy to distingish:
1.) VGA Segment A000 is legacy and should be handled through QEMU and not 
through KVM (because it is much more faster). Also 16 color modes should be 
fast enough there.
2.) All other flat PCI memory accesses should be handled through KVM (there 
is a specialized driver loaded for that PCI device in the non legacy OS).


Is that easily possible?


No.  Code can run in either qemu or kvm, not both.  You can switch between 
them based on access statistics (early versions of qemu-kvm did that, without 
the statistics part), but this isn't trivial.


Hmmm. Ok, 2 different opinions about the memory write performance:
Easily or not possible?

Thnx for you help so far.

Ciao,
Gerhard

--
http://www.wiesinger.com/

-
kvm_stat
Please mount debugfs ('mount -t debugfs debugfs /sys/kernel/debug')
and ensure the kvm modules are loaded

lsmod|grep -i kvm
kvm_amd38276  0
kvm   162288  1 kvm_amd

mount|grep -i debug

=
mount -t debugfs debugfs /sys/kernel/debug

int10perf: INT10h Performance tests:
kvm statistics
 efer_reload  0   0
 exits 37648629  456206
 fpu_reload 8512535  455983
 halt_exits2084   0
 halt_wakeup   2047   0
 host_state_reload  8513213  456011
 hypercalls   0   0
 insn_emulation29182065   0
 insn_emulation_fail  0   0
 invlpg   0   0
 io_exits   8386082  455975
 irq_exits51713 214
 irq_injections   21797  36
 irq_window   0   0
 largepages   0   0
 mmio_exits  242781   0
 mmu_cache_miss 150   0
 mmu_flooded  0   0
 mmu_pde_zapped   0   0
 mmu_pte_updated  0   0
 mmu_pte_write 8192   0
 mmu_recycled 0   0
 mmu_shadow_zapped  151   0
 mmu_unsync   0   0
 mmu_unsync_global0   0
 nmi_injections   0   0
 nmi_window   0   0
 pf_fixed 16935   0
 pf_guest 0   0
 remote_tlb_flush 2   0
 request_irq  0   0
 request_nmi  0   0
 signal_exits 1   0
 tlb_flush 2251   0

Running VGA memory tests in same VGA page in Video Mode VESA 101h:
kvm statistics

 efer_reload  0   0
 exits 18470836  554582
 fpu_reload 21478333469
 halt_exits2083   0
 halt_wakeup   2047   0
 host_state_reload  21481863470
 hypercalls   0   0
 insn_emulation 7688203  554244
 insn_emulation_fail  0   0
 invlpg   0   0
 io_exits  10701583  18
 irq_exits50781 321
 irq_injections   25251  18
 irq_window   0   0
 largepages   0   0
 mmio_exits  1628473241
 mmu_cache_miss 154   0
 mmu_flooded  0   0
 mmu_pde_zapped   0   0
 mmu_pte_updated  0   0
 mmu_pte_write 8192   0
 mmu_recycled 0   0
 mmu_shadow_zapped  155   0
 mmu_unsync   0   0
 mmu_unsync_global0   0
 nmi_injections   0   0
 nmi_window   0   0
 pf_fixed 16936   0
 pf_guest 0   0
 remote_tlb_flush 5   0
 request_irq  0   0
 request_nmi  0  

Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-22 Thread Fernando Luis Vázquez Cao
On 04/22/2010 11:45 AM, Fernando Luis Vázquez Cao wrote:
 On 04/21/2010 06:41 PM, Alexander Graf wrote:
 On 21.04.2010, at 10:29, Fernando Luis Vázquez Cao wrote:

 On 04/20/2010 08:03 PM, Takuya Yoshikawa wrote:
 @@ -318,7 +318,7 @@ struct kvm_dirty_log {
__u32 padding1;
union {
void __user *dirty_bitmap; /* one bit per page */
 -  __u64 padding2;
 +  __u64 addr;

 This can break on x86_32 and x86_64-compat. addr is a long not a __u64.

 So the high 32 bits are zero. Where's the problem?
 
 If we are careful enough to cast the addr appropriately we should be fine,
 even if we keep the padding field in the union. I am not saying that it
 breaks 32 architectures but that it can potentially be problematic.
 
 +  case KVM_SWITCH_DIRTY_LOG: {
 +  struct kvm_dirty_log log;
 +
 +  r = -EFAULT;
 +  if (copy_from_user(log, argp, sizeof log))
 +  goto out;
 +  r = kvm_vm_ioctl_switch_dirty_log(kvm, log);
 +  if (r)
 +  goto out;
 +  r = -EFAULT;
 +  if (copy_to_user(argp, log, sizeof log))
 +  goto out;
 +  r = 0;
 +  break;
 +  }

 In x86_64-compat mode we are handling 32bit user-space addresses
 so we need the compat counterpart of KVM_SWITCH_DIRTY_LOG too.

 The compat code just forwards everything to the generic ioctls.
 
 The compat code uses struct compat_kvm_dirty_log instead of
 struct kvm_dirty_log to communicate with user space so
 the necessary conversions needs to be done before invoking
 the generic ioctl (see KVM_GET_DIRTY_LOG in kvm_vm_compat_ioctl).
 
 By the way we probable should move the definition of struct
 compat_kvm_dirty_log to a header file.

It seems that it was you and Arnd who added the kvm_vm compat ioctl :-).
Are you considering a different approach to tackle the issues that we
have with a big-endian userspace?

Thanks,

Fernando
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-22 Thread Gerhard Wiesinger

On Wed, 21 Apr 2010, Jamie Lokier wrote:


Gerhard Wiesinger wrote:

Hmmm. I'm very new to QEMU and KVM but at least accessing the virtual HW
of QEMU even from KVM must be possible (e.g. memory and port accesses are
done on nearly every virtual device) and therefore I'm ending in C code in
the QEMU hw/*.c directory. Therefore also the VGA memory area should be
able to be accessable from KVM but with the specialized and fast memory
access of QEMU.  Am I missing something?


What you're missing is that when KVM calls out to QEMU to handle
hw/*.c traps, that call is very slow.  It's because the hardware-VM
support is a bit slow when the trap happens, and then the the call
from KVM in the kernel up to QEMU is a bit slow again.  Then all the
way back.  It adds up to a lot, for every I/O operation.


Isn't that then a general problem of KVM virtualization (oder hardware 
virtualization) in general? Is this CPU dependend (AMD vs. Intel)?



When QEMU does the same thing, it's fast because it's inside the same
process; it's just a function call.


Yes, that's clear to me.


That's why the most often called devices are emulated separately in
KVM's kernel code, things like the interrupt controller, timer chip
etc.  It's also why individual instructions that need help are
emulated in KVM's kernel code, instead of passing control up to QEMU
just for one instruction.



BTW: Still not clear why performance is low with KVM since there are
no window changes in the testcase involved which could cause a (slow) page
fault.


It sounds like a bug.  Avi gave suggests about what to look for.
If it fixes my OS install speeds too, I'll be very happy :-)



See other post for details.


In 256-colour mode, KVM should be writing to the VGA memory at high
speed a lot like normal RAM, not trapping at the hardware-VM level,
and not calling up to the code in hw/*.c for every byte.




Yes, same picture to me: 256 color mode should be only a memory write (16 
color mode is more difficult as pixel/byte mapping is not the same).

But it looks like this isn't the case in this test scenario.


You might double-check if your guest is using VGA Mode X.  (See Wikipedia.)



Code:
inregs.x.ax = 0x4F02;
inregs.x.bx = 0xC000 | 0x101; // bh=bit 15=0 (clear), bit14=0 (windowed)
int86x(INT_SCREEN, inregs, outregs, outsregs);   /* Call INT 
10h */

I can post the whole code/exes if you want (I already planned to post my 
whole tools, but I have to do some cleanups until I wanted to publish 
whole package) .



That was a way to accelerate VGA on real PCs, but it will be slow in
KVM for the same reasons as 16-colour mode.


Which way do you mean?

Thnx.

Ciao,
Gerhard

--
http://www.wiesinger.com/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/10] KVM MMU: allow more shadow pages become asynchronous

2010-04-22 Thread Xiao Guangrong
In current code, shadow page can become asynchronous only if one
shadow page for a gfn, this rule is too strict, in fact, we can
let all last mapping page(i.e, it's the pte page) become unsync
and sync them at invlpg or flush tlb time.
Address this thinking, a gfn may have many shadow pages, for
performance reason, when one sp need be synced, just write
protect sp-gfn and sync this sp but we keep other shadow pages
asynchronous, i.e, sp only can be synced just at their own invlpg
and flush TLB time.

patch 1 ~ 6 are bugfix/cleanup/optimization for current code
patch 7 ~ 10 implement this idea

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/10] KVM MMU: fix for calculating gpa in invlpg code

2010-04-22 Thread Xiao Guangrong
If the guest is 32-bit, we should use 'quadrant' to adjust gpa
offset

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/paging_tmpl.h |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index d0cc07e..46d80d6 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -478,9 +478,14 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
((level == PT_DIRECTORY_LEVEL  is_large_pte(*sptep))) ||
((level == PT_PDPE_LEVEL  is_large_pte(*sptep {
struct kvm_mmu_page *sp = page_header(__pa(sptep));
+   int offset = 0;
+
+   if (PTTYPE == 32)
+   offset = sp-role.quadrant  PT64_LEVEL_BITS;;
 
pte_gpa = (sp-gfn  PAGE_SHIFT);
-   pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t);
+   pte_gpa += (sptep - sp-spt + offset) *
+   sizeof(pt_element_t);
 
if (is_shadow_present_pte(*sptep)) {
rmap_remove(vcpu-kvm, sptep);
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/10] KVM MMU: convert mmu tracepoints

2010-04-22 Thread Xiao Guangrong
Convert mmu tracepoints by using DECLARE_EVENT_CLASS

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmutrace.h |   69 +-
 1 files changed, 26 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index bc4f7f0..d860a03 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -92,15 +92,15 @@ TRACE_EVENT(
TP_printk(pte %llx level %u, __entry-pte, __entry-level)
 );
 
-/* We set a pte accessed bit */
-TRACE_EVENT(
-   kvm_mmu_set_accessed_bit,
+DECLARE_EVENT_CLASS(kvm_mmu_set_bit_class,
+
TP_PROTO(unsigned long table_gfn, unsigned index, unsigned size),
+
TP_ARGS(table_gfn, index, size),
 
TP_STRUCT__entry(
__field(__u64, gpa)
-   ),
+   ),
 
TP_fast_assign(
__entry-gpa = ((u64)table_gfn  PAGE_SHIFT)
@@ -110,22 +110,20 @@ TRACE_EVENT(
TP_printk(gpa %llx, __entry-gpa)
 );
 
-/* We set a pte dirty bit */
-TRACE_EVENT(
-   kvm_mmu_set_dirty_bit,
+/* We set a pte accessed bit */
+DEFINE_EVENT(kvm_mmu_set_bit_class, kvm_mmu_set_accessed_bit,
+
TP_PROTO(unsigned long table_gfn, unsigned index, unsigned size),
-   TP_ARGS(table_gfn, index, size),
 
-   TP_STRUCT__entry(
-   __field(__u64, gpa)
-   ),
+   TP_ARGS(table_gfn, index, size)
+);
 
-   TP_fast_assign(
-   __entry-gpa = ((u64)table_gfn  PAGE_SHIFT)
-   + index * size;
-   ),
+/* We set a pte dirty bit */
+DEFINE_EVENT(kvm_mmu_set_bit_class, kvm_mmu_set_dirty_bit,
 
-   TP_printk(gpa %llx, __entry-gpa)
+   TP_PROTO(unsigned long table_gfn, unsigned index, unsigned size),
+
+   TP_ARGS(table_gfn, index, size)
 );
 
 TRACE_EVENT(
@@ -164,54 +162,39 @@ TRACE_EVENT(
  __entry-created ? new : existing)
 );
 
-TRACE_EVENT(
-   kvm_mmu_sync_page,
+DECLARE_EVENT_CLASS(kvm_mmu_page_class,
+
TP_PROTO(struct kvm_mmu_page *sp),
TP_ARGS(sp),
 
TP_STRUCT__entry(
KVM_MMU_PAGE_FIELDS
-   ),
+   ),
 
TP_fast_assign(
KVM_MMU_PAGE_ASSIGN(sp)
-   ),
+   ),
 
TP_printk(%s, KVM_MMU_PAGE_PRINTK())
 );
 
-TRACE_EVENT(
-   kvm_mmu_unsync_page,
+DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_sync_page,
TP_PROTO(struct kvm_mmu_page *sp),
-   TP_ARGS(sp),
-
-   TP_STRUCT__entry(
-   KVM_MMU_PAGE_FIELDS
-   ),
-
-   TP_fast_assign(
-   KVM_MMU_PAGE_ASSIGN(sp)
-   ),
 
-   TP_printk(%s, KVM_MMU_PAGE_PRINTK())
+   TP_ARGS(sp)
 );
 
-TRACE_EVENT(
-   kvm_mmu_zap_page,
+DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_unsync_page,
TP_PROTO(struct kvm_mmu_page *sp),
-   TP_ARGS(sp),
 
-   TP_STRUCT__entry(
-   KVM_MMU_PAGE_FIELDS
-   ),
+   TP_ARGS(sp)
+);
 
-   TP_fast_assign(
-   KVM_MMU_PAGE_ASSIGN(sp)
-   ),
+DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_zap_page,
+   TP_PROTO(struct kvm_mmu_page *sp),
 
-   TP_printk(%s, KVM_MMU_PAGE_PRINTK())
+   TP_ARGS(sp)
 );
-
 #endif /* _TRACE_KVMMMU_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/10] KVM MMU: move unsync/sync tracpoints to proper place

2010-04-22 Thread Xiao Guangrong
Move unsync/sync tracepoints to the proper place, it's good
for us to obtain unsync page live time

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ddfa865..abf8bd4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1189,6 +1189,7 @@ static struct kvm_mmu_page *kvm_mmu_lookup_page(struct 
kvm *kvm, gfn_t gfn)
 static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
WARN_ON(!sp-unsync);
+   trace_kvm_mmu_sync_page(sp);
sp-unsync = 0;
--kvm-stat.mmu_unsync;
 }
@@ -1202,7 +1203,6 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
return 1;
}
 
-   trace_kvm_mmu_sync_page(sp);
if (rmap_write_protect(vcpu-kvm, sp-gfn))
kvm_flush_remote_tlbs(vcpu-kvm);
kvm_unlink_unsync_page(vcpu-kvm, sp);
@@ -1730,7 +1730,6 @@ static int kvm_unsync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
struct kvm_mmu_page *s;
struct hlist_node *node, *n;
 
-   trace_kvm_mmu_unsync_page(sp);
index = kvm_page_table_hashfn(sp-gfn);
bucket = vcpu-kvm-arch.mmu_page_hash[index];
/* don't unsync if pagetable is shadowed with multiple roles */
@@ -1740,6 +1739,7 @@ static int kvm_unsync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
if (s-role.word != sp-role.word)
return 1;
}
+   trace_kvm_mmu_unsync_page(sp);
++vcpu-kvm-stat.mmu_unsync;
sp-unsync = 1;
 
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h

2010-04-22 Thread Xiao Guangrong
Using '!sp-role.cr4_pae' replaces 'PTTYPE == 32' and using
'pte_size = sp-role.cr4_pae ? 8 : 4' replaces sizeof(pt_element_t)

Then no need compile twice for this code

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   60 ++-
 arch/x86/kvm/paging_tmpl.h |   56 -
 2 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index abf8bd4..fac7c09 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2256,6 +2256,62 @@ static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 
gpte, int level)
return (gpte  vcpu-arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0;
 }
 
+static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
+{
+   struct kvm_shadow_walk_iterator iterator;
+   gpa_t pte_gpa = -1;
+   int level;
+   u64 *sptep;
+   int need_flush = 0;
+   unsigned pte_size = 0;
+
+   spin_lock(vcpu-kvm-mmu_lock);
+
+   for_each_shadow_entry(vcpu, gva, iterator) {
+   level = iterator.level;
+   sptep = iterator.sptep;
+
+   if (level == PT_PAGE_TABLE_LEVEL  ||
+   ((level == PT_DIRECTORY_LEVEL  is_large_pte(*sptep))) ||
+   ((level == PT_PDPE_LEVEL  is_large_pte(*sptep {
+   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+   int offset = 0;
+
+   if (!sp-role.cr4_pae)
+   offset = sp-role.quadrant  PT64_LEVEL_BITS;;
+   pte_size = sp-role.cr4_pae ? 8 : 4;
+   pte_gpa = (sp-gfn  PAGE_SHIFT);
+   pte_gpa += (sptep - sp-spt + offset) * pte_size;
+
+   if (is_shadow_present_pte(*sptep)) {
+   rmap_remove(vcpu-kvm, sptep);
+   if (is_large_pte(*sptep))
+   --vcpu-kvm-stat.lpages;
+   need_flush = 1;
+   }
+   __set_spte(sptep, shadow_trap_nonpresent_pte);
+   break;
+   }
+
+   if (!is_shadow_present_pte(*sptep))
+   break;
+   }
+
+   if (need_flush)
+   kvm_flush_remote_tlbs(vcpu-kvm);
+
+   atomic_inc(vcpu-kvm-arch.invlpg_counter);
+
+   spin_unlock(vcpu-kvm-mmu_lock);
+
+   if (pte_gpa == -1)
+   return;
+
+   if (mmu_topup_memory_caches(vcpu))
+   return;
+   kvm_mmu_pte_write(vcpu, pte_gpa, NULL, pte_size, 0);
+}
+
 #define PTTYPE 64
 #include paging_tmpl.h
 #undef PTTYPE
@@ -2335,7 +2391,7 @@ static int paging64_init_context_common(struct kvm_vcpu 
*vcpu, int level)
context-gva_to_gpa = paging64_gva_to_gpa;
context-prefetch_page = paging64_prefetch_page;
context-sync_page = paging64_sync_page;
-   context-invlpg = paging64_invlpg;
+   context-invlpg = paging_invlpg;
context-free = paging_free;
context-root_level = level;
context-shadow_root_level = level;
@@ -2360,7 +2416,7 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
context-free = paging_free;
context-prefetch_page = paging32_prefetch_page;
context-sync_page = paging32_sync_page;
-   context-invlpg = paging32_invlpg;
+   context-invlpg = paging_invlpg;
context-root_level = PT32_ROOT_LEVEL;
context-shadow_root_level = PT32E_ROOT_LEVEL;
context-root_hpa = INVALID_PAGE;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 46d80d6..d0df9cd 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -460,62 +460,6 @@ out_unlock:
return 0;
 }
 
-static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
-{
-   struct kvm_shadow_walk_iterator iterator;
-   gpa_t pte_gpa = -1;
-   int level;
-   u64 *sptep;
-   int need_flush = 0;
-
-   spin_lock(vcpu-kvm-mmu_lock);
-
-   for_each_shadow_entry(vcpu, gva, iterator) {
-   level = iterator.level;
-   sptep = iterator.sptep;
-
-   if (level == PT_PAGE_TABLE_LEVEL  ||
-   ((level == PT_DIRECTORY_LEVEL  is_large_pte(*sptep))) ||
-   ((level == PT_PDPE_LEVEL  is_large_pte(*sptep {
-   struct kvm_mmu_page *sp = page_header(__pa(sptep));
-   int offset = 0;
-
-   if (PTTYPE == 32)
-   offset = sp-role.quadrant  PT64_LEVEL_BITS;;
-
-   pte_gpa = (sp-gfn  PAGE_SHIFT);
-   pte_gpa += (sptep - sp-spt + offset) *
-   sizeof(pt_element_t);
-
-   if (is_shadow_present_pte(*sptep)) {
-   rmap_remove(vcpu-kvm, sptep);
-  

[PATCH 5/10] KVM MMU: cleanup invlpg code

2010-04-22 Thread Xiao Guangrong
Using is_last_spte() to cleanup invlpg code

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fac7c09..fd027a6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2271,9 +2271,7 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t 
gva)
level = iterator.level;
sptep = iterator.sptep;
 
-   if (level == PT_PAGE_TABLE_LEVEL  ||
-   ((level == PT_DIRECTORY_LEVEL  is_large_pte(*sptep))) ||
-   ((level == PT_PDPE_LEVEL  is_large_pte(*sptep {
+   if (is_last_spte(*sptep, level)) {
struct kvm_mmu_page *sp = page_header(__pa(sptep));
int offset = 0;
 
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page

2010-04-22 Thread Xiao Guangrong
If have new mapping to the unsync page(i.e, add a new parent), just
update the page from sp-gfn but not write-protect gfn, and if need
create new shadow page form sp-gfn, we should sync it

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   27 +++
 1 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fd027a6..8607a64 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1196,16 +1196,20 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
 
 static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
-static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+bool clear_unsync)
 {
if (sp-role.cr4_pae != !!is_pae(vcpu)) {
kvm_mmu_zap_page(vcpu-kvm, sp);
return 1;
}
 
-   if (rmap_write_protect(vcpu-kvm, sp-gfn))
-   kvm_flush_remote_tlbs(vcpu-kvm);
-   kvm_unlink_unsync_page(vcpu-kvm, sp);
+   if (clear_unsync) {
+   if (rmap_write_protect(vcpu-kvm, sp-gfn))
+   kvm_flush_remote_tlbs(vcpu-kvm);
+   kvm_unlink_unsync_page(vcpu-kvm, sp);
+   }
+
if (vcpu-arch.mmu.sync_page(vcpu, sp)) {
kvm_mmu_zap_page(vcpu-kvm, sp);
return 1;
@@ -1293,7 +1297,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
kvm_flush_remote_tlbs(vcpu-kvm);
 
for_each_sp(pages, sp, parents, i) {
-   kvm_sync_page(vcpu, sp);
+   kvm_sync_page(vcpu, sp, true);
mmu_pages_clear_parents(parents);
}
cond_resched_lock(vcpu-kvm-mmu_lock);
@@ -1313,7 +1317,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
unsigned index;
unsigned quadrant;
struct hlist_head *bucket;
-   struct kvm_mmu_page *sp;
+   struct kvm_mmu_page *sp, *unsync_sp = NULL;
struct hlist_node *node, *tmp;
 
role = vcpu-arch.mmu.base_role;
@@ -1332,12 +1336,16 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
if (sp-gfn == gfn) {
if (sp-unsync)
-   if (kvm_sync_page(vcpu, sp))
-   continue;
+   unsync_sp = sp;
 
if (sp-role.word != role.word)
continue;
 
+   if (unsync_sp  kvm_sync_page(vcpu, unsync_sp, false)) 
{
+   unsync_sp = NULL;
+   continue;
+   }
+
mmu_page_add_parent_pte(vcpu, sp, parent_pte);
if (sp-unsync_children) {
set_bit(KVM_REQ_MMU_SYNC, vcpu-requests);
@@ -1346,6 +1354,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
trace_kvm_mmu_get_page(sp, false);
return sp;
}
+   if (unsync_sp)
+   kvm_sync_page(vcpu, unsync_sp, true);
+
++vcpu-kvm-stat.mmu_cache_miss;
sp = kvm_mmu_alloc_page(vcpu, parent_pte);
if (!sp)
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/10] KVM MMU: allow more page become unsync at gfn mapping time

2010-04-22 Thread Xiao Guangrong
In current code, shadow page can become asynchronous only if one
shadow page for a gfn, this rule is too strict, in fact, we can
let all last mapping page(i.e, it's the pte page) become unsync,
and sync them at invlpg or flush tlb time.

This patch allow more page become asynchronous at gfn mapping time 

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   81 +++
 1 files changed, 37 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8607a64..13378e7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1166,26 +1166,6 @@ static int mmu_unsync_walk(struct kvm_mmu_page *sp,
return __mmu_unsync_walk(sp, pvec);
 }
 
-static struct kvm_mmu_page *kvm_mmu_lookup_page(struct kvm *kvm, gfn_t gfn)
-{
-   unsigned index;
-   struct hlist_head *bucket;
-   struct kvm_mmu_page *sp;
-   struct hlist_node *node;
-
-   pgprintk(%s: looking for gfn %lx\n, __func__, gfn);
-   index = kvm_page_table_hashfn(gfn);
-   bucket = kvm-arch.mmu_page_hash[index];
-   hlist_for_each_entry(sp, node, bucket, hash_link)
-   if (sp-gfn == gfn  !sp-role.direct
-!sp-role.invalid) {
-   pgprintk(%s: found role %x\n,
-__func__, sp-role.word);
-   return sp;
-   }
-   return NULL;
-}
-
 static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
WARN_ON(!sp-unsync);
@@ -1734,47 +1714,60 @@ u8 kvm_get_guest_memory_type(struct kvm_vcpu *vcpu, 
gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_get_guest_memory_type);
 
-static int kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+static void __kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+   trace_kvm_mmu_unsync_page(sp);
+   ++vcpu-kvm-stat.mmu_unsync;
+   sp-unsync = 1;
+
+   kvm_mmu_mark_parents_unsync(sp);
+   mmu_convert_notrap(sp);
+}
+
+static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
 {
-   unsigned index;
struct hlist_head *bucket;
struct kvm_mmu_page *s;
struct hlist_node *node, *n;
+   unsigned index;
 
-   index = kvm_page_table_hashfn(sp-gfn);
+   index = kvm_page_table_hashfn(gfn);
bucket = vcpu-kvm-arch.mmu_page_hash[index];
-   /* don't unsync if pagetable is shadowed with multiple roles */
+
hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
-   if (s-gfn != sp-gfn || s-role.direct)
+   if (s-gfn != gfn || s-role.direct || s-unsync)
continue;
-   if (s-role.word != sp-role.word)
-   return 1;
+   WARN_ON(s-role.level != PT_PAGE_TABLE_LEVEL);
+   __kvm_unsync_page(vcpu, s);
}
-   trace_kvm_mmu_unsync_page(sp);
-   ++vcpu-kvm-stat.mmu_unsync;
-   sp-unsync = 1;
-
-   kvm_mmu_mark_parents_unsync(sp);
-
-   mmu_convert_notrap(sp);
-   return 0;
 }
 
 static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
  bool can_unsync)
 {
-   struct kvm_mmu_page *shadow;
+   unsigned index;
+   struct hlist_head *bucket;
+   struct kvm_mmu_page *s;
+   struct hlist_node *node, *n;
+   bool need_unsync = false;
+
+   index = kvm_page_table_hashfn(gfn);
+   bucket = vcpu-kvm-arch.mmu_page_hash[index];
+   hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
+   if (s-gfn != gfn || s-role.direct)
+   continue;
 
-   shadow = kvm_mmu_lookup_page(vcpu-kvm, gfn);
-   if (shadow) {
-   if (shadow-role.level != PT_PAGE_TABLE_LEVEL)
+   if (s-role.level != PT_PAGE_TABLE_LEVEL)
return 1;
-   if (shadow-unsync)
-   return 0;
-   if (can_unsync  oos_shadow)
-   return kvm_unsync_page(vcpu, shadow);
-   return 1;
+
+   if (!need_unsync  !s-unsync) {
+   if (!can_unsync || !oos_shadow)
+   return 1;
+   need_unsync = true;
+   }
}
+   if (need_unsync)
+   kvm_unsync_pages(vcpu, gfn);
return 0;
 }
 
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/10] KVM MMU: allow more page become unsync at getting sp time

2010-04-22 Thread Xiao Guangrong
Allow more page become asynchronous at getting sp time, if need create new
shadow page for gfn but it not allow unsync(level  0), we should unsync all
gfn's unsync page

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   22 --
 1 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 13378e7..e0bb4d8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1199,6 +1199,23 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
return 0;
 }
 
+static void kvm_sync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
+{
+   struct hlist_head *bucket;
+   struct kvm_mmu_page *s;
+   struct hlist_node *node, *n;
+   unsigned index;
+
+   index = kvm_page_table_hashfn(gfn);
+   bucket = vcpu-kvm-arch.mmu_page_hash[index];
+   hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
+   if (s-gfn != gfn || !s-unsync)
+   continue;
+   WARN_ON(s-role.level != PT_PAGE_TABLE_LEVEL);
+   kvm_sync_page(vcpu, s, true);
+   }
+}
+
 struct mmu_page_path {
struct kvm_mmu_page *parent[PT64_ROOT_LEVEL-1];
unsigned int idx[PT64_ROOT_LEVEL-1];
@@ -1334,8 +1351,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
trace_kvm_mmu_get_page(sp, false);
return sp;
}
-   if (unsync_sp)
-   kvm_sync_page(vcpu, unsync_sp, true);
+
+   if (!direct  level  PT_PAGE_TABLE_LEVEL  unsync_sp)
+   kvm_sync_pages(vcpu, gfn);
 
++vcpu-kvm-stat.mmu_cache_miss;
sp = kvm_mmu_alloc_page(vcpu, parent_pte);
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 9/10] KVM MMU: separate invlpg code form kvm_mmu_pte_write()

2010-04-22 Thread Xiao Guangrong
Let invlpg not depends on kvm_mmu_pte_write path, later patch will need
this feature

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   40 
 1 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e0bb4d8..f092e71 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2278,14 +2278,21 @@ static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 
gpte, int level)
return (gpte  vcpu-arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0;
 }
 
+static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u64 gpte);
+static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp,
+ u64 *spte,
+ const void *new);
+
 static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
 {
+   struct kvm_mmu_page *sp = NULL;
struct kvm_shadow_walk_iterator iterator;
-   gpa_t pte_gpa = -1;
-   int level;
-   u64 *sptep;
-   int need_flush = 0;
+   gfn_t gfn = -1;
+   u64 *sptep = NULL,  gentry;
unsigned pte_size = 0;
+   int invlpg_counter, level, offset = 0, need_flush = 0;
 
spin_lock(vcpu-kvm-mmu_lock);
 
@@ -2294,14 +2301,14 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t 
gva)
sptep = iterator.sptep;
 
if (is_last_spte(*sptep, level)) {
-   struct kvm_mmu_page *sp = page_header(__pa(sptep));
-   int offset = 0;
+
+   sp = page_header(__pa(sptep));
 
if (!sp-role.cr4_pae)
offset = sp-role.quadrant  PT64_LEVEL_BITS;;
pte_size = sp-role.cr4_pae ? 8 : 4;
-   pte_gpa = (sp-gfn  PAGE_SHIFT);
-   pte_gpa += (sptep - sp-spt + offset) * pte_size;
+   gfn = (sp-gfn  PAGE_SHIFT);
+   offset = (sptep - sp-spt + offset) * pte_size;
 
if (is_shadow_present_pte(*sptep)) {
rmap_remove(vcpu-kvm, sptep);
@@ -2320,16 +2327,22 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t 
gva)
if (need_flush)
kvm_flush_remote_tlbs(vcpu-kvm);
 
-   atomic_inc(vcpu-kvm-arch.invlpg_counter);
+   invlpg_counter = atomic_add_return(1, vcpu-kvm-arch.invlpg_counter);
 
spin_unlock(vcpu-kvm-mmu_lock);
 
-   if (pte_gpa == -1)
+   if (gfn == -1)
return;
 
if (mmu_topup_memory_caches(vcpu))
return;
-   kvm_mmu_pte_write(vcpu, pte_gpa, NULL, pte_size, 0);
+
+   kvm_read_guest_page(vcpu-kvm, gfn, gentry, offset, pte_size);
+   mmu_guess_page_from_pte_write(vcpu, gfn_to_gpa(gfn) + offset, gentry);
+   spin_lock(vcpu-kvm-mmu_lock);
+   if (atomic_read(vcpu-kvm-arch.invlpg_counter) == invlpg_counter)
+   mmu_pte_write_new_pte(vcpu, sp, sptep, gentry);
+   spin_unlock(vcpu-kvm-mmu_lock);
 }
 
 #define PTTYPE 64
@@ -2675,12 +2688,9 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
int flooded = 0;
int npte;
int r;
-   int invlpg_counter;
 
pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes);
 
-   invlpg_counter = atomic_read(vcpu-kvm-arch.invlpg_counter);
-
/*
 * Assume that the pte write on a page table of the same type
 * as the current vcpu paging mode.  This is nearly always true
@@ -2713,8 +2723,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 
mmu_guess_page_from_pte_write(vcpu, gpa, gentry);
spin_lock(vcpu-kvm-mmu_lock);
-   if (atomic_read(vcpu-kvm-arch.invlpg_counter) != invlpg_counter)
-   gentry = 0;
kvm_mmu_access_page(vcpu, gfn);
kvm_mmu_free_some_pages(vcpu);
++vcpu-kvm-stat.mmu_pte_write;
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] KVM MMU: optimize sync/update unsync-page

2010-04-22 Thread Xiao Guangrong
invlpg only need update unsync page, sp-unsync and sp-unsync_children
can help us to find it

Now, a gfn may have many shadow pages, when one sp need be synced, we
write protect sp-gfn and sync this sp but we keep other shadow pages
asynchronous

So, while gfn happen page fault, let it not touch unsync page, the unsync
page only updated at invlpg/flush TLB time

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f092e71..5bdcc17 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2299,10 +2299,11 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t 
gva)
for_each_shadow_entry(vcpu, gva, iterator) {
level = iterator.level;
sptep = iterator.sptep;
+   sp = page_header(__pa(sptep));
 
if (is_last_spte(*sptep, level)) {
-
-   sp = page_header(__pa(sptep));
+   if (!sp-unsync)
+   break;
 
if (!sp-role.cr4_pae)
offset = sp-role.quadrant  PT64_LEVEL_BITS;;
@@ -2320,7 +2321,7 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t 
gva)
break;
}
 
-   if (!is_shadow_present_pte(*sptep))
+   if (!is_shadow_present_pte(*sptep) || !sp-unsync_children)
break;
}
 
@@ -2744,7 +2745,8 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 
 restart:
hlist_for_each_entry_safe(sp, node, n, bucket, hash_link) {
-   if (sp-gfn != gfn || sp-role.direct || sp-role.invalid)
+   if (sp-gfn != gfn || sp-role.direct || sp-role.invalid ||
+   sp-unsync)
continue;
pte_size = sp-role.cr4_pae ? 8 : 4;
misaligned = (offset ^ (offset + bytes - 1))  ~(pte_size - 1);
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM Page Fault Question

2010-04-22 Thread Avi Kivity

On 04/22/2010 08:26 AM, Marek Olszewski wrote:
Under VMX without EPT, I do not seeing any VM Exits due to task 
switches.  Is there a way to enable these?  I'm looking to intercept 
the guest whenever it does a iret.


See EXIT_REASON_TASK_SWITCH.  However, that won't fire on any iret, only 
irets that generate task switches.  You can ask for exits on irets by 
setting CPU_BASED_VIRTUAL_NMI_PENDING and GUEST_INTR_STATE_NMI, and 
looking for EXIT_REASON_NMI_WINDOW.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-22 Thread Avi Kivity

On 04/22/2010 08:37 AM, Gerhard Wiesinger wrote:

On Wed, 21 Apr 2010, Avi Kivity wrote:


On 04/21/2010 09:14 PM, Gerhard Wiesinger wrote:


Can you explain which code files/functions of KVM is involved in 
handling VGA memory window and page switching through the port write 
to the VGA window register (or is that part handled through QEMU), 
so a little bit architecture explaination would be nice?


qemu hw/vga.c and hw/cirrus_vga.c.  Boring functions like 
vbe_ioport_write_data() and vga_ioport_write().




Yes, I was already in that code part and that are very simple 
functions as already explained and are therefore in QEMU only very 
fast. But I ment: How is the calling path from KVM guest OS to 
hw/vga.c for memory and I/O accesses, and which parts are done in 
hardware directly (to understand the speed gap and maybe to find a 
solution)?


The speed gap is mostly due to hardware constraints (it takes ~2000 
cycles for an exit from guest mode, plus we need to switch a few msrs to 
get to userspace).


See vmx_vcpu_run(), the vmresume instruction is where an exit starts.





BTW: In which KVM code parts is decided where direct code or an 
emulated device code is used?




Same place.  Look for calls to cpu_register_physical_memory().  If 
the last argument was obtained by a call to cpu_register_io_memory(), 
then all writes trap.  Otherwise, it was obtained by qemu_ram_alloc() 
and writes will not trap (except the first write to a page in a 30ms 
window, used to note that the page is dirty and needs redrawing).


Ok, that finally ends in:
cpu_register_physical_memory_offset()
...
// 0.12.3
if (kvm_enabled())
kvm_set_phys_mem(start_addr, size, phys_offset);
// KVM
cpu_notify_set_memory(start_addr, size, phys_offset);
...

I/O is always done through:
cpu_register_io_memory = cpu_register_io_memory_fixed
cpu_register_io_memory_fixed()
...
No call to KVM?


kvm_set_phys_mem() is a call to kvm.


...

Where is the trap from KVM to QEMU?


See kvm_cpu_exec().

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-22 Thread Avi Kivity

On 04/22/2010 09:04 AM, Gerhard Wiesinger wrote:

On Wed, 21 Apr 2010, Avi Kivity wrote:


On 04/21/2010 09:50 PM, Gerhard Wiesinger wrote:

I don't think changing VGA window is a problem because there are
500.000-1Mio changes/s possible.


1MB/s, 500k-1M changes/s Coincidence?  Is it taking a page fault
or trap on every write?



To clarify:
Memory Performance writing to segmen A000 is about 1MB/st.


That indicates a fault every write (assuming 8-16 bit writes).  If 
you're using 256 color vga and not switching banks, this indicates a 
bug.




Yes, 256 color VGA and no bank switches involved.

Calling INT 10 set/get window function with different windows (e.g. 
toggling between window page 0 and 1) is about 500.000 to 1Mio 
function calls per second.


That's suprisingly fast. I'd expect 100-200k/sec.



Sorry, I mixed up the numbers:
1.) QEMU-KVM: ~111k
2.) QEMU only: 500k-1Mio


Please run kvm_stat and report output for both tests to confirm.



See below. 2nd column is per second statistic when running the test.


 efer_reload  0   0
 exits 18470836  554582
 fpu_reload 21478333469
 halt_exits2083   0
 halt_wakeup   2047   0
 host_state_reload  21481863470
 hypercalls   0   0
 insn_emulation 7688203  554244

This indicates that kvm is emulating instead of direct mapping.  That's 
probably a bug.  If you fix it, performance will increase dramatically.






To get real good VGA performance both parameters should be:
About 50MB/s for writes to segment A000
~500.000 bank switches per second.


First should be doable easily, second is borderline.


I think this is very easy to distingish:
1.) VGA Segment A000 is legacy and should be handled through QEMU 
and not through KVM (because it is much more faster). Also 16 color 
modes should be fast enough there.
2.) All other flat PCI memory accesses should be handled through KVM 
(there is a specialized driver loaded for that PCI device in the non 
legacy OS).


Is that easily possible?


No.  Code can run in either qemu or kvm, not both.  You can switch 
between them based on access statistics (early versions of qemu-kvm 
did that, without the statistics part), but this isn't trivial.


Hmmm. Ok, 2 different opinions about the memory write performance:
Easily or not possible?


Switching between tcg and kvm is hard, but not needed.  For 256 color 
modes, direct map is possible and should yield good performance.  Bank 
switching can be improved perhaps 3x, but will never be fast.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Document mmu

2010-04-22 Thread Karl Vogel

Avi Kivity wrote:

+Translation
+===
+
+The primary job of the mmu is to program the processor's mmu to translate
+addresses for the guest.  Different translations are required at different
+times:
+
+- when guest paging is disabled, we translate guest physical addresses to
+  host physical addresses (gpa-hpa)
+- when guest paging is enabled, we translate guest virtual addresses, to
+  guest virtual addresses, to host physical addresses (gva-gpa-hpa)


I think you meant 'to guest physical addresses' here.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Document mmu

2010-04-22 Thread Avi Kivity

On 04/22/2010 10:13 AM, Karl Vogel wrote:

Avi Kivity wrote:

+Translation
+===
+
+The primary job of the mmu is to program the processor's mmu to 
translate
+addresses for the guest.  Different translations are required at 
different

+times:
+
+- when guest paging is disabled, we translate guest physical 
addresses to

+  host physical addresses (gpa-hpa)
+- when guest paging is enabled, we translate guest virtual 
addresses, to

+  guest virtual addresses, to host physical addresses (gva-gpa-hpa)


I think you meant 'to guest physical addresses' here.



Fixed.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Timedrift in KVM guests after livemigration.

2010-04-22 Thread Thomas Treutner
On Sunday 18 April 2010 11:33:44 Espen Berg wrote:
 All guest are Debian lenny with latest upstream kernel, hvm/kvm.

 We are using kvm-clock as guest source clock.

 cat /sys/devices/system/clocksource/clocksource0/current_clocksource
 kvm-clock

I had to deactivate C1E (AMD CPUs) and use acpi clocksource (for both servers 
and VMs, IIRC). If you can, you should give it a try. After that, live 
migration worked somewhat stable.


regards, 
thomas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-22 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---

Michael,
Thanks. I have updated the patch with your suggestion.
It looks much clean now. Please have a review.

Thanks
Xiaohui

 drivers/vhost/Kconfig |   10 +
 drivers/vhost/Makefile|2 +
 drivers/vhost/mpassthru.c | 1239 +
 include/linux/mpassthru.h |   29 +
 4 files changed, 1280 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..91806b1 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+   tristate mediate passthru network driver (EXPERIMENTAL)
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..cc99b14
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1239 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAMEmpassthru
+#define DRV_DESCRIPTION Mediate passthru device driver
+#define DRV_COPYRIGHT   (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+
+#include linux/module.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/slab.h
+#include linux/smp_lock.h
+#include linux/poll.h
+#include linux/fcntl.h
+#include linux/init.h
+#include linux/aio.h
+
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/miscdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_arp.h
+#include linux/if_ether.h
+#include linux/crc32.h
+#include linux/nsproxy.h
+#include linux/uaccess.h
+#include linux/virtio_net.h
+#include linux/mpassthru.h
+#include net/net_namespace.h
+#include net/netns/generic.h
+#include net/rtnetlink.h
+#include net/sock.h
+
+#include asm/system.h
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp-debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int w_len;
+   int r_len;
+   spinlock_t  read_lock;
+   struct kmem_cache   *cache;
+   /* record the locked pages */
+   int lock_pages;
+   struct rlimit   o_rlim;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct page_info {
+   struct list_headlist;
+   int header;
+   /* indicate the actual length of bytes
+* send/recv in the user space buffers
+*/
+   int total;
+   int offset;
+   struct page *pages[MAX_SKB_FRAGS+1];
+   struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+   struct sk_buff  *skb;
+   struct page_ctor*ctor;
+
+   /* The pointer relayed to skb, to indicate
+* it's a user space allocated skb or kernel
+*/
+   struct 

RE: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-22 Thread Xin, Xiaohui
Michael,
Sorry, it's based on the suggestion to hook an iocb completion callback
to handle the iocb list in vhost-net.

Thanks
Xiaohui

-Original Message-
From: Xin, Xiaohui
Sent: Thursday, April 22, 2010 4:24 PM
To: m...@redhat.com
Cc: a...@arndb.de; net...@vger.kernel.org; kvm@vger.kernel.org; 
linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; 
jd...@linux.intel.com; Xin, Xiaohui
Subject: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

From: Xin Xiaohui xiaohui@intel.com

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzha...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---

Michael,
Thanks. I have updated the patch with your suggestion.
It looks much clean now. Please have a review.

Thanks
Xiaohui

 drivers/vhost/Kconfig |   10 +
 drivers/vhost/Makefile|2 +
 drivers/vhost/mpassthru.c | 1239 +
 include/linux/mpassthru.h |   29 +
 4 files changed, 1280 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..91806b1 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.

+config MEDIATE_PASSTHRU
+   tristate mediate passthru network driver (EXPERIMENTAL)
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..cc99b14
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1239 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAMEmpassthru
+#define DRV_DESCRIPTION Mediate passthru device driver
+#define DRV_COPYRIGHT   (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+
+#include linux/module.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/slab.h
+#include linux/smp_lock.h
+#include linux/poll.h
+#include linux/fcntl.h
+#include linux/init.h
+#include linux/aio.h
+
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/miscdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_arp.h
+#include linux/if_ether.h
+#include linux/crc32.h
+#include linux/nsproxy.h
+#include linux/uaccess.h
+#include linux/virtio_net.h
+#include linux/mpassthru.h
+#include net/net_namespace.h
+#include net/netns/generic.h
+#include net/rtnetlink.h
+#include net/sock.h
+
+#include asm/system.h
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp-debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int w_len;
+   int r_len;
+   spinlock_t  read_lock;
+   struct kmem_cache   *cache;
+   /* record the locked pages */
+   int lock_pages;
+   struct rlimit   o_rlim;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct page_info {
+   struct list_headlist;
+   int header;
+   /* indicate the actual 

Re:[RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.

2010-04-22 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
---

Michael,

Can't vhost supply a kiocb completion callback that will handle the list?

Yes, thanks. And with it I also remove the vq-receiver finally.

Thanks
Xiaohui

 drivers/vhost/net.c   |  227 +++--
 drivers/vhost/vhost.c |  115 ++---
 drivers/vhost/vhost.h |   14 +++
 3 files changed, 301 insertions(+), 55 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..4a70f66 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include linux/workqueue.h
 #include linux/rcupdate.h
 #include linux/file.h
+#include linux/aio.h
 
 #include linux/net.h
 #include linux/if_packet.h
 #include linux/if_arp.h
 #include linux/if_tun.h
+#include linux/mpassthru.h
 
 #include net/sock.h
 
@@ -47,6 +49,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct kmem_cache   *cache;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
@@ -91,11 +94,132 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   if (!list_empty(vq-notifier)) {
+   iocb = list_first_entry(vq-notifier,
+   struct kiocb, ki_list);
+   list_del(iocb-ki_list);
+   }
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+   return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+   struct vhost_virtqueue *vq = iocb-private;
+   unsigned long flags;
+
+spin_lock_irqsave(vq-notify_lock, flags);
+list_add_tail(iocb-ki_list, vq-notifier);
+spin_unlock_irqrestore(vq-notify_lock, flags);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+struct vhost_virtqueue *vq,
+struct socket *sock)
+{
+   struct kiocb *iocb = NULL;
+   struct vhost_log *vq_log = NULL;
+   int rx_total_len = 0;
+   unsigned int head, log, in, out;
+   int size;
+
+   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   if (sock-sk-sk_data_ready)
+   sock-sk-sk_data_ready(sock-sk, 0);
+
+   vq_log = unlikely(vhost_has_feature(
+   net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, iocb-ki_nbytes);
+   log = (int)(iocb-ki_user_data  32);
+   size = iocb-ki_nbytes;
+   head = iocb-ki_pos;
+   rx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+   kmem_cache_free(net-cache, iocb);
+
+   /* when log is enabled, recomputing the log info is needed,
+* since these buffers are in async queue, and may not get
+* the log info before.
+*/
+   if (unlikely(vq_log)) {
+   if (!log)
+   __vhost_get_vq_desc(net-dev, vq, vq-iov,
+   ARRAY_SIZE(vq-iov),
+   out, in, vq_log,
+   log, head);
+   vhost_log_write(vq, vq_log, log, size);
+   }
+   if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   int tx_total_len = 0;
+
+   if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, 0);
+   tx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+
+   kmem_cache_free(net-cache, iocb);
+   if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) {
+

Re: [Qemu-devel] [PATCH][STABLE] block: Free iovec arrays allocated by multiwrite_merge()

2010-04-22 Thread Kevin Wolf
Am 21.04.2010 21:35, schrieb Stefan Hajnoczi:
 A new iovec array is allocated when creating a merged write request.
 This patch ensures that the iovec array is deleted in addition to its
 qiov owner.
 
 Reported-by: Leszek Urbanski tyg...@moo.pl
 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com

Acked-by: Kevin Wolf kw...@redhat.com

Picked it up for my block branch, too, but I think this should be
committed immediately by Aurelien.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 3/6] KVM: introduce a wrapper function to copy dirty bitmaps to user space

2010-04-22 Thread Takuya Yoshikawa

(2010/04/21 20:12), Avi Kivity wrote:

On 04/20/2010 01:59 PM, Takuya Yoshikawa wrote:

We will replace copy_to_user() to copy_in_user() when we move
the dirty bitmaps to user space.

But sadly, we have copy_in_user() only for 64 bits architectures.
So this function should work as a wrapper to hide ifdefs from outside.
Once we get copy_in_user() for 32 bits architectures, we can remove
this wrapper and use copy_in_user() directly.


I prefer a generic copy_in_user() instead of having multiple paths in kvm.



I do too :).

I checked ppc64's copy_in_user().
It looks like just using __copy_tofrom_user() and it is also implemented
for ppc32, IIUC.

So yes, we may not need to implement it by ourselves!

Just ask them to make it useful for 32bit too.



And about x86_32 copy_in_user().

They are using copy_user_generic() which is implemented only for 64bit.

So I'll show them current copy_from_user() and copy_to_user() version and
see their response.

If rejected, though I hope not, please give me another option, OK?

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Dor Laor

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're sending
this message to share what we have now and TODO lists.  Hopefully, we would like
to get early feedback to keep us in the right direction.  Although advanced
approaches in the TODO lists are fascinating, we would like to run this project
step by step while absorbing comments from the community.  The current code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs.  It may also need some plumbing in the
kernel side to guarantee replayability of certain events and instructions,
integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
as for optimization purposes, for example.


[ snap]



The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which event the
primary should synchronize with the secondary.  The basic assumption here is
that outgoing I/O operations are idempotent, which is usually true for disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs 
non tcp protocol and the packet that the master node sent reached some 
remote client and before the sync to the slave the master failed?


[snap]



=== clock ===

Since synchronizing the virtual machines every time the TSC is accessed would be
prohibitive, the transmission of the TSC will be done lazily, which means
delaying it until there is a non-TSC synchronization point arrives.


Why do you specifically care about the tsc sync? When you sync all the 
IO model on snapshot it also synchronizes the tsc.


In general, can you please explain the 'algorithm' for continuous 
snapshots (is that what you like to do?):

A trivial one would we to :
 - do X online snapshots/sec
 - Stall all IO (disk/block) from the guest to the outside world
   until the previous snapshot reaches the slave.
 - Snapshots are made of
   - diff of dirty pages from last snapshot
   - Qemu device model (+kvm's) diff from last.
You can do 'light' snapshots in between to send dirty pages to reduce 
snapshot time.


I wrote the above to serve a reference for your comments so it will map 
into my mind. Thanks, dor




TODO:
  - Synchronization of clock sources (need to intercept TSC reads, etc).

=== usability ===

These are items that defines how users interact with Kemari.

TODO:
  - Kemarid daemon that takes care of the cluster management/monitoring
side of things.
  - Some device emulators might need minor modifications to work well
with Kemari.  Use white(black)-listing to take the burden of
choosing the right device model off the users.

=== optimizations ===

Although the big picture can be realized by completing the TODO list above, we
need some optimizations/enhancements to make Kemari useful in real world, and
these are items what needs to be done for that.

TODO:
  - SMP (for the sake of performance might need to implement a
synchronization protocol that can maintain two or more
synchronization points active at any given moment)
  - VGA (leverage VNC's subtilting mechanism to identify fb pages that
are really dirty).


Any comments/suggestions would be greatly appreciated.

Thanks,

Yoshi

--

Kemari starts synchronizing VMs when QEMU handles I/O requests.
Without this patch VCPU state is already proceeded before
synchronization, and after failover to the VM on the receiver, it
hangs because of this.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/svm.c  |   11 ---
  arch/x86/kvm/vmx.c  |   11 ---
  arch/x86/kvm/x86.c  |4 
  4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26c629a..7b8f514 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -227,6 +227,7 @@ struct kvm_pio_request {
int in;
int port;
int size;
+   bool lazy_skip;
  };

  /*
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d04c7ad..e373245 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
  {
struct kvm_vcpu *vcpu =svm-vcpu;
u32 io_info = svm-vmcb-control.exit_info_1; /* address size bug? */
-   int size, in, string;
+   int size, in, string, ret;
unsigned port;

++svm-vcpu.stat.io_exits;
@@ 

RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-22 Thread Xin, Xiaohui
Michael,

Yes, I think this packet split mode probably maps well to mergeable buffer
support. Note that
1. Not all devices support large packets in this way, others might map
   to indirect buffers better

Do the indirect buffers accord to deal with the skb-frag_list?

   So we have to figure out how migration is going to work
Yes, different guest virtio-net driver may contain different features.
Does the qemu migration work with different features supported by virtio-net
driver now?

2. It's up to guest driver whether to enable features such as
   mergeable buffers and indirect buffers
   So we have to figure out how to notify guest which mode
   is optimal for a given device
Yes. When a device is binded, the mp device may query the capabilities from 
driver.
Actually, there is a structure now in mp device can do this, we can add some 
field
to support more.

3. We don't want to depend on jumbo frames for decent performance
   So we probably should support GSO/GRO
GSO is for the tx side, right? I think driver can handle it itself.
For GRO, I'm not sure it's easy or not. Basically, the mp device now
we have support is doing what raw socket is doing. The packets are not going to 
host stack.
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-22 Thread Takuya Yoshikawa

(2010/04/21 20:26), Avi Kivity wrote:


r = 0;
@@ -1858,7 +1866,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
if (memslot-is_dirty) {
kvm_flush_remote_tlbs(kvm);
n = kvm_dirty_bitmap_bytes(memslot);
- memset(memslot-dirty_bitmap, 0, n);
+ clear_user(memslot-dirty_bitmap, n);
memslot-is_dirty = false;


Does this need an error check?


Yes, sorry I forgot.





@@ -468,8 +480,12 @@ void kvm_free_physmem(struct kvm *kvm)
int i;
struct kvm_memslots *slots = kvm-memslots;

- for (i = 0; i slots-nmemslots; ++i)
+ for (i = 0; i slots-nmemslots; ++i) {
+ /* We don't munmap dirty bitmaps by ourselves. */


Why not? If we allocated them, we have to free them.


In the case of VM destruction, when qemu process exits, it conflicts
with the work of the usual (process) destructor's job.

I struggled with this problem before sending the first version.

So I have to differentiate the slot release and qemu process exit.






+ slots-memslots[i].dirty_bitmap = NULL;
+ slots-memslots[i].dirty_bitmap_old = NULL;
kvm_free_physmem_slot(slots-memslots[i], NULL);
+ }


+/*
+ * Please use generic *_user bitops once they become available.
+ * Be careful setting the bit won't be done atomically.
+ */


Please introduce the user bitops as part of this patchset.



OK, I will do in the next version. In this RFC, I would be happy if I can
know the overall design is right or not.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-22 Thread Michael S. Tsirkin
On Thu, Apr 22, 2010 at 04:57:56PM +0800, Xin, Xiaohui wrote:
 Michael,
 
 Yes, I think this packet split mode probably maps well to mergeable buffer
 support. Note that
 1. Not all devices support large packets in this way, others might map
to indirect buffers better
 
 Do the indirect buffers accord to deal with the skb-frag_list?

We currently use skb-frags.

So we have to figure out how migration is going to work
 Yes, different guest virtio-net driver may contain different features.
 Does the qemu migration work with different features supported by virtio-net
 driver now?

For now, you must have identical feature-sets for migration to work.
And long as we manage the buffers in software, we can always make
features match.

 2. It's up to guest driver whether to enable features such as
mergeable buffers and indirect buffers
So we have to figure out how to notify guest which mode
is optimal for a given device
 Yes. When a device is binded, the mp device may query the capabilities from 
 driver.
 Actually, there is a structure now in mp device can do this, we can add some 
 field
 to support more.
 
 3. We don't want to depend on jumbo frames for decent performance
So we probably should support GSO/GRO
 GSO is for the tx side, right? I think driver can handle it itself.
 For GRO, I'm not sure it's easy or not. Basically, the mp device now
 we have support is doing what raw socket is doing. The packets are not going 
 to host stack.

See commit bfd5f4a3d605e0f6054df0b59fe0907ff7e696d3
(it doesn't currently work with vhost net, but that's
 a separate story).

 -- 
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-22 Thread Takuya Yoshikawa

Thanks, I can know basic rules about kvm/api .

(2010/04/21 20:46), Avi Kivity wrote:




+Type: vm ioctl
+Parameters: struct kvm_dirty_log (in/out)
+Returns: 0 on success, -1 on error
+
+/* for KVM_SWITCH_DIRTY_LOG */
+struct kvm_dirty_log {
+ __u32 slot;
+ __u32 padding;


Please put a flags field here (and verify it is zero in the ioctl
implementation). This allows us to extend it later.


+ union {
+ void __user *dirty_bitmap; /* one bit per page */
+ __u64 addr;
+ };


Just make dirty_bitmap a __u64. With the union there is the risk that
someone forgets to zero the structure so we get some random bits in the
pointer, and also issues with big endian and s390 compat.

Reserve some extra space here for future expansion.

Hm. I see that this is the existing kvm_dirty_log structure. So we can't
play with it, ignore my comments about it.


So, introducing a new structure to export(and get) two bitmap addresses
with a flag is the thing?

1. Qemu calls ioctl to get the two addresses.
2. Qemu calls ioctl to make KVM switch the dirty bitmaps.
   -- in this case, qemu have to change the address got in step 1.
   ...
3. Qemu calls ioctl(we can reuse 1. with different command flag) to free the 
bitmaps.


In my plan, we don't let qemu malloc() dirty bitmaps: at least, I want to make 
that
another patch set because this patch set already has some dependencies to other 
issues.

But, yes, I can make the struct considering the future expansion if it needed.




- r = kvm_copy_dirty_bitmap(log-dirty_bitmap, dirty_bitmap_old, n);
+ if (need_copy) {
+ r = kvm_copy_dirty_bitmap(log-dirty_bitmap,
+ dirty_bitmap_old, n);
+ } else {
+ log-addr = (unsigned long)dirty_bitmap_old;
+ r = 0;
+ }


Instead of passing need_copy everywhere, you might check a flag in
log-. You'll need a flag to know whether to munmap() or not, no?


To judge munmap() or not, putting in the memory slot's flag may be
more useful.

But as you suggest, we won't use kvm_dirty_log so parameters will change.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] KVM MMU: fix sp-unsync type error in trace event definition.

2010-04-22 Thread Gui Jianfeng
sp-unsync is bool now, so update trace event declaration.

Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com
---
 arch/x86/kvm/mmutrace.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 3851f1f..9966e80 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -11,7 +11,7 @@
__field(__u64, gfn) \
__field(__u32, role) \
__field(__u32, root_count) \
-   __field(__u32, unsync)
+   __field(bool, unsync)
 
 #define KVM_MMU_PAGE_ASSIGN(sp) \
__entry-gfn = sp-gfn;  \
-- 
1.6.5.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] KVM MMU: Take sp level into account when calculating quadran

2010-04-22 Thread Gui Jianfeng
Take sp level into account when calculating quadrant, because only when
level == PT_PAGE_TABLE_LEVEL, quadrant is needed.

Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 640b82d..2a35a65 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1324,7 +1324,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
if (role.direct)
role.cr4_pae = 0;
role.access = access;
-   if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) {
+   if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL 
+   level == PT_PAGE_TABLE_LEVEL) {
quadrant = gaddr  (PAGE_SHIFT + (PT64_PT_BITS * level));
quadrant = (1  ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
role.quadrant = quadrant;
-- 
1.6.5.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.

2010-04-22 Thread Michael S. Tsirkin
On Thu, Apr 22, 2010 at 04:37:16PM +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com
 
 The vhost-net backend now only supports synchronous send/recv
 operations. The patch provides multiple submits and asynchronous
 notifications. This is needed for zero-copy case.
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 ---
 
 Michael,
 
 Can't vhost supply a kiocb completion callback that will handle the list?
 
 Yes, thanks. And with it I also remove the vq-receiver finally.
 
 Thanks
 Xiaohui

Nice progress. I commented on some minor issues below.
Thanks!

  drivers/vhost/net.c   |  227 
 +++--
  drivers/vhost/vhost.c |  115 ++---
  drivers/vhost/vhost.h |   14 +++
  3 files changed, 301 insertions(+), 55 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 22d5fef..4a70f66 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -17,11 +17,13 @@
  #include linux/workqueue.h
  #include linux/rcupdate.h
  #include linux/file.h
 +#include linux/aio.h
  
  #include linux/net.h
  #include linux/if_packet.h
  #include linux/if_arp.h
  #include linux/if_tun.h
 +#include linux/mpassthru.h
  #include net/sock.h
  
 @@ -47,6 +49,7 @@ struct vhost_net {
   struct vhost_dev dev;
   struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
   struct vhost_poll poll[VHOST_NET_VQ_MAX];
 + struct kmem_cache   *cache;
   /* Tells us whether we are polling a socket for TX.
* We only do this when socket buffer fills up.
* Protected by tx vq lock. */
 @@ -91,11 +94,132 @@ static void tx_poll_start(struct vhost_net *net, struct 
 socket *sock)
   net-tx_poll_state = VHOST_NET_POLL_STARTED;
  }
  
 +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
 +{
 + struct kiocb *iocb = NULL;
 + unsigned long flags;
 +
 + spin_lock_irqsave(vq-notify_lock, flags);
 + if (!list_empty(vq-notifier)) {
 + iocb = list_first_entry(vq-notifier,
 + struct kiocb, ki_list);
 + list_del(iocb-ki_list);
 + }
 + spin_unlock_irqrestore(vq-notify_lock, flags);
 + return iocb;
 +}
 +
 +static void handle_iocb(struct kiocb *iocb)
 +{
 + struct vhost_virtqueue *vq = iocb-private;
 + unsigned long flags;
 +
 +spin_lock_irqsave(vq-notify_lock, flags);
 +list_add_tail(iocb-ki_list, vq-notifier);
 +spin_unlock_irqrestore(vq-notify_lock, flags);
 +}
 +

checkpatch.pl does not complain about the above?

 +static void handle_async_rx_events_notify(struct vhost_net *net,
 +  struct vhost_virtqueue *vq,
 +  struct socket *sock)

continuation lines should start to the right of (.

 +{
 + struct kiocb *iocb = NULL;
 + struct vhost_log *vq_log = NULL;
 + int rx_total_len = 0;
 + unsigned int head, log, in, out;
 + int size;
 +
 + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
 + return;
 +
 + if (sock-sk-sk_data_ready)
 + sock-sk-sk_data_ready(sock-sk, 0);
 +
 + vq_log = unlikely(vhost_has_feature(
 + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;

split the above line at ?, continuation being to the left of ( looks
ugly.

 + while ((iocb = notify_dequeue(vq)) != NULL) {
 + vhost_add_used_and_signal(net-dev, vq,
 + iocb-ki_pos, iocb-ki_nbytes);
 + log = (int)(iocb-ki_user_data  32);

how about we always do the recompute step, and not encode
the log bit in ki_user_data?

 + size = iocb-ki_nbytes;
 + head = iocb-ki_pos;
 + rx_total_len += iocb-ki_nbytes;
 +
 + if (iocb-ki_dtor)
 + iocb-ki_dtor(iocb);
 + kmem_cache_free(net-cache, iocb);
 +
 + /* when log is enabled, recomputing the log info is needed,
 +  * since these buffers are in async queue, and may not get
 +  * the log info before.
 +  */
 + if (unlikely(vq_log)) {
 + if (!log)
 + __vhost_get_vq_desc(net-dev, vq, vq-iov,
 + ARRAY_SIZE(vq-iov),
 + out, in, vq_log,
 + log, head);
 + vhost_log_write(vq, vq_log, log, size);
 + }
 + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
 + vhost_poll_queue(vq-poll);
 + break;
 + }
 + }
 +}
 +
 +static void handle_async_tx_events_notify(struct vhost_net *net,
 + struct vhost_virtqueue *vq)
 +{
 + struct kiocb *iocb = NULL;
 + int tx_total_len = 0;
 +
 + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
 + return;
 +
 + while 

32-bit color graphic on KVM virtual machines

2010-04-22 Thread shacky
Hi.
Is it possible to have 32-bit color graphic on KVM virtual machines?
I installed a Windows virtual machine, but it allows me to configure
only 24-bit color display and it does not have any display driver
installed.

Is there a way to solve this problem?

Thank youv very much!
Bye.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-22 Thread Liu Yu-B13201
 

 -Original Message-
 From: kvm-ow...@vger.kernel.org 
 [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Zhang, Yanmin
 Sent: Monday, April 19, 2010 1:33 PM
 To: Avi Kivity
 Cc: Ingo Molnar; Peter Zijlstra; Avi Kivity; Sheng Yang; 
 linux-ker...@vger.kernel.org; kvm@vger.kernel.org; Marcelo 
 Tosatti; oerg Roedel; Jes Sorensen; Gleb Natapov; Zachary 
 Amsden; zhiteng.hu...@intel.com; tim.c.c...@intel.com; 
 Arnaldo Carvalho de Melo
 Subject: [PATCH V5 1/3] perf  kvm: Enhance perf to collect 
 KVM guest os statistics from host side
 
 Below patch introduces perf_guest_info_callbacks and related 
 register/unregister
 functions. Add more PERF_RECORD_MISC_XXX bits meaning guest 
 kernel and guest user
 space.
 
 Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
 
 ---

 diff -Nraup --exclude-from=exclude.diff 
 linux-2.6_tip0417/include/linux/perf_event.h 
 linux-2.6_tip0417_perfkvm/include/linux/perf_event.h
 --- linux-2.6_tip0417/include/linux/perf_event.h  
 2010-04-19 09:51:59.544791000 +0800
 +++ linux-2.6_tip0417_perfkvm/include/linux/perf_event.h  
 2010-04-19 09:53:59.691378953 +0800

 @@ -932,6 +940,12 @@ static inline void perf_event_mmap(struc
   __perf_event_mmap(vma);
  }
  
 +extern struct perf_guest_info_callbacks *perf_guest_cbs;
 +extern int perf_register_guest_info_callbacks(
 + struct perf_guest_info_callbacks *);
 +extern int perf_unregister_guest_info_callbacks(
 + struct perf_guest_info_callbacks *);
 +
  extern void perf_event_comm(struct task_struct *tsk);
  extern void perf_event_fork(struct task_struct *tsk);
  
 @@ -1001,6 +1015,11 @@ perf_sw_event(u32 event_id, u64 nr, int 
  static inline void
  perf_bp_event(struct perf_event *event, void *data)  
   { }
  
 +static inline int perf_register_guest_info_callbacks
 +(struct perf_guest_info_callbacks *) {return 0; }
 +static inline int perf_unregister_guest_info_callbacks
 +(struct perf_guest_info_callbacks *) {return 0; }
 +
  static inline void perf_event_mmap(struct vm_area_struct 
 *vma) { }
  static inline void perf_event_comm(struct task_struct *tsk)  
   { }
  static inline void perf_event_fork(struct task_struct *tsk)  
   { }


Hi,

I met this error when built kernel. Anything wrong?

  CC  init/main.o
In file included from include/linux/ftrace_event.h:8,
 from include/trace/syscall.h:6,
 from include/linux/syscalls.h:75,
 from init/main.c:16:
include/linux/perf_event.h: In function 'perf_register_guest_info_callbacks':
include/linux/perf_event.h:1019: error: parameter name omitted
include/linux/perf_event.h: In function 'perf_unregister_guest_info_callbacks':
include/linux/perf_event.h:1021: error: parameter name omitted
make[1]: *** [init/main.o] Error 1
make: *** [init] Error 2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-22 Thread Joerg Roedel
The patch introducing nested nmi handling had a bug. The
check does not belong to enable_nmi_window but must be in
nmi_allowed. This patch fixes this.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ab78eb8..ec20584 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb *vmcb = svm-vmcb;
-   return !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
-   !(svm-vcpu.arch.hflags  HF_NMI_MASK);
+   int ret;
+   ret = !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
+ !(svm-vcpu.arch.hflags  HF_NMI_MASK);
+   ret = ret  gif_set(svm)  nested_svm_nmi(svm);
+
+   return ret;
 }
 
 static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
@@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
 * Something prevents NMI from been injected. Single step over possible
 * problem (IRET or exception injection or interrupt shadow)
 */
-   if (gif_set(svm)  nested_svm_nmi(svm)) {
-   svm-nmi_singlestep = true;
-   svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
-   update_db_intercept(vcpu);
-   }
+   svm-nmi_singlestep = true;
+   svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
+   update_db_intercept(vcpu);
 }
 
 static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] KVM: SVM: Propagate nested entry failure into guest hypervisor

2010-04-22 Thread Joerg Roedel
This patch implements propagation of a failes guest vmrun
back into the guest instead of killing the whole guest.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 5ad9d80..b10d163 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1715,6 +1715,10 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
vmexit = NESTED_EXIT_DONE;
break;
}
+   case SVM_EXIT_ERR: {
+   vmexit = NESTED_EXIT_DONE;
+   break;
+   }
default: {
u64 exit_bits = 1ULL  (exit_code - SVM_EXIT_INTR);
if (svm-nested.intercept  exit_bits)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling

2010-04-22 Thread Joerg Roedel
This patch syncs cr0 and cr3 from the vmcb to the kvm state
before nested intercept handling is done. This allows to
simplify the vmexit path.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |   15 ++-
 1 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c480d7f..5ad9d80 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1799,10 +1799,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
nested_vmcb-save.gdtr   = vmcb-save.gdtr;
nested_vmcb-save.idtr   = vmcb-save.idtr;
nested_vmcb-save.cr0= kvm_read_cr0(svm-vcpu);
-   if (npt_enabled)
-   nested_vmcb-save.cr3= vmcb-save.cr3;
-   else
-   nested_vmcb-save.cr3= svm-vcpu.arch.cr3;
+   nested_vmcb-save.cr3= svm-vcpu.arch.cr3;
nested_vmcb-save.cr2= vmcb-save.cr2;
nested_vmcb-save.cr4= svm-vcpu.arch.cr4;
nested_vmcb-save.rflags = vmcb-save.rflags;
@@ -2641,6 +2638,11 @@ static int handle_exit(struct kvm_vcpu *vcpu)
 
trace_kvm_exit(exit_code, vcpu);
 
+   if (!(svm-vmcb-control.intercept_cr_write  INTERCEPT_CR0_MASK))
+   vcpu-arch.cr0 = svm-vmcb-save.cr0;
+   if (npt_enabled)
+   vcpu-arch.cr3 = svm-vmcb-save.cr3;
+
if (unlikely(svm-nested.exit_required)) {
nested_svm_vmexit(svm);
svm-nested.exit_required = false;
@@ -2668,11 +2670,6 @@ static int handle_exit(struct kvm_vcpu *vcpu)
 
svm_complete_interrupts(svm);
 
-   if (!(svm-vmcb-control.intercept_cr_write  INTERCEPT_CR0_MASK))
-   vcpu-arch.cr0 = svm-vmcb-save.cr0;
-   if (npt_enabled)
-   vcpu-arch.cr3 = svm-vmcb-save.cr3;
-
if (svm-vmcb-control.exit_code == SVM_EXIT_ERR) {
kvm_run-exit_reason = KVM_EXIT_FAIL_ENTRY;
kvm_run-fail_entry.hardware_entry_failure_reason
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] KVM: x86: Allow marking an exception as reinjected

2010-04-22 Thread Joerg Roedel
This patch adds logic to kvm/x86 which allows to mark an
injected exception as reinjected. This allows to remove an
ugly hack from svm_complete_interrupts that prevented
exceptions from being reinjected at all in the nested case.
The hack was necessary because an reinjected exception into
the nested guest could cause a nested vmexit emulation. But
reinjected exceptions must not intercept. The downside of
the hack is that a exception that in injected could get
lost.
This patch fixes the problem and puts the code for it into
generic x86 files because. Nested-VMX will likely have the
same problem and could reuse the code.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/include/asm/kvm_host.h |6 +-
 arch/x86/kvm/svm.c  |   12 ++--
 arch/x86/kvm/vmx.c  |3 ++-
 arch/x86/kvm/x86.c  |   23 +++
 4 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 357573a..3f0007b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -312,6 +312,7 @@ struct kvm_vcpu_arch {
struct kvm_queued_exception {
bool pending;
bool has_error_code;
+   bool reinject;
u8 nr;
u32 error_code;
} exception;
@@ -514,7 +515,8 @@ struct kvm_x86_ops {
void (*set_irq)(struct kvm_vcpu *vcpu);
void (*set_nmi)(struct kvm_vcpu *vcpu);
void (*queue_exception)(struct kvm_vcpu *vcpu, unsigned nr,
-   bool has_error_code, u32 error_code);
+   bool has_error_code, u32 error_code,
+   bool reinject);
int (*interrupt_allowed)(struct kvm_vcpu *vcpu);
int (*nmi_allowed)(struct kvm_vcpu *vcpu);
bool (*get_nmi_mask)(struct kvm_vcpu *vcpu);
@@ -617,6 +619,8 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long 
rflags);
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
+void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
+void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 
error_code);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long cr2,
   u32 error_code);
 bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 65fc114..30e49fe 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -338,7 +338,8 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
 }
 
 static void svm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
-   bool has_error_code, u32 error_code)
+   bool has_error_code, u32 error_code,
+   bool reinject)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -346,7 +347,8 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu, 
unsigned nr,
 * If we are within a nested VM we'd better #VMEXIT and let the guest
 * handle the exception
 */
-   if (nested_svm_check_exception(svm, nr, has_error_code, error_code))
+   if (!reinject 
+   nested_svm_check_exception(svm, nr, has_error_code, error_code))
return;
 
if (nr == BP_VECTOR  !svm_has(SVM_FEATURE_NRIP)) {
@@ -2918,8 +2920,6 @@ static void svm_complete_interrupts(struct vcpu_svm *svm)
svm-vcpu.arch.nmi_injected = true;
break;
case SVM_EXITINTINFO_TYPE_EXEPT:
-   if (is_nested(svm))
-   break;
/*
 * In case of software exceptions, do not reinject the vector,
 * but re-execute the instruction instead. Rewind RIP first
@@ -2935,10 +2935,10 @@ static void svm_complete_interrupts(struct vcpu_svm 
*svm)
}
if (exitintinfo  SVM_EXITINTINFO_VALID_ERR) {
u32 err = svm-vmcb-control.exit_int_info_err;
-   kvm_queue_exception_e(svm-vcpu, vector, err);
+   kvm_requeue_exception_e(svm-vcpu, vector, err);
 
} else
-   kvm_queue_exception(svm-vcpu, vector);
+   kvm_requeue_exception(svm-vcpu, vector);
break;
case SVM_EXITINTINFO_TYPE_INTR:
kvm_queue_interrupt(svm-vcpu, vector, false);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index dc023bc..285fe1a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -919,7 +919,8 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
 }
 
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
-   bool has_error_code, u32 error_code)
+   bool has_error_code, u32 

[PATCH 8/8] KVM: SVM: Handle MCE intercepts always on host level

2010-04-22 Thread Joerg Roedel
This patch prevents MCE intercepts from being propagated
into the L1 guest if they happened in an L2 guest.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 30e49fe..889f660 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1651,6 +1651,7 @@ static int nested_svm_exit_special(struct vcpu_svm *svm)
switch (exit_code) {
case SVM_EXIT_INTR:
case SVM_EXIT_NMI:
+   case SVM_EXIT_EXCP_BASE + MC_VECTOR:
return NESTED_EXIT_HOST;
case SVM_EXIT_NPF:
/* For now we are always handling NPFs when using them */
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] KVM: SVM: Report emulated SVM features to userspace

2010-04-22 Thread Joerg Roedel
This patch implements the reporting of the emulated SVM
features to userspace instead of the real hardware
capabilities. Every real hardware capability needs emulation
in nested svm so the old behavior was broken.

Cc: sta...@kernel.org
Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |   10 ++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 0fa2035..65fc114 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3154,6 +3154,16 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
 
 static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+   switch (func) {
+   case 0x800A:
+   entry-eax = 1; /* SVM revision 1 */
+   entry-ebx = 8; /* Lets support 8 ASIDs in case we add proper
+  ASID emulation to nested SVM */
+   entry-ecx = 0; /* Reserved */
+   entry-edx = 0; /* Do not support any additional features */
+
+   break;
+   }
 }
 
 static const struct trace_print_flags svm_exit_reasons_str[] = {
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] KVM: X86: Add callback to let modules decide over some supported cpuid bits

2010-04-22 Thread Joerg Roedel
This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.

Cc: sta...@kernel.org
Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/include/asm/kvm_host.h |2 ++
 arch/x86/kvm/svm.c  |6 ++
 arch/x86/kvm/vmx.c  |6 ++
 arch/x86/kvm/x86.c  |3 +++
 4 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d47d087..357573a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -528,6 +528,8 @@ struct kvm_x86_ops {
int (*get_lpage_level)(void);
bool (*rdtscp_supported)(void);
 
+   void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
+
const struct trace_print_flags *exit_reasons_str;
 };
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index b10d163..0fa2035 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3152,6 +3152,10 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
 {
 }
 
+static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
+{
+}
+
 static const struct trace_print_flags svm_exit_reasons_str[] = {
{ SVM_EXIT_READ_CR0,read_cr0 },
{ SVM_EXIT_READ_CR3,read_cr3 },
@@ -3297,6 +3301,8 @@ static struct kvm_x86_ops svm_x86_ops = {
.cpuid_update = svm_cpuid_update,
 
.rdtscp_supported = svm_rdtscp_supported,
+
+   .set_supported_cpuid = svm_set_supported_cpuid,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d0a10b5..dc023bc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4111,6 +4111,10 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
}
 }
 
+static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
+{
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -4183,6 +4187,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.cpuid_update = vmx_cpuid_update,
 
.rdtscp_supported = vmx_rdtscp_supported,
+
+   .set_supported_cpuid = vmx_set_supported_cpuid,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf37ac6..c2c3c29 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1960,6 +1960,9 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
entry-ecx = kvm_supported_word6_x86_features;
break;
}
+
+   kvm_x86_ops-set_supported_cpuid(function, entry);
+
put_cpu();
 }
 
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] More fixes for nested svm

2010-04-22 Thread Joerg Roedel
Hi Avi, Marcelo,

here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
root domain booting. The patches also add better handling for nested entry
failures and mce intercepts.
Also in this patchset are the fixes for the supported cpuid reporting for svm
features. These patches were taken from the nested-npt patchset and slightly
modified. These patches are also marked for -stable backporting.
The probably most important fix is about exception reinjection. This didn't
work reliably before and is fixed with the patch in this series now. This fix
also touches common x86 code but that should be ok because it could be reused
by nested-vmx later.
Please review and give comments (or apply ;-).

Thanks,

Joerg

Diffstat:

 arch/x86/include/asm/kvm_host.h |8 -
 arch/x86/kvm/svm.c  |   72 +--
 arch/x86/kvm/vmx.c  |9 -
 arch/x86/kvm/x86.c  |   26 --
 4 files changed, 83 insertions(+), 32 deletions(-)

Shortlog:

Joerg Roedel (8):
  KVM: SVM: Fix nested nmi handling
  KVM: SVM: Make sure rip is synced to vmcb before nested vmexit
  KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling
  KVM: SVM: Propagate nested entry failure into guest hypervisor
  KVM: X86: Add callback to let modules decide over some supported cpuid 
bits
  KVM: SVM: Report emulated SVM features to userspace
  KVM: x86: Allow marking an exception as reinjected
  KVM: SVM: Handle MCE intercepts always on host level


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] KVM: SVM: Make sure rip is synced to vmcb before nested vmexit

2010-04-22 Thread Joerg Roedel
This patch fixes a bug where a nested guest always went over
the same instruction because the rip was not advanced on a
nested vmexit.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ec20584..c480d7f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2960,6 +2960,10 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
u16 gs_selector;
u16 ldt_selector;
 
+   svm-vmcb-save.rax = vcpu-arch.regs[VCPU_REGS_RAX];
+   svm-vmcb-save.rsp = vcpu-arch.regs[VCPU_REGS_RSP];
+   svm-vmcb-save.rip = vcpu-arch.regs[VCPU_REGS_RIP];
+
/*
 * A vmexit emulation is required before the vcpu can be executed
 * again.
@@ -2967,10 +2971,6 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(svm-nested.exit_required))
return;
 
-   svm-vmcb-save.rax = vcpu-arch.regs[VCPU_REGS_RAX];
-   svm-vmcb-save.rsp = vcpu-arch.regs[VCPU_REGS_RSP];
-   svm-vmcb-save.rip = vcpu-arch.regs[VCPU_REGS_RIP];
-
pre_svm_run(svm);
 
sync_lapic_to_cr8(vcpu);
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs. It may also need some plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.


[ snap]



The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true for
disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?


In current implementation, it is actually stalling any type of network that goes 
through virtio-net.


However, if the application was using unreliable protocols, it should have its 
own recovering mechanism, or it should be completely stateless.



[snap]



=== clock ===

Since synchronizing the virtual machines every time the TSC is
accessed would be
prohibitive, the transmission of the TSC will be done lazily, which means
delaying it until there is a non-TSC synchronization point arrives.


Why do you specifically care about the tsc sync? When you sync all the
IO model on snapshot it also synchronizes the tsc.

In general, can you please explain the 'algorithm' for continuous
snapshots (is that what you like to do?):


Yes, of course.
Sorry for being less informative.


A trivial one would we to :
- do X online snapshots/sec


I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the guest was 
almost idle, there will be no snapshots in 5sec.  On the other hand, if the 
guest was running I/O intensive workloads (netperf, iozone for example), there 
will be about 50 snapshots/sec.



- Stall all IO (disk/block) from the guest to the outside world
until the previous snapshot reaches the slave.


Yes, it does.


- Snapshots are made of


Full device model + diff of dirty pages from the last snapshot.


- diff of dirty pages from last snapshot


This also depends on the workload.
In case of I/O intensive workloads, dirty pages are usually less than 100.


- Qemu device model (+kvm's) diff from last.


We're currently sending full copy because we're completely reusing this part of 
existing live migration framework.


Last time we measured, it was about 13KB.
But it varies by which QEMU version is used.


You can do 'light' snapshots in between to send dirty pages to reduce
snapshot time.


I agree.  That's one of the advanced topic we would like to try too.


I wrote the above to serve a reference for your comments so it will map
into my mind. Thanks, dor


Thank your for the guidance.
I hope this answers to your question.

At the same time, I would also be happy it we could discuss how to implement 
too.  In fact, we needed a hack to prevent rip from proceeding in KVM, which 
turned out that it was not the best workaround.


Thanks,

Yoshi





TODO:
- Synchronization of clock sources (need to intercept TSC reads, etc).

=== usability ===

These are items that defines how users interact with Kemari.

TODO:
- Kemarid daemon that takes care of the cluster management/monitoring
side of things.
- Some device emulators might need minor modifications to work well
with Kemari. Use white(black)-listing to take the burden of
choosing the right device model off the users.

=== optimizations ===

Although the big picture can be realized by completing the TODO list
above, we
need some optimizations/enhancements to make Kemari useful in real
world, and
these are items what needs to be done for that.

TODO:
- SMP (for the sake of performance might need to implement a
synchronization protocol that can maintain two or more
synchronization points active at any given moment)
- VGA (leverage VNC's subtilting mechanism to identify fb pages that
are really 

[PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Joerg Roedel
This patch adds code to ask the kernel about the svm
features it supports for its guests and propagates them to
the guest. The new capability is necessary because the old
behavior of the kernel was to just return the host svm
features but every svm-feature needs emulation in the nested
svm kernel code. The new capability indicates that the
kernel is aware of that when returning svm cpuid
information.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 qemu-kvm-x86.c |   14 --
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 748ff69..6eccd69 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv)
 qemu_kvm_cpuid_on_env(copy);
 limit = copy.regs[R_EAX];
 
-for (i = 0x8000; i = limit; ++i)
-   do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0, copy);
+for (i = 0x8000; i = limit; ++i) {
+do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0, copy);
+switch (i) {
+case 0x800a:
+cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EAX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EDX);
+break;
+}
+cpuid_nent += 1;
+}
 
 kvm_setup_cpuid2(cenv, cpuid_nent, cpuid_ent);
 
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] More fixes for nested svm

2010-04-22 Thread Avi Kivity

On 04/22/2010 01:33 PM, Joerg Roedel wrote:

Hi Avi, Marcelo,

here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
root domain booting. The patches also add better handling for nested entry
failures and mce intercepts.
Also in this patchset are the fixes for the supported cpuid reporting for svm
features. These patches were taken from the nested-npt patchset and slightly
modified. These patches are also marked for -stable backporting.
The probably most important fix is about exception reinjection. This didn't
work reliably before and is fixed with the patch in this series now. This fix
also touches common x86 code but that should be ok because it could be reused
by nested-vmx later.
Please review and give comments (or apply ;-).
   


Looks good.  I'll wait a day or two for more reviews (esp. Alex).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Avi Kivity

On 04/22/2010 01:57 PM, Joerg Roedel wrote:

This patch adds code to ask the kernel about the svm
features it supports for its guests and propagates them to
the guest. The new capability is necessary because the old
behavior of the kernel was to just return the host svm
features but every svm-feature needs emulation in the nested
svm kernel code. The new capability indicates that the
kernel is aware of that when returning svm cpuid
information.
   


No new capability here.


Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
---
  qemu-kvm-x86.c |   14 --
  1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 748ff69..6eccd69 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv)
  qemu_kvm_cpuid_on_env(copy);
  limit = copy.regs[R_EAX];

-for (i = 0x8000; i= limit; ++i)
-   do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0,copy);
+for (i = 0x8000; i= limit; ++i) {
+do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0,copy);
+switch (i) {
+case 0x800a:
+cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EAX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EDX);
+break;
+}
+cpuid_nent += 1;
+}
   


I don't understand why this is different compared to all other cpuid bits.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm vmx: free vpid when fail to create vcpu

2010-04-22 Thread Avi Kivity

On 04/17/2010 11:41 AM, Lai Jiangshan wrote:

Fix bug of the exception path, free allocated vpid when fail
to create vcpu.
   


Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Joerg Roedel
On Thu, Apr 22, 2010 at 02:07:14PM +0300, Avi Kivity wrote:
 On 04/22/2010 01:57 PM, Joerg Roedel wrote:
 This patch adds code to ask the kernel about the svm
 features it supports for its guests and propagates them to
 the guest. The new capability is necessary because the old
 behavior of the kernel was to just return the host svm
 features but every svm-feature needs emulation in the nested
 svm kernel code. The new capability indicates that the
 kernel is aware of that when returning svm cpuid
 information.
 
 No new capability here.

copypaste error, sorry.

 
 Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
 ---
   qemu-kvm-x86.c |   14 --
   1 files changed, 12 insertions(+), 2 deletions(-)
 
 diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
 index 748ff69..6eccd69 100644
 --- a/qemu-kvm-x86.c
 +++ b/qemu-kvm-x86.c
 @@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv)
   qemu_kvm_cpuid_on_env(copy);
   limit = copy.regs[R_EAX];
 
 -for (i = 0x8000; i= limit; ++i)
 -do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0,copy);
 +for (i = 0x8000; i= limit; ++i) {
 +do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0,copy);
 +switch (i) {
 +case 0x800a:
 +cpuid_ent[cpuid_nent].eax = 
 kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EAX);
 +cpuid_ent[cpuid_nent].ebx = 
 kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX);
 +cpuid_ent[cpuid_nent].ebx = 
 kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX);
 +cpuid_ent[cpuid_nent].edx = 
 kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EDX);
 +break;
 +}
 +cpuid_nent += 1;
 +}
 
 I don't understand why this is different compared to all other cpuid bits.

Because for the SVM features we report to the guest we need to ask the
kernel which of them are supported. We can't just take the host-cpuid
because most of the additional svm features need special emulation in
the kernel. Or do you think this should better be handled in
target-i386/cpuid.c?

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Avi Kivity

On 04/22/2010 03:02 PM, Joerg Roedel wrote:



Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
---
  qemu-kvm-x86.c |   14 --
  1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 748ff69..6eccd69 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv)
  qemu_kvm_cpuid_on_env(copy);
  limit = copy.regs[R_EAX];

-for (i = 0x8000; i= limit; ++i)
-   do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0,copy);
+for (i = 0x8000; i= limit; ++i) {
+do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0,copy);
+switch (i) {
+case 0x800a:
+cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EAX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EDX);
+break;
+}
+cpuid_nent += 1;
+}
   

I don't understand why this is different compared to all other cpuid bits.
 

Because for the SVM features we report to the guest we need to ask the
kernel which of them are supported.


That's true for all cpuid features.


We can't just take the host-cpuid
because most of the additional svm features need special emulation in
the kernel. Or do you think this should better be handled in
target-i386/cpuid.c?
   


Yes.  -cpu host should take KVM_GET_SUPPORTED_CPUID output and loop it 
back to the vcpu configuration, others just take the qemu configuration, 
mask it with supported bits, and pass it back (see 
check_features_against_host()).


(need feature names for the bits, too, so you can enable or disable them 
from the command line)


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Dor Laor

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip
from
proceeding before synchronizing VMs. It may also need some plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.


[ snap]



The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true for
disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?


In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely stateless.


Why do you treat tcp differently? You can damage the entire VM this way 
- think of dhcp request that was dropped on the moment you switched 
between the master and the slave?






[snap]



=== clock ===

Since synchronizing the virtual machines every time the TSC is
accessed would be
prohibitive, the transmission of the TSC will be done lazily, which
means
delaying it until there is a non-TSC synchronization point arrives.


Why do you specifically care about the tsc sync? When you sync all the
IO model on snapshot it also synchronizes the tsc.


So, do you agree that an extra clock synchronization is not needed since 
it is done anyway as part of the live migration state sync?




In general, can you please explain the 'algorithm' for continuous
snapshots (is that what you like to do?):


Yes, of course.
Sorry for being less informative.


A trivial one would we to :
- do X online snapshots/sec


I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the
guest was almost idle, there will be no snapshots in 5sec. On the other
hand, if the guest was running I/O intensive workloads (netperf, iozone
for example), there will be about 50 snapshots/sec.


- Stall all IO (disk/block) from the guest to the outside world
until the previous snapshot reaches the slave.


Yes, it does.


- Snapshots are made of


Full device model + diff of dirty pages from the last snapshot.


- diff of dirty pages from last snapshot


This also depends on the workload.
In case of I/O intensive workloads, dirty pages are usually less than 100.


The hardest would be memory intensive loads.
So 100 snap/sec means latency of 10msec right?
(not that it's not ok, with faster hw and IB you'll be able to get much 
more)





- Qemu device model (+kvm's) diff from last.


We're currently sending full copy because we're completely reusing this
part of existing live migration framework.

Last time we measured, it was about 13KB.
But it varies by which QEMU version is used.


You can do 'light' snapshots in between to send dirty pages to reduce
snapshot time.


I agree. That's one of the advanced topic we would like to try too.


I wrote the above to serve a reference for your comments so it will map
into my mind. Thanks, dor


Thank your for the guidance.
I hope this answers to your question.

At the same time, I would also be happy it we could discuss how to
implement too. In fact, we needed a hack to prevent rip from proceeding
in KVM, which turned out that it was not the best workaround.


There are brute force solutions like
- stop the guest until you send all of the snapshot to the remote (like
  standard live migration)
- Stop + fork + cont the father

Or mark the recent dirty pages that were not sent to the remote as write 
protected and copy them if touched.





Thanks,

Yoshi





TODO:
- Synchronization of clock sources (need to intercept TSC reads, etc).

=== usability ===


Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Joerg Roedel
On Thu, Apr 22, 2010 at 03:13:14PM +0300, Avi Kivity wrote:
 On 04/22/2010 03:02 PM, Joerg Roedel wrote:

 We can't just take the host-cpuid
 because most of the additional svm features need special emulation in
 the kernel. Or do you think this should better be handled in
 target-i386/cpuid.c?


 Yes.  -cpu host should take KVM_GET_SUPPORTED_CPUID output and loop it  
 back to the vcpu configuration, others just take the qemu configuration,  
 mask it with supported bits, and pass it back (see  
 check_features_against_host()).

Hmm, the plan was to enable with -enable-nesting all kernel supported
svm features for the guest (and add switches later to remove them
individually)
If we activate nested svm with -cpu host in the future thats fine too
(closed-source hypervisors need that anyway). But we should also define
a cpu model in which we can migrate nested hypervisors between machines
were the cpu is not completly indentical.

 (need feature names for the bits, too, so you can enable or disable them  
 from the command line)

Yeah, I know. I omitted that for the first bring-up. It was planned for
a later patch.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[qemu-kvm tests 0/3] qemu-kvm tests cleanup

2010-04-22 Thread Naphtali Sprei
Cleanup, mostly x86 oriented.
Patches against 'next' branch.

Naphtali Sprei (3):
  qemu-kvm tests cleanup
  qemu-kvm tests cleanup: adapt stringio test to kernel-mode run
  qemu-kvm tests cleanup: Added printing for passing tests Also typo
fix

 kvm/user/README|   23 ++
 kvm/user/balloon_ctl.c |   92 --
 kvm/user/bootstrap.lds |   15 -
 kvm/user/config-x86-common.mak |   26 +--
 kvm/user/config-x86_64.mak |5 +-
 kvm/user/main.c|  611 
 kvm/user/test/x86/README   |   15 +
 kvm/user/test/x86/bootstrap.S  |  137 -
 kvm/user/test/x86/exit.c   |7 -
 kvm/user/test/x86/memtest1.S   |   44 ---
 kvm/user/test/x86/realmode.c   |  106 +++-
 kvm/user/test/x86/runtime.h|6 -
 kvm/user/test/x86/simple.S |   13 -
 kvm/user/test/x86/stringio.S   |   13 +-
 kvm/user/test/x86/test32.S |8 -
 kvm/user/testdev.txt   |   14 +
 16 files changed, 170 insertions(+), 965 deletions(-)
 create mode 100644 kvm/user/README
 delete mode 100755 kvm/user/balloon_ctl.c
 delete mode 100644 kvm/user/bootstrap.lds
 delete mode 100644 kvm/user/main.c
 create mode 100644 kvm/user/test/x86/README
 delete mode 100644 kvm/user/test/x86/bootstrap.S
 delete mode 100644 kvm/user/test/x86/exit.c
 delete mode 100644 kvm/user/test/x86/memtest1.S
 delete mode 100644 kvm/user/test/x86/runtime.h
 delete mode 100644 kvm/user/test/x86/simple.S
 delete mode 100644 kvm/user/test/x86/test32.S
 create mode 100644 kvm/user/testdev.txt

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[qemu-kvm tests 1/3] qemu-kvm tests cleanup

2010-04-22 Thread Naphtali Sprei
Mainly removed unused/unnecessary files and references to them

Signed-off-by: Naphtali Sprei nsp...@redhat.com
---
 kvm/user/README|   23 ++
 kvm/user/balloon_ctl.c |   92 --
 kvm/user/bootstrap.lds |   15 -
 kvm/user/config-x86-common.mak |   24 +--
 kvm/user/config-x86_64.mak |5 +-
 kvm/user/main.c|  611 
 kvm/user/test/x86/README   |   15 +
 kvm/user/test/x86/bootstrap.S  |  137 -
 kvm/user/test/x86/exit.c   |7 -
 kvm/user/test/x86/memtest1.S   |   44 ---
 kvm/user/test/x86/runtime.h|6 -
 kvm/user/test/x86/simple.S |   13 -
 kvm/user/test/x86/test32.S |8 -
 kvm/user/testdev.txt   |   14 +
 14 files changed, 60 insertions(+), 954 deletions(-)
 create mode 100644 kvm/user/README
 delete mode 100755 kvm/user/balloon_ctl.c
 delete mode 100644 kvm/user/bootstrap.lds
 delete mode 100644 kvm/user/main.c
 create mode 100644 kvm/user/test/x86/README
 delete mode 100644 kvm/user/test/x86/bootstrap.S
 delete mode 100644 kvm/user/test/x86/exit.c
 delete mode 100644 kvm/user/test/x86/memtest1.S
 delete mode 100644 kvm/user/test/x86/runtime.h
 delete mode 100644 kvm/user/test/x86/simple.S
 delete mode 100644 kvm/user/test/x86/test32.S
 create mode 100644 kvm/user/testdev.txt

diff --git a/kvm/user/README b/kvm/user/README
new file mode 100644
index 000..6a83831
--- /dev/null
+++ b/kvm/user/README
@@ -0,0 +1,23 @@
+This directory contains sources for a kvm test suite.
+
+Tests for x86 architecture are run as kernel images for qemu that supports 
multiboot format.
+Tests uses an infrastructure called from the bios code. The infrastructure 
initialize the system/cpu's,
+switch to long-mode and calls the 'main' function of the individual test.
+Tests uses a qemu's virtual test device, named testdev, for services like 
printing, exiting, query memory size etc.
+See file testdev.txt for more details.
+
+To create the tests' images just type 'make' in this directory.
+Tests' images created in ./test/ARCH/*.flat
+
+An example of a test invocation:
+qemu-system-x86_64 -device testdev,chardev=testlog -chardev 
file,id=testlog,path=msr.out -kernel ./test/x86/msr.flat
+This invocation runs the msr test case. The test output is in file msr.out.
+
+
+
+Directory structure:
+.:  Makefile and config files for the tests
+./test/lib: general services for the tests
+./test/lib/ARCH: architecture dependent services for the tests
+./test/ARCH: the sources of the tests and the created objects/images
+
diff --git a/kvm/user/balloon_ctl.c b/kvm/user/balloon_ctl.c
deleted file mode 100755
index e65b08d..000
--- a/kvm/user/balloon_ctl.c
+++ /dev/null
@@ -1,92 +0,0 @@
-/*
- * This binary provides access to the guest's balloon driver
- * module.
- *
- * Copyright (C) 2007 Qumranet
- *
- * Author:
- *
- *  Dor Laor dor.l...@qumranet.com
- *
- * This work is licensed under the GNU LGPL license, version 2.
- */
-
-#include unistd.h
-#include fcntl.h
-#include stdio.h
-#include stdlib.h
-#include sys/mman.h
-#include string.h
-#include errno.h
-#include sys/ioctl.h
-
-#define __user
-#include linux/kvm.h
-
-#define PAGE_SIZE 4096ul
-
-
-static int balloon_op(int *fd, int bytes)
-{
-   struct kvm_balloon_op bop;
-int r;
-
-   bop.npages = bytes/PAGE_SIZE;
-   r = ioctl(*fd, KVM_BALLOON_OP, bop);
-   if (r == -1)
-   return -errno;
-   printf(Ballon handled %d pages successfully\n, bop.npages);
-
-   return 0;
-}
-
-static int balloon_init(int *fd)
-{
-   *fd = open(/dev/kvm_balloon, O_RDWR);
-   if (*fd == -1) {
-   perror(open /dev/kvm_balloon);
-   return -1;
-   }
-
-   return 0;
-}
-
-int main(int argc, char *argv[])
-{
-   int fd;
-   int r;
-   int bytes;
-
-   if (argc != 3) {
-   perror(Please provide op=[i|d], bytes\n);
-   return 1;
-   }
-   bytes = atoi(argv[2]);
-
-   switch (*argv[1]) {
-   case 'i':
-   break;
-   case 'd':
-   bytes = -bytes;
-   break;
-   default:
-   perror(Wrong op param\n);
-   return 1;
-   }
-
-   if (balloon_init(fd)) {
-   perror(balloon_init failed\n);
-   return 1;
-   }
-
-   if ((r = balloon_op(fd, bytes))) {
-   perror(balloon_op failed\n);
-   goto out;
-   }
-
-out:
-   close(fd);
-
-   return r;
-}
-
diff --git a/kvm/user/bootstrap.lds b/kvm/user/bootstrap.lds
deleted file mode 100644
index fd0a4f8..000
--- a/kvm/user/bootstrap.lds
+++ /dev/null
@@ -1,15 +0,0 @@
-OUTPUT_FORMAT(binary)
-
-SECTIONS
-{
-. = 0;
-stext = .;
-.text : { *(.init) *(.text) }
-. = ALIGN(4K);
-.data : { *(.data) }
-. = ALIGN(16);
-.bss : { *(.bss) }
-. = ALIGN(4K);
-edata = .;
-}
-
diff --git a/kvm/user/config-x86-common.mak 

[qemu-kvm tests 2/3] qemu-kvm tests cleanup: adapt stringio test to kernel-mode run

2010-04-22 Thread Naphtali Sprei
Also use testdev for output, call exit to quit.
Currently, test reboots endlessly because of a triple-fault.
Need to run test with -no-reboot till issue fixed (in kvm ??)

Signed-off-by: Naphtali Sprei nsp...@redhat.com
---
 kvm/user/config-x86-common.mak |2 +-
 kvm/user/test/x86/stringio.S   |   13 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kvm/user/config-x86-common.mak b/kvm/user/config-x86-common.mak
index 960741e..d312681 100644
--- a/kvm/user/config-x86-common.mak
+++ b/kvm/user/config-x86-common.mak
@@ -54,7 +54,7 @@ $(TEST_DIR)/realmode.flat: $(TEST_DIR)/realmode.o
 
 $(TEST_DIR)/realmode.o: bits = 32
 
-$(TEST_DIR)/stringio.flat: $(TEST_DIR)/stringio.o
+$(TEST_DIR)/stringio.flat: $(cstart.o) $(TEST_DIR)/stringio.o
 
 $(TEST_DIR)/msr.flat: $(cstart.o) $(TEST_DIR)/msr.o
 
diff --git a/kvm/user/test/x86/stringio.S b/kvm/user/test/x86/stringio.S
index 31ddc47..461621c 100644
--- a/kvm/user/test/x86/stringio.S
+++ b/kvm/user/test/x86/stringio.S
@@ -8,24 +8,29 @@
 1: 
 .endm  
 
+TESTDEV_PORT = 0xf1
+
str forward, forward
str backward, backward

 .text
 
-
+.global main
+main:
cld
movl forward, %ecx
lea  4+forward, %rsi
-   movw $1, %dx
+   movw $TESTDEV_PORT, %dx
rep outsb
 
std
movl backward, %ecx
lea 4+backward-1(%rcx), %rsi
-   movw $2, %dx
+   movw $TESTDEV_PORT, %dx
rep outsb

-   hlt
+   mov $0, %rsi
+   call exit
+
 
 
-- 
1.6.3.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[qemu-kvm tests 3/3] qemu-kvm tests cleanup: Added printing for passing tests Also typo fix

2010-04-22 Thread Naphtali Sprei

Signed-off-by: Naphtali Sprei nsp...@redhat.com
---
 kvm/user/test/x86/realmode.c |  106 +++--
 1 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/kvm/user/test/x86/realmode.c b/kvm/user/test/x86/realmode.c
index bfc2942..bc4ed97 100644
--- a/kvm/user/test/x86/realmode.c
+++ b/kvm/user/test/x86/realmode.c
@@ -160,6 +160,8 @@ void test_xchg(void)
 
if (!regs_equal(inregs, outregs, 0))
print_serial(xchg test 1: FAIL\n);
+   else
+   print_serial(xchg test 1: PASS\n);
 
exec_in_big_real_mode(inregs, outregs,
  insn_xchg_test2,
@@ -169,6 +171,8 @@ void test_xchg(void)
 outregs.eax != inregs.ebx ||
 outregs.ebx != inregs.eax)
print_serial(xchg test 2: FAIL\n);
+   else
+   print_serial(xchg test 2: PASS\n);
 
exec_in_big_real_mode(inregs, outregs,
  insn_xchg_test3,
@@ -178,6 +182,8 @@ void test_xchg(void)
 outregs.eax != inregs.ecx ||
 outregs.ecx != inregs.eax)
print_serial(xchg test 3: FAIL\n);
+   else
+   print_serial(xchg test 3: PASS\n);
 
exec_in_big_real_mode(inregs, outregs,
  insn_xchg_test4,
@@ -187,6 +193,8 @@ void test_xchg(void)
 outregs.eax != inregs.edx ||
 outregs.edx != inregs.eax)
print_serial(xchg test 4: FAIL\n);
+   else
+   print_serial(xchg test 4: PASS\n);
 
exec_in_big_real_mode(inregs, outregs,
  insn_xchg_test5,
@@ -196,6 +204,8 @@ void test_xchg(void)
 outregs.eax != inregs.esi ||
 outregs.esi != inregs.eax)
print_serial(xchg test 5: FAIL\n);
+   else
+   print_serial(xchg test 5: PASS\n);
 
exec_in_big_real_mode(inregs, outregs,
  insn_xchg_test6,
@@ -205,6 +215,8 @@ void test_xchg(void)
 outregs.eax != inregs.edi ||
 outregs.edi != inregs.eax)
print_serial(xchg test 6: FAIL\n);
+   else
+   print_serial(xchg test 6: PASS\n);
 
exec_in_big_real_mode(inregs, outregs,
  insn_xchg_test7,
@@ -214,6 +226,8 @@ void test_xchg(void)
 outregs.eax != inregs.ebp ||
 outregs.ebp != inregs.eax)
print_serial(xchg test 7: FAIL\n);
+   else
+   print_serial(xchg test 7: PASS\n);
 
exec_in_big_real_mode(inregs, outregs,
  insn_xchg_test8,
@@ -223,6 +237,8 @@ void test_xchg(void)
 outregs.eax != inregs.esp ||
 outregs.esp != inregs.eax)
print_serial(xchg test 8: FAIL\n);
+   else
+   print_serial(xchg test 8: PASS\n);
 }
 
 void test_shld(void)
@@ -234,9 +250,9 @@ void test_shld(void)
  insn_shld_test,
  insn_shld_test_end - insn_shld_test);
if (outregs.eax != 0xbeef)
-   print_serial(shld: failure\n);
+   print_serial(shld: FAIL\n);
else
-   print_serial(shld: success\n);
+   print_serial(shld: PASS\n);
 }
 
 void test_mov_imm(void)
@@ -253,6 +269,8 @@ void test_mov_imm(void)
  insn_mov_r16_imm_1_end - insn_mov_r16_imm_1);
if (!regs_equal(inregs, outregs, R_AX) || outregs.eax != 1234)
print_serial(mov test 1: FAIL\n);
+   else
+   print_serial(mov test 1: PASS\n);
 
/* test mov $imm, %eax */
exec_in_big_real_mode(inregs, outregs,
@@ -260,6 +278,8 @@ void test_mov_imm(void)
  insn_mov_r32_imm_1_end - insn_mov_r32_imm_1);
if (!regs_equal(inregs, outregs, R_AX) || outregs.eax != 1234567890)
print_serial(mov test 2: FAIL\n);
+   else
+   print_serial(mov test 2: PASS\n);
 
/* test mov $imm, %al/%ah */
exec_in_big_real_mode(inregs, outregs,
@@ -267,16 +287,24 @@ void test_mov_imm(void)
  insn_mov_r8_imm_1_end - insn_mov_r8_imm_1);
if (!regs_equal(inregs, outregs, R_AX) || outregs.eax != 0x1200)
print_serial(mov test 3: FAIL\n);
+   else
+   print_serial(mov test 3: PASS\n);
+
exec_in_big_real_mode(inregs, outregs,
  insn_mov_r8_imm_2,
  insn_mov_r8_imm_2_end - insn_mov_r8_imm_2);
if (!regs_equal(inregs, outregs, R_AX) || outregs.eax != 0x34)
print_serial(mov test 4: FAIL\n);
+   else
+   print_serial(mov test 4: PASS\n);
+
exec_in_big_real_mode(inregs, outregs,
  insn_mov_r8_imm_3,
  insn_mov_r8_imm_3_end - insn_mov_r8_imm_3);
if (!regs_equal(inregs, outregs, 

Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura
2010/4/22 Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp:
 (2010/04/22 19:35), Yoshiaki Tamura wrote:


 A trivial one would we to :
 - do X online snapshots/sec

 I currently don't have good numbers that I can share right now.
 Snapshots/sec depends on what kind of workload is running, and if the
 guest was almost idle, there will be no snapshots in 5sec. On the other
 hand, if the guest was running I/O intensive workloads (netperf, iozone
 for example), there will be about 50 snapshots/sec.


 50 is too small: this depends on the synchronization speed and does not
 show how many snapshots we need, right?

No it doesn't.
It's an example data which I measured before.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-22 Thread Glauber Costa
On Tue, Apr 20, 2010 at 12:42:17PM -0700, Jeremy Fitzhardinge wrote:
 On 04/20/2010 11:54 AM, Avi Kivity wrote:
  On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:
  On 04/20/2010 02:31 AM, Avi Kivity wrote:

  btw, do you want this code in pvclock.c, or shall we keep it kvmclock
  specific?
   
  I think its a pvclock-level fix.  I'd been hoping to avoid having
  something like this, but I think its ultimately necessary.
 
 
  Did you observe drift on Xen, or is this ultimately pointing at the
  future?
 
 People are reporting weirdnesses that clocksource=jiffies apparently
 resolves.  Xen and KVM are faced with the same hardware constraints, and
 it wouldn't surprise me if there were small measurable
 non-monotonicities in the PV clock under Xen.  May as well be safe.
 
 Of course, it kills any possibility of being able to usefully export
 this interface down to usermode.
 
 My main concern about this kind of simple fix is that if there's a long
 term systematic drift between different CPU's tscs, then this will
 somewhat mask the problem while giving really awful time measurement on
 the slow CPU(s).  In that case it really needs to adjust the scaling
 factor to correct for the drift (*not* update the offset).  But if we're
 definitely only talking about fixed, relatively small time offsets then
 it is fine.
Can you by any chance run ingo's time warp test on those machines?

You need to define TOD to 1, and leave out the TSC test.

For me, warps exists on every machine out there, but the nehalems, so far
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Avi Kivity

On 04/22/2010 03:28 PM, Joerg Roedel wrote:



Yes.  -cpu host should take KVM_GET_SUPPORTED_CPUID output and loop it
back to the vcpu configuration, others just take the qemu configuration,
mask it with supported bits, and pass it back (see
check_features_against_host()).
 

Hmm, the plan was to enable with -enable-nesting all kernel supported
svm features for the guest (and add switches later to remove them
individually)
If we activate nested svm with -cpu host in the future thats fine too
(closed-source hypervisors need that anyway). But we should also define
a cpu model in which we can migrate nested hypervisors between machines
were the cpu is not completly indentical.
   


You can use -cpu host, or -cpu kvm64,+svm,+npt,+other_feature... just as 
with all other features.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Jamie Lokier
Yoshiaki Tamura wrote:
 Dor Laor wrote:
 On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
 Event tapping is the core component of Kemari, and it decides on which
 event the
 primary should synchronize with the secondary. The basic assumption
 here is
 that outgoing I/O operations are idempotent, which is usually true for
 disk I/O
 and reliable network protocols such as TCP.
 
 IMO any type of network even should be stalled too. What if the VM runs
 non tcp protocol and the packet that the master node sent reached some
 remote client and before the sync to the slave the master failed?
 
 In current implementation, it is actually stalling any type of network 
 that goes through virtio-net.
 
 However, if the application was using unreliable protocols, it should have 
 its own recovering mechanism, or it should be completely stateless.

Even with unreliable protocols, if slave takeover causes the receiver
to have received a packet that the sender _does not think it has ever
sent_, expect some protocols to break.

If the slave replaying master's behaviour since the last sync means it
will definitely get into the same state of having sent the packet,
that works out.

But you still have to be careful that the other end's responses to
that packet are not seen by the slave too early during that replay.
Otherwise, for example, the slave may observe a TCP ACK to a packet
that it hasn't yet sent, which is an error.

About IP idempotency:

In general, IP packets are allowed to be lost or duplicated in the
network.  All IP protocols should be prepared for that; it is a basic
property.

However there is one respect in which they're not idempotent:

The TTL field should be decreased if packets are delayed.  Packets
should not appear to live in the network for longer than TTL seconds.
If they do, some protocols (like TCP) can react to the delayed ones
differently, such as sending a RST packet and breaking a connection.

It is acceptable to reduce TTL faster than the minimum.  After all, it
is reduced by 1 on every forwarding hop, in addition to time delays.

 I currently don't have good numbers that I can share right now.
 Snapshots/sec depends on what kind of workload is running, and if the 
 guest was almost idle, there will be no snapshots in 5sec.  On the other 
 hand, if the guest was running I/O intensive workloads (netperf, iozone 
 for example), there will be about 50 snapshots/sec.

That is a really satisfying number, thank you :-)

Without this work I wouldn't have imagined that synchronised machines
could work with such a low transaction rate.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-22 Thread Marcelo Tosatti
On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote:
 On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote:
  On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote:
   On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote:
On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote:
 On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote:
   Or could we make kvm_set_irq() atomic? Though the code path is a 
   little long 
   for spinlock.
  
  Yes, given the sleep-inside-RCU-protected section bug from
  kvm_notify_acked_irq, either that or convert IRQ locking to SRCU.
  
  But as you said, the code paths are long and potentially slow, so
  probably SRCU is a better alternative.
  
  Gleb?
 kvm_set_irq() was converted to rcu from mutex to make msix interrupt
 injection scalable.

We meant ioapic lock. See the last report from Ralf on this thread. 
   Can we solve the problem by calling ack notifier outside rcu read
   section in kvm_notify_acked_irq()?
  
  The unregister path does
  
  - remove_from_list(entry)
  - synchronize_rcu
  - kfree(entry)
  
  So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and the
  notifier entry can be freed.
 What I mean is kvm_notify_acked_irq() will iterate over all ack entries
 in rcu read protected section, but instead of calling callback it will
 collect them into the array and call them outside rcu read section. At
 this point it doesn't matter if entry is unregistered since the copy is
 used to actually call the notifier.

Here it is, but no, this trick can't be done safely for ack notifiers
because they are unregistered at runtime by device assignment.

How does the freeing path knows it can proceed to free its entry if
there is no reliable way to know if there is a reference to itself?
(think priv gets freed between rcu_read_unlock and -irq_acked with
the patch below).

I can convert to SRCU if you have no objections.

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 0150aff..900ac05 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -241,8 +241,8 @@ int pit_has_pending_timer(struct kvm_vcpu *vcpu)
 
 static void kvm_pit_ack_irq(struct kvm_irq_ack_notifier *kian)
 {
-   struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state,
-irq_ack_notifier);
+   struct kvm_kpit_state *ps = kian-priv;
+
raw_spin_lock(ps-inject_lock);
if (atomic_dec_return(ps-pit_timer.pending)  0)
atomic_inc(ps-pit_timer.pending);
@@ -636,6 +636,7 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
 CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
pit_state-irq_ack_notifier.gsi = 0;
pit_state-irq_ack_notifier.irq_acked = kvm_pit_ack_irq;
+   pit_state-irq_ack_notifier.priv = pit-pit_state;
kvm_register_irq_ack_notifier(kvm, pit_state-irq_ack_notifier);
pit_state-pit_timer.reinject = true;
mutex_unlock(pit-pit_state.lock);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ce027d5..6c3fb06 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -404,6 +404,7 @@ struct kvm_irq_ack_notifier {
struct hlist_node link;
unsigned gsi;
void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
+   void *priv;
 };
 
 #define KVM_ASSIGNED_MSIX_PENDING  0x1
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 4d10b1e..779b749 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -121,8 +121,7 @@ static void kvm_assigned_dev_ack_irq(struct 
kvm_irq_ack_notifier *kian)
if (kian-gsi == -1)
return;
 
-   dev = container_of(kian, struct kvm_assigned_dev_kernel,
-  ack_notifier);
+   dev = kian-priv;
 
kvm_set_irq(dev-kvm, dev-irq_source_id, dev-guest_irq, 0);
 
@@ -563,6 +562,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
match-irq_source_id = -1;
match-kvm = kvm;
match-ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
+   match-ack_notifier.priv = match;
INIT_WORK(match-interrupt_work,
  kvm_assigned_dev_interrupt_work_handler);
 
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index a0e8880..ebccea8 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -175,11 +175,14 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq, int level)
return ret;
 }
 
+#define MAX_ACK_NOTIFIER_PER_GSI 4
+
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
struct kvm_irq_ack_notifier *kian;
+   struct kvm_irq_ack_notifier acks[MAX_ACK_NOTIFIER_PER_GSI];
struct hlist_node *n;
-   int gsi;
+   int gsi, i, acks_nr = 0;
 
trace_kvm_ack_irq(irqchip, pin);
 
@@ -188,9 +191,15 @@ void 

[PATCH] Enable pvclock flags in vcpu_time_info structure

2010-04-22 Thread Glauber Costa
This patch removes one padding byte and transform it into a flags
field. New versions of guests using pvclock will query these flags
upon each read.

Flags, however, will only be interpreted when the guest decides to.
It uses the pvclock_valid_flags function to signal that a specific
set of flags should be taken into consideration. Which flags are valid
are usually devised via HV negotiation.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 arch/x86/include/asm/pvclock-abi.h |3 ++-
 arch/x86/include/asm/pvclock.h |1 +
 arch/x86/kernel/pvclock.c  |9 +
 3 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/pvclock-abi.h 
b/arch/x86/include/asm/pvclock-abi.h
index 6d93508..ec5c41a 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -29,7 +29,8 @@ struct pvclock_vcpu_time_info {
u64   system_time;
u32   tsc_to_system_mul;
s8tsc_shift;
-   u8pad[3];
+   u8flags;
+   u8pad[2];
 } __attribute__((__packed__)); /* 32 bytes */
 
 struct pvclock_wall_clock {
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 53235fd..c50823f 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+void pvclock_valid_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
struct pvclock_vcpu_time_info *vcpu,
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 03801f2..8f4af7b 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -31,8 +31,16 @@ struct pvclock_shadow_time {
u32 tsc_to_nsec_mul;
int tsc_shift;
u32 version;
+   u8  flags;
 };
 
+static u8 valid_flags = 0;
+
+void pvclock_valid_flags(u8 flags)
+{
+   valid_flags = flags;
+}
+
 /*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
@@ -91,6 +99,7 @@ static unsigned pvclock_get_time_values(struct 
pvclock_shadow_time *dst,
dst-system_timestamp  = src-system_time;
dst-tsc_to_nsec_mul   = src-tsc_to_system_mul;
dst-tsc_shift = src-tsc_shift;
+   dst-flags = src-flags;
rmb();  /* test version after fetching data */
} while ((src-version  1) || (dst-version != src-version));
 
-- 
1.6.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 05:02:25 AM:

 On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
  This patch adds the mergeable RX buffers feature to vhost.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 Looks pretty clean to me.
 Could you send a checkpatch-clean version please?

The original passes checkpatch already, but I guess I must
still be getting whitespace mangling if it didn't for you. (sigh)
Here it is as an attachment:



 We should also check performance implications.
 Do you have any data?

I'm getting on the order of 10-20% improvement in
throughput over stock vhost guest to host, but I do see
a lot of variability in the results, even with no KVM
and just over loopback. I don't know where that's coming
from, but I'll do some runs and post.

Thanks for all the reviews!

+-DLS



MRXBv4.patch
Description: Binary data


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread Michael S. Tsirkin
On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
 This patch adds the mergeable RX buffers feature to vhost.
 
 Signed-off-by: David L Stevens dlstev...@us.ibm.com

BTW, which userspace should I use for testing this?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread Michael S. Tsirkin
On Thu, Apr 22, 2010 at 11:47:15AM -0600, David Stevens wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 05:02:25 AM:
 
  On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
   This patch adds the mergeable RX buffers feature to vhost.
   
   Signed-off-by: David L Stevens dlstev...@us.ibm.com
  
  Looks pretty clean to me.
  Could you send a checkpatch-clean version please?
 
 The original passes checkpatch already, but I guess I must
 still be getting whitespace mangling if it didn't for you. (sigh)
 Here it is as an attachment:

No good either. I thought you managed to run sendmail somewhere?
Failing that, put it on dropbox and let me know.
See http://kbase.redhat.com/faq/docs/DOC-2113

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 10:43:49 AM:

 On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
  This patch adds the mergeable RX buffers feature to vhost.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 BTW, which userspace should I use for testing this?

I use the qemu-kvm patch to your git tree that I posted
a while back (also attached). And it needs your TUNSETVNETHDRSIZE
ioctl in the kernel as well. Also, I added the TUN ioctl
defines to /usr/include for this to pick up (it compiles, but
won't do the ioctl's without that); will get back
to that patch per your comments next.

[also correction: I said 10-20% guest to host, I meant host-to-guest
(ie, the receiver side). h2g appears improved too, but not as
much)]

+-DLS


qemu-MRXB-2.patch
Description: Binary data


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread Michael S. Tsirkin
On Thu, Apr 22, 2010 at 11:59:55AM -0600, David Stevens wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 10:43:49 AM:
 
  On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
   This patch adds the mergeable RX buffers feature to vhost.
   
   Signed-off-by: David L Stevens dlstev...@us.ibm.com
  
  BTW, which userspace should I use for testing this?
 
 I use the qemu-kvm patch to your git tree that I posted
 a while back (also attached).

Doesn't apply, probably also corrupted.
Could you put it up on dropbox please?

 And it needs your TUNSETVNETHDRSIZE
 ioctl in the kernel as well. Also, I added the TUN ioctl
 defines to /usr/include for this to pick up (it compiles, but
 won't do the ioctl's without that); will get back
 to that patch per your comments next.
 
 [also correction: I said 10-20% guest to host, I meant host-to-guest
 (ie, the receiver side). h2g appears improved too, but not as
 much)]
 
 +-DLS


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty.

2010-04-22 Thread Anthony Liguori

Hi,

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Replaces byte-based phys_ram_dirty bitmap with four (MASTER, VGA,
CODE, MIGRATION) bit-based phys_ram_dirty bitmap.  On allocation, it
sets all bits in the bitmap.  It uses ffs() to convert DIRTY_FLAG to
DIRTY_IDX.

Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
bit-based phys_ram_dirty bitmap.  MASTER works as a buffer, and upon
get_diry() or get_dirty_flags(), it calls
cpu_physical_memory_sync_master() to update VGA and MIGRATION.
   


Why use an additional bitmap for MASTER instead of just updating the 
VGA, CODE, and MIGRATION bitmaps together?


Regards,

Anthony Liguori


Replaces direct phys_ram_dirty access with wrapper functions to
prevent direct access to the phys_ram_dirty bitmap.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Keiohmura@lab.ntt.co.jp
---
  cpu-all.h |  130 +
  exec.c|   60 ++--
  2 files changed, 152 insertions(+), 38 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 51effc0..3f8762d 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -37,6 +37,9 @@

  #include softfloat.h

+/* to use ffs in flag_to_idx() */
+#includestrings.h
+
  #if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN)
  #define BSWAP_NEEDED
  #endif
@@ -846,7 +849,6 @@ int cpu_str_to_log_mask(const char *str);
  /* memory API */

  extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
  extern ram_addr_t ram_size;
  extern ram_addr_t last_ram_offset;
  extern uint8_t *bios_mem;
@@ -869,28 +871,140 @@ extern uint8_t *bios_mem;
  /* Set if TLB entry is an IO callback.  */
  #define TLB_MMIO(1  5)

+/* Use DIRTY_IDX as indexes of bit-based phys_ram_dirty. */
+#define MASTER_DIRTY_IDX0
+#define VGA_DIRTY_IDX   1
+#define CODE_DIRTY_IDX  2
+#define MIGRATION_DIRTY_IDX 3
+#define NUM_DIRTY_IDX   4
+
+#define MASTER_DIRTY_FLAG(1  MASTER_DIRTY_IDX)
+#define VGA_DIRTY_FLAG   (1  VGA_DIRTY_IDX)
+#define CODE_DIRTY_FLAG  (1  CODE_DIRTY_IDX)
+#define MIGRATION_DIRTY_FLAG (1  MIGRATION_DIRTY_IDX)
+
+extern unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
+
+static inline int dirty_flag_to_idx(int flag)
+{
+return ffs(flag) - 1;
+}
+
+static inline int dirty_idx_to_flag(int idx)
+{
+return 1  idx;
+}
+
  int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
  uint8_t *buf, int len, int is_write);

-#define VGA_DIRTY_FLAG   0x01
-#define CODE_DIRTY_FLAG  0x02
-#define MIGRATION_DIRTY_FLAG 0x08
-
  /* read dirty bit (return 0 or 1) */
  static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
  {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS] == 0xff;
+unsigned long mask;
+ram_addr_t index = (addr  TARGET_PAGE_BITS) / HOST_LONG_BITS;
+int offset = (addr  TARGET_PAGE_BITS)  (HOST_LONG_BITS - 1);
+
+mask = 1UL  offset;
+return (phys_ram_dirty[MASTER_DIRTY_IDX][index]  mask) == mask;
+}
+
+static inline void cpu_physical_memory_sync_master(ram_addr_t index)
+{
+if (phys_ram_dirty[MASTER_DIRTY_IDX][index]) {
+phys_ram_dirty[VGA_DIRTY_IDX][index]
+|=  phys_ram_dirty[MASTER_DIRTY_IDX][index];
+phys_ram_dirty[MIGRATION_DIRTY_IDX][index]
+|=  phys_ram_dirty[MASTER_DIRTY_IDX][index];
+phys_ram_dirty[MASTER_DIRTY_IDX][index] = 0UL;
+}
+}
+
+static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
+{
+unsigned long mask;
+ram_addr_t index = (addr  TARGET_PAGE_BITS) / HOST_LONG_BITS;
+int offset = (addr  TARGET_PAGE_BITS)  (HOST_LONG_BITS - 1);
+int ret = 0, i;
+
+mask = 1UL  offset;
+cpu_physical_memory_sync_master(index);
+
+for (i = VGA_DIRTY_IDX; i= MIGRATION_DIRTY_IDX; i++) {
+if (phys_ram_dirty[i][index]  mask) {
+ret |= dirty_idx_to_flag(i);
+}
+}
+
+return ret;
+}
+
+static inline int cpu_physical_memory_get_dirty_idx(ram_addr_t addr,
+int dirty_idx)
+{
+unsigned long mask;
+ram_addr_t index = (addr  TARGET_PAGE_BITS) / HOST_LONG_BITS;
+int offset = (addr  TARGET_PAGE_BITS)  (HOST_LONG_BITS - 1);
+
+mask = 1UL  offset;
+cpu_physical_memory_sync_master(index);
+return (phys_ram_dirty[dirty_idx][index]  mask) == mask;
  }

  static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
  int dirty_flags)
  {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS]  dirty_flags;
+return cpu_physical_memory_get_dirty_idx(addr,
+ dirty_flag_to_idx(dirty_flags));
  }

  static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
  {
-phys_ram_dirty[addr  TARGET_PAGE_BITS] = 0xff;
+unsigned long mask;
+ram_addr_t index = (addr  TARGET_PAGE_BITS) / HOST_LONG_BITS;
+int offset = (addr  TARGET_PAGE_BITS)  

Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

For fool proof purpose, qemu_put_vector_parepare should be called
before qemu_put_vector.  Then, if qemu_put_* functions except this is
called after qemu_put_vector_prepare, program will abort().

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
   


I don't get it.  What's this protecting against?

Regards,

Anthony Liguori


---
  hw/hw.h  |2 ++
  savevm.c |   53 +
  2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 921cf90..10e6dda 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -77,6 +77,8 @@ void qemu_fflush(QEMUFile *f);
  int qemu_fclose(QEMUFile *f);
  void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
  void qemu_put_byte(QEMUFile *f, int v);
+void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
+void qemu_put_vector_prepare(QEMUFile *f);
  void *qemu_realloc_buffer(QEMUFile *f, int size);
  void qemu_clear_buffer(QEMUFile *f);

diff --git a/savevm.c b/savevm.c
index 944e788..22d928c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -180,6 +180,7 @@ struct QEMUFile {
  uint8_t *buf;

  int has_error;
+int prepares_vector;
  };

  typedef struct QEMUFileStdio
@@ -557,6 +558,58 @@ void qemu_put_byte(QEMUFile *f, int v)
  qemu_fflush(f);
  }

+void qemu_put_vector(QEMUFile *f, QEMUIOVector *v)
+{
+struct iovec *iov;
+int cnt;
+size_t bufsize;
+uint8_t *buf;
+
+if (qemu_file_get_rate_limit(f) != 0) {
+fprintf(stderr,
+Attempted to write vector while bandwidth limit is not 
zero.\n);
+abort();
+}
+
+/* checks prepares vector.
+ * For fool proof purpose, qemu_put_vector_parepare should be called
+ * before qemu_put_vector.  Then, if qemu_put_* functions except this
+ * is called after qemu_put_vector_prepare, program will abort().
+ */
+if (!f-prepares_vector) {
+fprintf(stderr,
+You should prepare with qemu_put_vector_prepare.\n);
+abort();
+} else if (f-prepares_vector  f-buf_index != 0) {
+fprintf(stderr, Wrote data after qemu_put_vector_prepare.\n);
+abort();
+}
+f-prepares_vector = 0;
+
+if (f-put_vector) {
+qemu_iovec_to_vector(v,iov,cnt);
+f-put_vector(f-opaque, iov, 0, cnt);
+} else {
+qemu_iovec_to_size(v,bufsize);
+buf = qemu_malloc(bufsize + 1 /* for '\0' */);
+qemu_iovec_to_buffer(v, buf);
+qemu_put_buffer(f, buf, bufsize);
+qemu_free(buf);
+}
+
+}
+
+void qemu_put_vector_prepare(QEMUFile *f)
+{
+if (f-prepares_vector) {
+/* prepare vector */
+fprintf(stderr, Attempted to prepare vector twice\n);
+abort();
+}
+f-prepares_vector = 1;
+qemu_fflush(f);
+}
+
  int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size1)
  {
  int size, l;
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Introduce RAMSaveIO to use writev for saving ram blocks, and modifies
ram_save_block() and ram_save_remaining() to use
cpu_physical_memory_get_dirty_range() to check multiple dirty and
non-dirty pages at once.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Keiohmura@lab.ntt.co.jp
   


Perf data?

Regards,

Anthony Liguori


---
  vl.c |  221 ++---
  1 files changed, 197 insertions(+), 24 deletions(-)

diff --git a/vl.c b/vl.c
index 729c955..9c3dc4c 100644
--- a/vl.c
+++ b/vl.c
@@ -2774,12 +2774,167 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
  return 1;
  }

-static int ram_save_block(QEMUFile *f)
+typedef struct RAMSaveIO RAMSaveIO;
+
+struct RAMSaveIO {
+QEMUFile *f;
+QEMUIOVector *qiov;
+
+uint8_t *ram_store;
+size_t nalloc, nused;
+uint8_t io_mode;
+
+void (*put_buffer)(RAMSaveIO *s, uint8_t *buf, size_t len);
+void (*put_byte)(RAMSaveIO *s, int v);
+void (*put_be64)(RAMSaveIO *s, uint64_t v);
+
+};
+
+static inline void ram_saveio_flush(RAMSaveIO *s, int prepare)
+{
+qemu_put_vector(s-f, s-qiov);
+if (prepare)
+qemu_put_vector_prepare(s-f);
+
+/* reset stored data */
+qemu_iovec_reset(s-qiov);
+s-nused = 0;
+}
+
+static inline void ram_saveio_put_buffer(RAMSaveIO *s, uint8_t *buf, size_t 
len)
+{
+s-put_buffer(s, buf, len);
+}
+
+static inline void ram_saveio_put_byte(RAMSaveIO *s, int v)
+{
+s-put_byte(s, v);
+}
+
+static inline void ram_saveio_put_be64(RAMSaveIO *s, uint64_t v)
+{
+s-put_be64(s, v);
+}
+
+static inline void ram_saveio_set_error(RAMSaveIO *s)
+{
+qemu_file_set_error(s-f);
+}
+
+static void ram_saveio_put_buffer_vector(RAMSaveIO *s, uint8_t *buf, size_t 
len)
+{
+qemu_iovec_add(s-qiov, buf, len);
+}
+
+static void ram_saveio_put_buffer_direct(RAMSaveIO *s, uint8_t *buf, size_t 
len)
+{
+qemu_put_buffer(s-f, buf, len);
+}
+
+static void ram_saveio_put_byte_vector(RAMSaveIO *s, int v)
+{
+uint8_t *to_save;
+
+if (s-nalloc - s-nused  sizeof(int))
+ram_saveio_flush(s, 1);
+
+to_save =s-ram_store[s-nused];
+to_save[0] = v  0xff;
+s-nused++;
+
+qemu_iovec_add(s-qiov, to_save, 1);
+}
+
+static void ram_saveio_put_byte_direct(RAMSaveIO *s, int v)
+{
+qemu_put_byte(s-f, v);
+}
+
+static void ram_saveio_put_be64_vector(RAMSaveIO *s, uint64_t v)
+{
+uint8_t *to_save;
+
+if (s-nalloc - s-nused  sizeof(uint64_t))
+ram_saveio_flush(s, 1);
+
+to_save =s-ram_store[s-nused];
+to_save[0] = (v  56)  0xff;
+to_save[1] = (v  48)  0xff;
+to_save[2] = (v  40)  0xff;
+to_save[3] = (v  32)  0xff;
+to_save[4] = (v  24)  0xff;
+to_save[5] = (v  16)  0xff;
+to_save[6] = (v   8)  0xff;
+to_save[7] = (v   0)  0xff;
+s-nused += sizeof(uint64_t);
+
+qemu_iovec_add(s-qiov, to_save, sizeof(uint64_t));
+}
+
+static void ram_saveio_put_be64_direct(RAMSaveIO *s, uint64_t v)
+{
+
+qemu_put_be64(s-f, v);
+}
+
+static RAMSaveIO *ram_saveio_new(QEMUFile *f, size_t max_store)
+{
+RAMSaveIO *s;
+
+s = qemu_mallocz(sizeof(*s));
+
+if (qemu_file_get_rate_limit(f) == 0) {/* non buffer mode */
+/* When QEMUFile don't have get_rate limit,
+ * qemu_file_get_rate_limit will return 0.
+ * However, we believe that all kinds of QEMUFile
+ * except non-block mode has rate limit function.
+ */
+s-io_mode = 1;
+s-ram_store = qemu_mallocz(max_store);
+s-nalloc = max_store;
+s-nused = 0;
+
+s-qiov = qemu_mallocz(sizeof(*s-qiov));
+qemu_iovec_init(s-qiov, max_store);
+
+s-put_buffer = ram_saveio_put_buffer_vector;
+s-put_byte = ram_saveio_put_byte_vector;
+s-put_be64 = ram_saveio_put_be64_vector;
+
+qemu_put_vector_prepare(f);
+} else {
+s-io_mode = 0;
+s-put_buffer = ram_saveio_put_buffer_direct;
+s-put_byte = ram_saveio_put_byte_direct;
+s-put_be64 = ram_saveio_put_be64_direct;
+}
+
+s-f = f;
+
+return s;
+}
+
+static void ram_saveio_destroy(RAMSaveIO *s)
+{
+if (s-qiov != NULL) { /* means using put_vector */
+ram_saveio_flush(s, 0);
+qemu_iovec_destroy(s-qiov);
+qemu_free(s-qiov);
+qemu_free(s-ram_store);
+}
+qemu_free(s);
+}
+
+/*
+ * RAMSaveIO will manage I/O.
+ */
+static int ram_save_block(RAMSaveIO *s)
  {
  static ram_addr_t current_addr = 0;
  ram_addr_t saved_addr = current_addr;
  ram_addr_t addr = 0;
-int found = 0;
+ram_addr_t dirty_rams[HOST_LONG_BITS];
+int i, found = 0;

  while (addr  last_ram_offset) {
  if (kvm_enabled()  current_addr == 0) {
@@ -2787,32 +2942,38 @@ static int ram_save_block(QEMUFile *f)
  r = kvm_update_dirty_pages_log();
  if (r) {
  fprintf(stderr, %s: update dirty 

Re: [PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page

2010-04-22 Thread Marcelo Tosatti
On Thu, Apr 22, 2010 at 02:13:04PM +0800, Xiao Guangrong wrote:
 If have new mapping to the unsync page(i.e, add a new parent), just
 update the page from sp-gfn but not write-protect gfn, and if need
 create new shadow page form sp-gfn, we should sync it
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/kvm/mmu.c |   27 +++
  1 files changed, 19 insertions(+), 8 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index fd027a6..8607a64 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1196,16 +1196,20 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
 struct kvm_mmu_page *sp)
  
  static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
  
 -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 +  bool clear_unsync)
  {
   if (sp-role.cr4_pae != !!is_pae(vcpu)) {
   kvm_mmu_zap_page(vcpu-kvm, sp);
   return 1;
   }
  
 - if (rmap_write_protect(vcpu-kvm, sp-gfn))
 - kvm_flush_remote_tlbs(vcpu-kvm);
 - kvm_unlink_unsync_page(vcpu-kvm, sp);
 + if (clear_unsync) {
 + if (rmap_write_protect(vcpu-kvm, sp-gfn))
 + kvm_flush_remote_tlbs(vcpu-kvm);
 + kvm_unlink_unsync_page(vcpu-kvm, sp);
 + }
 +
   if (vcpu-arch.mmu.sync_page(vcpu, sp)) {
   kvm_mmu_zap_page(vcpu-kvm, sp);
   return 1;
 @@ -1293,7 +1297,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
   kvm_flush_remote_tlbs(vcpu-kvm);
  
   for_each_sp(pages, sp, parents, i) {
 - kvm_sync_page(vcpu, sp);
 + kvm_sync_page(vcpu, sp, true);
   mmu_pages_clear_parents(parents);
   }
   cond_resched_lock(vcpu-kvm-mmu_lock);
 @@ -1313,7 +1317,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
   unsigned index;
   unsigned quadrant;
   struct hlist_head *bucket;
 - struct kvm_mmu_page *sp;
 + struct kvm_mmu_page *sp, *unsync_sp = NULL;
   struct hlist_node *node, *tmp;
  
   role = vcpu-arch.mmu.base_role;
 @@ -1332,12 +1336,16 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
   hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
   if (sp-gfn == gfn) {
   if (sp-unsync)
 - if (kvm_sync_page(vcpu, sp))
 - continue;
 + unsync_sp = sp;

Xiao,

I don't see a reason why you can't create a new mapping to an unsync
page. The code already creates shadow pte entries using unsync
pagetables.

So all you need would be to kvm_sync_pages before write protecting.

Also make sure kvm_sync_pages is in place here before enabling multiple
unsync shadows, in the patch series.

  
   if (sp-role.word != role.word)
   continue;
  
 + if (unsync_sp  kvm_sync_page(vcpu, unsync_sp, false)) 
 {
 + unsync_sp = NULL;
 + continue;
 + }
 +
   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
   if (sp-unsync_children) {
   set_bit(KVM_REQ_MMU_SYNC, vcpu-requests);
 @@ -1346,6 +1354,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
   trace_kvm_mmu_get_page(sp, false);
   return sp;
   }
 + if (unsync_sp)
 + kvm_sync_page(vcpu, unsync_sp, true);
 +
   ++vcpu-kvm-stat.mmu_cache_miss;
   sp = kvm_mmu_alloc_page(vcpu, parent_pte);
   if (!sp)
 -- 
 1.6.1.2
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
   


I think the more appropriate thing to do is have 
qemu_savevm_state_complete() not write QEMU_VM_EOF when doing a 
continuous live migration.


You would then want qemu_loadvm_state() to detect real EOF and treat 
that the same as QEMU_VM_EOF (provided it occurred at a section boundary).


Of course, this should be a flag to qemu_loadvm_state() as it's not 
always the right behavior.


Regards,

Anthony Liguori


---
  migration-exec.c |2 +-
  migration-fd.c   |2 +-
  migration-tcp.c  |2 +-
  migration-unix.c |2 +-
  savevm.c |   25 ++---
  sysemu.h |2 +-
  6 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/migration-exec.c b/migration-exec.c
index 3edc026..5839a6d 100644
--- a/migration-exec.c
+++ b/migration-exec.c
@@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void *opaque)
  QEMUFile *f = opaque;
  int ret;

-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  if (ret  0) {
  fprintf(stderr, load of migration failed\n);
  goto err;
diff --git a/migration-fd.c b/migration-fd.c
index 0cc74ad..0e97ed0 100644
--- a/migration-fd.c
+++ b/migration-fd.c
@@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void *opaque)
  QEMUFile *f = opaque;
  int ret;

-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  if (ret  0) {
  fprintf(stderr, load of migration failed\n);
  goto err;
diff --git a/migration-tcp.c b/migration-tcp.c
index 56e1a3b..94a1a03 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -182,7 +182,7 @@ static void tcp_accept_incoming_migration(void *opaque)
  goto out;
  }

-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  if (ret  0) {
  fprintf(stderr, load of migration failed\n);
  goto out_fopen;
diff --git a/migration-unix.c b/migration-unix.c
index b7aab38..dd99a73 100644
--- a/migration-unix.c
+++ b/migration-unix.c
@@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void *opaque)
  goto out;
  }

-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  if (ret  0) {
  fprintf(stderr, load of migration failed\n);
  goto out_fopen;
diff --git a/savevm.c b/savevm.c
index 22d928c..a401b27 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1554,7 +1554,7 @@ typedef struct LoadStateEntry {
  int version_id;
  } LoadStateEntry;

-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, int skip_header)
  {
  QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
  QLIST_HEAD_INITIALIZER(loadvm_handlers);
@@ -1563,17 +1563,20 @@ int qemu_loadvm_state(QEMUFile *f)
  unsigned int v;
  int ret;

-v = qemu_get_be32(f);
-if (v != QEMU_VM_FILE_MAGIC)
-return -EINVAL;
+if (!skip_header) {
+v = qemu_get_be32(f);
+if (v != QEMU_VM_FILE_MAGIC)
+return -EINVAL;
+
+v = qemu_get_be32(f);
+if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+fprintf(stderr, SaveVM v3 format is obsolete and don't work 
anymore\n);
+return -ENOTSUP;
+}
+if (v != QEMU_VM_FILE_VERSION)
+return -ENOTSUP;

-v = qemu_get_be32(f);
-if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-fprintf(stderr, SaveVM v2 format is obsolete and don't work 
anymore\n);
-return -ENOTSUP;
  }
-if (v != QEMU_VM_FILE_VERSION)
-return -ENOTSUP;

  while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
  uint32_t instance_id, version_id, section_id;
@@ -1898,7 +1901,7 @@ int load_vmstate(Monitor *mon, const char *name)
  monitor_printf(mon, Could not open VM state file\n);
  return -EINVAL;
  }
-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  qemu_fclose(f);
  if (ret  0) {
  monitor_printf(mon, Error %d while loading VM state\n, ret);
diff --git a/sysemu.h b/sysemu.h
index 647a468..6c1441f 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int 
blk_enable,
  int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
  int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
  void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, int skip_header);

  void qemu_errors_to_file(FILE *fp);
  void qemu_errors_to_mon(Monitor *mon);
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all().

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Make a 32bit entry after QEMU_VM_FILE_VERSION to recognize whether the
transfered data is QEMU_VM_FT_MODE or QEMU_VM_LIVE_MIGRATION_MODE.
   


I'd rather you encapsulate the current protocol as opposed to extending 
it with a new version.


You could also do this by just having a new -incoming option, right?  
All you really need is to indicate that this is for high frequency 
checkpointing, right?


Regards,

Anthony Liguori


Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
---
  savevm.c |   76 -
  sysemu.h |1 +
  2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/savevm.c b/savevm.c
index 292ae32..19b3efb 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1402,8 +1402,10 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
  }

  #define QEMU_VM_FILE_MAGIC   0x5145564d
-#define QEMU_VM_FILE_VERSION_COMPAT  0x0002
-#define QEMU_VM_FILE_VERSION 0x0003
+#define QEMU_VM_FILE_VERSION_COMPAT  0x0003
+#define QEMU_VM_FILE_VERSION 0x0004
+#define QEMU_VM_LIVE_MIGRATION_MODE  0x0005
+#define QEMU_VM_FT_MODE  0x0006

  #define QEMU_VM_EOF  0x00
  #define QEMU_VM_SECTION_START0x01
@@ -1425,6 +1427,12 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, 
int blk_enable,

  qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
  qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+if (ft_mode) {
+qemu_put_be32(f, QEMU_VM_FT_MODE);
+} else {
+qemu_put_be32(f, QEMU_VM_LIVE_MIGRATION_MODE);
+}

  QTAILQ_FOREACH(se,savevm_handlers, entry) {
  int len;
@@ -1533,6 +1541,66 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
  return 0;
  }

+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se,savevm_handlers, entry) {
+int len;
+
+if (se-save_live_state == NULL)
+continue;
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_START);
+qemu_put_be32(f, se-section_id);
+
+/* ID string */
+len = strlen(se-idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se-idstr, len);
+
+qemu_put_be32(f, se-instance_id);
+qemu_put_be32(f, se-version_id);
+if (ft_mode == FT_INIT) {
+/* This is workaround. */
+se-save_live_state(mon, f, QEMU_VM_SECTION_START, se-opaque);
+} else {
+se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque);
+}
+}
+
+ft_mode = FT_TRANSACTION;
+QTAILQ_FOREACH(se,savevm_handlers, entry) {
+int len;
+
+   if (se-save_state == NULL  se-vmsd == NULL)
+   continue;
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_FULL);
+qemu_put_be32(f, se-section_id);
+
+/* ID string */
+len = strlen(se-idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se-idstr, len);
+
+qemu_put_be32(f, se-instance_id);
+qemu_put_be32(f, se-version_id);
+
+vmstate_save(f, se);
+}
+
+qemu_put_byte(f, QEMU_VM_EOF);
+
+if (qemu_file_has_error(f))
+return -EIO;
+
+return 0;
+}
+
+
  void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
  {
  SaveStateEntry *se;
@@ -1617,6 +1685,10 @@ int qemu_loadvm_state(QEMUFile *f, int skip_header)
  if (v != QEMU_VM_FILE_VERSION)
  return -ENOTSUP;

+v = qemu_get_be32(f);
+if (v == QEMU_VM_FT_MODE) {
+ft_mode = FT_INIT;
+}
  }

  while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
diff --git a/sysemu.h b/sysemu.h
index 6c1441f..df314bb 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int 
blk_enable,
  int shared);
  int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
  int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f);
  void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
  int qemu_loadvm_state(QEMUFile *f, int skip_header);

   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 15/20] Introduce FT mode support to configure.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
   


No need for this.

Regards,

Anthony Liguori


---
  configure |8 
  1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index 046c591..f0682d4 100755
--- a/configure
+++ b/configure
@@ -298,6 +298,7 @@ bsd_user=no
  guest_base=
  uname_release=
  io_thread=no
+ft_mode=no
  mixemu=no
  kvm_trace=no
  kvm_cap_pit=
@@ -671,6 +672,8 @@ for opt do
;;
--enable-io-thread) io_thread=yes
;;
+  --enable-ft-mode) ft_mode=yes
+  ;;
--disable-blobs) blobs=no
;;
--kerneldir=*) kerneldir=$optarg
@@ -840,6 +843,7 @@ echo   --enable-vde enable support for vde 
network
  echo   --disable-linux-aio  disable Linux AIO support
  echo   --enable-linux-aio   enable Linux AIO support
  echo   --enable-io-thread   enable IO thread
+echo   --enable-ft-mode enable FT mode support
  echo   --disable-blobs  disable installing provided firmware blobs
  echo   --kerneldir=PATH look for kernel includes in PATH
  echo   --with-kvm-trace enable building the KVM module with the kvm trace 
option
@@ -2117,6 +2121,7 @@ echo GUEST_BASE$guest_base
  echo PIE user targets  $user_pie
  echo vde support   $vde
  echo IO thread $io_thread
+echo FT mode support   $ft_mode
  echo Linux AIO support $linux_aio
  echo Install blobs $blobs
  echo KVM support   $kvm
@@ -2318,6 +2323,9 @@ fi
  if test $io_thread = yes ; then
echo CONFIG_IOTHREAD=y  $config_host_mak
  fi
+if test $ft_mode = yes ; then
+  echo CONFIG_FT_MODE=y  $config_host_mak
+fi
  if test $linux_aio = yes ; then
echo CONFIG_LINUX_AIO=y  $config_host_mak
  fi
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

do_event_tap() is inserted to functions which actually fire outputs.
By synchronizing VMs before outputs are fired, we can failover to the
receiver upon failure.  To save VM continuously, comment out assert()
on cpu_single_env temporally.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
---
  hw/virtio-blk.c |2 ++
  hw/virtio-net.c |2 ++
  qemu-kvm.c  |7 ++-
  3 files changed, 10 insertions(+), 1 deletions(-)
   


This would be better done in the generic layers (the block and net 
layers respectively).  Then it would work with virtio and emulated devices.


Regards,

Anthony Liguori


diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index b80402d..1dd1c31 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -327,6 +327,8 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
VirtQueue *vq)
  .old_bs = NULL,
  };

+do_event_tap();
+
  while ((req = virtio_blk_get_request(s))) {
  virtio_blk_handle_request(req,mrb);
  }
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 5c0093e..1a32bf3 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -667,6 +667,8 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, 
VirtQueue *vq)
  {
  VirtIONet *n = to_virtio_net(vdev);

+do_event_tap();
+
  if (n-tx_timer_active) {
  virtio_queue_set_notification(vq, 1);
  qemu_del_timer(n-tx_timer);
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 1414f49..769bc95 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -935,8 +935,12 @@ int kvm_run(CPUState *env)

  post_kvm_run(kvm, env);

+/* TODO: we need to prevent tapping events that derived from the
+ * same VMEXIT. This needs more info from the kernel. */
  #if defined(KVM_CAP_COALESCED_MMIO)
  if (kvm_state-coalesced_mmio) {
+/* prevent from tapping events while handling coalesced_mmio */
+event_tap_suspend();
  struct kvm_coalesced_mmio_ring *ring =
  (void *) run + kvm_state-coalesced_mmio * PAGE_SIZE;
  while (ring-first != ring-last) {
@@ -946,6 +950,7 @@ int kvm_run(CPUState *env)
  smp_wmb();
  ring-first = (ring-first + 1) % KVM_COALESCED_MMIO_MAX;
  }
+event_tap_resume();
  }
  #endif

@@ -1770,7 +1775,7 @@ static void resume_all_threads(void)
  {
  CPUState *penv = first_cpu;

-assert(!cpu_single_env);
+/* assert(!cpu_single_env); */

  while (penv) {
  penv-stop = 0;
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-22 Thread Marcelo Tosatti
On Thu, Apr 22, 2010 at 09:11:30PM +0300, Gleb Natapov wrote:
 On Thu, Apr 22, 2010 at 01:40:38PM -0300, Marcelo Tosatti wrote:
  On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote:
   On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote:
On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote:
 On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote:
  On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote:
   On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote:
 Or could we make kvm_set_irq() atomic? Though the code path 
 is a little long 
 for spinlock.

Yes, given the sleep-inside-RCU-protected section bug from
kvm_notify_acked_irq, either that or convert IRQ locking to 
SRCU.

But as you said, the code paths are long and potentially slow, 
so
probably SRCU is a better alternative.

Gleb?
   kvm_set_irq() was converted to rcu from mutex to make msix 
   interrupt
   injection scalable.
  
  We meant ioapic lock. See the last report from Ralf on this thread. 
 Can we solve the problem by calling ack notifier outside rcu read
 section in kvm_notify_acked_irq()?

The unregister path does

- remove_from_list(entry)
- synchronize_rcu
- kfree(entry)

So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and the
notifier entry can be freed.
   What I mean is kvm_notify_acked_irq() will iterate over all ack entries
   in rcu read protected section, but instead of calling callback it will
   collect them into the array and call them outside rcu read section. At
   this point it doesn't matter if entry is unregistered since the copy is
   used to actually call the notifier.
  
  Here it is, but no, this trick can't be done safely for ack notifiers
  because they are unregistered at runtime by device assignment.
  
  How does the freeing path knows it can proceed to free its entry if
  there is no reliable way to know if there is a reference to itself?
  (think priv gets freed between rcu_read_unlock and -irq_acked with
  the patch below).
  
 Yeah, I see :(
 
  I can convert to SRCU if you have no objections.
  
 AFAIR there was a disadvantage comparing to RCU, but I don't remember what
 it was exactly. 

http://www.mentby.com/paul-e-mckenney/kvm-patch-v3-13-kvm-fix-race-in-irqrouting-logic.html

But for KVM IRQ path's it should not matter much since usage of grace
periods is rare (registration/unregistration is very rare compared to 
read side), and the IRQ path's are already large and slow, so the added
overhead should not be noticeable.

 What about converting PIC/IOAPIC mutexes into spinlocks?

Works for me, but on large guests the spinning will be noticeable.
I believe.

Avi?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're sending
this message to share what we have now and TODO lists.  Hopefully, we would like
to get early feedback to keep us in the right direction.  Although advanced
approaches in the TODO lists are fascinating, we would like to run this project
step by step while absorbing comments from the community.  The current code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs.  It may also need some plumbing in the
kernel side to guarantee replayability of certain events and instructions,
integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
as for optimization purposes, for example.

Before going into details, we would like to show how Kemari looks.  We prepared
a demonstration video at the following location.  For those who are not
interested in the code, please take a look.
The demonstration scenario is,

1. Play with a guest VM that has virtio-blk and virtio-net.
# The guest image should be a NFS/SAN.
2. Start Kemari to synchronize the VM by running the following command in QEMU.
Just add -k option to usual migrate command.
migrate -d -k tcp:192.168.0.20:
3. Check the status by calling info migrate.
4. Go back to the VM to play chess animation.
5. Kill the the VM. (VNC client also disappears)
6. Press c to continue the VM on the other host.
7. Bring up the VNC client (Sorry, it pops outside of video capture.)
8. Confirm that the chess animation ends, browser works fine, then shutdown.

http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov

The repository contains all patches we're sending with this message.  For those
who want to try, pull the following repository.  At running configure, please
put --enable-ft-mode.  Also you need to apply a patch attached at the end of
this message to your KVM.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari

In addition to usual migrate environment and command, add -k to run.

The patch set consists of following components.

- bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April 2o)
- writev() support to QEMUFile and FdMigrationState.
- FT transaction sender/receiver
- event tap that triggers FT transaction.
- virtio-blk, virtio-net support for event tap.
   


This series looks quite nice!

I think it would make sense to separate out the things that are actually 
optimizations (like the dirty bitmap changes and the writev/readv 
changes) and to attempt to justify them with actual performance data.


I'd prefer not to modify the live migration protocol ABI and it doesn't 
seem to be necessary if we're willing to add options to the -incoming 
flag.  We also want to be a bit more generic with respect to IO.  
Otherwise, the series looks very close to being mergable.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

QEMUFile currently doesn't support writev().  For sending multiple
data, such as pages, using writev() should be more efficient.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
   


Is there performance data that backs this up?  Since QEMUFile uses a 
linear buffer for most operations that's limited to 16k, I suspect you 
wouldn't be able to observe a difference in practice.


Regards,

Anthony Liguori


---
  buffered_file.c |2 +-
  hw/hw.h |   16 
  savevm.c|   43 +--
  3 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 54dc6c2..187d1d4 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
  s-wait_for_unfreeze = wait_for_unfreeze;
  s-close = close;

-s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
+s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL,
   buffered_close, buffered_rate_limit,
   buffered_set_rate_limit,
 buffered_get_rate_limit);
diff --git a/hw/hw.h b/hw/hw.h
index fc9ed29..921cf90 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -23,6 +23,13 @@
  typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
  int64_t pos, int size);

+/* This function writes a chunk of vector to a file at the given position.
+ * The pos argument can be ignored if the file is only being used for
+ * streaming.
+ */
+typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov,
+int64_t pos, int iovcnt);
+
  /* Read a chunk of data from a file at the given position.  The pos argument
   * can be ignored if the file is only be used for streaming.  The number of
   * bytes actually read should be returned.
@@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const 
uint8_t *buf,
  typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
  int64_t pos, int size);

+/* Read a chunk of vector from a file at the given position.  The pos argument
+ * can be ignored if the file is only be used for streaming.  The number of
+ * bytes actually read should be returned.
+ */
+typedef int (QEMUFileGetVectorFunc)(void *opaque, struct iovec *iov,
+int64_t pos, int iovcnt);
+
  /* Close a file and return an error code */
  typedef int (QEMUFileCloseFunc)(void *opaque);

@@ -46,7 +60,9 @@ typedef size_t (QEMUFileSetRateLimit)(void *opaque, size_t 
new_rate);
  typedef size_t (QEMUFileGetRateLimit)(void *opaque);

  QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
+ QEMUFilePutVectorFunc *put_vector,
   QEMUFileGetBufferFunc *get_buffer,
+ QEMUFileGetVectorFunc *get_vector,
   QEMUFileCloseFunc *close,
   QEMUFileRateLimit *rate_limit,
   QEMUFileSetRateLimit *set_rate_limit,
diff --git a/savevm.c b/savevm.c
index 490ab70..944e788 100644
--- a/savevm.c
+++ b/savevm.c
@@ -162,7 +162,9 @@ void qemu_announce_self(void)

  struct QEMUFile {
  QEMUFilePutBufferFunc *put_buffer;
+QEMUFilePutVectorFunc *put_vector;
  QEMUFileGetBufferFunc *get_buffer;
+QEMUFileGetVectorFunc *get_vector;
  QEMUFileCloseFunc *close;
  QEMUFileRateLimit *rate_limit;
  QEMUFileSetRateLimit *set_rate_limit;
@@ -263,11 +265,11 @@ QEMUFile *qemu_popen(FILE *stdio_file, const char *mode)
  s-stdio_file = stdio_file;

  if(mode[0] == 'r') {
-s-file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_pclose,
-NULL, NULL, NULL);
+s-file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer,
+ NULL, stdio_pclose, NULL, NULL, NULL);
  } else {
-s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_pclose,
-NULL, NULL, NULL);
+s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
+ stdio_pclose, NULL, NULL, NULL);
  }
  return s-file;
  }
@@ -312,11 +314,11 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
  goto fail;

  if(mode[0] == 'r') {
-s-file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_fclose,
-NULL, NULL, NULL);
+s-file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, NULL,
+ stdio_fclose, NULL, NULL, NULL);
  } else {
-s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_fclose,
-NULL, NULL, NULL);
+s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
+ stdio_fclose, NULL, NULL, NULL);
  }
  return 

Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-22 Thread Gleb Natapov
On Thu, Apr 22, 2010 at 04:40:30PM -0300, Marcelo Tosatti wrote:
 On Thu, Apr 22, 2010 at 09:11:30PM +0300, Gleb Natapov wrote:
  On Thu, Apr 22, 2010 at 01:40:38PM -0300, Marcelo Tosatti wrote:
   On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote:
On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote:
 On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote:
  On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote:
   On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote:
On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote:
  Or could we make kvm_set_irq() atomic? Though the code path 
  is a little long 
  for spinlock.
 
 Yes, given the sleep-inside-RCU-protected section bug from
 kvm_notify_acked_irq, either that or convert IRQ locking to 
 SRCU.
 
 But as you said, the code paths are long and potentially 
 slow, so
 probably SRCU is a better alternative.
 
 Gleb?
kvm_set_irq() was converted to rcu from mutex to make msix 
interrupt
injection scalable.
   
   We meant ioapic lock. See the last report from Ralf on this 
   thread. 
  Can we solve the problem by calling ack notifier outside rcu read
  section in kvm_notify_acked_irq()?
 
 The unregister path does
 
 - remove_from_list(entry)
 - synchronize_rcu
 - kfree(entry)
 
 So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and 
 the
 notifier entry can be freed.
What I mean is kvm_notify_acked_irq() will iterate over all ack entries
in rcu read protected section, but instead of calling callback it will
collect them into the array and call them outside rcu read section. At
this point it doesn't matter if entry is unregistered since the copy is
used to actually call the notifier.
   
   Here it is, but no, this trick can't be done safely for ack notifiers
   because they are unregistered at runtime by device assignment.
   
   How does the freeing path knows it can proceed to free its entry if
   there is no reliable way to know if there is a reference to itself?
   (think priv gets freed between rcu_read_unlock and -irq_acked with
   the patch below).
   
  Yeah, I see :(
  
   I can convert to SRCU if you have no objections.
   
  AFAIR there was a disadvantage comparing to RCU, but I don't remember what
  it was exactly. 
 
 http://www.mentby.com/paul-e-mckenney/kvm-patch-v3-13-kvm-fix-race-in-irqrouting-logic.html
 
 But for KVM IRQ path's it should not matter much since usage of grace
 periods is rare (registration/unregistration is very rare compared to 
 read side), and the IRQ path's are already large and slow, so the added
 overhead should not be noticeable.
 
msix path shouldn't be slow and this is the case we are trying to
optimize.

 SRCU's read-side primitives are also significantly slower than
 those of RCU.

Sounds not too optimistic.

  What about converting PIC/IOAPIC mutexes into spinlocks?
 
 Works for me, but on large guests the spinning will be noticeable.
 I believe.
For interrupts going through IOPIC, but we know this is not scalable
anyway.

 
 Avi?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Dor Laor

On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote:

2010/4/22 Dor Laordl...@redhat.com:

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:


Dor Laor wrote:


On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:


Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip
from
proceeding before synchronizing VMs. It may also need some plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.


[ snap]



The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true for
disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?


In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely stateless.


Why do you treat tcp differently? You can damage the entire VM this way -
think of dhcp request that was dropped on the moment you switched between
the master and the slave?


I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?


But until it timeouts it won't have networking.


BTW, in current implementation, it's synchronizing before dhcp ack is sent.
But in case of tcp, once you send ack to the client before sync, there
is no way to recover.


What if the guest is running dhcp server? It we provide an IP to a 
client and then fail to the secondary that will run without knowing the 
master allocated this IP





[snap]



=== clock ===

Since synchronizing the virtual machines every time the TSC is
accessed would be
prohibitive, the transmission of the TSC will be done lazily, which
means
delaying it until there is a non-TSC synchronization point arrives.


Why do you specifically care about the tsc sync? When you sync all the
IO model on snapshot it also synchronizes the tsc.


So, do you agree that an extra clock synchronization is not needed since it
is done anyway as part of the live migration state sync?


I agree that its sent as part of the live migration.
What I wanted to say here is that this is not something for real time
applications.
I usually get questions like can this guarantee fault tolerance for
real time applications.


First the huge cost of snapshots won't match to any real time app.
Second, even if it wasn't the case, the tsc delta and kvmclock are 
synchronized as part of the VM state so there is no use of trapping it 
in the middle.





In general, can you please explain the 'algorithm' for continuous
snapshots (is that what you like to do?):


Yes, of course.
Sorry for being less informative.


A trivial one would we to :
- do X online snapshots/sec


I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the
guest was almost idle, there will be no snapshots in 5sec. On the other
hand, if the guest was running I/O intensive workloads (netperf, iozone
for example), there will be about 50 snapshots/sec.


- Stall all IO (disk/block) from the guest to the outside world
until the previous snapshot reaches the slave.


Yes, it does.


- Snapshots are made of


Full device model + diff of dirty pages from the last snapshot.


- diff of dirty pages from last snapshot


This also depends on the workload.
In case of I/O intensive workloads, dirty pages are usually less than 100.


The hardest would be memory intensive loads.
So 100 snap/sec means latency of 10msec right?
(not that it's not ok, with faster hw and IB you'll be able to get much
more)


Doesn't 100 snap/sec mean the interval of snap is 10msec?
IIUC, to get the latency, 

using ftrace with kvm

2010-04-22 Thread David S. Ahern
I have a VM that is spinning (both vcpus at 100%). As I recall kvm_stat
has been deprecated in favor or ftrace. Is there a wiki page or document
that gives suggestions on this?

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] KVM MMU: fix hashing for TDP and non-paging modes

2010-04-22 Thread Eric Northup
I've been reading the x86's mmu.c recently and had been wondering
about something.  Avi's recent mmu documentation (thanks!) seems to
have confirmed my understanding of how the shadow paging is supposed
to be working.  In TDP mode, when mmu_alloc_roots() calls
kvm_mmu_get_page(), why does it pass (vcpu-arch.cr3  PAGE_SHIFT) or
(vcpu-arch.mmu.pae_root[i]) as gfn?

It seems to me that in TDP mode, gfn should be either zero for the
root page table, or 0/1GB/2GB/3GB (for PAE page tables).

The existing behavior can lead to multiple, semantically-identical TDP
roots being created by mmu_alloc_roots, depending on the VCPU's CR3 at
the time that mmu_alloc_roots was called.  But the nested page tables
should be* independent of the VCPU state. That wastes some memory and
causes extra page faults while populating the extra copies of the page
tables.

*assuming that we aren't modeling per-VCPU state that might change the
physical address map as seen by that VCPU, such as setting the APIC
base to an address overlapping RAM.

All feedback would be welcome, since I'm new to this system!  A
strawman patch follows.

thanks,
-Eric

--

For TDP mode, avoid creating multiple page table roots for the single
guest-to-host physical address map by fixing the inputs used for the
shadow page table hash in mmu_alloc_roots().

Signed-off-by: Eric Northup digitale...@google.com
---
 arch/x86/kvm/mmu.c |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ddfa865..9696d65 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2059,10 +2059,12 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
hpa_t root = vcpu-arch.mmu.root_hpa;

ASSERT(!VALID_PAGE(root));
-   if (tdp_enabled)
-   direct = 1;
if (mmu_check_root(vcpu, root_gfn))
return 1;
+   if (tdp_enabled) {
+   direct = 1;
+   root_gfn = 0;
+   }
sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
  PT64_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
@@ -2072,8 +2074,6 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
return 0;
}
direct = !is_paging(vcpu);
-   if (tdp_enabled)
-   direct = 1;
for (i = 0; i  4; ++i) {
hpa_t root = vcpu-arch.mmu.pae_root[i];

@@ -2089,6 +2089,10 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
root_gfn = 0;
if (mmu_check_root(vcpu, root_gfn))
return 1;
+   if (tdp_enabled) {
+   direct = 1;
+   root_gfn = i  30;
+   }
sp = kvm_mmu_get_page(vcpu, root_gfn, i  30,
  PT32_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: using ftrace with kvm

2010-04-22 Thread Manish Regmi
On Thu, Apr 22, 2010 at 3:53 PM, David S. Ahern daah...@cisco.com wrote:
I have a VM that is spinning (both vcpus at 100%). As I recall kvm_stat
has been deprecated in favor or ftrace. Is there a wiki page or document
that gives suggestions on this?

David

Documentation/trace/* is the place to see.

but for me function and function_graph give too much data even if i
limit it to 1 function. so i use trace points. i simply enable the
tracepoints i am interested in and read trace_pipe.

cat available_events
echo kvm:*  set_event
cat trace_pipe  ~/mydump.txt (or whatever you want to do)

---
regards
Manish Regmi

http://manish-cs.blogspot.com
http://ext2read.sf.net
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-22 Thread Alexander Graf

On 22.04.2010, at 08:09, Fernando Luis Vázquez Cao wrote:

 On 04/22/2010 11:45 AM, Fernando Luis Vázquez Cao wrote:
 On 04/21/2010 06:41 PM, Alexander Graf wrote:
 On 21.04.2010, at 10:29, Fernando Luis Vázquez Cao wrote:
 
 On 04/20/2010 08:03 PM, Takuya Yoshikawa wrote:
 @@ -318,7 +318,7 @@ struct kvm_dirty_log {
   __u32 padding1;
   union {
   void __user *dirty_bitmap; /* one bit per page */
 - __u64 padding2;
 + __u64 addr;
 
 This can break on x86_32 and x86_64-compat. addr is a long not a __u64.
 
 So the high 32 bits are zero. Where's the problem?
 
 If we are careful enough to cast the addr appropriately we should be fine,
 even if we keep the padding field in the union. I am not saying that it
 breaks 32 architectures but that it can potentially be problematic.
 
 + case KVM_SWITCH_DIRTY_LOG: {
 + struct kvm_dirty_log log;
 +
 + r = -EFAULT;
 + if (copy_from_user(log, argp, sizeof log))
 + goto out;
 + r = kvm_vm_ioctl_switch_dirty_log(kvm, log);
 + if (r)
 + goto out;
 + r = -EFAULT;
 + if (copy_to_user(argp, log, sizeof log))
 + goto out;
 + r = 0;
 + break;
 + }
 
 In x86_64-compat mode we are handling 32bit user-space addresses
 so we need the compat counterpart of KVM_SWITCH_DIRTY_LOG too.
 
 The compat code just forwards everything to the generic ioctls.
 
 The compat code uses struct compat_kvm_dirty_log instead of
 struct kvm_dirty_log to communicate with user space so
 the necessary conversions needs to be done before invoking
 the generic ioctl (see KVM_GET_DIRTY_LOG in kvm_vm_compat_ioctl).
 
 By the way we probable should move the definition of struct
 compat_kvm_dirty_log to a header file.
 
 It seems that it was you and Arnd who added the kvm_vm compat ioctl :-).
 Are you considering a different approach to tackle the issues that we
 have with a big-endian userspace?

IIRC the issue was a pointer inside of a nested structure, no?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura

Jamie Lokier wrote:

Yoshiaki Tamura wrote:

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Event tapping is the core component of Kemari, and it decides on which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true for
disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?


In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should have
its own recovering mechanism, or it should be completely stateless.


Even with unreliable protocols, if slave takeover causes the receiver
to have received a packet that the sender _does not think it has ever
sent_, expect some protocols to break.

If the slave replaying master's behaviour since the last sync means it
will definitely get into the same state of having sent the packet,
that works out.


That's something we're expecting now.


But you still have to be careful that the other end's responses to
that packet are not seen by the slave too early during that replay.
Otherwise, for example, the slave may observe a TCP ACK to a packet
that it hasn't yet sent, which is an error.


Even current implementation syncs just before network output, what you pointed 
out could happen.  In this case, would the connection going to be lost, or would 
client/server recover from it?  If latter, it would be fine, otherwise I wonder 
how people doing similar things are handling this situation.



About IP idempotency:

In general, IP packets are allowed to be lost or duplicated in the
network.  All IP protocols should be prepared for that; it is a basic
property.

However there is one respect in which they're not idempotent:

The TTL field should be decreased if packets are delayed.  Packets
should not appear to live in the network for longer than TTL seconds.
If they do, some protocols (like TCP) can react to the delayed ones
differently, such as sending a RST packet and breaking a connection.

It is acceptable to reduce TTL faster than the minimum.  After all, it
is reduced by 1 on every forwarding hop, in addition to time delays.


So the problem is, when the slave takes over, it sends a packet with same TTL 
which client may have received.



I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the
guest was almost idle, there will be no snapshots in 5sec.  On the other
hand, if the guest was running I/O intensive workloads (netperf, iozone
for example), there will be about 50 snapshots/sec.


That is a really satisfying number, thank you :-)

Without this work I wouldn't have imagined that synchronised machines
could work with such a low transaction rate.


Thank you for your comments.

Although I haven't prepared good data yet, I personally prefer to have 
discussion with actual implementation and experimental data.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs. It may also need some plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.

Before going into details, we would like to show how Kemari looks. We
prepared
a demonstration video at the following location. For those who are not
interested in the code, please take a look.
The demonstration scenario is,

1. Play with a guest VM that has virtio-blk and virtio-net.
# The guest image should be a NFS/SAN.
2. Start Kemari to synchronize the VM by running the following command
in QEMU.
Just add -k option to usual migrate command.
migrate -d -k tcp:192.168.0.20:
3. Check the status by calling info migrate.
4. Go back to the VM to play chess animation.
5. Kill the the VM. (VNC client also disappears)
6. Press c to continue the VM on the other host.
7. Bring up the VNC client (Sorry, it pops outside of video capture.)
8. Confirm that the chess animation ends, browser works fine, then
shutdown.

http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov

The repository contains all patches we're sending with this message.
For those
who want to try, pull the following repository. At running configure,
please
put --enable-ft-mode. Also you need to apply a patch attached at the
end of
this message to your KVM.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari

In addition to usual migrate environment and command, add -k to run.

The patch set consists of following components.

- bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April
2o)
- writev() support to QEMUFile and FdMigrationState.
- FT transaction sender/receiver
- event tap that triggers FT transaction.
- virtio-blk, virtio-net support for event tap.


This series looks quite nice!


Thanks for your kind words!


I think it would make sense to separate out the things that are actually
optimizations (like the dirty bitmap changes and the writev/readv
changes) and to attempt to justify them with actual performance data.


I agree with the separation plan.

For dirty bitmap change, Avi and I discussed on patchset for upsream QEMU while 
you were offline (Sorry, if I was wrong).  Could you also take a look?


http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html

Regarding writev, I agree that it should be backed with actual data, otherwise 
it should be removed.  We attemped to do everything that may reduce the overhead 
of the transaction.



I'd prefer not to modify the live migration protocol ABI and it doesn't
seem to be necessary if we're willing to add options to the -incoming
flag. We also want to be a bit more generic with respect to IO.


I totally agree with your approach not to change the protocol ABI.  Can we add 
an option to -incoming?  Like, -incoming ft_mode, for example

Regarding the IO, let me reply to the next message.


Otherwise, the series looks very close to being mergable.


Thank you for your comment on each patch.

To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU and 
struggled for how to implement in an acceptable way.


Thanks,

Yoshi



Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PXE Boot Timeout Issue...

2010-04-22 Thread Stuart Sheldon
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi all,

Just upgraded to 12.3 user space tools from 11.0, and now when I attempt
to netboot a guest, it appears that the pxe rom is timing out on dhcp
before the bridge has enough time to come up.

Is there a command line switch to set the dhcp timeout, or a build
option that can be changed to set the timeout to a longer value, or
disable it entirely?

Thanks in advance...

Stu

- --
Spider Pig, Spider Pig, Does whatever a Spider Pig does.
Can he swing, from a web, no he can't, he's a pig.
Look Oot! He is a Spider Pig
-- Matt Groening - Spider Pig Lyrics
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iQIcBAEBCAAGBQJL0O2PAAoJEFKVLITDJSGSHKUP/RGNTgmaQmqQ4L5wdkRpWVY4
0BY0wsfajLCQ8W87Zr5EIoFDw3QfeKM9345bvstwSvzvX5WB19U/Q463nxzUfV3v
YjBk6oCQVI7nnbSZIuz0rqxeo8QPcfuDStHQAkYZD6TlfgNntbzFRnCoMRAMuK46
EdcjmJMCcZ5bF2OezzljhpvJavqwBhOsuOwXHprlIxkhnlOQB5sBY4RSS/f886xw
34Bf1ER7J3zoyq96ElTOCNpFU5BeoaUGw0n/fDPaxQ5QmGoHOB/OpSoN8fae+h1J
OxgPWWUhgcb1d0rCsU9R4frbzJGwQFy8X4tBw4gqhjPX76D3T4zyf532pbTjrnri
GASOLFuSGxowPH1LmPag36Yq8IdOejVrEzK8jTyrpbmWtVBkyS9NiqQEJmqdbGpt
bu4Kt+Ab4oJRI+DGf8kfnvBvfZ5wIPmfsWzV3YrV4AHn+9X6FG/1OMxg9EV8cyC7
hhy/BTGX19qwzuxxUq8s1aH0BSARip++lZoKFuYojF5AoP+SBUSoOjpdhY+NNMTh
3cldyrN3Zd9Y5/ejHIBta0QDUg6YdbXvi6Ai2B+6aKtq63B4+uT8XBfh+duKNBfF
Y25gERY2aXoed4KRukSfr/6qh2y9MSQb7UbZneAy9bFMKhK21/ZK9FRM/uIUOs+g
lE6iqHCL4f4esyNFIwAr
=OAyU
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-22 Thread Zachary Amsden

On 04/22/2010 03:11 AM, Glauber Costa wrote:

On Tue, Apr 20, 2010 at 12:42:17PM -0700, Jeremy Fitzhardinge wrote:
   

On 04/20/2010 11:54 AM, Avi Kivity wrote:
 

On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:
   

On 04/20/2010 02:31 AM, Avi Kivity wrote:

 

btw, do you want this code in pvclock.c, or shall we keep it kvmclock
specific?

   

I think its a pvclock-level fix.  I'd been hoping to avoid having
something like this, but I think its ultimately necessary.

 

Did you observe drift on Xen, or is this ultimately pointing at the
future?
   

People are reporting weirdnesses that clocksource=jiffies apparently
resolves.  Xen and KVM are faced with the same hardware constraints, and
it wouldn't surprise me if there were small measurable
non-monotonicities in the PV clock under Xen.  May as well be safe.

Of course, it kills any possibility of being able to usefully export
this interface down to usermode.

My main concern about this kind of simple fix is that if there's a long
term systematic drift between different CPU's tscs, then this will
somewhat mask the problem while giving really awful time measurement on
the slow CPU(s).  In that case it really needs to adjust the scaling
factor to correct for the drift (*not* update the offset).  But if we're
definitely only talking about fixed, relatively small time offsets then
it is fine.
 

Can you by any chance run ingo's time warp test on those machines?

You need to define TOD to 1, and leave out the TSC test.

For me, warps exists on every machine out there, but the nehalems, so far
   

Or apply this patch.
diff -rup a/time-warp-test.c b/time-warp-test.c
--- a/time-warp-test.c  2010-04-15 16:30:13.955981607 -1000
+++ b/time-warp-test.c  2010-04-15 16:35:37.777982377 -1000
@@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
 {
DECLARE_ARGS(val, low, high);
 
-   asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high));
+   asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high) :: ebx, 
ecx);
 
return EAX_EDX_VAL(val, low, high);
 }


[RFC PATCH] asm-generic: bitops: introduce le bit offset macro

2010-04-22 Thread Takuya Yoshikawa
Although we can use *_le_bit() helpers to treat bitmaps le arranged,
having le bit offset calculation as a seperate macro gives us more freedom.

For example, KVM has le arranged dirty bitmaps for VGA, live-migration,
etc. and we use them in user space too. To avoid bitmap copies between
kernel and user space, we want to update the bitmaps in user space directly.

To achive this, le bit offset with *_user() functions help us a lot.

So let us use the le bit offset calculation part by defining it as a new
macro: generic_le_bit_offset() .

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 include/asm-generic/bitops/le.h |   18 +++---
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/asm-generic/bitops/le.h b/include/asm-generic/bitops/le.h
index 80e3bf1..6af2e54 100644
--- a/include/asm-generic/bitops/le.h
+++ b/include/asm-generic/bitops/le.h
@@ -9,6 +9,8 @@
 
 #if defined(__LITTLE_ENDIAN)
 
+#define generic_le_bit_offset(nr)  (nr)
+
 #define generic_test_le_bit(nr, addr) test_bit(nr, addr)
 #define generic___set_le_bit(nr, addr) __set_bit(nr, addr)
 #define generic___clear_le_bit(nr, addr) __clear_bit(nr, addr)
@@ -25,22 +27,24 @@
 
 #elif defined(__BIG_ENDIAN)
 
+#define generic_le_bit_offset(nr)  ((nr) ^ BITOP_LE_SWIZZLE)
+
 #define generic_test_le_bit(nr, addr) \
-   test_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   test_bit(generic_le_bit_offset(nr), (addr))
 #define generic___set_le_bit(nr, addr) \
-   __set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   __set_bit(generic_le_bit_offset(nr), (addr))
 #define generic___clear_le_bit(nr, addr) \
-   __clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   __clear_bit(generic_le_bit_offset(nr), (addr))
 
 #define generic_test_and_set_le_bit(nr, addr) \
-   test_and_set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   test_and_set_bit(generic_le_bit_offset(nr), (addr))
 #define generic_test_and_clear_le_bit(nr, addr) \
-   test_and_clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   test_and_clear_bit(generic_le_bit_offset(nr), (addr))
 
 #define generic___test_and_set_le_bit(nr, addr) \
-   __test_and_set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   __test_and_set_bit(generic_le_bit_offset(nr), (addr))
 #define generic___test_and_clear_le_bit(nr, addr) \
-   __test_and_clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   __test_and_clear_bit(generic_le_bit_offset(nr), (addr))
 
 extern unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
unsigned long size, unsigned long offset);
-- 
1.6.3.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:

2010/4/22 Dor Laordl...@redhat.com:

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control
logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip
from
proceeding before synchronizing VMs. It may also need some
plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.

[ snap]


The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on
which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true
for
disk I/O
and reliable network protocols such as TCP.

IMO any type of network even should be stalled too. What if the VM
runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?

In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely
stateless.

Why do you treat tcp differently? You can damage the entire VM this
way -
think of dhcp request that was dropped on the moment you switched
between
the master and the slave?

I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?
BTW, in current implementation,


I'm slightly confused about the current implementation vs. my
recollection of the original paper with Xen. I had thought that all disk
and network I/O was buffered in such a way that at each checkpoint, the
I/O operations would be released in a burst. Otherwise, you would have
to synchronize after every I/O operation which is what it seems the
current implementation does.


Yes, you're almost right.
It's synchronizing before QEMU starts emulating I/O at each device model.
It was originally designed that way to avoid complexity of introducing buffering 
mechanism and additional I/O latency by buffering.



I'm not sure how that is accomplished
atomically though since you could have a completed I/O operation
duplicated on the slave node provided it didn't notify completion prior
to failure.


That's exactly the point I wanted to discuss.
Currently, we're calling vm_stop(0), qemu_aio_flush() and bdrv_flush_all() 
before qemu_save_state_all() in ft_tranx_ready(), to ensure outstanding I/O is 
complete.  I mimicked what existing live migration is doing.

It's not enough?


Is there another kemari component that somehow handles buffering I/O
that is not obvious from these patches?


No, I'm not hiding anything, and I would share any information regarding Kemari 
to develop it in this community :-)


Thanks,

Yoshi



Regards,

Anthony Liguori




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

Hi,

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Replaces byte-based phys_ram_dirty bitmap with four (MASTER, VGA,
CODE, MIGRATION) bit-based phys_ram_dirty bitmap. On allocation, it
sets all bits in the bitmap. It uses ffs() to convert DIRTY_FLAG to
DIRTY_IDX.

Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
bit-based phys_ram_dirty bitmap. MASTER works as a buffer, and upon
get_diry() or get_dirty_flags(), it calls
cpu_physical_memory_sync_master() to update VGA and MIGRATION.


Why use an additional bitmap for MASTER instead of just updating the
VGA, CODE, and MIGRATION bitmaps together?


This way we don't have to think whether we should update VGA or MIGRATION. 
IIRC, we had this discussion on upstream before with Avi?


http://www.mail-archive.com/kvm@vger.kernel.org/msg30728.html

BTW, I also have the following TODO list regarding dirty bitmap.

1. Allocate vga and migration dirty bitmap dynamically.
2. Separate CODE and the other dirty bitmaps functions.



Regards,

Anthony Liguori


Replaces direct phys_ram_dirty access with wrapper functions to
prevent direct access to the phys_ram_dirty bitmap.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Keiohmura@lab.ntt.co.jp
---
cpu-all.h | 130
+
exec.c | 60 ++--
2 files changed, 152 insertions(+), 38 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 51effc0..3f8762d 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -37,6 +37,9 @@

#include softfloat.h

+/* to use ffs in flag_to_idx() */
+#includestrings.h
+
#if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN)
#define BSWAP_NEEDED
#endif
@@ -846,7 +849,6 @@ int cpu_str_to_log_mask(const char *str);
/* memory API */

extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
extern ram_addr_t ram_size;
extern ram_addr_t last_ram_offset;
extern uint8_t *bios_mem;
@@ -869,28 +871,140 @@ extern uint8_t *bios_mem;
/* Set if TLB entry is an IO callback. */
#define TLB_MMIO (1 5)

+/* Use DIRTY_IDX as indexes of bit-based phys_ram_dirty. */
+#define MASTER_DIRTY_IDX 0
+#define VGA_DIRTY_IDX 1
+#define CODE_DIRTY_IDX 2
+#define MIGRATION_DIRTY_IDX 3
+#define NUM_DIRTY_IDX 4
+
+#define MASTER_DIRTY_FLAG (1 MASTER_DIRTY_IDX)
+#define VGA_DIRTY_FLAG (1 VGA_DIRTY_IDX)
+#define CODE_DIRTY_FLAG (1 CODE_DIRTY_IDX)
+#define MIGRATION_DIRTY_FLAG (1 MIGRATION_DIRTY_IDX)
+
+extern unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
+
+static inline int dirty_flag_to_idx(int flag)
+{
+ return ffs(flag) - 1;
+}
+
+static inline int dirty_idx_to_flag(int idx)
+{
+ return 1 idx;
+}
+
int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
uint8_t *buf, int len, int is_write);

-#define VGA_DIRTY_FLAG 0x01
-#define CODE_DIRTY_FLAG 0x02
-#define MIGRATION_DIRTY_FLAG 0x08
-
/* read dirty bit (return 0 or 1) */
static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
{
- return phys_ram_dirty[addr TARGET_PAGE_BITS] == 0xff;
+ unsigned long mask;
+ ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS;
+ int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1);
+
+ mask = 1UL offset;
+ return (phys_ram_dirty[MASTER_DIRTY_IDX][index] mask) == mask;
+}
+
+static inline void cpu_physical_memory_sync_master(ram_addr_t index)
+{
+ if (phys_ram_dirty[MASTER_DIRTY_IDX][index]) {
+ phys_ram_dirty[VGA_DIRTY_IDX][index]
+ |= phys_ram_dirty[MASTER_DIRTY_IDX][index];
+ phys_ram_dirty[MIGRATION_DIRTY_IDX][index]
+ |= phys_ram_dirty[MASTER_DIRTY_IDX][index];
+ phys_ram_dirty[MASTER_DIRTY_IDX][index] = 0UL;
+ }
+}
+
+static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
+{
+ unsigned long mask;
+ ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS;
+ int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1);
+ int ret = 0, i;
+
+ mask = 1UL offset;
+ cpu_physical_memory_sync_master(index);
+
+ for (i = VGA_DIRTY_IDX; i= MIGRATION_DIRTY_IDX; i++) {
+ if (phys_ram_dirty[i][index] mask) {
+ ret |= dirty_idx_to_flag(i);
+ }
+ }
+
+ return ret;
+}
+
+static inline int cpu_physical_memory_get_dirty_idx(ram_addr_t addr,
+ int dirty_idx)
+{
+ unsigned long mask;
+ ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS;
+ int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1);
+
+ mask = 1UL offset;
+ cpu_physical_memory_sync_master(index);
+ return (phys_ram_dirty[dirty_idx][index] mask) == mask;
}

static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
int dirty_flags)
{
- return phys_ram_dirty[addr TARGET_PAGE_BITS] dirty_flags;
+ return cpu_physical_memory_get_dirty_idx(addr,
+ dirty_flag_to_idx(dirty_flags));
}

static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
{
- phys_ram_dirty[addr TARGET_PAGE_BITS] = 0xff;
+ unsigned long mask;
+ ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS;
+ int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1);
+
+ mask = 1UL offset;
+ 

Re: [RFC PATCH 15/20] Introduce FT mode support to configure.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp


No need for this.


OK.



Regards,

Anthony Liguori


---
configure | 8 
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index 046c591..f0682d4 100755
--- a/configure
+++ b/configure
@@ -298,6 +298,7 @@ bsd_user=no
guest_base=
uname_release=
io_thread=no
+ft_mode=no
mixemu=no
kvm_trace=no
kvm_cap_pit=
@@ -671,6 +672,8 @@ for opt do
;;
--enable-io-thread) io_thread=yes
;;
+ --enable-ft-mode) ft_mode=yes
+ ;;
--disable-blobs) blobs=no
;;
--kerneldir=*) kerneldir=$optarg
@@ -840,6 +843,7 @@ echo  --enable-vde enable support for vde network
echo  --disable-linux-aio disable Linux AIO support
echo  --enable-linux-aio enable Linux AIO support
echo  --enable-io-thread enable IO thread
+echo  --enable-ft-mode enable FT mode support
echo  --disable-blobs disable installing provided firmware blobs
echo  --kerneldir=PATH look for kernel includes in PATH
echo  --with-kvm-trace enable building the KVM module with the kvm
trace option
@@ -2117,6 +2121,7 @@ echo GUEST_BASE $guest_base
echo PIE user targets $user_pie
echo vde support $vde
echo IO thread $io_thread
+echo FT mode support $ft_mode
echo Linux AIO support $linux_aio
echo Install blobs $blobs
echo KVM support $kvm
@@ -2318,6 +2323,9 @@ fi
if test $io_thread = yes ; then
echo CONFIG_IOTHREAD=y $config_host_mak
fi
+if test $ft_mode = yes ; then
+ echo CONFIG_FT_MODE=y $config_host_mak
+fi
if test $linux_aio = yes ; then
echo CONFIG_LINUX_AIO=y $config_host_mak
fi







--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all().

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Make a 32bit entry after QEMU_VM_FILE_VERSION to recognize whether the
transfered data is QEMU_VM_FT_MODE or QEMU_VM_LIVE_MIGRATION_MODE.


I'd rather you encapsulate the current protocol as opposed to extending
it with a new version.

You could also do this by just having a new -incoming option, right? All
you really need is to indicate that this is for high frequency
checkpointing, right?


Exactly.
I would use -incoming option.



Regards,

Anthony Liguori


Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
---
savevm.c | 76
-
sysemu.h | 1 +
2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/savevm.c b/savevm.c
index 292ae32..19b3efb 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1402,8 +1402,10 @@ static void vmstate_save(QEMUFile *f,
SaveStateEntry *se)
}

#define QEMU_VM_FILE_MAGIC 0x5145564d
-#define QEMU_VM_FILE_VERSION_COMPAT 0x0002
-#define QEMU_VM_FILE_VERSION 0x0003
+#define QEMU_VM_FILE_VERSION_COMPAT 0x0003
+#define QEMU_VM_FILE_VERSION 0x0004
+#define QEMU_VM_LIVE_MIGRATION_MODE 0x0005
+#define QEMU_VM_FT_MODE 0x0006

#define QEMU_VM_EOF 0x00
#define QEMU_VM_SECTION_START 0x01
@@ -1425,6 +1427,12 @@ int qemu_savevm_state_begin(Monitor *mon,
QEMUFile *f, int blk_enable,

qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+ if (ft_mode) {
+ qemu_put_be32(f, QEMU_VM_FT_MODE);
+ } else {
+ qemu_put_be32(f, QEMU_VM_LIVE_MIGRATION_MODE);
+ }

QTAILQ_FOREACH(se,savevm_handlers, entry) {
int len;
@@ -1533,6 +1541,66 @@ int qemu_savevm_state_complete(Monitor *mon,
QEMUFile *f)
return 0;
}

+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f)
+{
+ SaveStateEntry *se;
+
+ QTAILQ_FOREACH(se,savevm_handlers, entry) {
+ int len;
+
+ if (se-save_live_state == NULL)
+ continue;
+
+ /* Section type */
+ qemu_put_byte(f, QEMU_VM_SECTION_START);
+ qemu_put_be32(f, se-section_id);
+
+ /* ID string */
+ len = strlen(se-idstr);
+ qemu_put_byte(f, len);
+ qemu_put_buffer(f, (uint8_t *)se-idstr, len);
+
+ qemu_put_be32(f, se-instance_id);
+ qemu_put_be32(f, se-version_id);
+ if (ft_mode == FT_INIT) {
+ /* This is workaround. */
+ se-save_live_state(mon, f, QEMU_VM_SECTION_START, se-opaque);
+ } else {
+ se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque);
+ }
+ }
+
+ ft_mode = FT_TRANSACTION;
+ QTAILQ_FOREACH(se,savevm_handlers, entry) {
+ int len;
+
+ if (se-save_state == NULL se-vmsd == NULL)
+ continue;
+
+ /* Section type */
+ qemu_put_byte(f, QEMU_VM_SECTION_FULL);
+ qemu_put_be32(f, se-section_id);
+
+ /* ID string */
+ len = strlen(se-idstr);
+ qemu_put_byte(f, len);
+ qemu_put_buffer(f, (uint8_t *)se-idstr, len);
+
+ qemu_put_be32(f, se-instance_id);
+ qemu_put_be32(f, se-version_id);
+
+ vmstate_save(f, se);
+ }
+
+ qemu_put_byte(f, QEMU_VM_EOF);
+
+ if (qemu_file_has_error(f))
+ return -EIO;
+
+ return 0;
+}
+
+
void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
{
SaveStateEntry *se;
@@ -1617,6 +1685,10 @@ int qemu_loadvm_state(QEMUFile *f, int
skip_header)
if (v != QEMU_VM_FILE_VERSION)
return -ENOTSUP;

+ v = qemu_get_be32(f);
+ if (v == QEMU_VM_FT_MODE) {
+ ft_mode = FT_INIT;
+ }
}

while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
diff --git a/sysemu.h b/sysemu.h
index 6c1441f..df314bb 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile
*f, int blk_enable,
int shared);
int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f);
void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
int qemu_loadvm_state(QEMUFile *f, int skip_header);







--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

QEMUFile currently doesn't support writev(). For sending multiple
data, such as pages, using writev() should be more efficient.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp


Is there performance data that backs this up? Since QEMUFile uses a
linear buffer for most operations that's limited to 16k, I suspect you
wouldn't be able to observe a difference in practice.


I currently don't have data, but I'll prepare it.
There were two things I wanted to avoid.

1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
2. Calling write() everytime even when we want to send multiple pages at once.

I think 2 may be neglectable.
But 1 seems to be problematic if we want make to the latency as small as 
possible, no?




Regards,

Anthony Liguori


---
buffered_file.c | 2 +-
hw/hw.h | 16 
savevm.c | 43 +--
3 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 54dc6c2..187d1d4 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
s-wait_for_unfreeze = wait_for_unfreeze;
s-close = close;

- s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
+ s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL,
buffered_close, buffered_rate_limit,
buffered_set_rate_limit,
buffered_get_rate_limit);
diff --git a/hw/hw.h b/hw/hw.h
index fc9ed29..921cf90 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -23,6 +23,13 @@
typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
int64_t pos, int size);

+/* This function writes a chunk of vector to a file at the given
position.
+ * The pos argument can be ignored if the file is only being used for
+ * streaming.
+ */
+typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov,
+ int64_t pos, int iovcnt);
+
/* Read a chunk of data from a file at the given position. The pos
argument
* can be ignored if the file is only be used for streaming. The number of
* bytes actually read should be returned.
@@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque,
const uint8_t *buf,
typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
int64_t pos, int size);

+/* Read a chunk of vector from a file at the given position. The pos
argument
+ * can be ignored if the file is only be used for streaming. The
number of
+ * bytes actually read should be returned.
+ */
+typedef int (QEMUFileGetVectorFunc)(void *opaque, struct iovec *iov,
+ int64_t pos, int iovcnt);
+
/* Close a file and return an error code */
typedef int (QEMUFileCloseFunc)(void *opaque);

@@ -46,7 +60,9 @@ typedef size_t (QEMUFileSetRateLimit)(void *opaque,
size_t new_rate);
typedef size_t (QEMUFileGetRateLimit)(void *opaque);

QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
+ QEMUFilePutVectorFunc *put_vector,
QEMUFileGetBufferFunc *get_buffer,
+ QEMUFileGetVectorFunc *get_vector,
QEMUFileCloseFunc *close,
QEMUFileRateLimit *rate_limit,
QEMUFileSetRateLimit *set_rate_limit,
diff --git a/savevm.c b/savevm.c
index 490ab70..944e788 100644
--- a/savevm.c
+++ b/savevm.c
@@ -162,7 +162,9 @@ void qemu_announce_self(void)

struct QEMUFile {
QEMUFilePutBufferFunc *put_buffer;
+ QEMUFilePutVectorFunc *put_vector;
QEMUFileGetBufferFunc *get_buffer;
+ QEMUFileGetVectorFunc *get_vector;
QEMUFileCloseFunc *close;
QEMUFileRateLimit *rate_limit;
QEMUFileSetRateLimit *set_rate_limit;
@@ -263,11 +265,11 @@ QEMUFile *qemu_popen(FILE *stdio_file, const
char *mode)
s-stdio_file = stdio_file;

if(mode[0] == 'r') {
- s-file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_pclose,
- NULL, NULL, NULL);
+ s-file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer,
+ NULL, stdio_pclose, NULL, NULL, NULL);
} else {
- s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_pclose,
- NULL, NULL, NULL);
+ s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
+ stdio_pclose, NULL, NULL, NULL);
}
return s-file;
}
@@ -312,11 +314,11 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
goto fail;

if(mode[0] == 'r') {
- s-file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_fclose,
- NULL, NULL, NULL);
+ s-file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, NULL,
+ stdio_fclose, NULL, NULL, NULL);
} else {
- s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_fclose,
- NULL, NULL, NULL);
+ s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
+ stdio_fclose, NULL, NULL, NULL);
}
return s-file;

@@ -330,8 +332,8 @@ QEMUFile *qemu_fopen_socket(int fd)
QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));

s-fd = fd;
- s-file = qemu_fopen_ops(s, NULL, socket_get_buffer, socket_close,
- NULL, NULL, NULL);
+ s-file = qemu_fopen_ops(s, NULL, NULL, socket_get_buffer, NULL,
+ socket_close, NULL, NULL, NULL);
return s-file;
}

@@ -368,11 +370,11 @@ QEMUFile *qemu_fopen(const char *filename, const
char *mode)
goto fail;


Re: [PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page

2010-04-22 Thread Xiao Guangrong


Marcelo Tosatti wrote:

  role = vcpu-arch.mmu.base_role;
 @@ -1332,12 +1336,16 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
  hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
  if (sp-gfn == gfn) {
  if (sp-unsync)
 -if (kvm_sync_page(vcpu, sp))
 -continue;
 +unsync_sp = sp;
 

Hi Marcelo,

Thanks for your comments, maybe the changlog is not clear, please allow
me explain here.

Two cases maybe happen in kvm_mmu_get_page() function:

- one case is, the goal sp is already in cache, if the sp is unsync,
  we only need update it to assure this mapping is valid, but not
  mark it sync and not write-protect sp-gfn since it not broke unsync
  rule(one shadow page for a gfn)

- another case is, the goal sp not existed, we need create a new sp
  for gfn, i.e, gfn (may)has another shadow page, to keep unsync rule,
  we should sync(mark sync and write-protect) gfn's unsync shadow page.
  After enabling multiple unsync shadows, we sync those shadow pages
  only when the new sp not allow to become unsync(also for the unsyc
  rule, the new rule is: allow all pte page become unsync)

 
 I don't see a reason why you can't create a new mapping to an unsync
 page. The code already creates shadow pte entries using unsync
 pagetables.

Do you means the case 2? In the original code, it unsync-ed gfn's unsync
page first regardless it's whether broke unsync rule:

|   hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
|   if (sp-gfn == gfn) {
|   if (sp-unsync)
|   if (kvm_sync_page(vcpu, sp))

And, my English is poor, sorry if i misunderstand your comment :-(

Xiao
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM MMU: make kvm_mmu_zap_page() return the number of zapped sp in total.

2010-04-22 Thread Xiao Guangrong


Gui Jianfeng wrote:
 Currently, in kvm_mmu_change_mmu_pages(kvm, page), used_pages-- is  
 performed after calling
 kvm_mmu_zap_page() in spite of that whether page is actually reclaimed. 
 Because root sp won't be 
 reclaimed by kvm_mmu_zap_page(). So making kvm_mmu_zap_page() return total 
 number of reclaimed sp
 makes more sense. A new flag is put into kvm_mmu_zap_page() to indicate 
 whether the top page is reclaimed.
 

This bug only hurts kvm_mmu_change_mmu_pages() function, we'd better allow 
'self_deleted' is
NULL, then we can pass NULL at other place.

 @@ -1571,7 +1584,8 @@ restart:
   pgprintk(%s: gfn %lx role %x\n, __func__, gfn,
sp-role.word);
   r = 1;
 - if (kvm_mmu_zap_page(kvm, sp))
 + ret = kvm_mmu_zap_page(kvm, sp, self_deleted);
 + if (ret  1 || (ret == 1  self_deleted == 0))
   goto restart;

Maybe we can keep kvm_mmu_zap_page() returns the number of zapped children,
and 'self_deleted' indicates whether self is zapped, then we no need modify
those function, just fix kvm_mmu_change_mmu_pages() that is if 'self_deleted == 
1',
inc 'used_pages'

Xiao

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

For fool proof purpose, qemu_put_vector_parepare should be called
before qemu_put_vector. Then, if qemu_put_* functions except this is
called after qemu_put_vector_prepare, program will abort().

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp


I don't get it. What's this protecting against?


This was introduced to prevent mixing the order of normal write and vector 
write, and flush QEMUFile buffer before handling vectors.
While qemu_put_buffer copies data to QEMUFile buffer, qemu_put_vector() will 
bypass that buffer.


It's just fool proof purpose for what we encountered at beginning, and if the 
user of qemu_put_vector() is careful enough, we can remove 
qemu_put_vectore_prepare().  While writing this message, I started to think that 
just calling qemu_fflush() in qemu_put_vector() would be enough...




Regards,

Anthony Liguori


---
hw/hw.h | 2 ++
savevm.c | 53 +
2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 921cf90..10e6dda 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -77,6 +77,8 @@ void qemu_fflush(QEMUFile *f);
int qemu_fclose(QEMUFile *f);
void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
void qemu_put_byte(QEMUFile *f, int v);
+void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
+void qemu_put_vector_prepare(QEMUFile *f);
void *qemu_realloc_buffer(QEMUFile *f, int size);
void qemu_clear_buffer(QEMUFile *f);

diff --git a/savevm.c b/savevm.c
index 944e788..22d928c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -180,6 +180,7 @@ struct QEMUFile {
uint8_t *buf;

int has_error;
+ int prepares_vector;
};

typedef struct QEMUFileStdio
@@ -557,6 +558,58 @@ void qemu_put_byte(QEMUFile *f, int v)
qemu_fflush(f);
}

+void qemu_put_vector(QEMUFile *f, QEMUIOVector *v)
+{
+ struct iovec *iov;
+ int cnt;
+ size_t bufsize;
+ uint8_t *buf;
+
+ if (qemu_file_get_rate_limit(f) != 0) {
+ fprintf(stderr,
+ Attempted to write vector while bandwidth limit is not zero.\n);
+ abort();
+ }
+
+ /* checks prepares vector.
+ * For fool proof purpose, qemu_put_vector_parepare should be called
+ * before qemu_put_vector. Then, if qemu_put_* functions except this
+ * is called after qemu_put_vector_prepare, program will abort().
+ */
+ if (!f-prepares_vector) {
+ fprintf(stderr,
+ You should prepare with qemu_put_vector_prepare.\n);
+ abort();
+ } else if (f-prepares_vector f-buf_index != 0) {
+ fprintf(stderr, Wrote data after qemu_put_vector_prepare.\n);
+ abort();
+ }
+ f-prepares_vector = 0;
+
+ if (f-put_vector) {
+ qemu_iovec_to_vector(v,iov,cnt);
+ f-put_vector(f-opaque, iov, 0, cnt);
+ } else {
+ qemu_iovec_to_size(v,bufsize);
+ buf = qemu_malloc(bufsize + 1 /* for '\0' */);
+ qemu_iovec_to_buffer(v, buf);
+ qemu_put_buffer(f, buf, bufsize);
+ qemu_free(buf);
+ }
+
+}
+
+void qemu_put_vector_prepare(QEMUFile *f)
+{
+ if (f-prepares_vector) {
+ /* prepare vector */
+ fprintf(stderr, Attempted to prepare vector twice\n);
+ abort();
+ }
+ f-prepares_vector = 1;
+ qemu_fflush(f);
+}
+
int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size1)
{
int size, l;









--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp


I think the more appropriate thing to do is have
qemu_savevm_state_complete() not write QEMU_VM_EOF when doing a
continuous live migration.

You would then want qemu_loadvm_state() to detect real EOF and treat
that the same as QEMU_VM_EOF (provided it occurred at a section boundary).

Of course, this should be a flag to qemu_loadvm_state() as it's not
always the right behavior.


Sorry.  I couldn't get your intention.
I would appreciate if you could explain the good points of it.

On the receiver side, we need to switch to ft_transaction mode. If the 
qemu_savevm_state_complete() didn't send QEMU_VM_EOF, qemu_loadvm_state() won't 
get out of the loop, and therefore it can switch to ft_transaction mode.


Please let me know if I were misunderstanding.



Regards,

Anthony Liguori


---
migration-exec.c | 2 +-
migration-fd.c | 2 +-
migration-tcp.c | 2 +-
migration-unix.c | 2 +-
savevm.c | 25 ++---
sysemu.h | 2 +-
6 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/migration-exec.c b/migration-exec.c
index 3edc026..5839a6d 100644
--- a/migration-exec.c
+++ b/migration-exec.c
@@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void
*opaque)
QEMUFile *f = opaque;
int ret;

- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
if (ret 0) {
fprintf(stderr, load of migration failed\n);
goto err;
diff --git a/migration-fd.c b/migration-fd.c
index 0cc74ad..0e97ed0 100644
--- a/migration-fd.c
+++ b/migration-fd.c
@@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void
*opaque)
QEMUFile *f = opaque;
int ret;

- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
if (ret 0) {
fprintf(stderr, load of migration failed\n);
goto err;
diff --git a/migration-tcp.c b/migration-tcp.c
index 56e1a3b..94a1a03 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -182,7 +182,7 @@ static void tcp_accept_incoming_migration(void
*opaque)
goto out;
}

- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
if (ret 0) {
fprintf(stderr, load of migration failed\n);
goto out_fopen;
diff --git a/migration-unix.c b/migration-unix.c
index b7aab38..dd99a73 100644
--- a/migration-unix.c
+++ b/migration-unix.c
@@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void
*opaque)
goto out;
}

- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
if (ret 0) {
fprintf(stderr, load of migration failed\n);
goto out_fopen;
diff --git a/savevm.c b/savevm.c
index 22d928c..a401b27 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1554,7 +1554,7 @@ typedef struct LoadStateEntry {
int version_id;
} LoadStateEntry;

-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, int skip_header)
{
QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
QLIST_HEAD_INITIALIZER(loadvm_handlers);
@@ -1563,17 +1563,20 @@ int qemu_loadvm_state(QEMUFile *f)
unsigned int v;
int ret;

- v = qemu_get_be32(f);
- if (v != QEMU_VM_FILE_MAGIC)
- return -EINVAL;
+ if (!skip_header) {
+ v = qemu_get_be32(f);
+ if (v != QEMU_VM_FILE_MAGIC)
+ return -EINVAL;
+
+ v = qemu_get_be32(f);
+ if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+ fprintf(stderr, SaveVM v3 format is obsolete and don't work
anymore\n);
+ return -ENOTSUP;
+ }
+ if (v != QEMU_VM_FILE_VERSION)
+ return -ENOTSUP;

- v = qemu_get_be32(f);
- if (v == QEMU_VM_FILE_VERSION_COMPAT) {
- fprintf(stderr, SaveVM v2 format is obsolete and don't work
anymore\n);
- return -ENOTSUP;
}
- if (v != QEMU_VM_FILE_VERSION)
- return -ENOTSUP;

while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
uint32_t instance_id, version_id, section_id;
@@ -1898,7 +1901,7 @@ int load_vmstate(Monitor *mon, const char *name)
monitor_printf(mon, Could not open VM state file\n);
return -EINVAL;
}
- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
qemu_fclose(f);
if (ret 0) {
monitor_printf(mon, Error %d while loading VM state\n, ret);
diff --git a/sysemu.h b/sysemu.h
index 647a468..6c1441f 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile
*f, int blk_enable,
int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, int skip_header);

void qemu_errors_to_file(FILE *fp);
void qemu_errors_to_mon(Monitor *mon);








--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

do_event_tap() is inserted to functions which actually fire outputs.
By synchronizing VMs before outputs are fired, we can failover to the
receiver upon failure. To save VM continuously, comment out assert()
on cpu_single_env temporally.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp
---
hw/virtio-blk.c | 2 ++
hw/virtio-net.c | 2 ++
qemu-kvm.c | 7 ++-
3 files changed, 10 insertions(+), 1 deletions(-)


This would be better done in the generic layers (the block and net
layers respectively). Then it would work with virtio and emulated devices.


I agree with your opinion that it's better if we can handle any emulated devices 
at once.  However, I have a question here that, if we put do_event_tap() to the 
generic layers, emulated devices state would have already been proceeded, and it 
won't be possible to reproduce those I/O after failover?  If I were wrong, I 
would be happy to move it, but if I were right, there are would be two 
approaches to overcome this:


1. Sync I/O requests to the receiver, and upon failover, release those requests 
before running the guest VM.
2. Copy the emulated devices state before starting emulating, and once it comes 
to the generic layer, start the synchronizing using the copied state.


We should also consider the guest's VCPU state.  I previously had similar 
discussion with Avi.  I would like to reconfirm his idea too.




Regards,

Anthony Liguori


diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index b80402d..1dd1c31 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -327,6 +327,8 @@ static void virtio_blk_handle_output(VirtIODevice
*vdev, VirtQueue *vq)
.old_bs = NULL,
};

+ do_event_tap();
+
while ((req = virtio_blk_get_request(s))) {
virtio_blk_handle_request(req,mrb);
}
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 5c0093e..1a32bf3 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -667,6 +667,8 @@ static void virtio_net_handle_tx(VirtIODevice
*vdev, VirtQueue *vq)
{
VirtIONet *n = to_virtio_net(vdev);

+ do_event_tap();
+
if (n-tx_timer_active) {
virtio_queue_set_notification(vq, 1);
qemu_del_timer(n-tx_timer);
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 1414f49..769bc95 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -935,8 +935,12 @@ int kvm_run(CPUState *env)

post_kvm_run(kvm, env);

+ /* TODO: we need to prevent tapping events that derived from the
+ * same VMEXIT. This needs more info from the kernel. */
#if defined(KVM_CAP_COALESCED_MMIO)
if (kvm_state-coalesced_mmio) {
+ /* prevent from tapping events while handling coalesced_mmio */
+ event_tap_suspend();
struct kvm_coalesced_mmio_ring *ring =
(void *) run + kvm_state-coalesced_mmio * PAGE_SIZE;
while (ring-first != ring-last) {
@@ -946,6 +950,7 @@ int kvm_run(CPUState *env)
smp_wmb();
ring-first = (ring-first + 1) % KVM_COALESCED_MMIO_MAX;
}
+ event_tap_resume();
}
#endif

@@ -1770,7 +1775,7 @@ static void resume_all_threads(void)
{
CPUState *penv = first_cpu;

- assert(!cpu_single_env);
+ /* assert(!cpu_single_env); */

while (penv) {
penv-stop = 0;






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >