Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-15 Thread Avi Kivity
On 04/13/2012 05:25 PM, Takuya Yoshikawa wrote:
 I forgot to say one important thing -- I might give you wrong impression.

 I am perfectly fine with your lock-less work.  It is really nice!

 The reason I say much about O(1) is that O(1) and rmap based
 GET_DIRTY_LOG have fundamentally different characteristics.

 I am thinking really seriously how to make dirty page tracking work
 well with QEMU in the future.

 For example, I am thinking about multi-threaded and fine-grained
 GET_DIRTY_LOG.

 If we use rmap based GET_DIRTY_LOG, we can restrict write protection to
 only a selected area of one guest memory slot.

 So we may be able to make each thread process dirty pages independently
 from other threads by calling GET_DIRTY_LOG for its own area.

 But I know that O(1) has its own good point.
 So please wait a bit.  I will write up what I am thinking or send patches.

 Anyway, I am looking forward to your lock-less work!
 It will improve the current GET_DIRTY_LOG performance.



Just to throw another idea into the mix - we can have write-protect-less
dirty logging, too.  Instead of write protection, drop the dirty bit,
and check it again when reading the dirty log.  It might look like we're
accessing the spte twice here, but it's actually just once - when we
check it to report for GET_DIRTY_LOG call N, we also prepare it for call
N+1.

This doesn't work for EPT, which lacks a dirty bit.  But we can emulate
it: take a free bit and call it spte.NOTDIRTY, when it is set, we also
clear spte.WRITE, and teach the mmu that if it sees spte.NOTDIRTY and
can just set spte.WRITE and clear spte.NOTDIRTY.  Now that looks exactly
like Xiao's lockless write enabling.

Another note: O(1) write protection is not mutually exclusive with rmap
based write protection.  In GET_DIRTY_LOG, you write protect everything,
and proceed to write enable on faults.  When you reach the page table
level, you perform the rmap check to see if you should write protect or
not.  With role.direct=1 the check is very cheap (and sometimes you can
drop the entire page table and replace it with a large spte).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DOS VM problem with QEMU-KVM and newer kernels

2012-04-15 Thread Avi Kivity
On 04/12/2012 09:32 PM, Gerhard Wiesinger wrote:
 Hello,

 I'm having problems with recents kernels and qemu-kvm with a DOS VM:
 TD286
 System: Bad selector: 0007
 System: Bad selector: 0D87
 System: Bad selector: 001F
 System: Bad selector: 0007
 GP at 0020 21D4 EC 0DC4
 Error 269 loading D:\BP\BIN\TD286.EXE into extended memory

 Another 286 DOS Extender application also rises a general protection
 fault:
 GP at 0020 18A1 CODE 357C

 Doesn't depend on the used DOS memory manager and is always
 reproduceable.

 Depends only on kernel version and not qemu-kvm and seabios (tried to
 bisect it without success):
 # NOK: Linux 3.3.1-3.fc16.x86_64 #1 SMP Wed Apr 4 18:08:51 UTC 2012
 x86_64 x86_64 x86_64 GNU/Linux
 # NOK: Linux 3.2.10-3.fc16.x86_64 #1 SMP Thu Mar 15 19:39:46 UTC 2012
 x86_64 x86_64 x86_64 GNU/Linux
 # OK: Linux 3.1.9-1.fc16.x86_64 #1 SMP Fri Jan 13 16:37:42 UTC 2012
 x86_64 x86_64 x86_64 GNU/Linux
 # OK: Linux 2.6.41.9-1.fc15.x86_64 #1 SMP Fri Jan 13 16:46:51 UTC 2012
 x86_64 x86_64 x86_64 GNU/Linux

 CPU is an AMD one.

 Any ideas how to fix it again?
 Any switches which might help?



The trigger is probably

 commit f1c1da2bde712812a3e0f9a7a7ebe7a916a4b5f4
 Author: Jan Kiszka jan.kis...@siemens.com
 Date:   Tue Oct 18 18:23:11 2011 +0200

 KVM: SVM: Keep intercepting task switching with NPT enabled
 
 AMD processors apparently have a bug in the hardware task switching
 support when NPT is enabled. If the task switch triggers a NPF, we can
 get wrong EXITINTINFO along with that fault. On resume, spurious
 exceptions may then be injected into the guest.
 
 We were able to reproduce this bug when our guest triggered #SS
 and the
 handler were supposed to run over a separate task with not yet touched
 stack pages.
 
 Work around the issue by continuing to emulate task switches even in
 NPT mode.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Although it's not the patch's direct fault - it simply exposed an
existing bug in kvm.

Things to try:
- revert the patch with a newer kernel
- try 3.4-rc2 which has some task switch fixes from Kevin; if you want a
Fedora kernel, use rawhide's [2]
- post traces [1]

Jan, Joerg, was an AMD erratum published for the bug?

[1] http://www.linux-kvm.org/page/Tracing
[2]
http://mirrors.kernel.org/fedora/development/rawhide/x86_64/os/Packages/k/kernel-3.4.0-0.rc2.git2.1.fc18.x86_64.rpm

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-kvm fails on pax kernel

2012-04-15 Thread Jens Kasten

Hi list,

I use the PAX patch in the kernel 3.2.14.
My qemu-kvm guest running on a gentoo hardend as host.
When I try to start my kvm guest I get:


PAX: size overflow detected in function wrmsr_interception 
arch/x86/kvm/svm.c:3115

Pid: 3565, comm: kvm_webserver Not tainted 3.2.14-rsbac-2.57-sec #5
Call Trace:
 [8115a407] ? 0x8115a407
 [810344f4] ? 0x810344f4
 [81032fd0] ? 0x81032fd0
 [81016f3c] ? 0x81016f3c
 [8102f215] ? 0x8102f215
 [810019c2] ? 0x810019c2
 [8116965b] ? 0x8116965b
 [81169f3b] ? 0x81169f3b
 [8116a60b] ? 0x8116a60b
 [81542226] ? 0x81542226
 [8154224d] ? 0x8154224d

--
Mit freundlchen  Grüßen

Jerns Kasten


http://www.kasten-edv.de

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Crash Caused By KVM?

2012-04-15 Thread Avi Kivity
On 04/11/2012 09:59 PM, Eric Northup wrote:
 On Wed, Apr 11, 2012 at 7:45 AM, Avi Kivity a...@redhat.com wrote:
  On 04/11/2012 05:11 AM, Peijie Yu wrote:
   For this problem, i found that panic is caused by
  BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
  But i check the Intel Technical Manual and found While an NMI
  interrupt handler is executing, the processor disables additional
  calls to the NMI handler until the next IRET instruction is executed.
  So, how this happen?
 
 
  The NMI path for kvm is different; the processor exits from the guest
  with NMIs blocked, then executes kvm code until it issues int $2 in
  vmx_complete_interrupts(). If an IRET is executed in this path, then
  NMIs will be unblocked and nested NMIs may occur.
 
  One way this can happen is if we access the vmap area and incur a fault,
  between the VMEXIT and invoking the NMI handler. Or perhaps the NMI
  handler itself generates a fault. Or we have a debug exception in that path.
 
  Is this reproducible?

 As an FYI, there have been BIOSes whose SMI handlers ran IRETs.  So
 the NMI blocking can go away surprisingly.

 See 29.8 NMI handling while in SMM in the Intel SDM vol 3.

Interesting, thanks.

From 29.8 it looks like you don't even need to issue IRET within SMM,
since SMM doesn't save/restore the NMI blocking flag.

However, this being a server, and the crash being in kvm code, I don't
think we can rule out that this is a kvm bug.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm fails on pax kernel

2012-04-15 Thread Avi Kivity
On 04/15/2012 12:42 PM, Jens Kasten wrote:
 Hi list,

 I use the PAX patch in the kernel 3.2.14.
 My qemu-kvm guest running on a gentoo hardend as host.
 When I try to start my kvm guest I get:


 PAX: size overflow detected in function wrmsr_interception
 arch/x86/kvm/svm.c:3115
 Pid: 3565, comm: kvm_webserver Not tainted 3.2.14-rsbac-2.57-sec #5
 Call Trace:
  [8115a407] ? 0x8115a407
  [810344f4] ? 0x810344f4
  [81032fd0] ? 0x81032fd0
  [81016f3c] ? 0x81016f3c
  [8102f215] ? 0x8102f215
  [810019c2] ? 0x810019c2
  [8116965b] ? 0x8116965b
  [81169f3b] ? 0x81169f3b
  [8116a60b] ? 0x8116a60b
  [81542226] ? 0x81542226
  [8154224d] ? 0x8154224d


Out of tree patches are not supported.  But if you provided a decoded
trace (with symbols instead of numbers), maybe we can learn something.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/16] KVM: MMU: return bool in __rmap_write_protect

2012-04-15 Thread Avi Kivity
On 04/14/2012 05:00 AM, Takuya Yoshikawa wrote:
 On Fri, 13 Apr 2012 18:11:13 +0800
 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:

  The reture value of __rmap_write_protect is either 1 or 0, use
  true/false instead of these

 ...

  @@ -1689,7 +1690,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
  
  kvm_mmu_pages_init(parent, parents, pages);
  while (mmu_unsync_walk(parent, pages)) {
  -   int protected = 0;
  +   bool protected = false;
  
  for_each_sp(pages, sp, parents, i)
  protected |= rmap_write_protect(vcpu-kvm, sp-gfn);

 Isn't this the reason we prefer int to bool?

 Not sure people like to use |= with boolean.


Why not?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable

2012-04-15 Thread Avi Kivity
On 04/13/2012 01:14 PM, Xiao Guangrong wrote:
 Using bit 1 (PTE_LIST_WP_BIT) in rmap store the write-protect status
 to avoid unnecessary shadow page walking

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   40 ++--
  1 files changed, 34 insertions(+), 6 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 0c6e92d..8b71908 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -796,7 +796,9 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
 large_gfn)
   return level - 1;
  }

 -#define PTE_LIST_DESC(0x1ull)
 +#define PTE_LIST_DESC_BIT0
 +#define PTE_LIST_WP_BIT  1
 +#define PTE_LIST_DESC(1  PTE_LIST_DESC_BIT)
  #define PTE_LIST_FLAG_MASK   (0x3ull)

  static void
 @@ -1067,6 +1069,12 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
   return mmu_memory_cache_free_objects(cache);
  }

 +static void host_page_write_protect(u64 *spte, unsigned long *rmapp)
 +{
 + if (!(*spte  SPTE_HOST_WRITEABLE))
 + __test_and_set_bit(PTE_LIST_WP_BIT, rmapp);
 +}


Why is this needed, in addition to spte.SPTE_WRITE_PROTECT?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm tools: Make raw block device work

2012-04-15 Thread Pekka Enberg
On Fri, 13 Apr 2012, 'Asias He wrote:
 From: Asias He asias.he...@gmail.com
 
 Previously, we use mmaped host root partition as guest's root
 filesystem. As virtio-9p based root filesystem is supported,
 mmaped host root partition approach is not used anymore.
 
 It is useful to use raw block device as guest's disk backend for some
 user. e.g. bypass host's fs layer.
 
 This patch makes raw block device work as disk image, user can do
 read/write on raw block device, by using DISK_IMAGE_REGULAR instead of
 DISK_IMAGE_MMAP for block device
 
 Changes in v3:
  - Add fclose.
 
 Changes in v2:
  - Check whether the block device is mounted before use it.
 
 Signed-off-by: Asias He asias.he...@gmail.com

Applied, thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm tools: Fix sdl hang

2012-04-15 Thread Pekka Enberg
On Fri, 13 Apr 2012, 'Asias He wrote:
 From: Asias He asias.he...@gmail.com
 
 Commit b4a932d175c6aa975c456e9b05339aa069c961cb sets sdl's .start
 ops to sdl__stop which makes the sdl never start.
 
 Fix it up.
 
 Signed-off-by: Asias He asias.he...@gmail.com

Both patches applied! Thnx!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv1 dont apply] RFC: kvm eoi PV using shared memory

2012-04-15 Thread Michael S. Tsirkin
I got lots of useful feedback from v0 so I thought
sending out a brain dump again would be a good idea.
This is mainly to show how I'm trying to address the
comments I got from the previous round. Flames/feedback
are wellcome!

Changes from v0:
- Tweaked setup MSRs a bit
- Keep ISR bit set. Before reading ISR, test EOI in guest memory and clear
- Check priority for nested interrupts, we can
  enable optimization if new one is high priority
- Disable optimization for any interrupt handled by ioapic
  (This is because ioapic handles notifiers and pic and it generally gets messy.
   It's possible that we can optimize some ioapic-handled
   edge interrupts - but is it worth it?)
- A less intrusive change for guest apic (0 overhead without kvm)

---

I took a stub at implementing PV EOI using shared memory.
This should reduce the number of exits an interrupt
causes as much as by half.

A partially complete draft for both host and guest parts
is below.

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.

There's a new MSR to set the address of said register
in guest memory. Otherwise not much changed:
- Guest EOI is not required
- Register is tested ISR is automatically cleared before injection
 
qemu support is incomplete - mostly for feature negotiation.
need to add some trace points to enable profiling.
No testing was done beyond compiling the kernel.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index b97596e..c9c70ea 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -26,11 +26,13 @@
 #if __GNUC__  4 || (__GNUC__ == 4  __GNUC_MINOR__  1)
 /* Technically wrong, but this avoids compilation errors on some gcc
versions. */
-#define BITOP_ADDR(x) =m (*(volatile long *) (x))
+#define BITOP_ADDR_CONSTRAINT =m
 #else
-#define BITOP_ADDR(x) +m (*(volatile long *) (x))
+#define BITOP_ADDR_CONSTRAINT +m
 #endif
 
+#define BITOP_ADDR(x) BITOP_ADDR_CONSTRAINT (*(volatile long *) (x))
+
 #define ADDR   BITOP_ADDR(addr)
 
 /*
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e216ba0..3d09ef1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -481,6 +481,12 @@ struct kvm_vcpu_arch {
u64 length;
u64 status;
} osvw;
+
+   struct {
+   u64 msr_val;
+   struct gfn_to_hva_cache data;
+   bool pending;
+   } eoi;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 734c376..164376a 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -22,6 +22,7 @@
 #define KVM_FEATURE_CLOCKSOURCE23
 #define KVM_FEATURE_ASYNC_PF   4
 #define KVM_FEATURE_STEAL_TIME 5
+#define KVM_FEATURE_EOI6
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -37,6 +38,8 @@
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
+#define MSR_KVM_EOI_EN  0x4b564d04
+#define MSR_KVM_EOI_DISABLED 0x0L
 
 struct kvm_steal_time {
__u64 steal;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b8ba6e4..450aae4 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -39,6 +39,7 @@
 #include asm/desc.h
 #include asm/tlbflush.h
 #include asm/idle.h
+#include asm/apic.h
 
 static int kvmapf = 1;
 
@@ -290,6 +291,33 @@ static void kvm_register_steal_time(void)
cpu, __pa(st));
 }
 
+/* TODO: needs to be early? aligned? */
+static DEFINE_EARLY_PER_CPU(u8, apic_eoi, 0);
+
+/* Our own copy of __test_and_clear_bit to make sure
+ * it is done with a single instruction */
+static inline int kvm_test_and_clear_bit(int nr, volatile u8* addr)
+{
+   int oldbit;
+
+   asm volatile(btr %2,%1\n\t
+sbb %0,%0
+: =r (oldbit),
+BITOP_ADDR_CONSTRAINT (*(volatile u8 *) (addr))
+: Ir (nr));
+   return oldbit;
+}
+
+static void (*kvm_guest_native_apic_write)(u32 reg, u32 val);
+static void kvm_guest_apic_write(u32 reg, u32 val)
+{
+   if (reg == APIC_EOI 
+   kvm_test_and_clear_bit(0, __get_cpu_var(apic_eoi)))
+   return;
+
+   kvm_guest_native_apic_write(reg, val);
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -307,11 +335,18 @@ void __cpuinit kvm_guest_cpu_init(void)
   

[PATCH 08/20] target-ppc: Add hooks for handling tcg and kvm limitations

2012-04-15 Thread Andreas Färber
From: David Gibson da...@gibson.dropbear.id.au

On target-ppc, our table of CPU types and features encodes the features as
found on the hardware, regardless of whether these features are actually
usable under TCG or KVM.  We already have cases where the information from
the cpu table must be fixed up to account for limitations in the emulation
method we're using.  e.g. TCG does not support the DFP and VSX instructions
and KVM needs different numbering of the CPUs in order to tell it the
correct thread to core mappings.

This patch cleans up these hacks to handle emulation limitations by
consolidating them into a pair of functions specifically for the purpose.

Signed-off-by: David Gibson da...@gibson.dropbear.id.au
[AF: Style and typo fixes, rename new functions and drop ppc_def_t arg]
Signed-off-by: Andreas Färber afaer...@suse.de
---
 target-ppc/helper.c |9 ---
 target-ppc/kvm.c|   14 +++
 target-ppc/kvm_ppc.h|5 
 target-ppc/translate_init.c |   51 +-
 4 files changed, 54 insertions(+), 25 deletions(-)

diff --git a/target-ppc/helper.c b/target-ppc/helper.c
index f61b8b2..b34dcbe 100644
--- a/target-ppc/helper.c
+++ b/target-ppc/helper.c
@@ -3198,15 +3198,6 @@ CPUPPCState *cpu_ppc_init (const char *cpu_model)
 if (tcg_enabled()) {
 ppc_translate_init();
 }
-/* Adjust cpu index for SMT */
-#if !defined(CONFIG_USER_ONLY)
-if (kvm_enabled()) {
-int smt = kvmppc_smt_threads();
-
-env-cpu_index = (env-cpu_index / smp_threads)*smt
-+ (env-cpu_index % smp_threads);
-}
-#endif /* !CONFIG_USER_ONLY */
 env-cpu_model_str = cpu_model;
 cpu_ppc_register_internal(env, def);
 
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index d929213..c09cc39 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -27,6 +27,7 @@
 #include kvm.h
 #include kvm_ppc.h
 #include cpu.h
+#include cpus.h
 #include device_tree.h
 #include hw/sysbus.h
 #include hw/spapr.h
@@ -938,6 +939,19 @@ const ppc_def_t *kvmppc_host_cpu_def(void)
 return spec;
 }
 
+int kvmppc_fixup_cpu(CPUPPCState *env)
+{
+int smt;
+
+/* Adjust cpu index for SMT */
+smt = kvmppc_smt_threads();
+env-cpu_index = (env-cpu_index / smp_threads) * smt
++ (env-cpu_index % smp_threads);
+
+return 0;
+}
+
+
 bool kvm_arch_stop_on_emulation_error(CPUPPCState *env)
 {
 return true;
diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
index 8f1267c..34ecad3 100644
--- a/target-ppc/kvm_ppc.h
+++ b/target-ppc/kvm_ppc.h
@@ -29,6 +29,7 @@ void *kvmppc_create_spapr_tce(uint32_t liobn, uint32_t 
window_size, int *pfd);
 int kvmppc_remove_spapr_tce(void *table, int pfd, uint32_t window_size);
 #endif /* !CONFIG_USER_ONLY */
 const ppc_def_t *kvmppc_host_cpu_def(void);
+int kvmppc_fixup_cpu(CPUPPCState *env);
 
 #else
 
@@ -95,6 +96,10 @@ static inline const ppc_def_t *kvmppc_host_cpu_def(void)
 return NULL;
 }
 
+static inline int kvmppc_fixup_cpu(CPUPPCState *env)
+{
+return -1;
+}
 #endif
 
 #ifndef CONFIG_KVM
diff --git a/target-ppc/translate_init.c b/target-ppc/translate_init.c
index b1f8785..067e07e 100644
--- a/target-ppc/translate_init.c
+++ b/target-ppc/translate_init.c
@@ -9889,6 +9889,28 @@ static int gdb_set_spe_reg(CPUPPCState *env, uint8_t 
*mem_buf, int n)
 return 0;
 }
 
+static int ppc_fixup_cpu(CPUPPCState *env)
+{
+/* TCG doesn't (yet) emulate some groups of instructions that
+ * are implemented on some otherwise supported CPUs (e.g. VSX
+ * and decimal floating point instructions on POWER7).  We
+ * remove unsupported instruction groups from the cpu state's
+ * instruction masks and hope the guest can cope.  For at
+ * least the pseries machine, the unavailability of these
+ * instructions can be advertised to the guest via the device
+ * tree. */
+if ((env-insns_flags  ~PPC_TCG_INSNS)
+|| (env-insns_flags2  ~PPC_TCG_INSNS2)) {
+fprintf(stderr, Warning: Disabling some instructions which are not 
+emulated by TCG (0x% PRIx64 , 0x% PRIx64 )\n,
+env-insns_flags  ~PPC_TCG_INSNS,
+env-insns_flags2  ~PPC_TCG_INSNS2);
+}
+env-insns_flags = PPC_TCG_INSNS;
+env-insns_flags2 = PPC_TCG_INSNS2;
+return 0;
+}
+
 int cpu_ppc_register_internal (CPUPPCState *env, const ppc_def_t *def)
 {
 env-msr_mask = def-msr_mask;
@@ -9897,25 +9919,22 @@ int cpu_ppc_register_internal (CPUPPCState *env, const 
ppc_def_t *def)
 env-bus_model = def-bus_model;
 env-insns_flags = def-insns_flags;
 env-insns_flags2 = def-insns_flags2;
-if (!kvm_enabled()) {
-/* TCG doesn't (yet) emulate some groups of instructions that
- * are implemented on some otherwise supported CPUs (e.g. VSX
- * and decimal floating point instructions on POWER7).  We
- * remove unsupported instruction groups from the cpu state's
- * instruction masks 

Re: DOS VM problem with QEMU-KVM and newer kernels

2012-04-15 Thread Gerhard Wiesinger

On 15.04.2012 11:44, Avi Kivity wrote:

On 04/12/2012 09:32 PM, Gerhard Wiesinger wrote:

Hello,

I'm having problems with recents kernels and qemu-kvm with a DOS VM:
TD286
System: Bad selector: 0007
System: Bad selector: 0D87
System: Bad selector: 001F
System: Bad selector: 0007
GP at 0020 21D4 EC 0DC4
Error 269 loading D:\BP\BIN\TD286.EXE into extended memory

Another 286 DOS Extender application also rises a general protection
fault:
GP at 0020 18A1 CODE 357C

Doesn't depend on the used DOS memory manager and is always
reproduceable.

Depends only on kernel version and not qemu-kvm and seabios (tried to
bisect it without success):
# NOK: Linux 3.3.1-3.fc16.x86_64 #1 SMP Wed Apr 4 18:08:51 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux
# NOK: Linux 3.2.10-3.fc16.x86_64 #1 SMP Thu Mar 15 19:39:46 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux
# OK: Linux 3.1.9-1.fc16.x86_64 #1 SMP Fri Jan 13 16:37:42 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux
# OK: Linux 2.6.41.9-1.fc15.x86_64 #1 SMP Fri Jan 13 16:46:51 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux

CPU is an AMD one.

Any ideas how to fix it again?
Any switches which might help?



The trigger is probably


commit f1c1da2bde712812a3e0f9a7a7ebe7a916a4b5f4
Author: Jan Kiszkajan.kis...@siemens.com
Date:   Tue Oct 18 18:23:11 2011 +0200

 KVM: SVM: Keep intercepting task switching with NPT enabled

 AMD processors apparently have a bug in the hardware task switching
 support when NPT is enabled. If the task switch triggers a NPF, we can
 get wrong EXITINTINFO along with that fault. On resume, spurious
 exceptions may then be injected into the guest.

 We were able to reproduce this bug when our guest triggered #SS
and the
 handler were supposed to run over a separate task with not yet touched
 stack pages.

 Work around the issue by continuing to emulate task switches even in
 NPT mode.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 Signed-off-by: Marcelo Tosattimtosa...@redhat.com

Although it's not the patch's direct fault - it simply exposed an
existing bug in kvm.

Things to try:
- revert the patch with a newer kernel
- try 3.4-rc2 which has some task switch fixes from Kevin; if you want a
Fedora kernel, use rawhide's [2]
- post traces [1]

Jan, Joerg, was an AMD erratum published for the bug?

[1] http://www.linux-kvm.org/page/Tracing
[2]
http://mirrors.kernel.org/fedora/development/rawhide/x86_64/os/Packages/k/kernel-3.4.0-0.rc2.git2.1.fc18.x86_64.rpm



Hello Avi,

Tried newer kernel since this version is no longer available:
http://mirrors.kernel.org/fedora/development/rawhide/x86_64/os/Packages/k/kernel-3.4.0-0.rc2.git3.1.fc18.x86_64.rpm

But I wasn't successfull. Still same GP fault (but with 18A2 instead of 
18A1):

GP at 0020 18A2 CODE 357C

yum install asciidoc udis86 udis86-devel
git clone 
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git

cd trace-cmd
make
./trace-cmd record -b 2 -e kvm
./trace-cmd report

Very long output, what should I grep/trigger for?

Thnx so far.

BTW: Where can I find old kernels like (removed on upgrade :-( ):
kernel-2.6.41.9-1.fc15.x86_64.rpm
kernel-3.1.9-1.fc16.x86_64.rpm
kernel-3.2.10-3.fc16.x86_64.rpm
kernel-debug-2.6.41.9-1.fc15.x86_64

Ciao,
Gerhard

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-15 Thread Chegu Vinod
Rik van Riel riel at redhat.com writes:

 
 On 04/11/2012 01:21 PM, Chegu Vinod wrote:
 
  Hello,
 
  While running an AIM7 (workfile.high_systime) in a single 40-way (or a 
single
  60-way KVM guest) I noticed pretty bad performance when the guest was booted
  with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
  (RHEL6.2) kernel.
 
  For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better 
than
  the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older 
guest
  kernel was nearly 12x better !
 
  Turned on function tracing and found that there appears to be more time 
being
  spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-
220
  guest.
 
 Looks like you may be running into the ticket spinlock
 code. During the early RHEL 6 days, Gleb came up with a
 patch to automatically disable ticket spinlocks when
 running inside a KVM guest.
 

Thanks for the pointer. 
Perhaps that is the issue.  
I did look up that old discussion thread.


 IIRC that patch got rejected upstream at the time,
 with upstream developers preferring to wait for a
 better solution.
 
 If such a better solution is not on its way upstream
 now (two years later), maybe we should just merge
 Gleb's patch upstream for the time being?



Also noticed a recent discussion thread (that originated from the Xen context)

http://article.gmane.org/gmane.linux.kernel.virtualization/15078

Not yet sure if this recent discussion is also in some way related to
the older one initiated by Gleb.

Thanks
Vinod



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable

2012-04-15 Thread Xiao Guangrong
On 04/15/2012 11:16 PM, Avi Kivity wrote:

 On 04/13/2012 01:14 PM, Xiao Guangrong wrote:
 Using bit 1 (PTE_LIST_WP_BIT) in rmap store the write-protect status
 to avoid unnecessary shadow page walking

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   40 ++--
  1 files changed, 34 insertions(+), 6 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 0c6e92d..8b71908 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -796,7 +796,9 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
 large_gfn)
  return level - 1;
  }

 -#define PTE_LIST_DESC   (0x1ull)
 +#define PTE_LIST_DESC_BIT   0
 +#define PTE_LIST_WP_BIT 1
 +#define PTE_LIST_DESC   (1  PTE_LIST_DESC_BIT)
  #define PTE_LIST_FLAG_MASK  (0x3ull)

  static void
 @@ -1067,6 +1069,12 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
  return mmu_memory_cache_free_objects(cache);
  }

 +static void host_page_write_protect(u64 *spte, unsigned long *rmapp)
 +{
 +if (!(*spte  SPTE_HOST_WRITEABLE))
 +__test_and_set_bit(PTE_LIST_WP_BIT, rmapp);
 +}

 
 Why is this needed, in addition to spte.SPTE_WRITE_PROTECT?
 


It is used to avoid the unnecessary overload for fast page fault if
KSM is enabled. On the fast check path, it can see the gfn is write-protected
by host, then the fast page fault path is not called.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/16] KVM: MMU: properly assert spte on rmap walking path

2012-04-15 Thread Xiao Guangrong
On 04/14/2012 10:15 AM, Takuya Yoshikawa wrote:

 On Fri, 13 Apr 2012 18:10:45 +0800
 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:
 
  static u64 *rmap_get_next(struct rmap_iterator *iter)
  {
 +u64 *sptep = NULL;
 +
  if (iter-desc) {
  if (iter-pos  PTE_LIST_EXT - 1) {
 -u64 *sptep;
 -
  ++iter-pos;
  sptep = iter-desc-sptes[iter-pos];
  if (sptep)
 -return sptep;
 +goto exit;
  }

  iter-desc = iter-desc-more;
 @@ -1028,11 +1036,14 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
  if (iter-desc) {
  iter-pos = 0;
  /* desc-sptes[0] cannot be NULL */
 -return iter-desc-sptes[iter-pos];
 +sptep = iter-desc-sptes[iter-pos];
 +goto exit;
  }
  }

 -return NULL;
 +exit:
 +WARN_ON(sptep  !is_shadow_present_pte(*sptep));
 +return sptep;
  }
 
 This will, probably, again force rmap_get_next function-call even with 
 EPT/NPT:
 CPU cannot skip it by branch prediction.
 

No, EPT/NPT also needs it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/16] KVM: MMU: abstract spte write-protect

2012-04-15 Thread Xiao Guangrong
On 04/14/2012 10:26 AM, Takuya Yoshikawa wrote:

 On Fri, 13 Apr 2012 18:11:45 +0800
 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:
 
 +/* Return true if the spte is dropped. */
 
 Return value does not correspond with the function name so it is confusing.
 


That is why i put comment here.

 People may think that true means write protection has been done.
 


You have a better name?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable

2012-04-15 Thread Xiao Guangrong
On 04/14/2012 11:01 AM, Takuya Yoshikawa wrote:

 On Fri, 13 Apr 2012 18:14:26 +0800
 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:
 
 Using bit 1 (PTE_LIST_WP_BIT) in rmap store the write-protect status
 to avoid unnecessary shadow page walking

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   40 ++--
  1 files changed, 34 insertions(+), 6 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 0c6e92d..8b71908 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -796,7 +796,9 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
 large_gfn)
  return level - 1;
  }

 -#define PTE_LIST_DESC   (0x1ull)
 +#define PTE_LIST_DESC_BIT   0
 +#define PTE_LIST_WP_BIT 1
 
 _BIT ?


What is the problem?

 
 @@ -2291,9 +2310,15 @@ static int mmu_need_write_protect(struct kvm_vcpu 
 *vcpu, gfn_t gfn,
  {
  struct kvm_mmu_page *s;
  struct hlist_node *node;
 +unsigned long *rmap;
  bool need_unsync = false;

 +rmap = gfn_to_rmap(vcpu-kvm, gfn, PT_PAGE_TABLE_LEVEL);
 
 Please use consistent variable names.
 
 In other parts of this patch, you are using rmapp for this.


Okay.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/16] KVM: MMU: introduce for_each_pte_list_spte

2012-04-15 Thread Xiao Guangrong
On 04/14/2012 10:44 AM, Takuya Yoshikawa wrote:

 On Fri, 13 Apr 2012 18:12:41 +0800b
 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:
 
 It is used to walk all the sptes of the specified pte_list, after
 this, the code of pte_list_walk can be removed

 And it can restart the walking automatically if the spte is zapped
 
 Well, I want to ask two questions:
 
   - why do you prefer pte_list_* naming to rmap_*?
 (not a big issue but just curious)


pte_list is a common infrastructure for both parent-list and rmap.

   - Are you sure the whole indirection by this patch will
 not introduce any regression?
 (not restricted to get_dirty)
 


I tested it with kernbench, no regression is found.

It is not a problem since the iter and spte should be in the cache.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Avoid zapping unrelated shadows in __kvm_set_memory_region()

2012-04-15 Thread Xiao Guangrong
On 04/14/2012 09:12 AM, Takuya Yoshikawa wrote:

 Hi,
 
 On Wed, 11 Apr 2012 11:11:07 +0800
 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:
 
  restart:
 -   list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link)
 -   if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list))
 -   goto restart;
 +   zapped = 0;
 +   list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link) {
 +   if ((slot = 0)  !test_bit(slot, sp-slot_bitmap))
 +   continue;
 +
 +   zapped |= kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);


 You should goto restart here like the origin code, also, safe version of
 list_for_each is not needed.
 
 
 Thank you for looking into this part.
 
 I understand that we can eliminate _safe in the original implementation.
 
 
 Can you tell me the reason why we should do goto restart immediately here?
 For performance, or correctness issue?
 


kvm_mmu_prepare_zap_page may remove many sp in kvm-arch.active_mmu_pages list
that means the next node cached in list_for_each_entry_safe will become invalid.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/16] KVM: MMU: fast page fault

2012-04-15 Thread Xiao Guangrong
On 04/14/2012 11:37 AM, Takuya Yoshikawa wrote:

 On Fri, 13 Apr 2012 18:05:29 +0800
 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:
 
 Thanks for Avi and Marcelo's review, i have simplified the whole things
 in this version:
 - it only fix the page fault with PFEC.P = 1  PFEC.W = 0 that means
   unlock set_spte path can be dropped.

 - it only fixes the page fault caused by dirty-log

 In this version, all the information we need is from spte, the
 SPTE_ALLOW_WRITE bit and SPTE_WRITE_PROTECT bit:
- SPTE_ALLOW_WRITE is set if the gpte is writable and the pfn pointed
  by the spte is writable on host.
- SPTE_WRITE_PROTECT is set if the spte is write-protected by shadow
  page table protection.

 All these bits can be protected by cmpxchg, now, all the things is fairly
 simple than before. :)
 
 Well, could you remove cleanup patches not needed for lock-less from
 this patch series?
 
 I want to see them separately.
 
 Or everything was needed for lock-less ?
 


The cleanup patches do the prepare work for fast page fault, the later path will
be easily implemented, for example, the for_each_spte_rmap patches make store
more bits in rmap patch doing little change since spte_list_walk is removed.

 Performance test:

 autotest migration:
 (Host: Intel(R) Xeon(R) CPU   X5690  @ 3.47GHz * 12 + 32G)
 
 Please explain what this test result means, not just numbers.
 
 There are many aspects:
   - how fast migration can converge/complete
   - how fast programs inside the guest can run during migration:
 -- throughput
 -- latency
   - ...
 


The result is rather straightforward, i think explanation is not needed.

 I think lock-less will reduce latency a lot, but not sure about convergence:
 why it became fast?
 


It is hard to understand? It is faster since it can be parallel.

 - For ept:

 Before:
 smp2.Fedora.16.64.migrate
 Times   .unix  .with_autotest.dbench.unix total
  1   104   214 323
  2   68238 310
  3   68242 314

 After:
 smp2.Fedora.16.64.migrate
 Times   .unix  .with_autotest.dbench.unix total
  1   101   190 295
  2   67188 259
  3   66217 289

 
 As discussed on v1-threads, the main goal of this lock-less should be
 the elimination of mmu_lock contentions
 
 So what we should measure is latency.
 


I think the test of migration time is fairly enough to see the effect.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html