Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Avi Kivity

On 12/02/2010 10:51 PM, Anthony Liguori wrote:

VCPU in HLT state only allows injection of certain events that
would be delivered on HLT. #PF is not one of them.


But you can't inject an exception into a guest while the VMCS is 
active, can you? 


No, but this is irrelevant.

So the guest takes an exit while in the hlt instruction but that's no 
different than if the guest has been interrupted because of hlt exiting.


hlt exiting doesn't leave vcpu in the halted state (since hlt has not 
been executed).  So currently we never see a vcpu in halted state.





You'd have to handle this situation on event injection, vmentry fails
otherwise. Or perhaps clear HLT state on vmexit and vmentry.


So this works today because on a hlt exit, emulate_halt() will clear 
the the HLT state which then puts the the vcpu into a state where it 
can receive an exception injection?


The halt state is never entered.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Avi Kivity

On 12/02/2010 05:23 PM, Anthony Liguori wrote:

On 12/02/2010 08:39 AM, lidong chen wrote:
In certain use-cases, we want to allocate guests fixed time slices 
where idle

guest cycles leave the machine idling.

i could not understand why need this? can you tell more detailedly?


If you run 4 guests on a CPU, and they're all trying to consume 100% 
CPU, all things being equal, you'll get ~25% CPU for each guest.


However, if one guest is idle, you'll get something like 1% 32% 33% 
32%.  This characteristic is usually desirable because it increase 
aggregate throughput but in some circumstances, determinism is more 
desirable than aggregate throughput.


This patch essentially makes guest execution non-work conserving by 
making it appear to the scheduler that each guest wants 100% CPU even 
though they may be idling.


That means that regardless of what each guest is doing, if you have 
four guests on one CPU, each will get ~25% CPU[1].




What if one of the guest crashes qemu or invokes a powerdown?  Suddenly 
the others get 33% each (with 1% going to my secret round-up account).  
Doesn't seem like a reliable way to limit cpu.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Avi Kivity

On 12/02/2010 09:14 PM, Chris Wright wrote:

Perhaps it should be a VM level option.  And then invert the notion.
Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
(pin in place probably).  And have that VM do nothing other than hlt.
Then it's always runnable according to scheduler, and can consume the
extra work that CFS wants to give away.


What's the difference between this and the Linux idle threads?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] KVM: SVM: Add manipulation functions for CRx intercepts

2010-12-03 Thread Roedel, Joerg
On Thu, Dec 02, 2010 at 11:43:50AM -0500, Marcelo Tosatti wrote:
 On Tue, Nov 30, 2010 at 06:03:57PM +0100, Joerg Roedel wrote:
  -   control-intercept_cr_read =INTERCEPT_CR0_MASK |
  -   INTERCEPT_CR3_MASK |
  -   INTERCEPT_CR4_MASK;
  -
  -   control-intercept_cr_write =   INTERCEPT_CR0_MASK |
  -   INTERCEPT_CR3_MASK |
  -   INTERCEPT_CR4_MASK |
  -   INTERCEPT_CR8_MASK;
  +   set_cr_intercept(svm, INTERCEPT_CR0_READ);
  +   set_cr_intercept(svm, INTERCEPT_CR3_READ);
  +   set_cr_intercept(svm, INTERCEPT_CR4_READ);
  +   set_cr_intercept(svm, INTERCEPT_CR0_WRITE);
  +   set_cr_intercept(svm, INTERCEPT_CR3_WRITE);
  +   set_cr_intercept(svm, INTERCEPT_CR4_WRITE);
  +   set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
 
 Should clear hflags before using is_guest_mode().

Right, thanks. Here is the updated patch.

From 78e9a425d1440fdf9232e38dc3093127b96f26bb Mon Sep 17 00:00:00 2001
From: Joerg Roedel joerg.roe...@amd.com
Date: Tue, 30 Nov 2010 15:30:21 +0100
Subject: [PATCH 2/6] KVM: SVM: Add manipulation functions for CRx intercepts

This patch wraps changes to the CRx intercepts of SVM into
seperate functions to abstract nested-svm better and prepare
the implementation of the vmcb-clean-bits feature.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/include/asm/svm.h |   15 +++--
 arch/x86/kvm/svm.c |  120 +++
 2 files changed, 73 insertions(+), 62 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 0e83105..39f9ddf 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -51,8 +51,7 @@ enum {
 
 
 struct __attribute__ ((__packed__)) vmcb_control_area {
-   u16 intercept_cr_read;
-   u16 intercept_cr_write;
+   u32 intercept_cr;
u16 intercept_dr_read;
u16 intercept_dr_write;
u32 intercept_exceptions;
@@ -204,10 +203,14 @@ struct __attribute__ ((__packed__)) vmcb {
 #define SVM_SELECTOR_READ_MASK SVM_SELECTOR_WRITE_MASK
 #define SVM_SELECTOR_CODE_MASK (1  3)
 
-#define INTERCEPT_CR0_MASK 1
-#define INTERCEPT_CR3_MASK (1  3)
-#define INTERCEPT_CR4_MASK (1  4)
-#define INTERCEPT_CR8_MASK (1  8)
+#define INTERCEPT_CR0_READ 0
+#define INTERCEPT_CR3_READ 3
+#define INTERCEPT_CR4_READ 4
+#define INTERCEPT_CR8_READ 8
+#define INTERCEPT_CR0_WRITE(16 + 0)
+#define INTERCEPT_CR3_WRITE(16 + 3)
+#define INTERCEPT_CR4_WRITE(16 + 4)
+#define INTERCEPT_CR8_WRITE(16 + 8)
 
 #define INTERCEPT_DR0_MASK 1
 #define INTERCEPT_DR1_MASK (1  1)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 05fe851..57c597b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -98,8 +98,7 @@ struct nested_state {
unsigned long vmexit_rax;
 
/* cache for intercepts of the guest */
-   u16 intercept_cr_read;
-   u16 intercept_cr_write;
+   u32 intercept_cr;
u16 intercept_dr_read;
u16 intercept_dr_write;
u32 intercept_exceptions;
@@ -204,14 +203,46 @@ static void recalc_intercepts(struct vcpu_svm *svm)
h = svm-nested.hsave-control;
g = svm-nested;
 
-   c-intercept_cr_read = h-intercept_cr_read | g-intercept_cr_read;
-   c-intercept_cr_write = h-intercept_cr_write | g-intercept_cr_write;
+   c-intercept_cr = h-intercept_cr | g-intercept_cr;
c-intercept_dr_read = h-intercept_dr_read | g-intercept_dr_read;
c-intercept_dr_write = h-intercept_dr_write | g-intercept_dr_write;
c-intercept_exceptions = h-intercept_exceptions | 
g-intercept_exceptions;
c-intercept = h-intercept | g-intercept;
 }
 
+static inline struct vmcb *get_host_vmcb(struct vcpu_svm *svm)
+{
+   if (is_guest_mode(svm-vcpu))
+   return svm-nested.hsave;
+   else
+   return svm-vmcb;
+}
+
+static inline void set_cr_intercept(struct vcpu_svm *svm, int bit)
+{
+   struct vmcb *vmcb = get_host_vmcb(svm);
+
+   vmcb-control.intercept_cr |= (1U  bit);
+
+   recalc_intercepts(svm);
+}
+
+static inline void clr_cr_intercept(struct vcpu_svm *svm, int bit)
+{
+   struct vmcb *vmcb = get_host_vmcb(svm);
+
+   vmcb-control.intercept_cr = ~(1U  bit);
+
+   recalc_intercepts(svm);
+}
+
+static inline bool is_cr_intercept(struct vcpu_svm *svm, int bit)
+{
+   struct vmcb *vmcb = get_host_vmcb(svm);
+
+   return vmcb-control.intercept_cr  (1U  bit);
+}
+
 static inline void enable_gif(struct vcpu_svm *svm)
 {
svm-vcpu.arch.hflags |= HF_GIF_MASK;
@@ -766,15 +797,15 @@ static void init_vmcb(struct vcpu_svm *svm)
struct vmcb_save_area *save = svm-vmcb-save;
 
svm-vcpu.fpu_active = 1;
+   svm-vcpu.arch.hflags = 0;
 
-   control-intercept_cr_read =INTERCEPT_CR0_MASK |
-   INTERCEPT_CR3_MASK |
-   

Re: Problems on qemu-kvm unittests

2010-12-03 Thread Avi Kivity

On 12/03/2010 12:59 AM, Lucas Meneghel Rodrigues wrote:

We are getting failures when executing apic.flat on our periodic upstream tests:

12/02 18:40:59 DEBUG|kvm_vm:0664| Running qemu command:
/usr/local/autotest/tests/kvm/qemu -name 'vm1' -monitor 
unix:'/tmp/monitor-humanmonitor1-20101202-184059-9EnX',server,nowait -serial 
unix:'/tmp/serial-20101202-184059-9EnX',server,nowait -m 512 -smp 2 -kernel 
'/usr/local/autotest/tests/kvm/unittests/apic.flat' -vnc :0 -chardev 
file,id=testlog,path=/tmp/testlog-20101202-184059-9EnX -device 
testdev,chardev=testlog  -S -cpu qemu64,+x2apic
12/02 18:40:59 DEBUG|kvm_subpro:0700| (qemu) Cannot load x86-64 image, give a 
32bit one.
12/02 18:40:59 DEBUG|kvm_subpro:0700| (qemu) (Process terminated with status 1)

Relevant git commits:

12/02 18:20:16 INFO | kvm_utils:0407| Commit hash for 
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git is 
9ee00410d82a7c5cab5ae347d97fbf8a95c55506 (tag v2.6.32-56688-g9ee0041)

12/02 18:39:11 INFO | kvm_utils:0407| Commit hash for 
git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git is 
53b6d3d5c2522e881c8d194f122de3114f6f76eb (tag kvm-88-6325-g53b6d3d)

12/02 18:39:16 INFO | kvm_utils:0407| Commit hash for 
git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git is 
f2d2b7c74355523b90019427224577b6f0ff1b8a (no tag found)

Please advise!



Caused by qemu tightening up -kernel.  Easily fixed by using objcopy, 
I'll post a patch soon.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] do not try to register kvmclock if the host does not support it.

2010-12-03 Thread Glauber Costa
With the new Async PF code, some of our initiation code was moved to
kvm.c. With that, code that was being issued conditional to kvmclock
successful registration (primary cpu registering), started being
issued unconditionally.

This patch proposes that we protect all registrations inside
kvm_register_clock, re-using the kvmclock-enabled parameter for that.
This fixes this specific bug, and should protect us from similar
changes in the future.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 arch/x86/kernel/kvmclock.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f98d3ea..fcdea34 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -130,6 +130,9 @@ int kvm_register_clock(char *txt)
int cpu = smp_processor_id();
int low, high, ret;
 
+   if (!kvmclock)
+   return 0;
+
low = (int)__pa(per_cpu(hv_clock, cpu)) | 1;
high = ((u64)__pa(per_cpu(hv_clock, cpu))  32);
ret = native_write_msr_safe(msr_kvm_system_time, low, high);
@@ -182,14 +185,19 @@ void __init kvmclock_init(void)
if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-   } else if (!(kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
+   } else if (!(kvmclock  
kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE))) {
+   kvmclock = 0;
return;
+   }
 
printk(KERN_INFO kvm-clock: Using msrs %x and %x,
msr_kvm_system_time, msr_kvm_wall_clock);
 
-   if (kvm_register_clock(boot clock))
+   if (kvm_register_clock(boot clock)) {
+   kvmclock = 0;
return;
+   }
+
pv_time_ops.sched_clock = kvm_clock_read;
x86_platform.calibrate_tsc = kvm_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/12] KVM: SVM: Add clean-bit for intercetps, tsc-offset and pause filter count

2010-12-03 Thread Joerg Roedel
This patch adds the clean-bit for intercepts-vectors, the
TSC offset and the pause-filter count to the appropriate
places. The IO and MSR permission bitmaps are not subject to
this bit.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 3ee59b0..d8058c9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -186,6 +186,8 @@ static int nested_svm_check_exception(struct vcpu_svm *svm, 
unsigned nr,
  bool has_error_code, u32 error_code);
 
 enum {
+   VMCB_INTERCEPTS, /* Intercept vectors, TSC offset,
+   pause filter count */
VMCB_DIRTY_MAX,
 };
 
@@ -217,6 +219,8 @@ static void recalc_intercepts(struct vcpu_svm *svm)
struct vmcb_control_area *c, *h;
struct nested_state *g;
 
+   mark_dirty(svm-vmcb, VMCB_INTERCEPTS);
+
if (!is_guest_mode(svm-vcpu))
return;
 
@@ -854,6 +858,8 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 
offset)
}
 
svm-vmcb-control.tsc_offset = offset + g_tsc_offset;
+
+   mark_dirty(svm-vmcb, VMCB_INTERCEPTS);
 }
 
 static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
@@ -863,6 +869,7 @@ static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, 
s64 adjustment)
svm-vmcb-control.tsc_offset += adjustment;
if (is_guest_mode(vcpu))
svm-nested.hsave-control.tsc_offset += adjustment;
+   mark_dirty(svm-vmcb, VMCB_INTERCEPTS);
 }
 
 static void init_vmcb(struct vcpu_svm *svm)
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/12] KVM: SVM: Add clean-bit for interrupt state

2010-12-03 Thread Joerg Roedel
This patch implements the clean-bit for all interrupt
related state in the vmcb. This corresponds to vmcb offset
0x60-0x67.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8675048..b81d31a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -190,10 +190,12 @@ enum {
pause filter count */
VMCB_PERM_MAP,   /* IOPM Base and MSRPM Base */
VMCB_ASID,   /* ASID */
+   VMCB_INTR,   /* int_ctl, int_vector */
VMCB_DIRTY_MAX,
 };
 
-#define VMCB_ALWAYS_DIRTY_MASK 0U
+/* TPR is always written before VMRUN */
+#define VMCB_ALWAYS_DIRTY_MASK (1U  VMCB_INTR)
 
 static inline void mark_all_dirty(struct vmcb *vmcb)
 {
@@ -2507,6 +2509,8 @@ static int clgi_interception(struct vcpu_svm *svm)
svm_clear_vintr(svm);
svm-vmcb-control.int_ctl = ~V_IRQ_MASK;
 
+   mark_dirty(svm-vmcb, VMCB_INTR);
+
return 1;
 }
 
@@ -2877,6 +2881,7 @@ static int interrupt_window_interception(struct vcpu_svm 
*svm)
kvm_make_request(KVM_REQ_EVENT, svm-vcpu);
svm_clear_vintr(svm);
svm-vmcb-control.int_ctl = ~V_IRQ_MASK;
+   mark_dirty(svm-vmcb, VMCB_INTR);
/*
 * If the user space waits to inject interrupts, exit as soon as
 * possible
@@ -3168,6 +3173,7 @@ static inline void svm_inject_irq(struct vcpu_svm *svm, 
int irq)
control-int_ctl = ~V_INTR_PRIO_MASK;
control-int_ctl |= V_IRQ_MASK |
((/*control-int_vector  4*/ 0xf)  V_INTR_PRIO_SHIFT);
+   mark_dirty(svm-vmcb, VMCB_INTR);
 }
 
 static void svm_set_irq(struct kvm_vcpu *vcpu)
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/12] KVM: SVM: Add clean-bit for the ASID

2010-12-03 Thread Joerg Roedel
This patch implements the clean-bit for the asid in the
vmcb.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 3ab42be..8675048 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -189,6 +189,7 @@ enum {
VMCB_INTERCEPTS, /* Intercept vectors, TSC offset,
pause filter count */
VMCB_PERM_MAP,   /* IOPM Base and MSRPM Base */
+   VMCB_ASID,   /* ASID */
VMCB_DIRTY_MAX,
 };
 
@@ -1488,6 +1489,8 @@ static void new_asid(struct vcpu_svm *svm, struct 
svm_cpu_data *sd)
 
svm-asid_generation = sd-asid_generation;
svm-vmcb-control.asid = sd-next_asid++;
+
+   mark_dirty(svm-vmcb, VMCB_ASID);
 }
 
 static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/12] KVM: SVM: Add clean-bit for IOPM_BASE and MSRPM_BASE

2010-12-03 Thread Joerg Roedel
This patch adds the clean bit for the physical addresses of
the MSRPM and the IOPM. It does not need to be set in the
code because the only place where these values are changed
is the nested-svm vmrun and vmexit path. These functions
already mark the complete VMCB as dirty.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d8058c9..3ab42be 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -188,6 +188,7 @@ static int nested_svm_check_exception(struct vcpu_svm *svm, 
unsigned nr,
 enum {
VMCB_INTERCEPTS, /* Intercept vectors, TSC offset,
pause filter count */
+   VMCB_PERM_MAP,   /* IOPM Base and MSRPM Base */
VMCB_DIRTY_MAX,
 };
 
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/12] KVM: SVM: Add clean-bit for LBR state

2010-12-03 Thread Joerg Roedel
This patch implements the clean-bit for all LBR related
state. This includes the debugctl, br_from, br_to,
last_excp_from, and last_excp_to msrs.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7643f83..899 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -197,6 +197,7 @@ enum {
VMCB_DT, /* GDT, IDT */
VMCB_SEG,/* CS, DS, SS, ES, CPL */
VMCB_CR2,/* CR2 only */
+   VMCB_LBR,/* DBGCTL, BR_FROM, BR_TO, LAST_EX_FROM, LAST_EX_TO */
VMCB_DIRTY_MAX,
 };
 
@@ -2846,6 +2847,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
return 1;
 
svm-vmcb-save.dbgctl = data;
+   mark_dirty(svm-vmcb, VMCB_LBR);
if (data  (1ULL0))
svm_enable_lbrv(svm);
else
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/12] KVM: SVM: Add clean-bit for CR2 register

2010-12-03 Thread Joerg Roedel
This patch implements the clean-bit for the cr2 register in
the vmcb.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4c366fe..7643f83 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -196,11 +196,12 @@ enum {
VMCB_DR, /* DR6, DR7 */
VMCB_DT, /* GDT, IDT */
VMCB_SEG,/* CS, DS, SS, ES, CPL */
+   VMCB_CR2,/* CR2 only */
VMCB_DIRTY_MAX,
 };
 
-/* TPR is always written before VMRUN */
-#define VMCB_ALWAYS_DIRTY_MASK (1U  VMCB_INTR)
+/* TPR and CR2 are always written before VMRUN */
+#define VMCB_ALWAYS_DIRTY_MASK ((1U  VMCB_INTR) | (1U  VMCB_CR2))
 
 static inline void mark_all_dirty(struct vmcb *vmcb)
 {
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/12] KVM: SVM: Add clean-bit for control registers

2010-12-03 Thread Joerg Roedel
This patch implements the CRx clean-bit for the vmcb. This
bit covers cr0, cr3, cr4, and efer.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 3b5d894..1b35969 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -192,6 +192,7 @@ enum {
VMCB_ASID,   /* ASID */
VMCB_INTR,   /* int_ctl, int_vector */
VMCB_NPT,/* npt_en, nCR3, gPAT */
+   VMCB_CR, /* CR0, CR3, CR4, EFER */
VMCB_DIRTY_MAX,
 };
 
@@ -441,6 +442,7 @@ static void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
efer = ~EFER_LME;
 
to_svm(vcpu)-vmcb-save.efer = efer | EFER_SVME;
+   mark_dirty(to_svm(vcpu)-vmcb, VMCB_CR);
 }
 
 static int is_external_interrupt(u32 info)
@@ -1338,6 +1340,7 @@ static void update_cr0_intercept(struct vcpu_svm *svm)
*hcr0 = (*hcr0  ~SVM_CR0_SELECTIVE_MASK)
| (gcr0  SVM_CR0_SELECTIVE_MASK);
 
+   mark_dirty(svm-vmcb, VMCB_CR);
 
if (gcr0 == *hcr0  svm-vcpu.fpu_active) {
clr_cr_intercept(svm, INTERCEPT_CR0_READ);
@@ -1404,6 +1407,7 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned 
long cr0)
 */
cr0 = ~(X86_CR0_CD | X86_CR0_NW);
svm-vmcb-save.cr0 = cr0;
+   mark_dirty(svm-vmcb, VMCB_CR);
update_cr0_intercept(svm);
 }
 
@@ -1420,6 +1424,7 @@ static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
cr4 |= X86_CR4_PAE;
cr4 |= host_cr4_mce;
to_svm(vcpu)-vmcb-save.cr4 = cr4;
+   mark_dirty(to_svm(vcpu)-vmcb, VMCB_CR);
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
@@ -3546,6 +3551,7 @@ static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned 
long root)
struct vcpu_svm *svm = to_svm(vcpu);
 
svm-vmcb-save.cr3 = root;
+   mark_dirty(svm-vmcb, VMCB_CR);
force_new_asid(vcpu);
 }
 
@@ -3558,6 +3564,7 @@ static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned 
long root)
 
/* Also sync guest cr3 here in case we live migrate */
svm-vmcb-save.cr3 = vcpu-arch.cr3;
+   mark_dirty(svm-vmcb, VMCB_CR);
 
force_new_asid(vcpu);
 }
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/12] KVM: SVM: Add clean-bit for Segements and CPL

2010-12-03 Thread Joerg Roedel
This patch implements the clean-bit defined for the cs, ds,
ss, an es segemnts and the current cpl saved in the vmcb.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2236c9a..4c366fe 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -195,6 +195,7 @@ enum {
VMCB_CR, /* CR0, CR3, CR4, EFER */
VMCB_DR, /* DR6, DR7 */
VMCB_DT, /* GDT, IDT */
+   VMCB_SEG,/* CS, DS, SS, ES, CPL */
VMCB_DIRTY_MAX,
 };
 
@@ -1457,6 +1458,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
= (svm-vmcb-save.cs.attrib
SVM_SELECTOR_DPL_SHIFT)  3;
 
+   mark_dirty(svm-vmcb, VMCB_SEG);
 }
 
 static void update_db_intercept(struct kvm_vcpu *vcpu)
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/12] KVM: SVM: Add clean-bit for NPT state

2010-12-03 Thread Joerg Roedel
This patch implements the clean-bit for all nested paging
related state in the vmcb.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index b81d31a..3b5d894 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -191,6 +191,7 @@ enum {
VMCB_PERM_MAP,   /* IOPM Base and MSRPM Base */
VMCB_ASID,   /* ASID */
VMCB_INTR,   /* int_ctl, int_vector */
+   VMCB_NPT,/* npt_en, nCR3, gPAT */
VMCB_DIRTY_MAX,
 };
 
@@ -1749,6 +1750,7 @@ static void nested_svm_set_tdp_cr3(struct kvm_vcpu *vcpu,
struct vcpu_svm *svm = to_svm(vcpu);
 
svm-vmcb-control.nested_cr3 = root;
+   mark_dirty(svm-vmcb, VMCB_NPT);
force_new_asid(vcpu);
 }
 
@@ -3552,6 +3554,7 @@ static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned 
long root)
struct vcpu_svm *svm = to_svm(vcpu);
 
svm-vmcb-control.nested_cr3 = root;
+   mark_dirty(svm-vmcb, VMCB_NPT);
 
/* Also sync guest cr3 here in case we live migrate */
svm-vmcb-save.cr3 = vcpu-arch.cr3;
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/12] KVM: SVM: Add clean-bit for GDT and IDT

2010-12-03 Thread Joerg Roedel
This patch implements the clean-bit for the base and limit
of the gdt and idt in the vmcb.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 0517505..2236c9a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -194,6 +194,7 @@ enum {
VMCB_NPT,/* npt_en, nCR3, gPAT */
VMCB_CR, /* CR0, CR3, CR4, EFER */
VMCB_DR, /* DR6, DR7 */
+   VMCB_DT, /* GDT, IDT */
VMCB_DIRTY_MAX,
 };
 
@@ -1304,6 +1305,7 @@ static void svm_set_idt(struct kvm_vcpu *vcpu, struct 
desc_ptr *dt)
 
svm-vmcb-save.idtr.limit = dt-size;
svm-vmcb-save.idtr.base = dt-address ;
+   mark_dirty(svm-vmcb, VMCB_DT);
 }
 
 static void svm_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
@@ -1320,6 +1322,7 @@ static void svm_set_gdt(struct kvm_vcpu *vcpu, struct 
desc_ptr *dt)
 
svm-vmcb-save.gdtr.limit = dt-size;
svm-vmcb-save.gdtr.base = dt-address ;
+   mark_dirty(svm-vmcb, VMCB_DT);
 }
 
 static void svm_decache_cr0_guest_bits(struct kvm_vcpu *vcpu)
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/12] KVM: SVM: Add clean-bit for DR6 and DR7

2010-12-03 Thread Joerg Roedel
This patch implements the clean-bit for the dr6 and dr7
debug registers in the vmcb.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1b35969..0517505 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -193,6 +193,7 @@ enum {
VMCB_INTR,   /* int_ctl, int_vector */
VMCB_NPT,/* npt_en, nCR3, gPAT */
VMCB_CR, /* CR0, CR3, CR4, EFER */
+   VMCB_DR, /* DR6, DR7 */
VMCB_DIRTY_MAX,
 };
 
@@ -1484,6 +1485,8 @@ static void svm_guest_debug(struct kvm_vcpu *vcpu, struct 
kvm_guest_debug *dbg)
else
svm-vmcb-save.dr7 = vcpu-arch.dr7;
 
+   mark_dirty(svm-vmcb, VMCB_DR);
+
update_db_intercept(vcpu);
 }
 
@@ -1506,6 +1509,7 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned 
long value)
struct vcpu_svm *svm = to_svm(vcpu);
 
svm-vmcb-save.dr7 = value;
+   mark_dirty(svm-vmcb, VMCB_DR);
 }
 
 static int pf_interception(struct vcpu_svm *svm)
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/12] KVM: SVM: Add support for VMCB state caching

2010-12-03 Thread Joerg Roedel
Hi Avi, Hi Marcelo,

here is a patch-set which adds support for VMCB state caching to KVM.
This is a new CPU feature where software can mark certain parts of the
VMCB as unchanged since the last vmexit and the hardware can then avoid
reloading these parts from memory.

The feature is implemented downwards-compatible in hardware, so a 0-bit
means the state has changed and needs to be reloaded. This makes it
possible to implement the bits without checking for the feature, as done
in this patch-set (another reason is that the check is as expensive as
clearing the bit). Processors which do not implement VMCB state
caching just ignore these bits.

These patches were tested with multiple guests (Windows, Linux, also in
parallel) and also with nested-svm.

The patches apply on-top of the intercept mask wrapping patch-set I sent
earlier this week. Your feedback is appreciated.

Regards,
Joerg

 arch/x86/include/asm/svm.h |6 +++-
 arch/x86/kvm/svm.c |   70 
 2 files changed, 75 insertions(+), 1 deletions(-)

Joerg Roedel (12):
  KVM: SVM: Add clean-bits infrastructure code
  KVM: SVM: Add clean-bit for intercetps, tsc-offset and pause filter count
  KVM: SVM: Add clean-bit for IOPM_BASE and MSRPM_BASE
  KVM: SVM: Add clean-bit for the ASID
  KVM: SVM: Add clean-bit for interrupt state
  KVM: SVM: Add clean-bit for NPT state
  KVM: SVM: Add clean-bit for control registers
  KVM: SVM: Add clean-bit for DR6 and DR7
  KVM: SVM: Add clean-bit for GDT and IDT
  KVM: SVM: Add clean-bit for Segements and CPL
  KVM: SVM: Add clean-bit for CR2 register
  KVM: SVM: Add clean-bit for LBR state


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/12] KVM: SVM: Add clean-bits infrastructure code

2010-12-03 Thread Joerg Roedel
This patch adds the infrastructure for the implementation of
the individual clean-bits.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/include/asm/svm.h |6 +-
 arch/x86/kvm/svm.c |   31 +++
 2 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 11dbca7..d786399 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -49,6 +49,9 @@ enum {
INTERCEPT_MWAIT_COND,
 };
 
+enum {
+   VMCB_CLEAN_MAX,
+};
 
 struct __attribute__ ((__packed__)) vmcb_control_area {
u32 intercept_cr;
@@ -79,7 +82,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u32 event_inj_err;
u64 nested_cr3;
u64 lbr_ctl;
-   u64 reserved_5;
+   u32 clean;
+   u32 reserved_5;
u64 next_rip;
u8 reserved_6[816];
 };
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c00ea90..3ee59b0 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -185,6 +185,28 @@ static int nested_svm_vmexit(struct vcpu_svm *svm);
 static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
  bool has_error_code, u32 error_code);
 
+enum {
+   VMCB_DIRTY_MAX,
+};
+
+#define VMCB_ALWAYS_DIRTY_MASK 0U
+
+static inline void mark_all_dirty(struct vmcb *vmcb)
+{
+   vmcb-control.clean = 0;
+}
+
+static inline void mark_all_clean(struct vmcb *vmcb)
+{
+   vmcb-control.clean = ((1  VMCB_DIRTY_MAX) - 1)
+   ~VMCB_ALWAYS_DIRTY_MASK;
+}
+
+static inline void mark_dirty(struct vmcb *vmcb, int bit)
+{
+   vmcb-control.clean = ~(1  bit);
+}
+
 static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
 {
return container_of(vcpu, struct vcpu_svm, vcpu);
@@ -973,6 +995,8 @@ static void init_vmcb(struct vcpu_svm *svm)
set_intercept(svm, INTERCEPT_PAUSE);
}
 
+   mark_all_dirty(svm-vmcb);
+
enable_gif(svm);
 }
 
@@ -1089,6 +1113,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
if (unlikely(cpu != vcpu-cpu)) {
svm-asid_generation = 0;
+   mark_all_dirty(svm-vmcb);
}
 
 #ifdef CONFIG_X86_64
@@ -2139,6 +2164,8 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
svm-vmcb-save.cpl = 0;
svm-vmcb-control.exit_int_info = 0;
 
+   mark_all_dirty(svm-vmcb);
+
nested_svm_unmap(page);
 
nested_svm_uninit_mmu_context(svm-vcpu);
@@ -2350,6 +2377,8 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
 
enable_gif(svm);
 
+   mark_all_dirty(svm-vmcb);
+
return true;
 }
 
@@ -3487,6 +3516,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(svm-vmcb-control.exit_code ==
 SVM_EXIT_EXCP_BASE + MC_VECTOR))
svm_handle_mce(svm);
+
+   mark_all_clean(svm-vmcb);
 }
 
 #undef R
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] make kvmclock value idempotent for stopped machine

2010-12-03 Thread Glauber Costa
Although we never made such commitment clear (well, to the best
of my knowledge), some people expect that two savevm issued in sequence
in a stopped machine will yield the same results. This is not a crazy
requirement, since we don't expect a stopped machine to be updating its state,
for any device.

With kvmclock, this is not the case, since the .pre_save hook will issue an
ioctl to the host to acquire a timestamp, which is always changing.

This patch moves the value acquisition to vm_stop. This should mean our
get clock ioctl is issued more times, but this should be fine since vm_stop
is not a hot path.

When we do migrate, we'll transfer that value along.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 cpus.c |4 
 qemu-kvm-x86.c |7 ++-
 qemu-kvm.h |2 ++
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/cpus.c b/cpus.c
index a55c330..879a03a 100644
--- a/cpus.c
+++ b/cpus.c
@@ -112,6 +112,10 @@ static void do_vm_stop(int reason)
 pause_all_vcpus();
 vm_state_notify(0, reason);
 monitor_protocol_event(QEVENT_STOP, NULL);
+if (kvm_enabled()) {
+kvmclock_update_clock();
+}
+
 }
 }
 
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 20b7d6d..d099d3d 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -500,11 +500,9 @@ static int kvm_enable_tpr_access_reporting(CPUState *env)
 #ifdef KVM_CAP_ADJUST_CLOCK
 static struct kvm_clock_data kvmclock_data;
 
-static void kvmclock_pre_save(void *opaque)
+void kvmclock_update_clock(void)
 {
-struct kvm_clock_data *cl = opaque;
-
-kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl);
+kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, kvmclock_data);
 }
 
 static int kvmclock_post_load(void *opaque, int version_id)
@@ -519,7 +517,6 @@ static const VMStateDescription vmstate_kvmclock= {
 .version_id = 1,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
-.pre_save = kvmclock_pre_save,
 .post_load = kvmclock_post_load,
 .fields  = (VMStateField []) {
 VMSTATE_U64(clock, struct kvm_clock_data),
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 0f3fb50..b0b7ab3 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -752,6 +752,8 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t 
rip,
 #define qemu_kvm_cpu_stop(env) do {} while(0)
 #endif
 
+void kvmclock_update_clock(void);
+
 #ifdef CONFIG_KVM
 
 typedef struct KVMSlot {
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Fix savevm odness related to kvmclock

2010-12-03 Thread Glauber Costa
Some users told me that savevm path is behaving oddly wrt kvmclock.
The first oddness is that a guarantee we never made (AFAIK) is being broken:
two consecutive savevm operations, with the machine stopped in between
produces different results, due to the call to KVM_GET_CLOCK ioctl.
I believe the assumption that if the vm does not run, its saveable
state won't change is fairly reasonable. Maybe we should formally
guarantee that?

The second patch deals with the fact that this happens even if
kvmclock is disabled in cpuid: its savevm section is registered
nevertheless. Here, I try to register it only if it's enabled at
machine start.

Thanks

Glauber Costa (2):
  make kvmclock value idempotent for stopped machine
  Do not register kvmclock savevm section if kvmclock is disabled.

 cpus.c|7 +++
 qemu-kvm-x86.c|   25 +++--
 qemu-kvm.h|4 
 target-i386/kvm.c |7 +++
 4 files changed, 33 insertions(+), 10 deletions(-)

-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-03 Thread Glauber Costa
Usually nobody usually thinks about that scenario (me included and specially),
but kvmclock can be actually disabled in the host.

It happens in two scenarios:
 1. host too old.
 2. we passed -kvmclock to our -cpu parameter.

In both cases, we should not register kvmclock savevm section. This patch
achives that by registering this section only if kvmclock is actually
currently enabled in cpuid.

The only caveat is that we have to register the savevm section a little bit
later, since we won't know the final kvmclock state before cpuid gets parsed.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 cpus.c|3 +++
 qemu-kvm-x86.c|   20 ++--
 qemu-kvm.h|2 ++
 target-i386/kvm.c |7 +++
 4 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/cpus.c b/cpus.c
index 879a03a..eef716c 100644
--- a/cpus.c
+++ b/cpus.c
@@ -97,6 +97,9 @@ void cpu_synchronize_all_post_init(void)
 for (cpu = first_cpu; cpu; cpu = cpu-next_cpu) {
 cpu_synchronize_post_init(cpu);
 }
+if (kvm_enabled()) {
+kvmclock_register_savevm();
+}
 }
 
 int cpu_is_stopped(CPUState *env)
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index d099d3d..668c8cf 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -502,6 +502,9 @@ static struct kvm_clock_data kvmclock_data;
 
 void kvmclock_update_clock(void)
 {
+if (!kvmclock_enabled)
+return;
+
 kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, kvmclock_data);
 }
 
@@ -525,6 +528,17 @@ static const VMStateDescription vmstate_kvmclock= {
 };
 #endif
 
+/* This has to happen after vcpu setup*/
+void kvmclock_register_savevm(void)
+{
+#ifdef KVM_CAP_ADJUST_CLOCK
+if (kvmclock_enabled  kvm_check_extension(kvm_state, 
KVM_CAP_ADJUST_CLOCK)) {
+printf(registering kvmclock savevm section\n);
+vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data);
+}
+#endif
+}
+
 int kvm_arch_qemu_create_context(void)
 {
 int r;
@@ -542,12 +556,6 @@ int kvm_arch_qemu_create_context(void)
 return -1;
 }
 
-#ifdef KVM_CAP_ADJUST_CLOCK
-if (kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) {
-vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data);
-}
-#endif
-
 r = kvm_set_boot_cpu_id(0);
 if (r  0  r != -ENOSYS) {
 return r;
diff --git a/qemu-kvm.h b/qemu-kvm.h
index b0b7ab3..f51a2d6 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -753,6 +753,8 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t 
rip,
 #endif
 
 void kvmclock_update_clock(void);
+extern int kvmclock_enabled;
+void kvmclock_register_savevm(void);
 
 #ifdef CONFIG_KVM
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 95e5d02..5443765 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -293,6 +293,7 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 }
 
 static int _kvm_arch_init_vcpu(CPUState *env);
+int kvmclock_enabled = 1;
 
 int kvm_arch_init_vcpu(CPUState *env)
 {
@@ -350,6 +351,12 @@ int kvm_arch_init_vcpu(CPUState *env)
 memset(c, 0, sizeof(*c));
 c-function = KVM_CPUID_FEATURES;
 c-eax = env-cpuid_kvm_features  get_para_features(env);
+
+if (!(c-eax  (1  KVM_FEATURE_CLOCKSOURCE))) {
+/* In theory cpuid is per-cpu, and this is a global variable,
+ * but we don't expect kvmclock enabled in some cpus only */
+kvmclock_enabled = 0;
+}
 #endif
 
 cpu_x86_cpuid(env, 0, 0, limit, unused, unused, unused);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 11:38:33AM +0200, Avi Kivity wrote:
 What if one of the guest crashes qemu or invokes a powerdown?
 Suddenly the others get 33% each (with 1% going to my secret
 round-up account).  Doesn't seem like a reliable way to limit cpu.

Some monitoring tool will need to catch that event and spawn a
dummy VM to consume 25% cpu, bringing back everyone's use to 25% as before.

That's admittedly not neat, but that's what we are thinking of atm in absence of
a better solution to the problem (ex: kernel scheduler supporting hard-limits).

- vatsa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 11:40:27AM +0200, Avi Kivity wrote:
 On 12/02/2010 09:14 PM, Chris Wright wrote:
 Perhaps it should be a VM level option.  And then invert the notion.
 Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
 (pin in place probably).  And have that VM do nothing other than hlt.
 Then it's always runnable according to scheduler, and can consume the
 extra work that CFS wants to give away.
 
 What's the difference between this and the Linux idle threads?

If we have 3 VMs and want to give them 25% each of a CPU, then having just idle
thread would end up giving them 33%. One way of achieving 25% rate limit is to
create a dummy or filler VM, and let it compete for resource, thus
rate-limiting everyone to 25% in this case. Essentially we are tackling
rate-limit problem by creating additional filler VMs/threads that will compete
for resource, thus keeping in check how much cpu resource is consumed by real
VMs. Admittedly not as neat as having a in-kernel support for rate-limit.

- vatsa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Srivatsa Vaddagiri
On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
 Perhaps it should be a VM level option.  And then invert the notion.
 Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
 (pin in place probably).  And have that VM do nothing other than hlt.
 Then it's always runnable according to scheduler, and can consume the
 extra work that CFS wants to give away.

That's not sufficient. Lets we have 3 guests A, B, C that need to be rate
limited to 25% on a single cpu system. We create this idle guest D that is 100%
cpu hog as per above definition. Now when one of the guest is idle, what ensures
that the idle cycles of A is given only to D and not partly to B/C?

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/12] KVM: SVM: Add clean-bits infrastructure code

2010-12-03 Thread Roedel, Joerg
On Fri, Dec 03, 2010 at 05:45:48AM -0500, Joerg Roedel wrote:
 +enum {
 + VMCB_CLEAN_MAX,
 +};

This is a left-over from an earlier version. I forgot to remove it. Here
is an updated patch. Sorry.


From 7e3f4f175561429d0054daac94763e67d12424ba Mon Sep 17 00:00:00 2001
From: Joerg Roedel joerg.roe...@amd.com
Date: Wed, 1 Dec 2010 12:01:08 +0100
Subject: [PATCH 01/12] KVM: SVM: Add clean-bits infrastructure code

This patch adds the infrastructure for the implementation of
the individual clean-bits.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/include/asm/svm.h |3 ++-
 arch/x86/kvm/svm.c |   31 +++
 2 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 11dbca7..235dd73 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -79,7 +79,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u32 event_inj_err;
u64 nested_cr3;
u64 lbr_ctl;
-   u64 reserved_5;
+   u32 clean;
+   u32 reserved_5;
u64 next_rip;
u8 reserved_6[816];
 };
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c00ea90..3ee59b0 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -185,6 +185,28 @@ static int nested_svm_vmexit(struct vcpu_svm *svm);
 static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
  bool has_error_code, u32 error_code);
 
+enum {
+   VMCB_DIRTY_MAX,
+};
+
+#define VMCB_ALWAYS_DIRTY_MASK 0U
+
+static inline void mark_all_dirty(struct vmcb *vmcb)
+{
+   vmcb-control.clean = 0;
+}
+
+static inline void mark_all_clean(struct vmcb *vmcb)
+{
+   vmcb-control.clean = ((1  VMCB_DIRTY_MAX) - 1)
+   ~VMCB_ALWAYS_DIRTY_MASK;
+}
+
+static inline void mark_dirty(struct vmcb *vmcb, int bit)
+{
+   vmcb-control.clean = ~(1  bit);
+}
+
 static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
 {
return container_of(vcpu, struct vcpu_svm, vcpu);
@@ -973,6 +995,8 @@ static void init_vmcb(struct vcpu_svm *svm)
set_intercept(svm, INTERCEPT_PAUSE);
}
 
+   mark_all_dirty(svm-vmcb);
+
enable_gif(svm);
 }
 
@@ -1089,6 +1113,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
if (unlikely(cpu != vcpu-cpu)) {
svm-asid_generation = 0;
+   mark_all_dirty(svm-vmcb);
}
 
 #ifdef CONFIG_X86_64
@@ -2139,6 +2164,8 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
svm-vmcb-save.cpl = 0;
svm-vmcb-control.exit_int_info = 0;
 
+   mark_all_dirty(svm-vmcb);
+
nested_svm_unmap(page);
 
nested_svm_uninit_mmu_context(svm-vcpu);
@@ -2350,6 +2377,8 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
 
enable_gif(svm);
 
+   mark_all_dirty(svm-vmcb);
+
return true;
 }
 
@@ -3487,6 +3516,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(svm-vmcb-control.exit_code ==
 SVM_EXIT_EXCP_BASE + MC_VECTOR))
svm_handle_mce(svm);
+
+   mark_all_clean(svm-vmcb);
 }
 
 #undef R
-- 
1.7.1

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu

2010-12-03 Thread Srivatsa Vaddagiri
On Thu, Dec 02, 2010 at 02:43:24PM -0500, Rik van Riel wrote:
   mutex_lock(vcpu-mutex);
 + vcpu-task = current;

Shouldn't we grab reference to current task_struct before storing a pointer to 
it?

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Gleb Natapov
On Thu, Dec 02, 2010 at 02:51:51PM -0600, Anthony Liguori wrote:
 On 12/02/2010 02:12 PM, Marcelo Tosatti wrote:
   opt = CPU_BASED_TPR_SHADOW |
 CPU_BASED_USE_MSR_BITMAPS |
 CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 -- 
 1.7.0.4
 Breaks async PF (see checks on guest state),
 Sorry, I don't follow what you mean here.  Can you elaborate?
 VCPU in HLT state only allows injection of certain events that
 would be delivered on HLT. #PF is not one of them.
 
 But you can't inject an exception into a guest while the VMCS is
 active, can you?  So the guest takes an exit while in the hlt
 instruction but that's no different than if the guest has been
 interrupted because of hlt exiting.
 
Async PF completion do not kick vcpu out of a guest mode. It wakes vcpu
only if it is waiting on waitqueue. It was done to not generate
unnecessary overhead.

 You'd have to handle this situation on event injection, vmentry fails
 otherwise. Or perhaps clear HLT state on vmexit and vmentry.
 
 So this works today because on a hlt exit, emulate_halt() will clear
 the the HLT state which then puts the the vcpu into a state where it
 can receive an exception injection?
 
 Regards,
 
 Anthony Liguori
 
   timer reinjection
 probably.
 Timer reinjection will continue to work as expected.  If a guest is
 halting an external interrupt is delivered (by a timer), the guest
 will still exit as expected.
 
 I can think of anything that would be functionally correct and still
 depend on getting hlt exits because ultimately, a guest never
 actually has to do a hlt (and certainly there are guests that
 won't).
 LAPIC pending timer events will be reinjected on entry path, if
 accumulated. So they depend on any exit. If you disable HLT-exiting,
 delay will increase. OK, maybe thats irrelevant.
 
   It should be possible to achieve determinism with
 a scheduler policy?
 If the desire is the ultimate desire is to have the guests be
 scheduled in a non-work conserving fashion, I can't see a more
 direct approach that to simply not have the guests yield (which is
 ultimately what hlt trapping does).
 
 Anything the scheduler would do is after the fact and probably based
 on inference about why the yield.
 Another issue is you ignore the hosts idea of the best way to sleep
 (ACPI, or whatever).
 
 And handling inactive HLT state (which was never enabled) can be painful.
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Correcting timeout interruption of virtio_console test.

2010-12-03 Thread Jiří Župka
Catch new exception from kvm_suprocess to avoid killing of tests.
---
 client/tests/kvm/tests/virtio_console.py |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/client/tests/kvm/tests/virtio_console.py 
b/client/tests/kvm/tests/virtio_console.py
index d8d2143..d083783 100644
--- a/client/tests/kvm/tests/virtio_console.py
+++ b/client/tests/kvm/tests/virtio_console.py
@@ -468,9 +468,13 @@ def run_virtio_console(test, params, env):
 logging.debug(Executing '%s' on virtio_guest.py loop, vm: %s, +
   timeout: %s, command, vm[0].name, timeout)
 vm[1].sendline(command)
-(match, data) = vm[1].read_until_last_line_matches([PASS:,
-FAIL:[Failed to 
execute]],
-timeout)
+try:
+(match, data) = vm[1].read_until_last_line_matches([PASS:,
+FAIL:],
+   timeout)
+except (kvm_subprocess.ExpectTimeoutError):
+match = None
+data = None
 return (match, data)
 
 
-- 
1.7.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM-Autotest][PATCH][virtio-console] Correcting timeout interruption of virtio_console test.

2010-12-03 Thread Jiří Župka
Catch new exception from kvm_suprocess to avoid killing of tests.
---
 client/tests/kvm/tests/virtio_console.py |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/client/tests/kvm/tests/virtio_console.py 
b/client/tests/kvm/tests/virtio_console.py
index d8d2143..d083783 100644
--- a/client/tests/kvm/tests/virtio_console.py
+++ b/client/tests/kvm/tests/virtio_console.py
@@ -468,9 +468,13 @@ def run_virtio_console(test, params, env):
 logging.debug(Executing '%s' on virtio_guest.py loop, vm: %s, +
   timeout: %s, command, vm[0].name, timeout)
 vm[1].sendline(command)
-(match, data) = vm[1].read_until_last_line_matches([PASS:,
-FAIL:[Failed to 
execute]],
-timeout)
+try:
+(match, data) = vm[1].read_until_last_line_matches([PASS:,
+FAIL:],
+   timeout)
+except (kvm_subprocess.ExpectTimeoutError):
+match = None
+data = None
 return (match, data)
 
 
-- 
1.7.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Thu, 2010-12-02 at 14:44 -0500, Rik van Riel wrote:
unsigned long clone_flags);
 +
 +#ifdef CONFIG_SCHED_HRTICK
 +extern u64 slice_remain(struct task_struct *);
 +extern void yield_to(struct task_struct *);
 +#else
 +static inline void yield_to(struct task_struct *p) yield()
 +#endif

That does SCHED_HRTICK have to do with any of this?

  #ifdef CONFIG_SMP
   extern void kick_process(struct task_struct *tsk);
  #else
 diff --git a/kernel/sched.c b/kernel/sched.c
 index f8e5a25..ef088cd 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -1909,6 +1909,26 @@ static void dequeue_task(struct rq *rq, struct 
 task_struct *p, int sleep)
   p-se.on_rq = 0;
  }
  
 +/**
 + * requeue_task - requeue a task which priority got changed by yield_to

priority doesn't seem the right word, you're not actually changing
anything related to p-*prio

 + * @rq: the tasks's runqueue
 + * @p: the task in question
 + * Must be called with the runqueue lock held. Will cause the CPU to
 + * reschedule if p is now at the head of the runqueue.
 + */
 +void requeue_task(struct rq *rq, struct task_struct *p)
 +{
 + assert_spin_locked(rq-lock);
 +
 + if (!p-se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
 + return;
 +
 + dequeue_task(rq, p, 0);
 + enqueue_task(rq, p, 0);
 +
 + resched_task(p);

I guess that wants to be something like check_preempt_curr()

 +}
 +
  /*
   * __normal_prio - return the priority that is based on the static prio
   */
 @@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, 
 unsigned int, len,
   return ret;
  }
  
 +#ifdef CONFIG_SCHED_HRTICK

Still wondering what all this has to do with SCHED_HRTICK..

 +/*
 + * Yield the CPU, giving the remainder of our time slice to task p.
 + * Typically used to hand CPU time to another thread inside the same
 + * process, eg. when p holds a resource other threads are waiting for.
 + * Giving priority to p may help get that resource released sooner.
 + */
 +void yield_to(struct task_struct *p)
 +{
 + unsigned long flags;
 + struct sched_entity *se = p-se;
 + struct rq *rq;
 + struct cfs_rq *cfs_rq;
 + u64 remain = slice_remain(current);
 +
 + rq = task_rq_lock(p, flags);
 + if (task_running(rq, p) || task_has_rt_policy(p))
 + goto out;

See, this all ain't nice, slice_remain() don't make no sense to be
called for !fair tasks.

Why not write:

  if (curr-sched_class == p-sched_class 
  curr-sched_class-yield_to)
curr-sched_class-yield_to(curr, p);

or something, and then implement sched_class_fair::yield_to only,
leaving it a NOP for all other classes.


Also, I think you can side-step that whole curr vs p rq-lock thing
you're doing here, by holding p's rq-lock, you've disabled IRQs in
current's task context, since -sum_exec_runtime and all are only
changed during scheduling and the scheduler_tick, disabling IRQs in its
task context pins them.

 + cfs_rq = cfs_rq_of(se);
 + se-vruntime -= remain;
 + if (se-vruntime  cfs_rq-min_vruntime)
 + se-vruntime = cfs_rq-min_vruntime;

Now here we have another problem, remain was measured in wall-time, and
then you go change a virtual time measure using that. These things are
related like:

 vt = t/weight

So you're missing a weight factor somewhere.

Also, that check against min_vruntime doesn't really make much sense.


 + requeue_task(rq, p);

Just makes me wonder why you added requeue task to begin with.. why not
simply dequeue at the top of this function, and enqueue at the tail,
like all the rest does: see rt_mutex_setprio(), set_user_nice(),
sched_move_task().

 + out:
 + task_rq_unlock(rq, flags);
 + yield();
 +}
 +EXPORT_SYMBOL(yield_to);

EXPORT_SYMBOL_GPL() pretty please, I really hate how kvm is a module and
needs to export hooks all over the core kernel :/

 +#endif
 +
  /**
   * sys_sched_yield - yield the current processor to other threads.
   *
 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index 5119b08..2a0a595 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity 
 *curr, int queued)
   */
  
  #ifdef CONFIG_SCHED_HRTICK
 +u64 slice_remain(struct task_struct *p)
 +{
 + unsigned long flags;
 + struct sched_entity *se = p-se;
 + struct cfs_rq *cfs_rq;
 + struct rq *rq;
 + u64 slice, ran;
 + s64 delta;
 +
 + rq = task_rq_lock(p, flags);
 + cfs_rq = cfs_rq_of(se);
 + slice = sched_slice(cfs_rq, se);
 + ran = se-sum_exec_runtime - se-prev_sum_exec_runtime;
 + delta = slice - ran;
 + task_rq_unlock(rq, flags);
 +
 + return max(delta, 0LL);
 +}


Right, so another approach might be to simply swap the vruntime between
curr and p.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 02:23:39PM +0100, Peter Zijlstra wrote:
 Right, so another approach might be to simply swap the vruntime between
 curr and p.

Can't that cause others to stave? For ex: consider a cpu p0 having these tasks:

p0 -  A0 B0 A1

A0/A1 have entered some sort of AB-BA spin-deadlock, as a result A0 wants to
direct yield to A1 which wants to direct yield to A0. If we keep swapping their
runtimes, can't it starve B0?

- vatsa


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 06:54:16AM +0100, Mike Galbraith wrote:
  +void yield_to(struct task_struct *p)
  +{
  +   unsigned long flags;
  +   struct sched_entity *se = p-se;
  +   struct rq *rq;
  +   struct cfs_rq *cfs_rq;
  +   u64 remain = slice_remain(current);
 
 That slice remaining only shows the distance to last preempt, however
 brief.  It shows nothing wrt tree position, the yielding task may well
 already be right of the task it wants to yield to, having been a buddy.

Good point.

  cfs_rq = cfs_rq_of(se);
  +   se-vruntime -= remain;
  +   if (se-vruntime  cfs_rq-min_vruntime)
  +   se-vruntime = cfs_rq-min_vruntime;
 
 (This is usually done using max_vruntime())
 
 If the recipient was already left of the fair stick (min_vruntime),
 clipping advances it's vruntime, vaporizing entitlement from both donor
 and recipient.
 
 What if a task tries to yield to another not on the same cpu, and/or in
 the same task group?

In this case, target of yield_to is a vcpu belonging to the same VM and hence is
expected to be in same task group, but I agree its good to put a check.

  This would munge min_vruntime of other queues.  I
 think you'd have to restrict this to same cpu, same group.  If tasks can
 donate cross cfs_rq, (say) pinned task A cpu A running solo could donate
 vruntime to selected tasks pinned to cpu B, for as long as minuscule
 preemptions can resupply ammo.  Would suck to not be the favored child.

IOW starving non-favored childs?

 Maybe you could exchange vruntimes cooperatively (iff same cfs_rq)
 between threads, but I don't think donations with clipping works.

Can't that lead to starvation again (as I pointed in a mail to Peterz):

p0 - A0 B0 A1

A0/A1 enter a yield_to(other) deadlock, which means we keep swapping their
vruntimes, starving B0?

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Fri, 2010-12-03 at 19:00 +0530, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 02:23:39PM +0100, Peter Zijlstra wrote:
  Right, so another approach might be to simply swap the vruntime between
  curr and p.
 
 Can't that cause others to stave? For ex: consider a cpu p0 having these 
 tasks:
 
 p0 -  A0 B0 A1
 
 A0/A1 have entered some sort of AB-BA spin-deadlock, as a result A0 wants to
 direct yield to A1 which wants to direct yield to A0. If we keep swapping 
 their
 runtimes, can't it starve B0?

No, because they do receive service (they spend some time spinning
before being interrupted), so the respective vruntimes will increase, at
some point they'll pass B0 and it'll get scheduled.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:
 No, because they do receive service (they spend some time spinning
 before being interrupted), so the respective vruntimes will increase, at
 some point they'll pass B0 and it'll get scheduled.

Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
case)?

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 07:36:07PM +0530, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:
  No, because they do receive service (they spend some time spinning
  before being interrupted), so the respective vruntimes will increase, at
  some point they'll pass B0 and it'll get scheduled.
 
 Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
 case)?

Hmm perhaps yes, althought at cost of tons of context switches, which would be
nice to minimize on?

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu

2010-12-03 Thread Rik van Riel

On 12/03/2010 07:17 AM, Srivatsa Vaddagiri wrote:

On Thu, Dec 02, 2010 at 02:43:24PM -0500, Rik van Riel wrote:

mutex_lock(vcpu-mutex);
+   vcpu-task = current;


Shouldn't we grab reference to current task_struct before storing a pointer to
it?


That should not be needed, since current cannot go away without
setting vcpu-task back to NULL in vcpu_put.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Anthony Liguori

On 12/02/2010 09:44 PM, Chris Wright wrote:

Yes.

There's definitely a use-case to have a hard cap.
 

OK, good, just wanted to be clear.  Because this started as a discussion
of hard caps, and it began to sound as if you were no longer advocating
for them.

   

But I think another common use-case is really just performance
isolation.  If over the course of a day, you go from 12CU, to 6CU,
to 4CU, that might not be that bad of a thing.
 

I guess it depends on your SLA.  We don't have to do anything to give
varying CU based on host load.  That's the one thing CFS will do for
us quite well ;)
   


I'm really anticipating things like the EC2 micro instance where the CPU 
allotment is variable.  Variable allotments are interesting from a 
density perspective but having interdependent performance is definitely 
a problem.


Another way to think about it: a customer reports a performance problem 
at 1PM.  With non-yielding guests, you can look at logs and see that the 
expected capacity was 2CU (it may have changed to 4CU at 3PM).  However, 
without something like non-yielding guests, the performance is almost 
entirely unpredictable and unless you have an exact timestamp from the 
customer along with a fine granularity performance log, there's no way 
to determine whether it's expected behavior.



If the environment is designed correctly, of N nodes, N-1 will
always be at capacity so it's really just a single node hat is under
utilized.
 

Many clouds do a variation on Small, Medium, Large sizing.  So depending
on the scheduler (best fit, rr...) even the notion of at capacity may
change from node to node and during the time of day.
   


An ideal cloud will make sure that something like 4 Small == 2 Medium == 
1 Large instance and that the machine capacity is always a multiple of 
Large instance size.


With a division like this, you can always achieve maximum density 
provided that you can support live migration.


Regards,

Anthony Liguori



thanks,
-chris
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] KVM: SVM: Remove flush_guest_tlb function

2010-12-03 Thread Joerg Roedel
This function is unused and there is svm_flush_tlb which
does the same. So this function can be removed.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c00ea90..772d48e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -389,11 +389,6 @@ static inline void force_new_asid(struct kvm_vcpu *vcpu)
to_svm(vcpu)-asid_generation--;
 }
 
-static inline void flush_guest_tlb(struct kvm_vcpu *vcpu)
-{
-   force_new_asid(vcpu);
-}
-
 static int get_npt_level(void)
 {
 #ifdef CONFIG_X86_64
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] KVM: SVM: Add support Flush-By-ASID feature

2010-12-03 Thread Joerg Roedel
Hi Avi, Hi Marcelo,

here is the patch-set to add support for the Flush-By-ASID feature to
KVM on AMD. Patches 1 and 2 clean up the code a little bit and patch 3
implements the feature itself.

Regards,

Joerg

 arch/x86/include/asm/svm.h |2 ++
 arch/x86/kvm/svm.c |   32 ++--
 2 files changed, 16 insertions(+), 18 deletions(-)

Joerg Roedel (3):
  KVM: SVM: Remove flush_guest_tlb function
  KVM: SVM: Use svm_flush_tlb instead of force_new_asid
  KVM: SVM: Implement Flush-By-Asid feature


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] KVM: SVM: Implement Flush-By-Asid feature

2010-12-03 Thread Joerg Roedel
This patch adds the new flush-by-asid of upcoming AMD
processors to the KVM-AMD module.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/include/asm/svm.h |2 ++
 arch/x86/kvm/svm.c |   10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 11dbca7..f20ff4f 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -87,6 +87,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 
 #define TLB_CONTROL_DO_NOTHING 0
 #define TLB_CONTROL_FLUSH_ALL_ASID 1
+#define TLB_CONTROL_FLUSH_ASID 3
+#define TLB_CONTROL_FLUSH_ASID_LOCAL 7
 
 #define V_TPR_MASK 0x0f
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index b70a1e8..cdc8ab9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3093,7 +3093,6 @@ static void pre_svm_run(struct vcpu_svm *svm)
 
struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
 
-   svm-vmcb-control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
/* FIXME: handle wraparound of asid_generation */
if (svm-asid_generation != sd-asid_generation)
new_asid(svm, sd);
@@ -3237,7 +3236,12 @@ static int svm_set_tss_addr(struct kvm *kvm, unsigned 
int addr)
 
 static void svm_flush_tlb(struct kvm_vcpu *vcpu)
 {
-   to_svm(vcpu)-asid_generation--;
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   if (static_cpu_has(X86_FEATURE_FLUSHBYASID))
+   svm-vmcb-control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
+   else
+   svm-asid_generation--;
 }
 
 static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu)
@@ -3461,6 +3465,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
svm-next_rip = 0;
 
+   svm-vmcb-control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
+
/* if exit due to PF check for async PF */
if (svm-vmcb-control.exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR)
svm-apf_reason = kvm_read_and_reset_pf_reason();
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] KVM: SVM: Use svm_flush_tlb instead of force_new_asid

2010-12-03 Thread Joerg Roedel
This patch replaces all calls to force_new_asid which are
intended to flush the guest-tlb by the more appropriate
function svm_flush_tlb. As a side-effect the force_new_asid
function is removed.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |   19 +++
 1 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 772d48e..b70a1e8 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -384,11 +384,6 @@ static inline void invlpga(unsigned long addr, u32 asid)
asm volatile (__ex(SVM_INVLPGA) : : a(addr), c(asid));
 }
 
-static inline void force_new_asid(struct kvm_vcpu *vcpu)
-{
-   to_svm(vcpu)-asid_generation--;
-}
-
 static int get_npt_level(void)
 {
 #ifdef CONFIG_X86_64
@@ -958,7 +953,7 @@ static void init_vmcb(struct vcpu_svm *svm)
save-cr3 = 0;
save-cr4 = 0;
}
-   force_new_asid(svm-vcpu);
+   svm-asid_generation = 0;
 
svm-nested.vmcb = 0;
svm-vcpu.arch.hflags = 0;
@@ -1371,7 +1366,7 @@ static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
unsigned long old_cr4 = to_svm(vcpu)-vmcb-save.cr4;
 
if (npt_enabled  ((old_cr4 ^ cr4)  X86_CR4_PGE))
-   force_new_asid(vcpu);
+   svm_flush_tlb(vcpu);
 
vcpu-arch.cr4 = cr4;
if (!npt_enabled)
@@ -1706,7 +1701,7 @@ static void nested_svm_set_tdp_cr3(struct kvm_vcpu *vcpu,
struct vcpu_svm *svm = to_svm(vcpu);
 
svm-vmcb-control.nested_cr3 = root;
-   force_new_asid(vcpu);
+   svm_flush_tlb(vcpu);
 }
 
 static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu)
@@ -2307,7 +2302,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
svm-nested.intercept_exceptions = 
nested_vmcb-control.intercept_exceptions;
svm-nested.intercept= nested_vmcb-control.intercept;
 
-   force_new_asid(svm-vcpu);
+   svm_flush_tlb(svm-vcpu);
svm-vmcb-control.int_ctl = nested_vmcb-control.int_ctl | 
V_INTR_MASKING_MASK;
if (nested_vmcb-control.int_ctl  V_INTR_MASKING_MASK)
svm-vcpu.arch.hflags |= HF_VINTR_MASK;
@@ -3242,7 +3237,7 @@ static int svm_set_tss_addr(struct kvm *kvm, unsigned int 
addr)
 
 static void svm_flush_tlb(struct kvm_vcpu *vcpu)
 {
-   force_new_asid(vcpu);
+   to_svm(vcpu)-asid_generation--;
 }
 
 static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu)
@@ -3491,7 +3486,7 @@ static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned 
long root)
struct vcpu_svm *svm = to_svm(vcpu);
 
svm-vmcb-save.cr3 = root;
-   force_new_asid(vcpu);
+   svm_flush_tlb(vcpu);
 }
 
 static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned long root)
@@ -3503,7 +3498,7 @@ static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned 
long root)
/* Also sync guest cr3 here in case we live migrate */
svm-vmcb-save.cr3 = vcpu-arch.cr3;
 
-   force_new_asid(vcpu);
+   svm_flush_tlb(vcpu);
 }
 
 static int is_disabled(void)
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Mike Galbraith
On Fri, 2010-12-03 at 19:16 +0530, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 06:54:16AM +0100, Mike Galbraith wrote:
   +void yield_to(struct task_struct *p)
   +{
   + unsigned long flags;
   + struct sched_entity *se = p-se;
   + struct rq *rq;
   + struct cfs_rq *cfs_rq;
   + u64 remain = slice_remain(current);
  
  That slice remaining only shows the distance to last preempt, however
  brief.  It shows nothing wrt tree position, the yielding task may well
  already be right of the task it wants to yield to, having been a buddy.
 
 Good point.
 
   cfs_rq = cfs_rq_of(se);
   + se-vruntime -= remain;
   + if (se-vruntime  cfs_rq-min_vruntime)
   + se-vruntime = cfs_rq-min_vruntime;
  
  (This is usually done using max_vruntime())
  
  If the recipient was already left of the fair stick (min_vruntime),
  clipping advances it's vruntime, vaporizing entitlement from both donor
  and recipient.
  
  What if a task tries to yield to another not on the same cpu, and/or in
  the same task group?
 
 In this case, target of yield_to is a vcpu belonging to the same VM and hence 
 is
 expected to be in same task group, but I agree its good to put a check.
 
   This would munge min_vruntime of other queues.  I
  think you'd have to restrict this to same cpu, same group.  If tasks can
  donate cross cfs_rq, (say) pinned task A cpu A running solo could donate
  vruntime to selected tasks pinned to cpu B, for as long as minuscule
  preemptions can resupply ammo.  Would suck to not be the favored child.
 
 IOW starving non-favored childs?

Yes, as in fairness ceases to exists.

  Maybe you could exchange vruntimes cooperatively (iff same cfs_rq)
  between threads, but I don't think donations with clipping works.
 
 Can't that lead to starvation again (as I pointed in a mail to Peterz):
 
 p0 - A0 B0 A1
 
 A0/A1 enter a yield_to(other) deadlock, which means we keep swapping their
 vruntimes, starving B0?

I'll have to go back and re-read that.  Off the top of my head, I see no
way it could matter which container the numbers live in as long as they
keep advancing, and stay in the same runqueue.  (hm, task weights would
have to be the same too or scaled. dangerous business, tinkering with
vruntimes)

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/03/2010 09:45 AM, Mike Galbraith wrote:


I'll have to go back and re-read that.  Off the top of my head, I see no
way it could matter which container the numbers live in as long as they
keep advancing, and stay in the same runqueue.  (hm, task weights would
have to be the same too or scaled. dangerous business, tinkering with
vruntimes)


They're not necessarily in the same runqueue, the
VCPU that is given time might be on another CPU
than the one that was spinning on a lock.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu

2010-12-03 Thread Rik van Riel

On 12/02/2010 08:18 PM, Chris Wright wrote:

* Rik van Riel (r...@redhat.com) wrote:

Keep track of which task is running a KVM vcpu.  This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single run of the vcpu.


So shouldn't it confine to KVM_RUN?  The other vcpu_load callers aren't
always a vcpu in a useful runnable state.


Yeah, probably.  If you want I can move the setting of
vcpu-task to kvm_vcpu_ioctl.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Mike Galbraith
On Fri, 2010-12-03 at 09:48 -0500, Rik van Riel wrote:
 On 12/03/2010 09:45 AM, Mike Galbraith wrote:
 
  I'll have to go back and re-read that.  Off the top of my head, I see no
  way it could matter which container the numbers live in as long as they
  keep advancing, and stay in the same runqueue.  (hm, task weights would
  have to be the same too or scaled. dangerous business, tinkering with
  vruntimes)
 
 They're not necessarily in the same runqueue, the
 VCPU that is given time might be on another CPU
 than the one that was spinning on a lock.

I don't think pumping vruntime cross cfs_rq would be safe, for the
reason noted (et al).  No competition means vruntime is meaningless.
Donating just advances a clock that nobody's looking at.

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/03/2010 10:09 AM, Mike Galbraith wrote:

On Fri, 2010-12-03 at 09:48 -0500, Rik van Riel wrote:

On 12/03/2010 09:45 AM, Mike Galbraith wrote:


I'll have to go back and re-read that.  Off the top of my head, I see no
way it could matter which container the numbers live in as long as they
keep advancing, and stay in the same runqueue.  (hm, task weights would
have to be the same too or scaled. dangerous business, tinkering with
vruntimes)


They're not necessarily in the same runqueue, the
VCPU that is given time might be on another CPU
than the one that was spinning on a lock.


I don't think pumping vruntime cross cfs_rq would be safe, for the
reason noted (et al).  No competition means vruntime is meaningless.
Donating just advances a clock that nobody's looking at.


Do you have suggestions on what I should do to make
this yield_to functionality work?

I'm willing to implement pretty much anything the
scheduler people will be happy with :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [stable] [PATCH 3/3][STABLE] KVM: add schedule check to napi_enable call

2010-12-03 Thread Peter Lieven

Am 04.06.2010 um 02:02 schrieb Bruce Rogers:

 On 6/3/2010 at 04:51 PM, Greg KH g...@kroah.com wrote: 
 On Thu, Jun 03, 2010 at 04:17:34PM -0600, Bruce Rogers wrote:
 On 6/3/2010 at 03:03 PM, Greg KH g...@kroah.com wrote: 
 On Thu, Jun 03, 2010 at 01:38:31PM -0600, Bruce Rogers wrote:
virtio_net: Add schedule check to napi_enable call
Under harsh testing conditions, including low memory, the guest would
stop receiving packets. With this patch applied we no longer see any
problems in the driver while performing these tests for extended 
 periods
of time.
 
Make sure napi is scheduled subsequent to each napi_enable.
 
Signed-off-by: Bruce Rogers brog...@novell.com
Signed-off-by: Olaf Kirch o...@suse.de
 
 I need a git commit id for this one as well.
 
 
 This one is not upstream.
 
 Then I can't include it in the -stable tree, so why are you sending it
 to me?  :)
 
 thanks,
 
 greg k-h
 
 Good point!
 Sorry about the confusion.
 Bruce

Is there any reason why this patch is not yet in the stable kernel?
virtio_net is still crashing under heavy load in current vanilla kernel.
This patch *might be* the solution.


 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu

2010-12-03 Thread Chris Wright
* Rik van Riel (r...@redhat.com) wrote:
 On 12/02/2010 08:18 PM, Chris Wright wrote:
 * Rik van Riel (r...@redhat.com) wrote:
 Keep track of which task is running a KVM vcpu.  This helps us
 figure out later what task to wake up if we want to boost a
 vcpu that got preempted.
 
 Unfortunately there are no guarantees that the same task
 always keeps the same vcpu, so we can only track the task
 across a single run of the vcpu.
 
 So shouldn't it confine to KVM_RUN?  The other vcpu_load callers aren't
 always a vcpu in a useful runnable state.
 
 Yeah, probably.  If you want I can move the setting of
 vcpu-task to kvm_vcpu_ioctl.

Or maybe setting in sched_out and unsetting in sched_in.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 10:35:25AM -0500, Rik van Riel wrote:
 Do you have suggestions on what I should do to make
 this yield_to functionality work?

Keeping in mind the complications of yield_to, I had suggested we do something
suggested below:

http://marc.info/?l=kvmm=129122645006996w=2

Essentially yield to other tasks on your own runqueue and when you get to run
again, try reclaiming what you gave up earlier (with a cap on how much one can
feedback this relinquished time). It can be accomplished via a special form of 
yield(), available only to in-kernel users, kvm in this case.

- vatsa 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 05:27:52PM +0530, Srivatsa Vaddagiri wrote:
 On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
  Perhaps it should be a VM level option.  And then invert the notion.
  Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
  (pin in place probably).  And have that VM do nothing other than hlt.
  Then it's always runnable according to scheduler, and can consume the
  extra work that CFS wants to give away.
 
 That's not sufficient. Lets we have 3 guests A, B, C that need to be rate
 limited to 25% on a single cpu system. We create this idle guest D that is 
 100%
 cpu hog as per above definition. Now when one of the guest is idle, what 
 ensures
 that the idle cycles of A is given only to D and not partly to B/C?

To tackle this problem, I was thinking of having a fill-thread associated with 
each vcpu (i.e both belong to same cgroup). Fill-thread consumes idle cycles 
left by vcpu, but otherwise doesn't compete with it for cycles.

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: SVM: Add xsetbv intercept

2010-12-03 Thread Joerg Roedel
This patch implements the xsetbv intercept to the AMD part
of KVM. This makes AVX usable in a save way for the guest on
AVX capable AMD hardware.
The patch is tested by using AVX in the guest and host in
parallel and checking for data corruption. I also used the
KVM xsave unit-tests and they all pass.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/include/asm/svm.h |2 ++
 arch/x86/kvm/svm.c |   16 
 2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 11dbca7..7f3a304 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -47,6 +47,7 @@ enum {
INTERCEPT_MONITOR,
INTERCEPT_MWAIT,
INTERCEPT_MWAIT_COND,
+   INTERCEPT_XSETBV,
 };
 
 
@@ -326,6 +327,7 @@ struct __attribute__ ((__packed__)) vmcb {
 #define SVM_EXIT_MONITOR   0x08a
 #define SVM_EXIT_MWAIT 0x08b
 #define SVM_EXIT_MWAIT_COND0x08c
+#define SVM_EXIT_XSETBV0x08d
 #define SVM_EXIT_NPF   0x400
 
 #define SVM_EXIT_ERR   -1
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c00ea90..9cd0c14 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -904,6 +904,7 @@ static void init_vmcb(struct vcpu_svm *svm)
set_intercept(svm, INTERCEPT_WBINVD);
set_intercept(svm, INTERCEPT_MONITOR);
set_intercept(svm, INTERCEPT_MWAIT);
+   set_intercept(svm, INTERCEPT_XSETBV);
 
control-iopm_base_pa = iopm_base;
control-msrpm_base_pa = __pa(svm-msrpm);
@@ -2493,6 +2494,19 @@ static int skinit_interception(struct vcpu_svm *svm)
return 1;
 }
 
+static int xsetbv_interception(struct vcpu_svm *svm)
+{
+   u64 new_bv = kvm_read_edx_eax(svm-vcpu);
+   u32 index = kvm_register_read(svm-vcpu, VCPU_REGS_RCX);
+
+   if (kvm_set_xcr(svm-vcpu, index, new_bv) == 0) {
+   svm-next_rip = kvm_rip_read(svm-vcpu) + 3;
+   skip_emulated_instruction(svm-vcpu);
+   }
+
+   return 1;
+}
+
 static int invalid_op_interception(struct vcpu_svm *svm)
 {
kvm_queue_exception(svm-vcpu, UD_VECTOR);
@@ -2916,6 +2930,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm) = 
{
[SVM_EXIT_WBINVD]   = emulate_on_interception,
[SVM_EXIT_MONITOR]  = invalid_op_interception,
[SVM_EXIT_MWAIT]= invalid_op_interception,
+   [SVM_EXIT_XSETBV]   = xsetbv_interception,
[SVM_EXIT_NPF]  = pf_interception,
 };
 
@@ -3628,6 +3643,7 @@ static const struct trace_print_flags 
svm_exit_reasons_str[] = {
{ SVM_EXIT_WBINVD,  wbinvd },
{ SVM_EXIT_MONITOR, monitor },
{ SVM_EXIT_MWAIT,   mwait },
+   { SVM_EXIT_XSETBV,  xsetbv },
{ SVM_EXIT_NPF, npf },
{ -1, NULL }
 };
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/03/2010 11:20 AM, Srivatsa Vaddagiri wrote:

On Fri, Dec 03, 2010 at 10:35:25AM -0500, Rik van Riel wrote:

Do you have suggestions on what I should do to make
this yield_to functionality work?


Keeping in mind the complications of yield_to, I had suggested we do something
suggested below:

http://marc.info/?l=kvmm=129122645006996w=2

Essentially yield to other tasks on your own runqueue and when you get to run
again, try reclaiming what you gave up earlier (with a cap on how much one can
feedback this relinquished time). It can be accomplished via a special form of
yield(), available only to in-kernel users, kvm in this case.


I don't see how that is going to help get the lock
released, when the VCPU holding the lock is on another
CPU.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Chris Wright
* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
 On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
  Perhaps it should be a VM level option.  And then invert the notion.
  Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
  (pin in place probably).  And have that VM do nothing other than hlt.
  Then it's always runnable according to scheduler, and can consume the
  extra work that CFS wants to give away.
 
 That's not sufficient. Lets we have 3 guests A, B, C that need to be
 rate limited to 25% on a single cpu system. We create this idle guest
 D that is 100% cpu hog as per above definition. Now when one of the
 guest is idle, what ensures that the idle cycles of A is given only
 to D and not partly to B/C?

Yeah, I pictured priorties handling this.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Chris Wright
* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
 On Fri, Dec 03, 2010 at 05:27:52PM +0530, Srivatsa Vaddagiri wrote:
  On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
   Perhaps it should be a VM level option.  And then invert the notion.
   Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
   (pin in place probably).  And have that VM do nothing other than hlt.
   Then it's always runnable according to scheduler, and can consume the
   extra work that CFS wants to give away.
  
  That's not sufficient. Lets we have 3 guests A, B, C that need to be rate
  limited to 25% on a single cpu system. We create this idle guest D that is 
  100%
  cpu hog as per above definition. Now when one of the guest is idle, what 
  ensures
  that the idle cycles of A is given only to D and not partly to B/C?
 
 To tackle this problem, I was thinking of having a fill-thread associated 
 with 
 each vcpu (i.e both belong to same cgroup). Fill-thread consumes idle cycles 
 left by vcpu, but otherwise doesn't compete with it for cycles.

That's what Marcelo's suggestion does w/out a fill thread.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 12:09:01PM -0500, Rik van Riel wrote:
 I don't see how that is going to help get the lock
 released, when the VCPU holding the lock is on another
 CPU.

Even the directed yield() is not guaranteed to get the lock released, given its
shooting in the dark?

Anyway, the intention of yield() proposed was not to get lock released
immediately (which will happen eventually), but rather to avoid inefficiency
associated with (long) spinning and at the same time make sure we are not
leaking our bandwidth to other guests because of a naive yield ..

- vatsa


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/03/2010 12:29 PM, Srivatsa Vaddagiri wrote:

On Fri, Dec 03, 2010 at 12:09:01PM -0500, Rik van Riel wrote:

I don't see how that is going to help get the lock
released, when the VCPU holding the lock is on another
CPU.


Even the directed yield() is not guaranteed to get the lock released, given its
shooting in the dark?


True, that's a fair point.


Anyway, the intention of yield() proposed was not to get lock released
immediately (which will happen eventually), but rather to avoid inefficiency
associated with (long) spinning and at the same time make sure we are not
leaking our bandwidth to other guests because of a naive yield ..


A KVM guest can run on the host alongside short-lived
processes, though.  How can we ensure that a VCPU that
donates time gets it back again later, when the task
time was donated to may no longer exist?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote:
 That's what Marcelo's suggestion does w/out a fill thread.

Are we willing to add that to KVM sources?

I was working under the constraints of not modifying the kernel (especially 
avoid adding short term hacks that become unnecessary in longer run, in this
case when kernel-based hard limits goes in).

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 09:28:25AM -0800, Chris Wright wrote:
 * Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
  On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
   Perhaps it should be a VM level option.  And then invert the notion.
   Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
   (pin in place probably).  And have that VM do nothing other than hlt.
   Then it's always runnable according to scheduler, and can consume the
   extra work that CFS wants to give away.
  
  That's not sufficient. Lets we have 3 guests A, B, C that need to be
  rate limited to 25% on a single cpu system. We create this idle guest
  D that is 100% cpu hog as per above definition. Now when one of the
  guest is idle, what ensures that the idle cycles of A is given only
  to D and not partly to B/C?
 
 Yeah, I pictured priorties handling this.

All guest are of equal priorty in this case (that's how we are able to divide 
time into 25% chunks), so unless we dynamically boost D's priority based on how
idle other VMs are, its not going to be easy!

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Chris Wright
* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
 On Fri, Dec 03, 2010 at 09:28:25AM -0800, Chris Wright wrote:
  * Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
   On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
Perhaps it should be a VM level option.  And then invert the notion.
Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
(pin in place probably).  And have that VM do nothing other than hlt.
Then it's always runnable according to scheduler, and can consume the
extra work that CFS wants to give away.
   
   That's not sufficient. Lets we have 3 guests A, B, C that need to be
   rate limited to 25% on a single cpu system. We create this idle guest
   D that is 100% cpu hog as per above definition. Now when one of the
   guest is idle, what ensures that the idle cycles of A is given only
   to D and not partly to B/C?
  
  Yeah, I pictured priorties handling this.
 
 All guest are of equal priorty in this case (that's how we are able to divide 
 time into 25% chunks), so unless we dynamically boost D's priority based on 
 how
 idle other VMs are, its not going to be easy!

Right, I think there has to be an external mgmt entity.  Because num
vcpus is not static.  So priorities have to be rebalanaced at vcpu
create/destroy time.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 09:38:05AM -0800, Chris Wright wrote:
  All guest are of equal priorty in this case (that's how we are able to 
  divide 
  time into 25% chunks), so unless we dynamically boost D's priority based on 
  how
  idle other VMs are, its not going to be easy!
 
 Right, I think there has to be an external mgmt entity.  Because num
 vcpus is not static.  So priorities have to be rebalanaced at vcpu
 create/destroy time.

and at idle/non-idle time as well, which makes the mgmt entity's job rather
harder? Anyway, if we are willing to take a patch to burn cycles upon halt (as
per Marcello's patch), that's be the best (short-term) solution ..otherwise,
something like a filler-thread per-vcpu is more easier than dynamic change of
priorities ..

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 12:33:29PM -0500, Rik van Riel wrote:
 Anyway, the intention of yield() proposed was not to get lock released
 immediately (which will happen eventually), but rather to avoid inefficiency
 associated with (long) spinning and at the same time make sure we are not
 leaking our bandwidth to other guests because of a naive yield ..
 
 A KVM guest can run on the host alongside short-lived
 processes, though.  How can we ensure that a VCPU that
 donates time gets it back again later, when the task
 time was donated to may no longer exist?

I think that does not matter. What matters for fairness in this case is how much
cpu time yielding thread gets over some (larger) time window. By ensuring that
relinquished time is fedback, we should maintian fairness for that particular
vcpu thread ..This also avoids nasty interactions associated with donation ..

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Anthony Liguori

On 12/03/2010 11:38 AM, Chris Wright wrote:

* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
   

On Fri, Dec 03, 2010 at 09:28:25AM -0800, Chris Wright wrote:
 

* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
   

On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
 

Perhaps it should be a VM level option.  And then invert the notion.
Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
(pin in place probably).  And have that VM do nothing other than hlt.
Then it's always runnable according to scheduler, and can consume the
extra work that CFS wants to give away.
   

That's not sufficient. Lets we have 3 guests A, B, C that need to be
rate limited to 25% on a single cpu system. We create this idle guest
D that is 100% cpu hog as per above definition. Now when one of the
guest is idle, what ensures that the idle cycles of A is given only
to D and not partly to B/C?
 

Yeah, I pictured priorties handling this.
   

All guest are of equal priorty in this case (that's how we are able to divide
time into 25% chunks), so unless we dynamically boost D's priority based on how
idle other VMs are, its not going to be easy!
 

Right, I think there has to be an external mgmt entity.  Because num
vcpus is not static.  So priorities have to be rebalanaced at vcpu
create/destroy time.
   


We've actually done a fair amount of testing with using priorities like 
this.  The granularity is extremely poor because priorities don't map 
linearly to cpu time allotment.  The interaction with background tasks 
also gets extremely complicated.


Regards,

Anthony Liguori


thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote:
 That's what Marcelo's suggestion does w/out a fill thread.

There's one complication though even with that. How do we compute the
real utilization of VM (given that it will appear to be burning 100% cycles)?
We need to have scheduler discount the cycles burnt post halt-exit, so more
stuff is needed than those simple 3-4 lines!

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Chris Wright
* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
 On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote:
  That's what Marcelo's suggestion does w/out a fill thread.
 
 There's one complication though even with that. How do we compute the
 real utilization of VM (given that it will appear to be burning 100% cycles)?
 We need to have scheduler discount the cycles burnt post halt-exit, so more
 stuff is needed than those simple 3-4 lines!

Heh, was just about to say the same thing ;)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Anthony Liguori

On 12/03/2010 11:58 AM, Chris Wright wrote:

* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
   

On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote:
 

That's what Marcelo's suggestion does w/out a fill thread.
   

There's one complication though even with that. How do we compute the
real utilization of VM (given that it will appear to be burning 100% cycles)?
We need to have scheduler discount the cycles burnt post halt-exit, so more
stuff is needed than those simple 3-4 lines!
 

Heh, was just about to say the same thing ;)
   


My first reaction is that it's not terribly important to account the 
non-idle time in the guest because of the use-case for this model.


Eventually, it might be nice to have idle time accounting but I don't 
see it as a critical feature here.


Non-idle time simply isn't as meaningful here as it normally would be.  
If you have 10 VMs in a normal environment and saw that you had only 50% 
CPU utilization, you might be inclined to add more VMs.  But if you're 
offering deterministic execution, it doesn't matter if you only have 
50% utilization.  If you add another VM, the guests will get exactly 
the same impact as if they were using 100% utilization.


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Srivatsa Vaddagiri
On Fri, Dec 03, 2010 at 12:07:15PM -0600, Anthony Liguori wrote:
 My first reaction is that it's not terribly important to account the
 non-idle time in the guest because of the use-case for this model.

Agreed ...but I was considering the larger user-base who may be surprised to see
their VMs being reported as 100% hogs when they had left it idle.

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Marcelo Tosatti
On Fri, Dec 03, 2010 at 09:58:54AM -0800, Chris Wright wrote:
 * Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
  On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote:
   That's what Marcelo's suggestion does w/out a fill thread.
  
  There's one complication though even with that. How do we compute the
  real utilization of VM (given that it will appear to be burning 100% 
  cycles)?
  We need to have scheduler discount the cycles burnt post halt-exit, so more
  stuff is needed than those simple 3-4 lines!
 
 Heh, was just about to say the same thing ;)

Probably yes. The point is, you get the same effect as with the
non-trapping hlt but without the complications on low-level VMX/SVM
code.

Even better if you can do it with fill thread idea.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Chris Wright
* Anthony Liguori (anth...@codemonkey.ws) wrote:
 On 12/03/2010 11:58 AM, Chris Wright wrote:
 * Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
 On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote:
 That's what Marcelo's suggestion does w/out a fill thread.
 There's one complication though even with that. How do we compute the
 real utilization of VM (given that it will appear to be burning 100% 
 cycles)?
 We need to have scheduler discount the cycles burnt post halt-exit, so more
 stuff is needed than those simple 3-4 lines!
 Heh, was just about to say the same thing ;)
 
 My first reaction is that it's not terribly important to account the
 non-idle time in the guest because of the use-case for this model.

Depends on the chargeback model.  This would put guest vcpu runtime vs
host running guest vcpu time really out of skew.  ('course w/out steal
and that time it's already out of skew).  But I think most models are
more uptime based rather then actual runtime now.

 Eventually, it might be nice to have idle time accounting but I
 don't see it as a critical feature here.
 
 Non-idle time simply isn't as meaningful here as it normally would
 be.  If you have 10 VMs in a normal environment and saw that you had
 only 50% CPU utilization, you might be inclined to add more VMs.

Who is you?  cloud user, or cloud service provider's scheduler?
On the user side, 50% cpu utilization wouldn't trigger me to add new
VMs.  On the host side, 50% cpu utilization would have to be measure
solely in terms of guest vcpu count.

 But if you're offering deterministic execution, it doesn't matter if
 you only have 50% utilization.  If you add another VM, the guests
 will get exactly the same impact as if they were using 100%
 utilization.

Sorry, didn't follow here?

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Marcelo Tosatti
On Fri, Dec 03, 2010 at 04:10:43PM -0200, Marcelo Tosatti wrote:
 On Fri, Dec 03, 2010 at 09:58:54AM -0800, Chris Wright wrote:
  * Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
   On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote:
That's what Marcelo's suggestion does w/out a fill thread.
   
   There's one complication though even with that. How do we compute the
   real utilization of VM (given that it will appear to be burning 100% 
   cycles)?
   We need to have scheduler discount the cycles burnt post halt-exit, so 
   more
   stuff is needed than those simple 3-4 lines!
  
  Heh, was just about to say the same thing ;)
 
 Probably yes. The point is, you get the same effect as with the
 non-trapping hlt but without the complications on low-level VMX/SVM
 code.
 
 Even better if you can do it with fill thread idea.

Well, no. Better to consume hlt time but yield if need_resched or in 
case of any event which breaks out of kvm_vcpu_block.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Rik van Riel

On 12/02/2010 07:50 PM, Chris Wright wrote:


+void requeue_task(struct rq *rq, struct task_struct *p)
+{
+   assert_spin_locked(rq-lock);
+
+   if (!p-se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
+   return;


already checked task_running(rq, p) || task_has_rt_policy(p) w/ rq lock
held.


OK, I removed the duplicate checks.


+
+   dequeue_task(rq, p, 0);
+   enqueue_task(rq, p, 0);


seems like you could condense to save an update_rq_clock() call at least,
don't know if the info_queued, info_dequeued need to be updated


Or I can do the whole operation with the task not queued.
Not sure yet what approach I'll take...


+#ifdef CONFIG_SCHED_HRTICK
+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the same
+ * process, eg. when p holds a resource other threads are waiting for.
+ * Giving priority to p may help get that resource released sooner.
+ */
+void yield_to(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se =p-se;
+   struct rq *rq;
+   struct cfs_rq *cfs_rq;
+   u64 remain = slice_remain(current);
+
+   rq = task_rq_lock(p,flags);
+   if (task_running(rq, p) || task_has_rt_policy(p))
+   goto out;
+   cfs_rq = cfs_rq_of(se);
+   se-vruntime -= remain;
+   if (se-vruntime  cfs_rq-min_vruntime)
+   se-vruntime = cfs_rq-min_vruntime;


Should these details all be in sched_fair?  Seems like the wrong layer
here.  And would that condition go the other way?  If new vruntime is
smaller than min, then it becomes new cfs_rq-min_vruntime?


That would be nice.  Unfortunately, EXPORT_SYMBOL() does
not seem to work right from sched_fair.c, which is included
from sched.c instead of being built from the makefile!


diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5119b08..2a0a595 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity 
*curr, int queued)
   */

  #ifdef CONFIG_SCHED_HRTICK
+u64 slice_remain(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se =p-se;
+   struct cfs_rq *cfs_rq;
+   struct rq *rq;
+   u64 slice, ran;
+   s64 delta;
+
+   rq = task_rq_lock(p,flags);
+   cfs_rq = cfs_rq_of(se);
+   slice = sched_slice(cfs_rq, se);
+   ran = se-sum_exec_runtime - se-prev_sum_exec_runtime;
+   delta = slice - ran;
+   task_rq_unlock(rq,flags);
+
+   return max(delta, 0LL);


Can delta go negative?


Good question.  I don't know.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Anthony Liguori

On 12/03/2010 12:20 PM, Chris Wright wrote:

* Anthony Liguori (anth...@codemonkey.ws) wrote:
   

On 12/03/2010 11:58 AM, Chris Wright wrote:
 

* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote:
   

On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote:
 

That's what Marcelo's suggestion does w/out a fill thread.
   

There's one complication though even with that. How do we compute the
real utilization of VM (given that it will appear to be burning 100% cycles)?
We need to have scheduler discount the cycles burnt post halt-exit, so more
stuff is needed than those simple 3-4 lines!
 

Heh, was just about to say the same thing ;)
   

My first reaction is that it's not terribly important to account the
non-idle time in the guest because of the use-case for this model.
 

Depends on the chargeback model.  This would put guest vcpu runtime vs
host running guest vcpu time really out of skew.  ('course w/out steal
and that time it's already out of skew).  But I think most models are
more uptime based rather then actual runtime now.
   


Right.  I'm not familiar with any models that are actually based on 
CPU-consumption based accounting.  In general, the feedback I've 
received is that predictable accounting is pretty critical so I don't 
anticipate something as volatile as CPU-consumption ever being something 
that's explicitly charged for in a granular fashion.



Eventually, it might be nice to have idle time accounting but I
don't see it as a critical feature here.

Non-idle time simply isn't as meaningful here as it normally would
be.  If you have 10 VMs in a normal environment and saw that you had
only 50% CPU utilization, you might be inclined to add more VMs.
 

Who is you?  cloud user, or cloud service provider's scheduler?
On the user side, 50% cpu utilization wouldn't trigger me to add new
VMs.  On the host side, 50% cpu utilization would have to be measure
solely in terms of guest vcpu count.

   

But if you're offering deterministic execution, it doesn't matter if
you only have 50% utilization.  If you add another VM, the guests
will get exactly the same impact as if they were using 100%
utilization.
 

Sorry, didn't follow here?
   


The question is, why would something care about host CPU utilization?  
The answer I can think of is, something wants to measure host CPU 
utilization to identify an underutilized node.  One the underutilized 
node is identified, more work can be given to it.


Adding more work to an underutilized node doesn't change the amount of 
work that can be done.  More concretely, one PCPU, four independent 
VCPUs.  They are consuming, 25%, 25%, 25%, 12% respectively.  My 
management software says, ah hah, I can stick a fifth VCPU on this box 
that's only using 5%.  The other VCPUs are unaffected.


However, in a no-yield-on-hlt model, if I have four VCPUs, they each get 
25%, 25%, 25%, 25% on the host.  Three of the VCPUs are running 100% in 
the guest and one is running 50%.


If I add a fifth VCPU, even if it's only using 5%, each VCPU drops to 
20%.  That means the three VCPUS that are consuming 100% now see a 25% 
drop in their performance even though you've added an idle guest.


Basically, the traditional view of density simply doesn't apply in this 
model.


Regards,

Anthony Liguori


thanks,
-chris
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Chris Wright
* Rik van Riel (r...@redhat.com) wrote:
 On 12/02/2010 07:50 PM, Chris Wright wrote:
 +/*
 + * Yield the CPU, giving the remainder of our time slice to task p.
 + * Typically used to hand CPU time to another thread inside the same
 + * process, eg. when p holds a resource other threads are waiting for.
 + * Giving priority to p may help get that resource released sooner.
 + */
 +void yield_to(struct task_struct *p)
 +{
 +   unsigned long flags;
 +   struct sched_entity *se =p-se;
 +   struct rq *rq;
 +   struct cfs_rq *cfs_rq;
 +   u64 remain = slice_remain(current);
 +
 +   rq = task_rq_lock(p,flags);
 +   if (task_running(rq, p) || task_has_rt_policy(p))
 +   goto out;
 +   cfs_rq = cfs_rq_of(se);
 +   se-vruntime -= remain;
 +   if (se-vruntime  cfs_rq-min_vruntime)
 +   se-vruntime = cfs_rq-min_vruntime;
 
 Should these details all be in sched_fair?  Seems like the wrong layer
 here.  And would that condition go the other way?  If new vruntime is
 smaller than min, then it becomes new cfs_rq-min_vruntime?
 
 That would be nice.  Unfortunately, EXPORT_SYMBOL() does
 not seem to work right from sched_fair.c, which is included
 from sched.c instead of being built from the makefile!

add a -yield_to() to properly isolate (only relevant then in
sched_fair)?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] Extra capabilities for device assignment

2010-12-03 Thread Alex Williamson
Now that we've got PCI capabilities cleaned up and device assignment
using them, we can add more capabilities to be guest visible.  This
adds minimal PCI Express, PCI-X, and Power Management, along with
direct passthrough Vital Product Data and Vendor Specific capabilities.
With this, devices like tg3, bnx2, vxge, and potentially quite a few
others that didn't work previously should be happier.  Thanks,

Alex

---

Alex Williamson (5):
  device-assignment: pass through and stub more PCI caps
  device-assignment: Error checking when adding capabilities
  pci: Error on PCI capability collisions
  pci: MSI-X capability is 12 bytes, not 16, MSI is 10 bytes
  device-assignment: Fix off-by-one in header check


 hw/device-assignment.c |  279 ++--
 hw/pci.c   |   14 ++
 hw/pci.h   |4 -
 3 files changed, 282 insertions(+), 15 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] device-assignment: Fix off-by-one in header check

2010-12-03 Thread Alex Williamson
Include the first byte at 40h or else access might go to the
hardware instead of the emulated config space, resulting in
capability loops, since the ordering is different.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 832c236..6d6e657 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -410,7 +410,7 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
uint32_t address,
   ((d-devfn  3)  0x1F), (d-devfn  0x7),
   (uint16_t) address, val, len);
 
-if (address  PCI_CONFIG_HEADER_SIZE  d-config_map[address]) {
+if (address = PCI_CONFIG_HEADER_SIZE  d-config_map[address]) {
 return assigned_device_pci_cap_write_config(d, address, val, len);
 }
 
@@ -456,7 +456,7 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, 
uint32_t address,
 if (address  0x4 || (pci_dev-need_emulate_cmd  address == 0x4) ||
(address = 0x10  address = 0x24) || address == 0x30 ||
 address == 0x34 || address == 0x3c || address == 0x3d ||
-(address  PCI_CONFIG_HEADER_SIZE  d-config_map[address])) {
+(address = PCI_CONFIG_HEADER_SIZE  d-config_map[address])) {
 val = pci_default_read_config(d, address, len);
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] pci: MSI-X capability is 12 bytes, not 16, MSI is 10 bytes

2010-12-03 Thread Alex Williamson
Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/pci.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/pci.h b/hw/pci.h
index 34955d8..7c52637 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -124,8 +124,8 @@ enum {
 
 #define PCI_CAPABILITY_CONFIG_MAX_LENGTH 0x60
 #define PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR 0x40
-#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0x10
-#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x10
+#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0xa
+#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x0c
 
 typedef int (*msix_mask_notifier_func)(PCIDevice *, unsigned vector,
   int masked);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] pci: Error on PCI capability collisions

2010-12-03 Thread Alex Williamson
Nothing good can happen when we overlap capabilities

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/pci.c |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index b08113d..288d6fd 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1845,6 +1845,20 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
 if (!offset) {
 return -ENOSPC;
 }
+} else {
+int i;
+
+for (i = offset; i  offset + size; i++) {
+if (pdev-config_map[i]) {
+fprintf(stderr, ERROR: %04x:%02x:%02x.%x 
+Attempt to add PCI capability %x at offset 
+%x overlaps existing capability %x at offset %x\n,
+pci_find_domain(pdev-bus), pci_bus_num(pdev-bus),
+PCI_SLOT(pdev-devfn), PCI_FUNC(pdev-devfn),
+cap_id, offset, pdev-config_map[i], i);
+return -EFAULT;
+}
+}
 }
 
 config = pdev-config + offset;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] device-assignment: Error checking when adding capabilities

2010-12-03 Thread Alex Williamson
Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 6d6e657..838bf89 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1288,7 +1288,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 {
 AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev);
 PCIRegion *pci_region = dev-real_device.regions;
-int pos;
+int ret, pos;
 
 /* Clear initial capabilities pointer and status copied from hw */
 pci_set_byte(pci_dev-config + PCI_CAPABILITY_LIST, 0);
@@ -1302,8 +1302,10 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
  * MSI capability is the 1st capability in capability config */
 if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI))) {
 dev-cap.available |= ASSIGNED_DEVICE_CAP_MSI;
-pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos,
-   PCI_CAPABILITY_CONFIG_MSI_LENGTH);
+if ((ret = pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos,
+  PCI_CAPABILITY_CONFIG_MSI_LENGTH))  0) {
+return ret;
+}
 
 /* Only 32-bit/no-mask currently supported */
 pci_set_word(pci_dev-config + pos + PCI_MSI_FLAGS,
@@ -1326,8 +1328,10 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 uint32_t msix_table_entry;
 
 dev-cap.available |= ASSIGNED_DEVICE_CAP_MSIX;
-pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos,
-   PCI_CAPABILITY_CONFIG_MSIX_LENGTH);
+if ((ret = pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos,
+  PCI_CAPABILITY_CONFIG_MSIX_LENGTH))  0) 
{
+return ret;
+}
 
 pci_set_word(pci_dev-config + pos + PCI_MSIX_FLAGS,
  pci_get_word(pci_dev-config + pos + PCI_MSIX_FLAGS) 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] device-assignment: pass through and stub more PCI caps

2010-12-03 Thread Alex Williamson
Some drivers depend on finding capabilities like power management,
PCI express/X, vital product data, or vendor specific fields.  Now
that we have better capability support, we can pass more of these
tables through to the guest.  Note that VPD and VNDR are direct pass
through capabilies, the rest are mostly empty shells with a few
writable bits where necessary.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |  263 +++-
 1 files changed, 256 insertions(+), 7 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 838bf89..2415e43 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -67,6 +67,9 @@ static void assigned_device_pci_cap_write_config(PCIDevice 
*pci_dev,
  uint32_t address,
  uint32_t val, int len);
 
+static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev,
+uint32_t address, int len);
+
 static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region,
uint32_t addr, int len, uint32_t *val)
 {
@@ -370,11 +373,32 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, 
int pos)
 return (uint8_t)assigned_dev_pci_read(d, pos, 1);
 }
 
-static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap)
+static void assigned_dev_pci_write(PCIDevice *d, int pos, uint32_t val, int 
len)
+{
+AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
+ssize_t ret;
+int fd = pci_dev-real_device.config_fd;
+
+again:
+ret = pwrite(fd, val, len, pos);
+if (ret != len) {
+   if ((ret  0)  (errno == EINTR || errno == EAGAIN))
+   goto again;
+
+   fprintf(stderr, %s: pwrite failed, ret = %zd errno = %d\n,
+   __func__, ret, errno);
+
+   exit(1);
+}
+
+return;
+}
+
+static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap, uint8_t start)
 {
 int id;
 int max_cap = 48;
-int pos = PCI_CAPABILITY_LIST;
+int pos = start ? start : PCI_CAPABILITY_LIST;
 int status;
 
 status = assigned_dev_pci_read_byte(d, PCI_STATUS);
@@ -453,10 +477,16 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice 
*d, uint32_t address,
 ssize_t ret;
 AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
 
+if (address = PCI_CONFIG_HEADER_SIZE  d-config_map[address]) {
+val = assigned_device_pci_cap_read_config(d, address, len);
+DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
+  (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
+return val;
+}
+
 if (address  0x4 || (pci_dev-need_emulate_cmd  address == 0x4) ||
(address = 0x10  address = 0x24) || address == 0x30 ||
-address == 0x34 || address == 0x3c || address == 0x3d ||
-(address = PCI_CONFIG_HEADER_SIZE  d-config_map[address])) {
+address == 0x34 || address == 0x3c || address == 0x3d) {
 val = pci_default_read_config(d, address, len);
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
@@ -1251,7 +1281,70 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, 
unsigned int ctrl_pos)
 #endif
 #endif
 
-static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t 
address,
+/* There can be multiple VNDR capabilities per device, we need to find the
+ * one that starts closet to the given address without going over. */
+static uint8_t find_vndr_start(PCIDevice *pci_dev, uint32_t address)
+{
+uint8_t cap, pos;
+
+for (cap = pos = 0;
+ (pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_VNDR, pos));
+ pos += PCI_CAP_LIST_NEXT) {
+if (pos = address) {
+cap = MAX(pos, cap);
+}
+}
+return cap;
+}
+
+/* Merge the bits set in mask from mval into val.  Both val and mval are
+ * at the same addr offset, pos is the starting offset of the mask. */
+static uint32_t merge_bits(uint32_t val, uint32_t mval, uint8_t addr,
+   int len, uint8_t pos, uint32_t mask)
+{
+if (!ranges_overlap(addr, len, pos, 4)) {
+return val;
+}
+
+if (addr = pos) {
+mask = (addr - pos) * 8;
+} else {
+mask = (pos - addr) * 8;
+}
+mask = 0xU  (4 - len) * 8;
+
+val = ~mask;
+val |= (mval  mask);
+
+return val;
+}
+
+static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev,
+uint32_t address, int len)
+{
+uint8_t cap, cap_id = pci_dev-config_map[address];
+uint32_t val;
+
+switch (cap_id) {
+
+case PCI_CAP_ID_VPD:
+cap = pci_find_capability(pci_dev, cap_id);
+val = assigned_dev_pci_read(pci_dev, address, len);
+return merge_bits(val, 

Re: [PATCH 2/5] pci: MSI-X capability is 12 bytes, not 16, MSI is 10 bytes

2010-12-03 Thread Chris Wright
* Alex Williamson (alex.william...@redhat.com) wrote:
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
  hw/pci.h |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/hw/pci.h b/hw/pci.h
 index 34955d8..7c52637 100644
 --- a/hw/pci.h
 +++ b/hw/pci.h
 @@ -124,8 +124,8 @@ enum {
  
  #define PCI_CAPABILITY_CONFIG_MAX_LENGTH 0x60
  #define PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR 0x40
 -#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0x10
 -#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x10
 +#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0xa

This is variable length.

 +#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x0c
  
  typedef int (*msix_mask_notifier_func)(PCIDevice *, unsigned vector,
  int masked);
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] pci: MSI-X capability is 12 bytes, not 16, MSI is 10 bytes

2010-12-03 Thread Alex Williamson
On Fri, 2010-12-03 at 11:37 -0800, Chris Wright wrote:
 * Alex Williamson (alex.william...@redhat.com) wrote:
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  ---
  
   hw/pci.h |4 ++--
   1 files changed, 2 insertions(+), 2 deletions(-)
  
  diff --git a/hw/pci.h b/hw/pci.h
  index 34955d8..7c52637 100644
  --- a/hw/pci.h
  +++ b/hw/pci.h
  @@ -124,8 +124,8 @@ enum {
   
   #define PCI_CAPABILITY_CONFIG_MAX_LENGTH 0x60
   #define PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR 0x40
  -#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0x10
  -#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x10
  +#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0xa
 
 This is variable length.

Yes, but this particular #define is only used by device assignment,
which only uses the minimum sized table.  Maybe as a follow-up patch we
should just remove these from pci.h and let device-assignment keep them
private.  Thanks,

Alex



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] pci: MSI-X capability is 12 bytes, not 16, MSI is 10 bytes

2010-12-03 Thread Chris Wright
* Alex Williamson (alex.william...@redhat.com) wrote:
 On Fri, 2010-12-03 at 11:37 -0800, Chris Wright wrote:
  * Alex Williamson (alex.william...@redhat.com) wrote:
   Signed-off-by: Alex Williamson alex.william...@redhat.com
   ---
   
hw/pci.h |4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
   
   diff --git a/hw/pci.h b/hw/pci.h
   index 34955d8..7c52637 100644
   --- a/hw/pci.h
   +++ b/hw/pci.h
   @@ -124,8 +124,8 @@ enum {

#define PCI_CAPABILITY_CONFIG_MAX_LENGTH 0x60
#define PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR 0x40
   -#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0x10
   -#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x10
   +#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0xa
  
  This is variable length.
 
 Yes, but this particular #define is only used by device assignment,
 which only uses the minimum sized table.  Maybe as a follow-up patch we
 should just remove these from pci.h and let device-assignment keep them
 private.  Thanks,

Just thinking of keeping the ability to use DAC address.  Esp for those
512 way guests ;)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Mike Galbraith
On Fri, 2010-12-03 at 10:35 -0500, Rik van Riel wrote:
 On 12/03/2010 10:09 AM, Mike Galbraith wrote:
  On Fri, 2010-12-03 at 09:48 -0500, Rik van Riel wrote:
  On 12/03/2010 09:45 AM, Mike Galbraith wrote:
 
  I'll have to go back and re-read that.  Off the top of my head, I see no
  way it could matter which container the numbers live in as long as they
  keep advancing, and stay in the same runqueue.  (hm, task weights would
  have to be the same too or scaled. dangerous business, tinkering with
  vruntimes)
 
  They're not necessarily in the same runqueue, the
  VCPU that is given time might be on another CPU
  than the one that was spinning on a lock.
 
  I don't think pumping vruntime cross cfs_rq would be safe, for the
  reason noted (et al).  No competition means vruntime is meaningless.
  Donating just advances a clock that nobody's looking at.
 
 Do you have suggestions on what I should do to make
 this yield_to functionality work?

Hm.

The problem with donating vruntime across queues is that there is no
global clock.  You have to be in the same frame of reference for
vruntime donation to make any sense.  Same with cross cpu yield_to() hw
wise though.  It makes no sense from another frame of reference.  Pull.

-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] KVM: SVM: Wrap access to intercept masks into functions

2010-12-03 Thread Marcelo Tosatti
On Tue, Nov 30, 2010 at 06:03:55PM +0100, Joerg Roedel wrote:
 Hi Avi, Hi Marcelo,
 
 this patchset wraps the access to the intercept vectors in the VMCB into
 specific functions. There are two reasons for this:
 
   1) In the nested-svm code the effective intercept masks are
  calculated from the host and the guest intercept masks.
  Whenever KVM changes the host intercept mask while the VCPU
  is in guest-mode the effective intercept masks need to be
  re-calculated. This is nicely wrapped into these functions
  now and makes the code more robust.
 
   2) These changes make the implementation of the upcoming
  vmcb-clean-bits feature easier and also more robust (which
  was the main reason for writing this patchset).
 
 These patches were developed on-top of the patch-set I sent yesterday. I
 tested these patches with various guests (Windows-64, Linux 32,32e and
 64 as well as with nested-svm).
 
 Regards,
 
   Joerg
 
 Summary:
 
  arch/x86/include/asm/svm.h |   44 +++--
  arch/x86/kvm/svm.c |  391 
 
  2 files changed, 241 insertions(+), 194 deletions(-)
 
 Joerg Roedel (6):
   KVM: SVM: Add function to recalculate intercept masks
   KVM: SVM: Add manipulation functions for CRx intercepts
   KVM: SVM: Add manipulation functions for DRx intercepts
   KVM: SVM: Add manipulation functions for exception intercepts
   KVM: SVM: Add manipulation functions for misc intercepts
   KVM: SVM: Use get_host_vmcb function in svm_get_msr for TSC

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Fri, 2010-12-03 at 16:09 +0100, Mike Galbraith wrote:
 On Fri, 2010-12-03 at 09:48 -0500, Rik van Riel wrote:
  On 12/03/2010 09:45 AM, Mike Galbraith wrote:
  
   I'll have to go back and re-read that.  Off the top of my head, I see no
   way it could matter which container the numbers live in as long as they
   keep advancing, and stay in the same runqueue.  (hm, task weights would
   have to be the same too or scaled. dangerous business, tinkering with
   vruntimes)
  
  They're not necessarily in the same runqueue, the
  VCPU that is given time might be on another CPU
  than the one that was spinning on a lock.
 
 I don't think pumping vruntime cross cfs_rq would be safe, for the
 reason noted (et al).  No competition means vruntime is meaningless.
 Donating just advances a clock that nobody's looking at.

Yeah, cross-cpu you have to model it like exchanging lag. That's a
slightly more complicated trick (esp. since we still don't have a proper
measure for lag) but it should be doable.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Fri, 2010-12-03 at 19:40 +0530, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 07:36:07PM +0530, Srivatsa Vaddagiri wrote:
  On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:
   No, because they do receive service (they spend some time spinning
   before being interrupted), so the respective vruntimes will increase, at
   some point they'll pass B0 and it'll get scheduled.
  
  Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in 
  this
  case)?
 
 Hmm perhaps yes, althought at cost of tons of context switches, which would be
 nice to minimize on?

Don't care, as long as the guys calling yield_to() pay for that time its
their problem.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-03 Thread Peter Zijlstra
On Fri, 2010-12-03 at 13:27 -0500, Rik van Riel wrote:
  Should these details all be in sched_fair?  Seems like the wrong layer
  here.  And would that condition go the other way?  If new vruntime is
  smaller than min, then it becomes new cfs_rq-min_vruntime?
 
 That would be nice.  Unfortunately, EXPORT_SYMBOL() does
 not seem to work right from sched_fair.c, which is included
 from sched.c instead of being built from the makefile!

I'm not quite sure why that is, but I kinda like that, the policy
implementation should never export stuff.

Code outside the scheduler cannot ever know the policy of a task, hence
policy specific exports are bad.

A generic export with policy implementations (like the
sched_class::yield_to() proposal) are the proper way.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v3)

2010-12-03 Thread Anthony Liguori
In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.

Signed-off-by: Anthony Liguori aligu...@us.ibm.com
---
v3 - v2
 - Clear HLT activity state on exception injection to fix issue with async PF

v1 - v2
 - Rename parameter to yield_on_hlt
 - Remove __read_mostly

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 42d9590..9642c22 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -297,6 +297,12 @@ enum vmcs_field {
 #define GUEST_INTR_STATE_SMI   0x0004
 #define GUEST_INTR_STATE_NMI   0x0008
 
+/* GUEST_ACTIVITY_STATE flags */
+#define GUEST_ACTIVITY_ACTIVE  0
+#define GUEST_ACTIVITY_HLT 1
+#define GUEST_ACTIVITY_SHUTDOWN2
+#define GUEST_ACTIVITY_WAIT_SIPI   3
+
 /*
  * Exit Qualifications for MOV for Control Register Access
  */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index caa967e..e8e64cb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
 static int __read_mostly vmm_exclusive = 1;
 module_param(vmm_exclusive, bool, S_IRUGO);
 
+static int yield_on_hlt = 1;
+module_param(yield_on_hlt, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST  \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK \
@@ -1016,6 +1019,10 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu, 
unsigned nr,
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+/* Cannot inject an exception in guest activity state is HLT */
+   if (vmcs_read32(GUEST_ACTIVITY_STATE) == GUEST_ACTIVITY_HLT)
+   vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
+
if (has_error_code) {
vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -1419,7 +1426,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
_pin_based_exec_control)  0)
return -EIO;
 
-   min = CPU_BASED_HLT_EXITING |
+   min =
 #ifdef CONFIG_X86_64
  CPU_BASED_CR8_LOAD_EXITING |
  CPU_BASED_CR8_STORE_EXITING |
@@ -1432,6 +1439,10 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  CPU_BASED_MWAIT_EXITING |
  CPU_BASED_MONITOR_EXITING |
  CPU_BASED_INVLPG_EXITING;
+
+   if (yield_on_hlt)
+   min |= CPU_BASED_HLT_EXITING;
+
opt = CPU_BASED_TPR_SHADOW |
  CPU_BASED_USE_MSR_BITMAPS |
  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Anthony Liguori

On 12/03/2010 03:36 AM, Avi Kivity wrote:

On 12/02/2010 10:51 PM, Anthony Liguori wrote:

VCPU in HLT state only allows injection of certain events that
would be delivered on HLT. #PF is not one of them.


But you can't inject an exception into a guest while the VMCS is 
active, can you? 


No, but this is irrelevant.

So the guest takes an exit while in the hlt instruction but that's no 
different than if the guest has been interrupted because of hlt exiting.


hlt exiting doesn't leave vcpu in the halted state (since hlt has not 
been executed).  So currently we never see a vcpu in halted state.


Right, you mean the guest activity state being halt.

My understanding is that it just needs to be cleared on exception 
injection.  Would could clear it at every vmentry but that would 
introduce a vmcs_read() to the fast path which is undesirable.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Anthony Liguori

On 12/02/2010 02:40 PM, Marcelo Tosatti wrote:

On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
   

* Anthony Liguori (aligu...@us.ibm.com) wrote:
 

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.
   

I like the idea, esp to keep from burning power.

 

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.
   

Perhaps it should be a VM level option.  And then invert the notion.
Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
(pin in place probably).  And have that VM do nothing other than hlt.
Then it's always runnable according to scheduler, and can consume the
extra work that CFS wants to give away.

What do you think?

thanks,
-chris
 

Consuming the timeslice outside guest mode is less intrusive and easier
to replace. Something like this should work?

if (vcpu-arch.mp_state == KVM_MP_STATE_HALTED) {
 while (!need_resched())
 default_idle();
}
   


This looked nice but the implementation in practice wasn't unless I 
totally misunderstood what you were getting at.


default_idle() is not exported to modules and is not an interface meant 
to be called directly.  Plus, an idle loop like this delays the guest 
until the scheduler wants to run something else but it doesn't account 
for another thread trying to inject an event into the halting thread.  
It's not immediately clear to me how to have what's effectively a wait 
queue that hlts instead of calls the scheduler.  You could mess around 
with various signalling mechanisms but it gets ugly fast.


So I circled back to disabling hlt exiting this time taking care of 
updating GUEST_ACTIVITY_STATE when necessary.


Regards,

Anthony Liguori


But you agree this is no KVM business.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Anthony Liguori

On 12/02/2010 11:37 AM, Marcelo Tosatti wrote:

On Thu, Dec 02, 2010 at 07:59:17AM -0600, Anthony Liguori wrote:
   

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.

Signed-off-by: Anthony Liguorialigu...@us.ibm.com
---
v1 -  v2
  - Rename parameter to yield_on_hlt
  - Remove __read_mostly

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index caa967e..d8310e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
  static int __read_mostly vmm_exclusive = 1;
  module_param(vmm_exclusive, bool, S_IRUGO);

+static int yield_on_hlt = 1;
+module_param(yield_on_hlt, bool, S_IRUGO);
+
  #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
  #define KVM_GUEST_CR0_MASK\
@@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
_pin_based_exec_control)  0)
return -EIO;

-   min = CPU_BASED_HLT_EXITING |
+   min =
  #ifdef CONFIG_X86_64
  CPU_BASED_CR8_LOAD_EXITING |
  CPU_BASED_CR8_STORE_EXITING |
@@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  CPU_BASED_MWAIT_EXITING |
  CPU_BASED_MONITOR_EXITING |
  CPU_BASED_INVLPG_EXITING;
+
+   if (yield_on_hlt)
+   min |= CPU_BASED_HLT_EXITING;
+
opt = CPU_BASED_TPR_SHADOW |
  CPU_BASED_USE_MSR_BITMAPS |
  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
--
1.7.0.4
 

Breaks async PF (see checks on guest state), timer reinjection
probably.


In v3, I set the activity state to ACTIVE if the state is currently HLT 
when injecting an exception into a guest.


The effect is that after the exception is handled, if iret is executed, 
the hlt instruction will be restarted.  The seems like the correct 
semantics to me.


Regards,

Anthony Liguori


  It should be possible to achieve determinism with
a scheduler policy?



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Anthony Liguori

On 12/03/2010 03:38 AM, Avi Kivity wrote:

On 12/02/2010 05:23 PM, Anthony Liguori wrote:

On 12/02/2010 08:39 AM, lidong chen wrote:
In certain use-cases, we want to allocate guests fixed time slices 
where idle

guest cycles leave the machine idling.

i could not understand why need this? can you tell more detailedly?


If you run 4 guests on a CPU, and they're all trying to consume 100% 
CPU, all things being equal, you'll get ~25% CPU for each guest.


However, if one guest is idle, you'll get something like 1% 32% 33% 
32%.  This characteristic is usually desirable because it increase 
aggregate throughput but in some circumstances, determinism is more 
desirable than aggregate throughput.


This patch essentially makes guest execution non-work conserving by 
making it appear to the scheduler that each guest wants 100% CPU even 
though they may be idling.


That means that regardless of what each guest is doing, if you have 
four guests on one CPU, each will get ~25% CPU[1].




What if one of the guest crashes qemu or invokes a powerdown?  
Suddenly the others get 33% each (with 1% going to my secret round-up 
account).  Doesn't seem like a reliable way to limit cpu.


A guest shutting down is a macro event.  Macro events are easy to track 
and are logged by even the most naive management tools.  Macro events 
affecting performance are a workable problem.  I agree, it would be 
ideal to make them not impact performance but perfection is the enemy of 
good.


The problem with the status quo is that there is no performance 
stability in a consolidation environment.


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-03 Thread Anthony Liguori

On 12/02/2010 02:12 PM, Marcelo Tosatti wrote:

  It should be possible to achieve determinism with
a scheduler policy?
   

If the desire is the ultimate desire is to have the guests be
scheduled in a non-work conserving fashion, I can't see a more
direct approach that to simply not have the guests yield (which is
ultimately what hlt trapping does).

Anything the scheduler would do is after the fact and probably based
on inference about why the yield.
 

Another issue is you ignore the hosts idea of the best way to sleep
(ACPI, or whatever).
   


Non-work conserving schedulers kill polar bears.  There's simply no way 
around it.


The best strategy for power savings is to complete you work as quickly 
as you can and then spend as much time in the deepest sleep mode you 
can.  If you're using a non-work conserving scheduler, you're going to 
take more time to complete a workload spending needless cycles in 
shallow sleep states.


But that's the price we pay for determinism.  Maybe we can plant some 
trees at the next KVM Forum to offset CPU limits? :-)



And handling inactive HLT state (which was never enabled) can be painful.
   


Sorry, I'm not sure what you mean by this.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v3)

2010-12-03 Thread Joerg Roedel
On Fri, Dec 03, 2010 at 04:39:22PM -0600, Anthony Liguori wrote:
 + if (yield_on_hlt)
 + min |= CPU_BASED_HLT_EXITING;

This approach won't work out on AMD because in HLT the CPU may enter
C1e. In C1e the local apic timer interupt is not delivered anymore and
when this is the current timer in use the cpu may miss timer ticks or
never comes out of HLT again. The guest has no chance to work around
this as the Linux idle routine does. 
If you really wan't active idling of a guest, it should idle in the
hypervisor where it can work around such problems.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] KVMgenirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-03 Thread Jan Kiszka
Besides 3 cleanup patches, this series consists of two major changes.
The first introduces an interrupt sharing notifier to the genirq
subsystem. It fires when an interrupt line is about to be use by more
than one driver or the last but one user called free_irq.

The second major change makes use of this interface in KVM's PCI pass-
through subsystem. KVM has to keep the interrupt source disabled while
calling into the guest to handle the event. This can be done at device
or line level. The former is required to share the interrupt line, the
latter is an order of magnitude faster (see patch 3 for details).

Beside pass-through support of KVM, further users of the IRQ notifier
could become VFIO (not yet mainline) and uio_pci_generic which have to
resolve the same conflict.

Jan Kiszka (5):
  genirq: Pass descriptor to __free_irq
  genirq: Introduce interrupt sharing notifier
  KVM: Split up MSI-X assigned device IRQ handler
  KVM: Clean up unneeded void pointer casts
  KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

 Documentation/kvm/api.txt |   25 +++
 arch/x86/kvm/x86.c|1 +
 include/linux/interrupt.h |   21 +++
 include/linux/irqdesc.h   |9 +
 include/linux/kvm.h   |6 +
 include/linux/kvm_host.h  |4 +
 kernel/irq/irqdesc.c  |6 +
 kernel/irq/manage.c   |  174 ++--
 virt/kvm/assigned-dev.c   |  420 +++--
 9 files changed, 606 insertions(+), 60 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] genirq: Pass descriptor to __free_irq

2010-12-03 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

One caller of __free_free already has to resolve and check the irq
descriptor. So this small cleanup consistently pushes irq_to_desc and
the NULL check to the call site to avoid redundant work.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 kernel/irq/manage.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5f92acc..6341765 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -879,17 +879,14 @@ EXPORT_SYMBOL_GPL(setup_irq);
  * Internal function to unregister an irqaction - used to free
  * regular and special interrupts that are part of the architecture.
  */
-static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
+static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
 {
-   struct irq_desc *desc = irq_to_desc(irq);
struct irqaction *action, **action_ptr;
+   unsigned int irq = desc-irq_data.irq;
unsigned long flags;
 
WARN(in_interrupt(), Trying to free IRQ %d from IRQ context!\n, irq);
 
-   if (!desc)
-   return NULL;
-
raw_spin_lock_irqsave(desc-lock, flags);
 
/*
@@ -977,7 +974,12 @@ static struct irqaction *__free_irq(unsigned int irq, void 
*dev_id)
  */
 void remove_irq(unsigned int irq, struct irqaction *act)
 {
-   __free_irq(irq, act-dev_id);
+   struct irq_desc *desc = irq_to_desc(irq);
+
+   if (!desc)
+   return;
+
+   __free_irq(desc, act-dev_id);
 }
 EXPORT_SYMBOL_GPL(remove_irq);
 
@@ -1003,7 +1005,7 @@ void free_irq(unsigned int irq, void *dev_id)
return;
 
chip_bus_lock(desc);
-   kfree(__free_irq(irq, dev_id));
+   kfree(__free_irq(desc, dev_id));
chip_bus_sync_unlock(desc);
 }
 EXPORT_SYMBOL(free_irq);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

2010-12-03 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices on the host side when passing
them to a guest.

However, IRQ disabling via the PCI config space is more costly than
masking the line via disable_irq. Therefore we register an IRQ sharing
notifier and switch between line and device level disabling on demand.

This feature is optional, user space has to request it explicitly as it
also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That
way, we can avoid unmasking the interrupt and signaling it if the guest
masked it via the PCI config space.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Documentation/kvm/api.txt |   25 +++
 arch/x86/kvm/x86.c|1 +
 include/linux/kvm.h   |6 +
 include/linux/kvm_host.h  |4 +
 virt/kvm/assigned-dev.c   |  382 +
 5 files changed, 385 insertions(+), 33 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index e1a9297..5e164db 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1112,6 +1112,14 @@ following flags are specified:
 
 /* Depends on KVM_CAP_IOMMU */
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1  0)
+/* The following two depend on KVM_CAP_PCI_2_3 */
+#define KVM_DEV_ASSIGN_PCI_2_3 (1  1)
+#define KVM_DEV_ASSIGN_MASK_INTX   (1  2)
+
+If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
+via the PCI-2.3-compliant device-level mask, but only if IRQ sharing with other
+assigned or host devices requires it. KVM_DEV_ASSIGN_MASK_INTX specifies the
+guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
 
 4.48 KVM_DEASSIGN_PCI_DEVICE
 
@@ -1263,6 +1271,23 @@ struct kvm_assigned_msix_entry {
__u16 padding[3];
 };
 
+4.54 KVM_ASSIGN_SET_INTX_MASK
+
+Capability: KVM_CAP_PCI_2_3
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Informs the kernel about the guest's view on the INTx mask. As long as the
+guest masks the legacy INTx, the kernel will refrain from unmasking it at
+hardware level and will not assert the guest's IRQ line. User space is still
+responsible for applying this state to the assigned device's real config space.
+
+See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
+by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
+evaluated.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 410d2d1..c77c129 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1969,6 +1969,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_X86_ROBUST_SINGLESTEP:
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
+   case KVM_CAP_PCI_2_3:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ea2dc1a..3cadb42 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -541,6 +541,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
 #define KVM_CAP_ASYNC_PF 59
+#define KVM_CAP_PCI_2_3 60
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -677,6 +678,9 @@ struct kvm_clock_data {
 #define KVM_SET_PIT2  _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
 /* Available with KVM_CAP_PPC_GET_PVINFO */
 #define KVM_PPC_GET_PVINFO   _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
+/* Available with KVM_CAP_PCI_2_3 */
+#define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa2, \
+  struct kvm_assigned_pci_dev)
 
 /*
  * ioctls for vcpu fds
@@ -742,6 +746,8 @@ struct kvm_clock_data {
 #define KVM_SET_XCRS _IOW(KVMIO,  0xa7, struct kvm_xcrs)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1  0)
+#define KVM_DEV_ASSIGN_PCI_2_3 (1  1)
+#define KVM_DEV_ASSIGN_MASK_INTX   (1  2)
 
 struct kvm_assigned_pci_dev {
__u32 assigned_dev_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index da0794f..6849568 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -18,6 +18,7 @@
 #include linux/msi.h
 #include linux/slab.h
 #include linux/rcupdate.h
+#include linux/notifier.h
 #include asm/signal.h
 
 #include linux/kvm.h
@@ -485,6 +486,7 @@ struct kvm_assigned_dev_kernel {
unsigned int entries_nr;
int host_irq;
bool host_irq_disabled;
+   bool pci_2_3_masking;
struct msix_entry *host_msix_entries;
int guest_irq;
struct msix_entry *guest_msix_entries;
@@ -494,7 +496,9 @@ struct kvm_assigned_dev_kernel {
struct pci_dev *dev;
struct kvm *kvm;
spinlock_t intx_lock;
+   spinlock_t intx_mask_lock;
char irq_name[32];
+   struct notifier_block sh_notifier;
 };
 
 

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v3)

2010-12-03 Thread Anthony Liguori

On 12/03/2010 05:32 PM, Joerg Roedel wrote:

On Fri, Dec 03, 2010 at 04:39:22PM -0600, Anthony Liguori wrote:
   

+   if (yield_on_hlt)
+   min |= CPU_BASED_HLT_EXITING;
 

This approach won't work out on AMD because in HLT the CPU may enter
C1e. In C1e the local apic timer interupt is not delivered anymore and
when this is the current timer in use the cpu may miss timer ticks or
never comes out of HLT again. The guest has no chance to work around
this as the Linux idle routine does.
   


And this doesn't break old software on bare metal?

Regards,

Anthony Liguori


If you really wan't active idling of a guest, it should idle in the
hypervisor where it can work around such problems.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] KVM: Clean up unneeded void pointer casts

2010-12-03 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 virt/kvm/assigned-dev.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 3f8a745..c6114d3 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -149,7 +149,7 @@ static void deassign_host_irq(struct kvm *kvm,
 
for (i = 0; i  assigned_dev-entries_nr; i++)
free_irq(assigned_dev-host_msix_entries[i].vector,
-(void *)assigned_dev);
+assigned_dev);
 
assigned_dev-entries_nr = 0;
kfree(assigned_dev-host_msix_entries);
@@ -159,7 +159,7 @@ static void deassign_host_irq(struct kvm *kvm,
/* Deal with MSI and INTx */
disable_irq(assigned_dev-host_irq);
 
-   free_irq(assigned_dev-host_irq, (void *)assigned_dev);
+   free_irq(assigned_dev-host_irq, assigned_dev);
 
if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_MSI)
pci_disable_msi(assigned_dev-dev);
@@ -238,7 +238,7 @@ static int assigned_device_enable_host_intx(struct kvm *kvm,
 * are going to be long delays in accepting, acking, etc.
 */
if (request_threaded_irq(dev-host_irq, NULL, kvm_assigned_dev_thread,
-IRQF_ONESHOT, dev-irq_name, (void *)dev))
+IRQF_ONESHOT, dev-irq_name, dev))
return -EIO;
return 0;
 }
@@ -257,7 +257,7 @@ static int assigned_device_enable_host_msi(struct kvm *kvm,
 
dev-host_irq = dev-dev-irq;
if (request_threaded_irq(dev-host_irq, NULL, kvm_assigned_dev_thread,
-0, dev-irq_name, (void *)dev)) {
+0, dev-irq_name, dev)) {
pci_disable_msi(dev-dev);
return -EIO;
}
@@ -284,7 +284,7 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
for (i = 0; i  dev-entries_nr; i++) {
r = request_threaded_irq(dev-host_msix_entries[i].vector,
 NULL, kvm_assigned_dev_thread_msix,
-0, dev-irq_name, (void *)dev);
+0, dev-irq_name, dev);
if (r)
goto err;
}
@@ -292,7 +292,7 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
return 0;
 err:
for (i -= 1; i = 0; i--)
-   free_irq(dev-host_msix_entries[i].vector, (void *)dev);
+   free_irq(dev-host_msix_entries[i].vector, dev);
pci_disable_msix(dev-dev);
return r;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] KVM: Split up MSI-X assigned device IRQ handler

2010-12-03 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

The threaded IRQ handler for MSI-X has almost nothing in common with the
INTx/MSI handler. Move its code into a dedicated handler.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 virt/kvm/assigned-dev.c |   32 +++-
 1 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index ae72ae6..3f8a745 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -58,8 +58,6 @@ static int find_index_from_host_irq(struct 
kvm_assigned_dev_kernel
 static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
-   u32 vector;
-   int index;
 
if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_INTX) {
spin_lock(assigned_dev-intx_lock);
@@ -68,20 +66,28 @@ static irqreturn_t kvm_assigned_dev_thread(int irq, void 
*dev_id)
spin_unlock(assigned_dev-intx_lock);
}
 
-   if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_MSIX) {
-   index = find_index_from_host_irq(assigned_dev, irq);
-   if (index = 0) {
-   vector = assigned_dev-
-   guest_msix_entries[index].vector;
-   kvm_set_irq(assigned_dev-kvm,
-   assigned_dev-irq_source_id, vector, 1);
-   }
-   } else
+   kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id,
+   assigned_dev-guest_irq, 1);
+
+   return IRQ_HANDLED;
+}
+
+#ifdef __KVM_HAVE_MSIX
+static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
+{
+   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+   int index = find_index_from_host_irq(assigned_dev, irq);
+   u32 vector;
+
+   if (index = 0) {
+   vector = assigned_dev-guest_msix_entries[index].vector;
kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id,
-   assigned_dev-guest_irq, 1);
+   vector, 1);
+   }
 
return IRQ_HANDLED;
 }
+#endif
 
 /* Ack the irq line for an assigned device */
 static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
@@ -277,7 +283,7 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
 
for (i = 0; i  dev-entries_nr; i++) {
r = request_threaded_irq(dev-host_msix_entries[i].vector,
-NULL, kvm_assigned_dev_thread,
+NULL, kvm_assigned_dev_thread_msix,
 0, dev-irq_name, (void *)dev);
if (r)
goto err;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >