Re: [PATCH] kvm: x86: Save/restore error_code

2010-12-10 Thread Jason Wang
Juan Quintela writes:
  Jason Wang jasow...@redhat.com wrote:
   Juan Quintela writes:
 Jason Wang jasow...@redhat.com wrote:
  The saving and restoring of error_code seems lost and convert the
  error_code to uint32_t.
 
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
   target-i386/cpu.h |4 ++--
   target-i386/machine.c |2 ++
   2 files changed, 4 insertions(+), 2 deletions(-)
 
 It should be a new subsection.  The test is if has_error_code != 0
 according to gleb.
 
 Later, Juan.
 
  
   Thanks for reminding, and maybe we can just use VMSTATE_UINT32_TEST() 
   which is
   simpler than subsection to do the check, isn't it?
  
  we need the subsection, that way we don't need to bump the section
  version.
  
  Later, Juan.
  

Have tried the subsection, but there's an issue I find:

When we use subsections with the structure who have an nested
VMStateDescription at the end of fields, the subsections could not be loaded
correctly becuase qemu always try to match the subsection with the nested one.

So when we use subsections for vmstate_cpu, the subsection always fail as it try
to load the cpu's subsection for vmstate_ymmh_reg which is an nested
VMStateDescription on the end.

Maybe we need to modify vmstate_subsection_load() to handle this condition.

Any thought about this?

  
  diff --git a/target-i386/cpu.h b/target-i386/cpu.h
  index 06e40f3..c990db9 100644
  --- a/target-i386/cpu.h
  +++ b/target-i386/cpu.h
  @@ -688,7 +688,7 @@ typedef struct CPUX86State {
   uint64_t pat;
   
   /* exception/interrupt handling */
  -int error_code;
  +uint32_t error_code;
   int exception_is_int;
   target_ulong exception_next_eip;
   target_ulong dr[8]; /* debug registers */
  @@ -933,7 +933,7 @@ uint64_t cpu_get_tsc(CPUX86State *env);
   #define cpu_list_id x86_cpu_list
   #define cpudef_setup x86_cpudef_setup
   
  -#define CPU_SAVE_VERSION 12
  +#define CPU_SAVE_VERSION 13
   
   /* MMU modes definitions */
   #define MMU_MODE0_SUFFIX _kernel
  diff --git a/target-i386/machine.c b/target-i386/machine.c
  index d78eceb..0e467da 100644
  --- a/target-i386/machine.c
  +++ b/target-i386/machine.c
  @@ -491,6 +491,8 @@ static const VMStateDescription vmstate_cpu = {
   VMSTATE_UINT64_V(xcr0, CPUState, 12),
   VMSTATE_UINT64_V(xstate_bv, CPUState, 12),
   VMSTATE_YMMH_REGS_VARS(ymmh_regs, CPUState, CPU_NB_REGS, 12),
  +
  +VMSTATE_UINT32_V(error_code, CPUState, 13),
   VMSTATE_END_OF_LIST()
   /* The above list is not sorted /wrt version numbers, watch 
   out! */
   },
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Clean up cpu_inject_x86_mce().

2010-12-10 Thread Jin Dongming
Hi, all

I am sorry for replying late.
I modified the patches as the comments of huang-san and Marcelo-san
and resend them.

Thanks.

Best Regards,
Jin Dongming
---
Clean up cpu_inject_x86_mce() for later patch.

Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
---
 target-i386/helper.c |   27 +--
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/target-i386/helper.c b/target-i386/helper.c
index d765cc6..9c07c38 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1023,21 +1023,12 @@ static void breakpoint_handler(CPUState *env)
 /* This should come from sysemu.h - if we could include it here... */
 void qemu_system_reset_request(void);
 
-void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
+static void qemu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
 uint64_t mcg_status, uint64_t addr, uint64_t misc)
 {
 uint64_t mcg_cap = cenv-mcg_cap;
-unsigned bank_num = mcg_cap  0xff;
 uint64_t *banks = cenv-mce_banks;
 
-if (bank = bank_num || !(status  MCI_STATUS_VAL))
-return;
-
-if (kvm_enabled()) {
-kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, 0);
-return;
-}
-
 /*
  * if MSR_MCG_CTL is not all 1s, the uncorrected error
  * reporting is disabled
@@ -1078,6 +1069,22 @@ void cpu_inject_x86_mce(CPUState *cenv, int bank, 
uint64_t status,
 } else
 banks[1] |= MCI_STATUS_OVER;
 }
+
+void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
+uint64_t mcg_status, uint64_t addr, uint64_t misc)
+{
+unsigned bank_num = cenv-mcg_cap  0xff;
+
+if (bank = bank_num || !(status  MCI_STATUS_VAL)) {
+return;
+}
+
+if (kvm_enabled()) {
+kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, 0);
+} else {
+qemu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
+}
+}
 #endif /* !CONFIG_USER_ONLY */
 
 static void mce_init(CPUX86State *cenv)
-- 
1.7.1.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Add broadcast option for mce command.

2010-12-10 Thread Jin Dongming
When the following test case is injected with mce command, maybe user could not
get the expected result.
DATA
   command cpu bank status mcg_status  addr   misc
(qemu) mce 1   10xbd00 0x050x1234 0x8c

Expected Result
   panic type: Fatal Machine check

That is because each mce command can only inject the given cpu and could not
inject mce interrupt to other cpus. So user will get the following result:
panic type: Fatal machine check on current CPU

broadcast option is used for injecting dummy data into other cpus. Injecting
mce with this option the expected result could be gotten.

Usage:
Broadcast[on]
   command broadcast cpu bank status mcg_status  addr   misc
(qemu) mce -b1   10xbd00 0x050x1234 0x8c

Broadcast[off]
   command cpu bank status mcg_status  addr   misc
(qemu) mce 1   10xbd00 0x050x1234 0x8c

Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
---
 cpu-all.h |3 ++-
 hmp-commands.hx   |6 +++---
 monitor.c |7 +--
 target-i386/helper.c  |   20 ++--
 target-i386/kvm.c |   16 
 target-i386/kvm_x86.h |5 -
 6 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 30ae17d..4ce4e83 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -964,6 +964,7 @@ int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
 uint8_t *buf, int len, int is_write);
 
 void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
-uint64_t mcg_status, uint64_t addr, uint64_t misc);
+uint64_t mcg_status, uint64_t addr, uint64_t misc,
+int broadcast);
 
 #endif /* CPU_ALL_H */
diff --git a/hmp-commands.hx b/hmp-commands.hx
index d989ddb..dea5e78 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1054,9 +1054,9 @@ ETEXI
 
 {
 .name   = mce,
-.args_type  = cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l,
-.params = cpu bank status mcgstatus addr misc,
-.help   = inject a MCE on the given CPU,
+.args_type  = 
broadcast:-b,cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l,
+.params = [-b] cpu bank status mcgstatus addr misc,
+.help   = inject a MCE on the given CPU [and broadcast to other 
CPUs with -b option],
 .mhandler.cmd = do_inject_mce,
 },
 
diff --git a/monitor.c b/monitor.c
index ba45fb5..acdce49 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2264,12 +2264,15 @@ static void do_inject_mce(Monitor *mon, const QDict 
*qdict)
 uint64_t mcg_status = qdict_get_int(qdict, mcg_status);
 uint64_t addr = qdict_get_int(qdict, addr);
 uint64_t misc = qdict_get_int(qdict, misc);
+int broadcast = qdict_get_try_bool(qdict, broadcast, 0);
 
-for (cenv = first_cpu; cenv != NULL; cenv = cenv-next_cpu)
+for (cenv = first_cpu; cenv != NULL; cenv = cenv-next_cpu) {
 if (cenv-cpu_index == cpu_index  cenv-mcg_cap) {
-cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
+cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc,
+   broadcast);
 break;
 }
+} 
 }
 #endif
 
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 9c07c38..985d532 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1071,18 +1071,34 @@ static void qemu_inject_x86_mce(CPUState *cenv, int 
bank, uint64_t status,
 }
 
 void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
-uint64_t mcg_status, uint64_t addr, uint64_t misc)
+uint64_t mcg_status, uint64_t addr, uint64_t misc,
+int broadcast)
 {
 unsigned bank_num = cenv-mcg_cap  0xff;
+CPUState *env;
+int flag = 0;
 
 if (bank = bank_num || !(status  MCI_STATUS_VAL)) {
 return;
 }
 
 if (kvm_enabled()) {
-kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, 0);
+if (broadcast) {
+flag |= MCE_BROADCAST;
+}
+
+kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, flag);
 } else {
 qemu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
+if (broadcast) {
+for (env = first_cpu; env != NULL; env = env-next_cpu) {
+if (cenv == env) {
+continue;
+}
+
+qemu_inject_x86_mce(env, 1, 0xa000, 0, 0, 0);
+}
+}
 }
 }
 #endif /* !CONFIG_USER_ONLY */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 940600c..d866dce 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -264,11 +264,13 @@ static void kvm_do_inject_x86_mce(void *_data)
 }
 

[PATCH 3/3] Add function for checking mca broadcast of CPU.

2010-12-10 Thread Jin Dongming
Add function for checking whether current CPU support mca broadcast.

Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
---
 target-i386/cpu.h|1 +
 target-i386/helper.c |   33 +
 target-i386/kvm.c|6 +-
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 1f1151d..f16860f 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -762,6 +762,7 @@ int cpu_x86_exec(CPUX86State *s);
 void cpu_x86_close(CPUX86State *s);
 void x86_cpu_list (FILE *f, fprintf_function cpu_fprintf, const char *optarg);
 void x86_cpudef_setup(void);
+int cpu_x86_support_mca_broadcast(CPUState *env);
 
 int cpu_get_pic_interrupt(CPUX86State *s);
 /* MSDOS compatibility mode FPU exception support */
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 985d532..b86a6c5 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -112,6 +112,32 @@ void cpu_x86_close(CPUX86State *env)
 qemu_free(env);
 }
 
+static void cpu_x86_version(CPUState *env, int *family, int *model)
+{
+int cpuver = env-cpuid_version;
+
+if (family == NULL || model == NULL) {
+return;
+}
+
+*family = (cpuver  8)  0x0f;
+*model = ((cpuver  12)  0xf0) + ((cpuver  4)  0x0f);
+}
+
+/* Broadcast MCA signal for processor version 06H_EH and above */
+int cpu_x86_support_mca_broadcast(CPUState *env)
+{
+int family = 0;
+int model = 0;
+
+cpu_x86_version(env, family, model);
+if ((family == 6  model = 14) || family  6) {
+return 1;
+}
+
+return 0;
+}
+
 /***/
 /* x86 debug */
 
@@ -1082,6 +1108,13 @@ void cpu_inject_x86_mce(CPUState *cenv, int bank, 
uint64_t status,
 return;
 }
 
+if (broadcast) {
+if (!cpu_x86_support_mca_broadcast(cenv)) {
+fprintf(stderr, Current CPU does not support broadcast\n);
+return;
+}
+}
+
 if (kvm_enabled()) {
 if (broadcast) {
 flag |= MCE_BROADCAST;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index d866dce..a7261c0 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1741,13 +1741,9 @@ static void hardware_memory_error(void)
 static void kvm_mce_broadcast_rest(CPUState *env)
 {
 CPUState *cenv;
-int family, model, cpuver = env-cpuid_version;
-
-family = (cpuver  8)  0xf;
-model = ((cpuver  12)  0xf0) + ((cpuver  4)  0xf);
 
 /* Broadcast MCA signal for processor version 06H_EH and above */
-if ((family == 6  model = 14) || family  6) {
+if (cpu_x86_support_mca_broadcast(env)) {
 for (cenv = first_cpu; cenv != NULL; cenv = cenv-next_cpu) {
 if (cenv == env) {
 continue;
-- 
1.7.1.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] kvm, x86: introduce kvm_mce_in_progress

2010-12-10 Thread Jin Dongming
Share same error handing, and rename this function after
MCIP (Machine Check In Progress) flag.

Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
---
 target-i386/kvm.c |   15 +--
 1 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index a7261c0..d7aae8b 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -219,7 +219,7 @@ static int kvm_get_msr(CPUState *env, struct kvm_msr_entry 
*msrs, int n)
 }
 
 /* FIXME: kill this and kvm_get_msr, use env-mcg_status instead */
-static int kvm_mce_in_exception(CPUState *env)
+static int kvm_mce_in_progress(CPUState *env)
 {
 struct kvm_msr_entry msr_mcg_status = {
 .index = MSR_MCG_STATUS,
@@ -228,7 +228,8 @@ static int kvm_mce_in_exception(CPUState *env)
 
 r = kvm_get_msr(env, msr_mcg_status, 1);
 if (r == -1 || r == 0) {
-return -1;
+fprintf(stderr, Failed to get MCE status\n);
+return 0;
 }
 return !!(msr_mcg_status.data  MCG_STATUS_MCIP);
 }
@@ -248,10 +249,7 @@ static void kvm_do_inject_x86_mce(void *_data)
 /* If there is an MCE exception being processed, ignore this SRAO MCE */
 if ((data-env-mcg_cap  MCG_SER_P) 
 !(data-mce-status  MCI_STATUS_AR)) {
-r = kvm_mce_in_exception(data-env);
-if (r == -1) {
-fprintf(stderr, Failed to get MCE status\n);
-} else if (r) {
+if (kvm_mce_in_progress(data-env)) {
 return;
 }
 }
@@ -1782,10 +1780,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, int code, void 
*addr)
  * If there is an MCE excpetion being processed, ignore
  * this SRAO MCE
  */
-r = kvm_mce_in_exception(env);
-if (r == -1) {
-fprintf(stderr, Failed to get MCE status\n);
-} else if (r) {
+if (kvm_mce_in_progress(env)) {
 return 0;
 }
 /* Fake an Intel architectural Memory scrubbing UCR */
-- 
1.7.1.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] kvm, x86: kvm_mce_inj_* subroutins for templated error injections

2010-12-10 Thread Jin Dongming
Refactor codes for maintainability.

Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
---
 target-i386/kvm.c |  111 ++---
 1 files changed, 71 insertions(+), 40 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index d7aae8b..1f3f369 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1752,44 +1752,75 @@ static void kvm_mce_broadcast_rest(CPUState *env)
 }
 }
 }
+
+static void kvm_mce_inj_srar_dataload(CPUState *env, target_phys_addr_t paddr)
+{
+struct kvm_x86_mce mce = {
+.bank = 9,
+.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+  | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+  | MCI_STATUS_AR | 0x134,
+.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV,
+.addr = paddr,
+.misc = (MCM_ADDR_PHYS  6) | 0xc,
+};
+int r;
+
+r = kvm_set_mce(env, mce);
+if (r  0) {
+fprintf(stderr, kvm_set_mce: %s\n, strerror(errno));
+abort();
+}
+kvm_mce_broadcast_rest(env);
+}
+
+static void kvm_mce_inj_srao_memscrub(CPUState *env, target_phys_addr_t paddr)
+{
+struct kvm_x86_mce mce = {
+.bank = 9,
+.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+  | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+  | 0xc0,
+.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV,
+.addr = paddr,
+.misc = (MCM_ADDR_PHYS  6) | 0xc,
+};
+int r;
+
+r = kvm_set_mce(env, mce);
+if (r  0) {
+fprintf(stderr, kvm_set_mce: %s\n, strerror(errno));
+abort();
+}
+kvm_mce_broadcast_rest(env);
+}
+
+static void kvm_mce_inj_srao_memscrub2(CPUState *env, target_phys_addr_t paddr)
+{
+uint64_t status;
+
+status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| 0xc0;
+kvm_inject_x86_mce(env, 9, status,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
+   (MCM_ADDR_PHYS  6) | 0xc, ABORT_ON_ERROR);
+
+kvm_mce_broadcast_rest(env);
+}
+
 #endif
 
 int kvm_on_sigbus_vcpu(CPUState *env, int code, void *addr)
 {
 #if defined(KVM_CAP_MCE)
-struct kvm_x86_mce mce = {
-.bank = 9,
-};
 void *vaddr;
 ram_addr_t ram_addr;
 target_phys_addr_t paddr;
-int r;
 
 if ((env-mcg_cap  MCG_SER_P)  addr
  (code == BUS_MCEERR_AR
 || code == BUS_MCEERR_AO)) {
-if (code == BUS_MCEERR_AR) {
-/* Fake an Intel architectural Data Load SRAR UCR */
-mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
-| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
-| MCI_STATUS_AR | 0x134;
-mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
-mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV;
-} else {
-/*
- * If there is an MCE excpetion being processed, ignore
- * this SRAO MCE
- */
-if (kvm_mce_in_progress(env)) {
-return 0;
-}
-/* Fake an Intel architectural Memory scrubbing UCR */
-mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
-| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
-| 0xc0;
-mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
-mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV;
-}
 vaddr = (void *)addr;
 if (qemu_ram_addr_from_host(vaddr, ram_addr) ||
 !kvm_physical_memory_addr_from_ram(env-kvm_state, ram_addr, 
paddr)) {
@@ -1802,13 +1833,20 @@ int kvm_on_sigbus_vcpu(CPUState *env, int code, void 
*addr)
 hardware_memory_error();
 }
 }
-mce.addr = paddr;
-r = kvm_set_mce(env, mce);
-if (r  0) {
-fprintf(stderr, kvm_set_mce: %s\n, strerror(errno));
-abort();
+
+if (code == BUS_MCEERR_AR) {
+/* Fake an Intel architectural Data Load SRAR UCR */
+kvm_mce_inj_srar_dataload(env, paddr);
+} else {
+/*
+ * If there is an MCE excpetion being processed, ignore
+ * this SRAO MCE
+ */
+if (!kvm_mce_in_progress(env)) {
+/* Fake an Intel architectural Memory scrubbing UCR */
+kvm_mce_inj_srao_memscrub(env, paddr);
+}
 }
-kvm_mce_broadcast_rest(env);
 } else
 #endif
 {
@@ -1827,7 +1865,6 @@ int kvm_on_sigbus(int code, void *addr)
 {
 #if defined(KVM_CAP_MCE)
 if ((first_cpu-mcg_cap  MCG_SER_P)  addr  code == BUS_MCEERR_AO) {
-uint64_t status;
 void *vaddr;
 ram_addr_t ram_addr;
 target_phys_addr_t paddr;
@@ -1840,13 +1877,7 @@ int 

[PATCH 3/3] kvm, x86: introduce kvm_inject_x86_mce_on

2010-12-10 Thread Jin Dongming
Pass a table instead of multiple args.

Note:

kvm_inject_x86_mce(env, bank, status, mcg_status, addr, misc,
   abort_on_error);

is equal to:

struct kvm_x86_mce mce = {
.bank = bank,
.status = status,
.mcg_status = mcg_status,
.addr = addr,
.misc = misc,
};
kvm_inject_x86_mce_on(env, mce, abort_on_error);

Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
---
 target-i386/kvm.c |   58 
 1 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 1f3f369..b11337a 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -263,6 +263,23 @@ static void kvm_do_inject_x86_mce(void *_data)
 }
 }
 
+static void kvm_inject_x86_mce_on(CPUState *env, struct kvm_x86_mce *mce,
+  int flag)
+{
+struct kvm_x86_mce_data data = {
+.env = env,
+.mce = mce,
+.abort_on_error = (flag  ABORT_ON_ERROR),
+};
+
+if (!env-mcg_cap) {
+fprintf(stderr, MCE support is not enabled!\n);
+return;
+}
+
+on_vcpu(env, kvm_do_inject_x86_mce, data);
+}
+
 static void kvm_mce_broadcast_rest(CPUState *env);
 #endif
 
@@ -278,21 +295,11 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, 
uint64_t status,
 .addr = addr,
 .misc = misc,
 };
-struct kvm_x86_mce_data data = {
-.env = cenv,
-.mce = mce,
-};
-
-if (!cenv-mcg_cap) {
-fprintf(stderr, MCE support is not enabled!\n);
-return;
-}
-
 if (flag  MCE_BROADCAST) {
 kvm_mce_broadcast_rest(cenv);
 }
 
-on_vcpu(cenv, kvm_do_inject_x86_mce, data);
+kvm_inject_x86_mce_on(cenv, mce, flag);
 #else
 if (flag  ABORT_ON_ERROR) {
 abort();
@@ -1738,6 +1745,13 @@ static void hardware_memory_error(void)
 #ifdef KVM_CAP_MCE
 static void kvm_mce_broadcast_rest(CPUState *env)
 {
+struct kvm_x86_mce mce = {
+.bank = 1,
+.status = MCI_STATUS_VAL | MCI_STATUS_UC,
+.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV,
+.addr = 0,
+.misc = 0,
+};
 CPUState *cenv;
 
 /* Broadcast MCA signal for processor version 06H_EH and above */
@@ -1746,9 +1760,7 @@ static void kvm_mce_broadcast_rest(CPUState *env)
 if (cenv == env) {
 continue;
 }
-kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC,
-   MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0,
-   ABORT_ON_ERROR);
+kvm_inject_x86_mce_on(cenv, mce, ABORT_ON_ERROR);
 }
 }
 }
@@ -1797,15 +1809,17 @@ static void kvm_mce_inj_srao_memscrub(CPUState *env, 
target_phys_addr_t paddr)
 
 static void kvm_mce_inj_srao_memscrub2(CPUState *env, target_phys_addr_t paddr)
 {
-uint64_t status;
-
-status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
-| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
-| 0xc0;
-kvm_inject_x86_mce(env, 9, status,
-   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
-   (MCM_ADDR_PHYS  6) | 0xc, ABORT_ON_ERROR);
+struct kvm_x86_mce mce = {
+.bank = 9,
+.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+  | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+  | 0xc0,
+.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV,
+.addr = paddr,
+.misc = (MCM_ADDR_PHYS  6) | 0xc,
+};
 
+kvm_inject_x86_mce_on(env, mce, ABORT_ON_ERROR);
 kvm_mce_broadcast_rest(env);
 }
 
-- 
1.7.1.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] intel-iommu: Fix use after release during device attach

2010-12-10 Thread Jan Kiszka
Am 14.11.2010 10:18, Jan Kiszka wrote:
 Am 02.11.2010 08:31, Sheng Yang wrote:
 On Tuesday 02 November 2010 15:05:51 Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 Obtail the new pgd pointer before releasing the page containing this
 value.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---

 Who is taking care of this? The kvm tree?

  drivers/pci/intel-iommu.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
 index 4789f8e..35463dd 100644
 --- a/drivers/pci/intel-iommu.c
 +++ b/drivers/pci/intel-iommu.c
 @@ -3627,9 +3627,9 @@ static int intel_iommu_attach_device(struct
 iommu_domain *domain,

 pte = dmar_domain-pgd;
 if (dma_pte_present(pte)) {
 -   free_pgtable_page(dmar_domain-pgd);
 dmar_domain-pgd = (struct dma_pte *)
 phys_to_virt(dma_pte_addr(pte));
 +   free_pgtable_page(pte);
 }
 dmar_domain-agaw--;
 }

 Reviewed-by: Sheng Yang sh...@linux.intel.com

 CC iommu mailing list and David.
 
 Ping...
 
 I think this fix also qualifies for stable (.35 and .36).
 

Still not merged?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-10 Thread Srivatsa Vaddagiri
On Thu, Dec 09, 2010 at 11:34:46PM -0500, Rik van Riel wrote:
 On 12/03/2010 09:06 AM, Srivatsa Vaddagiri wrote:
 On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:
 No, because they do receive service (they spend some time spinning
 before being interrupted), so the respective vruntimes will increase, at
 some point they'll pass B0 and it'll get scheduled.
 
 Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
 case)?
 
 I have a rough idea for a simpler way to ensure
 fairness.
 
 At yield_to time, we could track in the runqueue
 structure that a task received CPU time (and on
 the other runqueue that a task donated CPU time).
 
 The balancer can count time-given-to CPUs as
 busier, and donated-time CPUs as less busy,
 moving tasks away in the unlikely event that
 the same task gets keeping CPU time given to
 it.

I think just capping donation (either on send side or receive side) may be more 
simpler here than to mess with load balancer logic.

 Conversely, it can move other tasks onto CPUs
 that have tasks on them that cannot make progress
 right now and are just donating their CPU time.
 
 Most of the time the time-given and time-received
 should balance out and there should be little to
 no influence on the load balancer. This code would
 just be there to deal with exceptional circumstances,
 to avoid the theoretical worst case people have
 described.

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] qemu,kvm: Enable user space NMI injection for kvm guest

2010-12-10 Thread Jan Kiszka
Am 10.12.2010 08:42, Lai Jiangshan wrote:
 
 Make use of the new KVM_NMI IOCTL to send NMIs into the KVM guest if the
 user space raised them. (example: qemu monitor's nmi command)
 
 Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
 ---
 diff --git a/configure b/configure
 index 2917874..f6f9362 100755
 --- a/configure
 +++ b/configure
 @@ -1646,6 +1646,9 @@ if test $kvm != no ; then
  #if !defined(KVM_CAP_DESTROY_MEMORY_REGION_WORKS)
  #error Missing KVM capability KVM_CAP_DESTROY_MEMORY_REGION_WORKS
  #endif
 +#if !defined(KVM_CAP_USER_NMI)
 +#error Missing KVM capability KVM_CAP_USER_NMI
 +#endif
  int main(void) { return 0; }
  EOF
if test $kerneldir !=  ; then

That's what I meant.

We also have a runtime check for KVM_CAP_DESTROY_MEMORY_REGION_WORKS on
kvm init, but IMHO adding the same for KVM_CAP_USER_NMI would be
overkill. So...

 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 7dfc357..755f8c9 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -1417,6 +1417,13 @@ int kvm_arch_get_registers(CPUState *env)
  
  int kvm_arch_pre_run(CPUState *env, struct kvm_run *run)
  {
 +/* Inject NMI */
 +if (env-interrupt_request  CPU_INTERRUPT_NMI) {
 +env-interrupt_request = ~CPU_INTERRUPT_NMI;
 +DPRINTF(injected NMI\n);
 +kvm_vcpu_ioctl(env, KVM_NMI);
 +}
 +
  /* Try to inject an interrupt if the guest can accept it */
  if (run-ready_for_interrupt_injection 
  (env-interrupt_request  CPU_INTERRUPT_HARD) 

Acked-by: Jan Kiszka jan.kis...@siemens.com

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 0/5] macvtap TX zero copy between guest and host kernel

2010-12-10 Thread Shirley Ma
This patchset add supports for TX zero-copy between guest and host
kernel through vhost. It significantly reduces CPU utilization on the
local host on which the guest is located (It reduced 30-50% CPU usage
for vhost thread for single stream test). The patchset is based on
previous submission and comments from the community regarding when/how
to handle guest kernel buffers to be released. This is the simplest
approach I can think of after comparing with several other solutions.

This patchset includes:

1. Induce a new sock zero-copy flag, SOCK_ZEROCOPY;

2. Induce a new device flag, NETIF_F_ZEROCOPY for device can support
zero-copy;

3. Add a new struct skb_ubuf_info in skb_share_info for userspace
buffers release callback when device DMA has done for that skb;

4. Add vhost zero-copy callback in vhost when skb last refcnt is gone;
   add vhost_zerocopy_add_used_and_signal to notify guest to release TX
skb buffers.

5. Add macvtap zero-copy in lower device when sending packet is greater
than 128 bytes.

The patchset has passed netperf/netserver test on Chelsio, and
continuing test on other 10GbE NICs, like Intel ixgbe, Mellanox mlx4...
I will provide guest to host, host to guest performance data next week.

However when running stress test, vhost  virtio_net seems out of sync,
and virito_net interrupt was disabled somehow, and it stopped to send
any packet. This problem has bothered me for a long long time, I will
continue to look at this.

Please review this.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 1/5] Add a new sock flag for zero-copy

2010-12-10 Thread Shirley Ma
Signed-off-by: Shirley Ma x...@us.ibm.com

 include/net/sock.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index a6338d0..ff34ea7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -543,6 +543,7 @@ enum sock_flags {
SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
SOCK_FASYNC, /* fasync() active */
SOCK_RXQ_OVFL,
+   SOCK_ZEROCOPY,
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 2/5] Add a new device flag for zero copy

2010-12-10 Thread Shirley Ma
Signed-off-by: Shirley Ma x...@us.ibm.com
---

 include/linux/netdevice.h |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d8fd2c2..7207665 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -857,6 +857,9 @@ struct net_device {
 #define NETIF_F_NTUPLE (1  27) /* N-tuple filters supported */
 #define NETIF_F_RXHASH (1  28) /* Receive hashing offload */
 
+/* bit 29 is for device to map userspace buffers -- zerocopy */
+#define NETIF_F_ZEROCOPY   (1  29)   
+
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
 #define NETIF_F_GSO_MASK   0x00ff


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 3/5] Add userspace buffers callback in skb_share_info

2010-12-10 Thread Shirley Ma
Signed-off-by: Shirley Ma x...@us.ibm.com
---

 include/linux/skbuff.h |   13 +
 net/core/skbuff.c  |   13 -
 2 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e6ba898..938a7cb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -183,6 +183,15 @@ enum {
SKBTX_DRV_NEEDS_SK_REF = 1  3,
 };
 
+/* The callback notifies userspace to release buffers when skb DMA is done in
+ * lower device, the desc is used to track userspace buffer index.
+ */
+struct skb_ubuf_info {
+   /* support buffers allocation from userspace */
+   void(*callback)(struct sk_buff *);
+   size_t  desc;
+};
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb-end.
  */
@@ -205,6 +214,10 @@ struct skb_shared_info {
/* Intermediate layers must ensure that destructor_arg
 * remains valid until skb destructor */
void *  destructor_arg;
+
+   /* DMA mapping from userspace buffers */
+   struct skb_ubuf_info ubuf;
+
/* must be last field, see pskb_expand_head() */
skb_frag_t  frags[MAX_SKB_FRAGS];
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 104f844..f9468a0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -210,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
shinfo = skb_shinfo(skb);
memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
atomic_set(shinfo-dataref, 1);
+   shinfo-ubuf.callback = NULL;
 
if (fclone) {
struct sk_buff *child = skb + 1;
@@ -329,6 +330,15 @@ static void skb_release_data(struct sk_buff *skb)
 
if (skb_has_frag_list(skb))
skb_drop_fraglist(skb);
+   
+   /*
+* if skb buf is from userspace, we need to notify the caller
+* the lower device DMA has done;
+*/
+   if (skb_shinfo(skb)-ubuf.callback) {
+   skb_shinfo(skb)-ubuf.callback(skb);
+   skb_shinfo(skb)-ubuf.callback = NULL;
+   }
 
kfree(skb-head);
}
@@ -492,6 +502,7 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
shinfo = skb_shinfo(skb);
memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
atomic_set(shinfo-dataref, 1);
+   shinfo-ubuf.callback = NULL;
 
memset(skb, 0, offsetof(struct sk_buff, tail));
skb-data = skb-head + NET_SKB_PAD;



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V2 4/5] Add vhost zero copy callback to release guest kernel buffers

2010-12-10 Thread Shirley Ma
This patch uses msg_control to pass vhost callback to macvtap (any better
idea to pass this in a simple way?). vhost doesn't notify guest to release
buffers until the underlying lower device DMA has done for these buffers.
This vq can not be reset if any outstanding reference.

Signed-off-by: Shirley Ma x...@us.ibm.com
---

 drivers/vhost/net.c   |   13 ++-
 drivers/vhost/vhost.c |   56 +
 drivers/vhost/vhost.h |7 ++
 3 files changed, 75 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f442668..6779a1c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -128,6 +128,7 @@ static void handle_tx(struct vhost_net *net)
int err, wmem;
size_t hdr_size;
struct socket *sock;
+   struct skb_ubuf_info pend;
 
/* TODO: check that we are running from vhost_worker?
 * Not sure it's worth it, it's straight-forward enough. */
@@ -189,6 +190,13 @@ static void handle_tx(struct vhost_net *net)
   iov_length(vq-hdr, s), hdr_size);
break;
}
+   /* use msg_control to pass vhost zerocopy ubuf info here */
+   if (sock_flag(sock-sk, SOCK_ZEROCOPY)) {
+   pend.callback = vq-callback;
+   pend.desc = head;
+   msg.msg_control = pend;
+   msg.msg_controllen = sizeof(pend);
+   }
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
@@ -199,7 +207,10 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   if (sock_flag(sock-sk, SOCK_ZEROCOPY))
+   vhost_zerocopy_add_used_and_signal(vq);
+   else
+   vhost_add_used_and_signal(net-dev, vq, head, 0);
total_len += len;
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 94701ff..b0074bc 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -170,6 +170,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq-call_ctx = NULL;
vq-call = NULL;
vq-log_ctx = NULL;
+   atomic_set(vq-refcnt, 0);
+   vq-upend_cnt = 0;
 }
 
 static int vhost_worker(void *data)
@@ -273,6 +275,9 @@ long vhost_dev_init(struct vhost_dev *dev,
dev-vqs[i].heads = NULL;
dev-vqs[i].dev = dev;
mutex_init(dev-vqs[i].mutex);
+   spin_lock_init(dev-vqs[i].zerocopy_lock);
+   dev-vqs[i].upend_cnt = 0;
+   atomic_set(dev-vqs[i].refcnt, 0);
vhost_vq_reset(dev, dev-vqs + i);
if (dev-vqs[i].handle_kick)
vhost_poll_init(dev-vqs[i].poll,
@@ -370,10 +375,37 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
return 0;
 }
 
+void vhost_zerocopy_add_used_and_signal(struct vhost_virtqueue *vq)
+{
+   struct vring_used_elem heads[64];
+   int count, left, mod;
+   unsigned long flags;
+
+   count = (vq-num  64) ? 64 : vq-num;
+   mod = vq-ubuf_cnt / count;
+   /* notify guest when number of descriptors greater than count */
+   if (mod == 0)
+   return;
+   /* 
+* avoid holding spin lock by notifying guest x64 buffers first
+*/
+   vhost_add_used_and_signal_n(vq-dev, vq, vq-heads, count * mod);
+   /* reset the counter when notifying guest the rest*/
+   left = vq-ubuf_cnt - mod * count;
+   if (left  0) {
+   spin_lock_irqsave(vq-zerocopy_lock, flags);
+   memcpy(heads, vq-heads[mod * count], left * sizeof 
*vq-heads);
+   vq-ubuf_cnt = 0;
+   spin_unlock_irqrestore(vq-zerocopy_lock, flags);
+   vhost_add_used_and_signal_n(vq-dev, vq, heads, left);
+   }
+}
+
 /* Caller should have device mutex */
 void vhost_dev_cleanup(struct vhost_dev *dev)
 {
int i;
+   unsigned long begin = jiffies;
for (i = 0; i  dev-nvqs; ++i) {
if (dev-vqs[i].kick  dev-vqs[i].handle_kick) {
vhost_poll_stop(dev-vqs[i].poll);
@@ -389,6 +421,12 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
eventfd_ctx_put(dev-vqs[i].call_ctx);
if (dev-vqs[i].call)
fput(dev-vqs[i].call);
+   /* wait for all lower device DMAs done, then notify guest */
+   if (atomic_read(dev-vqs[i].refcnt)) {
+   if 

[RFC PATCH V2 5/5] Add TX zero copy in macvtap

2010-12-10 Thread Shirley Ma
Only when buffer size is greater than GOODCOPY_LEN (128), macvtap enables 
zero-copy.

Signed-off-by: Shirley Ma x...@us.ibm.com
---

 drivers/net/macvtap.c |  128 -
 1 files changed, 116 insertions(+), 12 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 4256727..2ec9692 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
  */
 static dev_t macvtap_major;
 #define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN  (L1_CACHE_BYTES  128 ? 128 : L1_CACHE_BYTES)
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
@@ -338,6 +339,7 @@ static int macvtap_open(struct inode *inode, struct file 
*file)
 {
struct net *net = current-nsproxy-net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
+   struct macvlan_dev *vlan = netdev_priv(dev);
struct macvtap_queue *q;
int err;
 
@@ -367,6 +369,16 @@ static int macvtap_open(struct inode *inode, struct file 
*file)
q-flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q-vnet_hdr_sz = sizeof(struct virtio_net_hdr);
 
+   /*
+* so far only VM uses macvtap, enable zero copy between guest
+* kernel and host kernel when lower device supports high memory
+* DMA
+*/
+   if (vlan) {
+   if (vlan-lowerdev-features  NETIF_F_ZEROCOPY)
+   sock_set_flag(q-sk, SOCK_ZEROCOPY);
+   }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(q-sk);
@@ -431,6 +443,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct 
sock *sk, size_t prepad,
return skb;
 }
 
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec 
*from,
+ int offset, size_t count)
+{
+   int len = iov_length(from, count) - offset;
+   int copy = skb_headlen(skb);
+   int size, offset1 = 0;
+   int i = 0;
+   skb_frag_t *f;
+
+   /* Skip over from offset */
+   while (offset = from-iov_len) {
+   offset -= from-iov_len;
+   ++from;
+   --count;
+   }
+
+   /* copy up to skb headlen */
+   while (copy  0) {
+   size = min_t(unsigned int, copy, from-iov_len - offset);
+   if (copy_from_user(skb-data + offset1, from-iov_base + offset,
+  size))
+   return -EFAULT;
+   if (copy  size) {
+   ++from;
+   --count;
+   }
+   copy -= size;
+   offset1 += size;
+   offset = 0;
+   }
+
+   if (len == offset1)
+   return 0;
+
+   while (count--) {
+   struct page *page[MAX_SKB_FRAGS];
+   int num_pages;
+   unsigned long base;
+
+   len = from-iov_len - offset1;
+   if (!len) {
+   offset1 = 0;
+   ++from;
+   continue;
+   }
+   base = (unsigned long)from-iov_base + offset1;
+   size = ((base  ~PAGE_MASK) + len + ~PAGE_MASK)  PAGE_SHIFT;
+   num_pages = get_user_pages_fast(base, size, 0, page[i]);
+   if ((num_pages != size) ||
+   (num_pages  MAX_SKB_FRAGS - skb_shinfo(skb)-nr_frags))
+   /* put_page is in skb free */
+   return -EFAULT;
+   while (len) {
+   f = skb_shinfo(skb)-frags[i];
+   f-page = page[i];
+   f-page_offset = base  ~PAGE_MASK;
+   f-size = min_t(int, len, PAGE_SIZE - f-page_offset);
+   skb-data_len += f-size;
+   skb-len += f-size;
+   skb-truesize += f-size;
+   skb_shinfo(skb)-nr_frags++;
+   /* increase sk_wmem_alloc */
+   atomic_add(f-size, skb-sk-sk_wmem_alloc);
+   base += f-size;
+   len -= f-size;
+   i++;
+   }
+   offset1 = 0;
+   ++from;
+   }
+   return 0;   
+}
+
 /*
  * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
  * be shared with the tun/tap driver.
@@ -514,17 +600,19 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff 
*skb,
 
 
 /* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
-   const struct iovec *iv, size_t count,
-   int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+   const struct iovec *iv, unsigned long 

Re: [RFC PATCH V2 0/5] macvtap TX zero copy between guest and host kernel

2010-12-10 Thread Shirley Ma
This patch has built and tested against most recent linus git tree. But
I haven't done checkpatch yet. I would like to know whether this
approach is acceptable or not first.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 5/5] Add TX zero copy in macvtap

2010-12-10 Thread Eric Dumazet
Le vendredi 10 décembre 2010 à 02:13 -0800, Shirley Ma a écrit :

 + while (len) {
 + f = skb_shinfo(skb)-frags[i];
 + f-page = page[i];
 + f-page_offset = base  ~PAGE_MASK;
 + f-size = min_t(int, len, PAGE_SIZE - f-page_offset);
 + skb-data_len += f-size;
 + skb-len += f-size;
 + skb-truesize += f-size;
 + skb_shinfo(skb)-nr_frags++;
 + /* increase sk_wmem_alloc */
 + atomic_add(f-size, skb-sk-sk_wmem_alloc);
 + base += f-size;
 + len -= f-size;
 + i++;
 + }

You could make one atomic_add() outside of the loop, and factorize many
things...

atomic_add(len, skb-sk-sk_wmem_alloc);
skb-data_len += len;
skb-len += len;
skb-truesize += len;
while (len) {
...
}


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] qemu,qmp: convert do_inject_nmi() to QObject, QError

2010-12-10 Thread Markus Armbruster
Lai Jiangshan la...@cn.fujitsu.com writes:

 Convert do_inject_nmi() to QObject, QError, we need to use it(via libvirt).

 changed from v1
 Add document.
 Add error handling when the cpu index is invalid.

 Signed-off-by:  Lai Jiangshan la...@cn.fujitsu.com
 ---
 diff --git a/hmp-commands.hx b/hmp-commands.hx
 index 23024ba..f86d9fe 100644
 --- a/hmp-commands.hx
 +++ b/hmp-commands.hx
 @@ -724,7 +724,8 @@ ETEXI
  .args_type  = cpu_index:i,
  .params = cpu,
  .help   = inject an NMI on the given CPU,
 -.mhandler.cmd = do_inject_nmi,
 +.user_print = monitor_user_noop,
 +.mhandler.cmd_new = do_inject_nmi,
  },
  #endif
  STEXI
 diff --git a/monitor.c b/monitor.c
 index ec31eac..f375eb3 100644
 --- a/monitor.c
 +++ b/monitor.c
 @@ -2119,7 +2119,7 @@ static void do_wav_capture(Monitor *mon, const QDict 
 *qdict)
  #endif
  
  #if defined(TARGET_I386)
 -static void do_inject_nmi(Monitor *mon, const QDict *qdict)
 +static int do_inject_nmi(Monitor *mon, const QDict *qdict, QObject 
 **ret_data)
  {
  CPUState *env;
  int cpu_index = qdict_get_int(qdict, cpu_index);
 @@ -2127,8 +2127,11 @@ static void do_inject_nmi(Monitor *mon, const QDict 
 *qdict)
  for (env = first_cpu; env != NULL; env = env-next_cpu)
  if (env-cpu_index == cpu_index) {
  cpu_interrupt(env, CPU_INTERRUPT_NMI);
 -break;
 +return 0;
  }
 +
 +qerror_report(QERR_INVALID_CPU_INDEX, cpu_index);
 +return -1;

do_cpu_set() reports invalud index like this:

qerror_report(QERR_INVALID_PARAMETER_VALUE, index,
  a CPU number);

What about sticking to that?

[...]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] QError: new QERR_INVALID_CPU_INDEX

2010-12-10 Thread Luiz Capitulino
On Fri, 10 Dec 2010 14:36:01 +0800
Lai Jiangshan la...@cn.fujitsu.com wrote:

 
 Signed-off-by:  Lai Jiangshan la...@cn.fujitsu.com

As Markus said, we report this as an invalid parameter in do_cpu(), we can do
the same for inject-nmi.

 ---
 diff --git a/qerror.c b/qerror.c
 index ac2cdaf..f59fb58 100644
 --- a/qerror.c
 +++ b/qerror.c
 @@ -117,6 +117,10 @@ static const QErrorStringTable qerror_table[] = {
  .desc  = Invalid block format '%(name)',
  },
  {
 +.error_fmt = QERR_INVALID_CPU_INDEX,
 +.desc  = Invalid CPU index '%(cpu_index)',
 +},
 +{
  .error_fmt = QERR_INVALID_PARAMETER,
  .desc  = Invalid parameter '%(name)',
  },
 diff --git a/qerror.h b/qerror.h
 index 943a24b..9117dda 100644
 --- a/qerror.h
 +++ b/qerror.h
 @@ -102,6 +102,9 @@ QError *qobject_to_qerror(const QObject *obj);
  #define QERR_INVALID_BLOCK_FORMAT \
  { 'class': 'InvalidBlockFormat', 'data': { 'name': %s } }
  
 +#define QERR_INVALID_CPU_INDEX \
 +{ 'class': 'InvalidCPUIndex', 'data': { 'cpu_index': %d } }
 +
  #define QERR_INVALID_PARAMETER \
  { 'class': 'InvalidParameter', 'data': { 'name': %s } }
  
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] qemu,qmp: convert do_inject_nmi() to QObject, QError

2010-12-10 Thread Luiz Capitulino
On Fri, 10 Dec 2010 14:36:08 +0800
Lai Jiangshan la...@cn.fujitsu.com wrote:

 +SQMP
 +inject_nmi
 +--
 +
 +Inject an NMI on the given CPU (x86 only).
 +
 +Arguments:
 +
 +- cpu_index: the index of the CPU to be injected NMI (json-int)
 +
 +Example:
 +
 +- { execute: inject_nmi, arguments: { cpu_index: 0 } }
 +- { return: {} }
 +
 +EQMP
 +

Avi, Anthony, can you please review this? Do we expect some kind of ack from
the guest? Do we expect it respond in some way?

Also note that the current series defines only one error condition: invalid
cpu index. Can this fail in other ways?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/28] nVMX: Add VMCS fields to the vmcs12

2010-12-10 Thread Nadav Har'El
On Thu, Dec 09, 2010, Avi Kivity wrote about Re: [PATCH 09/28] nVMX: Add VMCS 
fields to the vmcs12:
 And please address all my earlier comments, there's no point in me 
 reviewing the same thing again and again.

Hi,

Agreed. Like I said previously, I am keeping a detailed list of things you
already asked me to change, and once in a while addressing one issue and
replying about it. None of your previous comments have been forgotten, or
ignored.

Nadav.

-- 
Nadav Har'El|Friday, Dec 10 2010, 3 Tevet 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Long periods of drought are always
http://nadav.harel.org.il   |followed by rain.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] kvm/svm: copy instruction bytes from VMCB

2010-12-10 Thread Andre Przywara
In case of a nested page fault or an intercepted #PF newer SVM
implementations provide a copy of the faulting instruction bytes
in the VMCB.
Use these bytes to feed the instruction emulator and avoid the costly
guest instruction fetch in this case.

Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/include/asm/kvm_host.h |3 +++
 arch/x86/include/asm/svm.h  |4 +++-
 arch/x86/kvm/emulate.c  |1 +
 arch/x86/kvm/svm.c  |   20 
 arch/x86/kvm/vmx.c  |7 +++
 5 files changed, 34 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2b89195..baaf063 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -586,6 +586,9 @@ struct kvm_x86_ops {
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
+
+   int (*prefetch_instruction)(struct kvm_vcpu *vcpu);
+
const struct trace_print_flags *exit_reasons_str;
 };
 
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 589fc25..6d64b1d 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -81,7 +81,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u64 lbr_ctl;
u64 reserved_5;
u64 next_rip;
-   u8 reserved_6[816];
+   u8 insn_len;
+   u8 insn_bytes[15];
+   u8 reserved_6[800];
 };
 
 
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6366735..70385ee 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2623,6 +2623,7 @@ x86_decode_insn(struct x86_emulate_ctxt *ctxt)
c-eip = ctxt-eip;
c-fetch.start = c-fetch.end = c-eip;
ctxt-cs_base = seg_base(ctxt, ops, VCPU_SREG_CS);
+   kvm_x86_ops-prefetch_instruction(ctxt-vcpu);
 
switch (mode) {
case X86EMUL_MODE_REAL:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 73f1a6d..685b264 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -464,6 +464,24 @@ static void skip_emulated_instruction(struct kvm_vcpu 
*vcpu)
svm_set_interrupt_shadow(vcpu, 0);
 }
 
+static int svm_prefetch_instruction(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+   uint8_t len;
+   struct fetch_cache *fetch;
+
+   len = svm-vmcb-control.insn_len  0x0F;
+   if (len == 0)
+   return 1;
+
+   fetch = svm-vcpu.arch.emulate_ctxt.decode.fetch;
+   fetch-start = kvm_rip_read(svm-vcpu);
+   fetch-end = fetch-start + len;
+   memcpy(fetch-data, svm-vmcb-control.insn_bytes, len);
+
+   return 0;
+}
+
 static void svm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
bool has_error_code, u32 error_code,
bool reinject)
@@ -3848,6 +3866,8 @@ static struct kvm_x86_ops svm_x86_ops = {
.adjust_tsc_offset = svm_adjust_tsc_offset,
 
.set_tdp_cr3 = set_tdp_cr3,
+
+   .prefetch_instruction = svm_prefetch_instruction,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e5ef924..7572751 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1009,6 +1009,11 @@ static void skip_emulated_instruction(struct kvm_vcpu 
*vcpu)
vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+static int vmx_prefetch_instruction(struct kvm_vcpu *vcpu)
+{
+   return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
bool has_error_code, u32 error_code,
bool reinject)
@@ -4362,6 +4367,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.adjust_tsc_offset = vmx_adjust_tsc_offset,
 
.set_tdp_cr3 = vmx_set_cr3,
+
+   .prefetch_instruction = vmx_prefetch_instruction,
 };
 
 static int __init vmx_init(void)
-- 
1.6.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2 0/5] kvm/svm: implement new DecodeAssist features

2010-12-10 Thread Andre Przywara
Hi,

version 2 of the DecodeAssist patches.
Changes over version 1:
- goes on top of the CR8 handling fix I sent out earlier this week
  (required for proper handling of CR8 exceptions)
- handles exception cases properly (for mov cr and mov dr)
- uses X86_FEATURE_ names instead of SVM_FEATURE names (for boot_cpu_has)
  (thanks to Joerg for spotting this)
- use static_cpu_has where appropriate
- some minor code cleanups (for instance cr register calculation)
- move prefetch callback into x86_decode_insn and out of every fetch
  I refrained from ditching the callback at all, as I dont like extending
  every emulate_instruction call with NULL, 0. But if this is
  desperately needed, I can still change it.
- rename vendor specific prefetch function names


Upcoming AMD CPUs will have a SVM enhancement called DecodeAssist
which will provide more information when intercepting certain events.
These information allows to skip the instruction fetching and
decoding and handle the intercept immediately.
This patch set implements all the features which are documented
in the recent AMD manual (APM vol. 2). For details see the patches.

Please review and apply.

Regards,
Andre.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] kvm/svm: add new SVM feature bit names

2010-12-10 Thread Andre Przywara
the recent APM Vol.2 and the recent AMD CPUID specification describe
new CPUID features bits for SVM. Name them here for later usage.

Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/kvm/svm.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ed5950c..298ff79 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -51,6 +51,10 @@ MODULE_LICENSE(GPL);
 #define SVM_FEATURE_LBRV   (1   1)
 #define SVM_FEATURE_SVML   (1   2)
 #define SVM_FEATURE_NRIP   (1   3)
+#define SVM_FEATURE_TSC_RATE   (1   4)
+#define SVM_FEATURE_VMCB_CLEAN (1   5)
+#define SVM_FEATURE_FLUSH_ASID (1   6)
+#define SVM_FEATURE_DECODE_ASSIST  (1   7)
 #define SVM_FEATURE_PAUSE_FILTER   (1  10)
 
 #define NESTED_EXIT_HOST   0   /* Exit handled on host level */
-- 
1.6.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] kvm/svm: enhance mov DR intercept handler

2010-12-10 Thread Andre Przywara
Newer SVM implementations provide the GPR number in the VMCB, so
that the emulation path is no longer necesarry to handle debug
register access intercepts. Implement the handling in svm.c and
use it when the info is provided.

Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/kvm/svm.c |   61 ++-
 1 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ee5f100..ecb4acf 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2680,6 +2680,35 @@ static int cr0_write_interception(struct vcpu_svm *svm)
return r;
 }
 
+static int dr_interception(struct vcpu_svm *svm)
+{
+   int reg, dr;
+   unsigned long val;
+   int err;
+
+   if (!boot_cpu_has(X86_FEATURE_DECODEASSISTS))
+   return emulate_on_interception(svm);
+
+   reg = svm-vmcb-control.exit_info_1  SVM_EXITINFO_REG_MASK;
+   dr = svm-vmcb-control.exit_code - SVM_EXIT_READ_DR0;
+
+   if (dr = 16) { /* mov to DRn */
+   val = kvm_register_read(svm-vcpu, reg);
+   err = kvm_set_dr(svm-vcpu, dr - 16, val);
+   } else {
+   err = kvm_get_dr(svm-vcpu, dr, val);
+   if (!err)
+   kvm_register_write(svm-vcpu, reg, val);
+   }
+
+   if (!err)
+   skip_emulated_instruction(svm-vcpu);
+   else
+   kvm_inject_gp(svm-vcpu, 0);
+
+   return 1;
+}
+
 static int cr8_write_interception(struct vcpu_svm *svm)
 {
struct kvm_run *kvm_run = svm-vcpu.run;
@@ -2943,22 +2972,22 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm) 
= {
[SVM_EXIT_WRITE_CR3]= cr_interception,
[SVM_EXIT_WRITE_CR4]= cr_interception,
[SVM_EXIT_WRITE_CR8]= cr8_write_interception,
-   [SVM_EXIT_READ_DR0] = emulate_on_interception,
-   [SVM_EXIT_READ_DR1] = emulate_on_interception,
-   [SVM_EXIT_READ_DR2] = emulate_on_interception,
-   [SVM_EXIT_READ_DR3] = emulate_on_interception,
-   [SVM_EXIT_READ_DR4] = emulate_on_interception,
-   [SVM_EXIT_READ_DR5] = emulate_on_interception,
-   [SVM_EXIT_READ_DR6] = emulate_on_interception,
-   [SVM_EXIT_READ_DR7] = emulate_on_interception,
-   [SVM_EXIT_WRITE_DR0]= emulate_on_interception,
-   [SVM_EXIT_WRITE_DR1]= emulate_on_interception,
-   [SVM_EXIT_WRITE_DR2]= emulate_on_interception,
-   [SVM_EXIT_WRITE_DR3]= emulate_on_interception,
-   [SVM_EXIT_WRITE_DR4]= emulate_on_interception,
-   [SVM_EXIT_WRITE_DR5]= emulate_on_interception,
-   [SVM_EXIT_WRITE_DR6]= emulate_on_interception,
-   [SVM_EXIT_WRITE_DR7]= emulate_on_interception,
+   [SVM_EXIT_READ_DR0] = dr_interception,
+   [SVM_EXIT_READ_DR1] = dr_interception,
+   [SVM_EXIT_READ_DR2] = dr_interception,
+   [SVM_EXIT_READ_DR3] = dr_interception,
+   [SVM_EXIT_READ_DR4] = dr_interception,
+   [SVM_EXIT_READ_DR5] = dr_interception,
+   [SVM_EXIT_READ_DR6] = dr_interception,
+   [SVM_EXIT_READ_DR7] = dr_interception,
+   [SVM_EXIT_WRITE_DR0]= dr_interception,
+   [SVM_EXIT_WRITE_DR1]= dr_interception,
+   [SVM_EXIT_WRITE_DR2]= dr_interception,
+   [SVM_EXIT_WRITE_DR3]= dr_interception,
+   [SVM_EXIT_WRITE_DR4]= dr_interception,
+   [SVM_EXIT_WRITE_DR5]= dr_interception,
+   [SVM_EXIT_WRITE_DR6]= dr_interception,
+   [SVM_EXIT_WRITE_DR7]= dr_interception,
[SVM_EXIT_EXCP_BASE + DB_VECTOR]= db_interception,
[SVM_EXIT_EXCP_BASE + BP_VECTOR]= bp_interception,
[SVM_EXIT_EXCP_BASE + UD_VECTOR]= ud_interception,
-- 
1.6.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] kvm/svm: implement enhanced INVLPG intercept

2010-12-10 Thread Andre Przywara
When the DecodeAssist feature is available, the linear address
is provided in the VMCB on INVLPG intercepts. Use it directly to
avoid any decoding and emulation.
This is only useful for shadow paging, though.

Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/kvm/svm.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ecb4acf..73f1a6d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2586,7 +2586,12 @@ static int iret_interception(struct vcpu_svm *svm)
 
 static int invlpg_interception(struct vcpu_svm *svm)
 {
-   return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
+   if (!static_cpu_has(X86_FEATURE_DECODEASSISTS))
+   return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
+
+   kvm_mmu_invlpg(svm-vcpu, svm-vmcb-control.exit_info_1);
+   skip_emulated_instruction(svm-vcpu);
+   return 1;
 }
 
 static int emulate_on_interception(struct vcpu_svm *svm)
-- 
1.6.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] kvm/svm: enhance MOV CR intercept handler

2010-12-10 Thread Andre Przywara
Newer SVM implementations provide the GPR number in the VMCB, so
that the emulation path is no longer necesarry to handle CR
register access intercepts. Implement the handling in svm.c and
use it when the info is provided.

Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/include/asm/svm.h |2 +
 arch/x86/kvm/svm.c |   91 ++-
 2 files changed, 82 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 11dbca7..589fc25 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -256,6 +256,8 @@ struct __attribute__ ((__packed__)) vmcb {
 #define SVM_EXITINFOSHIFT_TS_REASON_JMP 38
 #define SVM_EXITINFOSHIFT_TS_HAS_ERROR_CODE 44
 
+#define SVM_EXITINFO_REG_MASK 0x0F
+
 #defineSVM_EXIT_READ_CR0   0x000
 #defineSVM_EXIT_READ_CR3   0x003
 #defineSVM_EXIT_READ_CR4   0x004
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 298ff79..ee5f100 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2594,12 +2594,81 @@ static int emulate_on_interception(struct vcpu_svm *svm)
return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
 }
 
+static int cr_interception(struct vcpu_svm *svm)
+{
+   int reg, cr;
+   unsigned long val;
+   int err;
+
+   if (!static_cpu_has(X86_FEATURE_DECODEASSISTS))
+   return emulate_on_interception(svm);
+
+   /* bit 63 is the valid bit, as not all instructions (like lmsw)
+  provide the information */
+   if (unlikely((svm-vmcb-control.exit_info_1  (1ULL  63)) == 0))
+   return emulate_on_interception(svm);
+
+   reg = svm-vmcb-control.exit_info_1  SVM_EXITINFO_REG_MASK;
+   cr = svm-vmcb-control.exit_code - SVM_EXIT_READ_CR0;
+
+   err = 0;
+   if (cr = 16) { /* mov to cr */
+   cr -= 16;
+   val = kvm_register_read(svm-vcpu, reg);
+   switch (cr) {
+   case 0:
+   err = kvm_set_cr0(svm-vcpu, val);
+   break;
+   case 3:
+   err = kvm_set_cr3(svm-vcpu, val);
+   break;
+   case 4:
+   err = kvm_set_cr4(svm-vcpu, val);
+   break;
+   case 8:
+   err = kvm_set_cr8(svm-vcpu, val);
+   break;
+   default:
+   WARN(1, unhandled write to CR%d, cr);
+   return EMULATE_FAIL;
+   }
+   } else { /* mov from cr */
+   switch (cr) {
+   case 0:
+   val = kvm_read_cr0(svm-vcpu);
+   break;
+   case 2:
+   val = svm-vcpu.arch.cr2;
+   break;
+   case 3:
+   val = svm-vcpu.arch.cr3;
+   break;
+   case 4:
+   val = kvm_read_cr4(svm-vcpu);
+   break;
+   case 8:
+   val = kvm_get_cr8(svm-vcpu);
+   break;
+   default:
+   WARN(1, unhandled read from CR%d, cr);
+   return EMULATE_FAIL;
+   }
+   kvm_register_write(svm-vcpu, reg, val);
+   }
+   if (!err)
+   skip_emulated_instruction(svm-vcpu);
+   else
+   kvm_inject_gp(svm-vcpu, 0);
+
+   return 1;
+}
+
 static int cr0_write_interception(struct vcpu_svm *svm)
 {
struct kvm_vcpu *vcpu = svm-vcpu;
int r;
 
-   r = emulate_instruction(svm-vcpu, 0, 0, 0);
+   r = cr_interception(svm);
 
if (svm-nested.vmexit_rip) {
kvm_register_write(vcpu, VCPU_REGS_RIP, svm-nested.vmexit_rip);
@@ -2608,7 +2677,7 @@ static int cr0_write_interception(struct vcpu_svm *svm)
svm-nested.vmexit_rip = 0;
}
 
-   return r == EMULATE_DONE;
+   return r;
 }
 
 static int cr8_write_interception(struct vcpu_svm *svm)
@@ -2618,13 +2687,13 @@ static int cr8_write_interception(struct vcpu_svm *svm)
 
u8 cr8_prev = kvm_get_cr8(svm-vcpu);
/* instruction emulation calls kvm_set_cr8() */
-   r = emulate_instruction(svm-vcpu, 0, 0, 0);
+   r = cr_interception(svm);
if (irqchip_in_kernel(svm-vcpu.kvm)) {
clr_cr_intercept(svm, INTERCEPT_CR8_WRITE);
-   return r == EMULATE_DONE;
+   return r;
}
if (cr8_prev = kvm_get_cr8(svm-vcpu))
-   return r == EMULATE_DONE;
+   return r;
kvm_run-exit_reason = KVM_EXIT_SET_TPR;
return 0;
 }
@@ -2865,14 +2934,14 @@ static int pause_interception(struct vcpu_svm *svm)
 }
 
 static int (*svm_exit_handlers[])(struct vcpu_svm *svm) = {
-   [SVM_EXIT_READ_CR0] = emulate_on_interception,

[PATCH 0/3] Provide unmapped page cache control (v2)

2010-12-10 Thread Balbir Singh

The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting https://lkml.org/lkml/2010/11/30/79

The previous revision received lot of comments, I've tried to
address as many of those as possible in this revision. The
last series was reviewed-by Christoph Lameter.

There were comments on overlap with Nick's changes and overlap
with them. I don't feel these changes impact Nick's work and
integration can/will be considered as the patches evolve, if
need be.

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79

Size measurement

CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 419431 1883047  140888 2443366  254866 mm/built-in.o

CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 418908 1883023  140888 2442819  254643 mm/built-in.o


---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim, move reusable functionality outside
  Provide control over unmapped pages


 Documentation/kernel-parameters.txt |8 ++
 include/linux/mmzone.h  |4 +
 include/linux/swap.h|   21 +-
 init/Kconfig|   12 +++
 kernel/sysctl.c |   20 +++--
 mm/page_alloc.c |9 ++
 mm/vmscan.c |  132 +++
 7 files changed, 175 insertions(+), 31 deletions(-)

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v2)

2010-12-10 Thread Balbir Singh
Changelog v2
Moved sysctl for min_unmapped_ratio as well

This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 include/linux/mmzone.h |4 ++--
 include/linux/swap.h   |4 ++--
 kernel/sysctl.c|   18 +-
 mm/vmscan.c|2 --
 4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..aeede91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,12 +302,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 84375e4..ac5c06e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,11 +253,11 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a00fdef..e40040e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
-#ifdef CONFIG_NUMA
-   {
-   .procname   = zone_reclaim_mode,
-   .data   = zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   .extra1 = zero,
-   },
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#ifdef CONFIG_NUMA
+   {
+   .procname   = zone_reclaim_mode,
+   .data   = zone_reclaim_mode,
+   .maxlen = sizeof(zone_reclaim_mode),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = zero,
+   },
{
.procname   = min_slab_ratio,
.data   = sysctl_min_slab_ratio,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..e841cae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2740,7 +2740,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Refactor zone_reclaim (v2)

2010-12-10 Thread Balbir Singh
Move reusable functionality outside of zone_reclaim.
Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e841cae..4e2ad05 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_unmapped_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Provide control over unmapped pages (v2)

2010-12-10 Thread Balbir Singh
Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)
5. Updated Documentation/kernel-parameters.txt (Andrew Morton)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt |8 +++
 include/linux/swap.h|   21 ++--
 init/Kconfig|   12 
 kernel/sysctl.c |2 +
 mm/page_alloc.c |9 +++
 mm/vmscan.c |   97 +++
 6 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index dd8fe2b..f52b0bd 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in 
the file
[X86]
Set unknown_nmi_panic=1 early on boot.
 
+   unmapped_page_control
+   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped memory
+   that is present in the system. This boot option plus
+   vm.min_unmapped_ratio (sysctl) provide granular control
+   over how much unmapped page cache can exist in the 
system
+   before kswapd starts reclaiming unmapped page cache 
pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2).  This
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ac5c06e..773d7e5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,19 +253,32 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
return 0;
 }
 #endif
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+   return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
 extern int page_evictable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_unevictable_pages(struct address_space *);
 
diff --git a/init/Kconfig b/init/Kconfig
index 3eb22ad..78c9169 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -782,6 +782,18 @@ endif # NAMESPACES
 config MM_OWNER
bool
 
+config UNMAPPED_PAGECACHE_CONTROL
+   bool Provide control over unmapped page cache
+   default n
+   help
+ This option adds support for controlling unmapped page cache
+ via a boot parameter (unmapped_page_control). The boot parameter
+ with sysctl (vm.min_unmapped_ratio) control the total number
+ of unmapped pages in the system. This feature is useful if
+ you want to limit the amount of unmapped page cache or want
+ to reduce page cache duplication in a virtualized environment.
+ If unsure say 'N'
+
 config SYSFS_DEPRECATED
bool enable deprecated sysfs features to support old userspace tools
depends on SYSFS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e40040e..ab2c60a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#endif
 #ifdef CONFIG_NUMA
{
.procname   = zone_reclaim_mode,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1845a97..1c9fbab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1662,6 +1662,9 @@ 

Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-10 Thread Rik van Riel

On 12/10/2010 12:03 AM, Balbir Singh wrote:


This is a good problem statement, there are other things to consider
as well

1. If a hard limit feature is enabled underneath, donating the
timeslice would probably not make too much sense in that case


The idea is to get the VCPU that is holding the lock to run
ASAP, so it can release the lock.


2. The implict assumption is that spinning is bad, but for locks
held for short durations, the assumption is not true. I presume
by the problem statement above, the h/w does the detection of
when to pause, but that is not always correct as you suggest above.


The hardware waits a certain number of spins before it traps
to the virt host.  Short-held locks, held by a virtual CPU
that is running, will not trigger the exception.


3. With respect to donating timeslices, don't scheduler cgroups
and job isolation address that problem today?


No.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-10 Thread Rik van Riel

On 12/10/2010 03:39 AM, Srivatsa Vaddagiri wrote:

On Thu, Dec 09, 2010 at 11:34:46PM -0500, Rik van Riel wrote:

On 12/03/2010 09:06 AM, Srivatsa Vaddagiri wrote:

On Fri, Dec 03, 2010 at 03:03:30PM +0100, Peter Zijlstra wrote:

No, because they do receive service (they spend some time spinning
before being interrupted), so the respective vruntimes will increase, at
some point they'll pass B0 and it'll get scheduled.


Is that sufficient to ensure that B0 receives its fair share (1/3 cpu in this
case)?


I have a rough idea for a simpler way to ensure
fairness.

At yield_to time, we could track in the runqueue
structure that a task received CPU time (and on
the other runqueue that a task donated CPU time).

The balancer can count time-given-to CPUs as
busier, and donated-time CPUs as less busy,
moving tasks away in the unlikely event that
the same task gets keeping CPU time given to
it.


I think just capping donation (either on send side or receive side) may be more
simpler here than to mess with load balancer logic.


Do you have any ideas on how to implement this in a simple
enough way that it may be acceptable upstream? :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 5/5] Add TX zero copy in macvtap

2010-12-10 Thread Shirley Ma
On Fri, 2010-12-10 at 11:27 +0100, Eric Dumazet wrote:
 You could make one atomic_add() outside of the loop, and factorize
 many
 things...
 
 atomic_add(len, skb-sk-sk_wmem_alloc);
 skb-data_len += len;
 skb-len += len;
 skb-truesize += len;
 while (len) {
 ...
 } 

Yep, thanks, will update it!

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 5/5] Add TX zero copy in macvtap

2010-12-10 Thread Shirley Ma
On Fri, 2010-12-10 at 08:25 -0800, Shirley Ma wrote:
 On Fri, 2010-12-10 at 11:27 +0100, Eric Dumazet wrote:
  You could make one atomic_add() outside of the loop, and factorize
  many
  things...
  
  atomic_add(len, skb-sk-sk_wmem_alloc);
  skb-data_len += len;
  skb-len += len;
  skb-truesize += len;
  while (len) {
  ...
  } 
 
 Yep, thanks, will update it! 

Maybe I should use total_len when skb frag mapping is done, something
like:

int total_len = 0;
...
total len += len;
...
skb-data_len += total_len;
skb-len += total_len;
skb-truesize += total_len;
atomic_add(total_len, skb-sk-sk_wmem_alloc);

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 5/5] Add TX zero copy in macvtap

2010-12-10 Thread Eric Dumazet
Le vendredi 10 décembre 2010 à 08:25 -0800, Shirley Ma a écrit :
 On Fri, 2010-12-10 at 11:27 +0100, Eric Dumazet wrote:
  You could make one atomic_add() outside of the loop, and factorize
  many
  things...
  
  atomic_add(len, skb-sk-sk_wmem_alloc);
  skb-data_len += len;
  skb-len += len;
  skb-truesize += len;
  while (len) {
  ...
  } 
 
 Yep, thanks, will update it!

Also take a look at skb_fill_page_desc() helper, and maybe
skb_add_rx_frag() too.

The atomic op should be factorized for sure, but other adds might be
done by helpers to keep code short.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux-next: Tree for November 22 (kvm)

2010-12-10 Thread Randy Dunlap
On Mon, 29 Nov 2010 08:08:30 -1000 Zachary Amsden wrote:

 On 11/29/2010 07:52 AM, Randy Dunlap wrote:
  On 11/29/10 09:47, Zachary Amsden wrote:
 
  On 11/29/2010 06:35 AM, Avi Kivity wrote:
   
  On 11/29/2010 06:33 PM, Randy Dunlap wrote:
 
  On Mon, 22 Nov 2010 13:26:27 -0800 Randy Dunlap wrote:
 
   
On Mon, 22 Nov 2010 13:49:11 +1100 Stephen Rothwell wrote:
 
 
Hi all,
 
Changes since 20101119:
   
 
kvm.c:(.init.text+0x11f49): undefined reference to
 
  `kvm_register_clock'
   
when CONFIG_KVM_CLOCK is not enabled.
 
 
  BUild error still present in linux-next-2010-NOV-29.
 
   
  Glauber, Zach?
 
 
  I can only speculate this reference is being called from smpboot without
  CONFIG guarding?
   
  Sorry, looks like I dropped the first line of the error messages:
 
  arch/x86/built-in.o: In function `kvm_smp_prepare_boot_cpu':
  kvm.c:(.init.text+0xad38): undefined reference to `kvm_register_clock'
 
  from arch/x86/kernel/kvm.c:
 
  #ifdef CONFIG_SMP
  static void __init kvm_smp_prepare_boot_cpu(void)
  {
  WARN_ON(kvm_register_clock(primary cpu clock));
  kvm_guest_cpu_init();
  native_smp_prepare_boot_cpu();
  }
 
  so it looks like you are correct...
 
 
 Looks like this is the appropriate fix:
 
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
 #ifdef CONFIG_KVM_CLOCK
   WARN_ON(kvm_register_clock(primary cpu clock));
 #endif
   kvm_guest_cpu_init();
   native_smp_prepare_boot_cpu();
 }


Can we get this fix merged, please?

Build error is still happening in linux-next 20101210.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] intel-iommu: Fix use after release during device attach

2010-12-10 Thread Chris Wright
* Jan Kiszka (jan.kis...@siemens.com) wrote:
  --- a/drivers/pci/intel-iommu.c
  +++ b/drivers/pci/intel-iommu.c
  @@ -3627,9 +3627,9 @@ static int intel_iommu_attach_device(struct
  iommu_domain *domain,
 
pte = dmar_domain-pgd;
if (dma_pte_present(pte)) {
  - free_pgtable_page(dmar_domain-pgd);
dmar_domain-pgd = (struct dma_pte *)
phys_to_virt(dma_pte_addr(pte));

While here, might as well remove the unnecessary cast.

  + free_pgtable_page(pte);
}
dmar_domain-agaw--;
}
 
  Reviewed-by: Sheng Yang sh...@linux.intel.com

Acked-by: Chris Wright chr...@sous-sol.org

  CC iommu mailing list and David.
  
  Ping...
  
  I think this fix also qualifies for stable (.35 and .36).
  
 
 Still not merged?

David, do you plan to pick this one up?

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v16 02/17]Add a new struct for device to manipulate external buffer.

2010-12-10 Thread David Miller
From: xiaohui@intel.com
Date: Wed,  1 Dec 2010 16:08:13 +0800

 From: Xin Xiaohui xiaohui@intel.com
 
 Add a structure in structure net_device, the new field is
 named as mp_port. It's for mediate passthru (zero-copy).
 It contains the capability for the net device driver,
 a socket, and an external buffer creator, external means
 skb buffer belongs to the device may not be allocated from
 kernel space.
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com

Please eliminate whatever is causing this indentation of your
commit messages.

There should be no special indentation of the commit message.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v16 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.

2010-12-10 Thread David Miller
From: xiaohui@intel.com
Date: Wed,  1 Dec 2010 16:08:14 +0800

 +#if defined(CONFIG_MEDIATE_PASSTHRU) || 
 defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
 + int (*ndo_mp_port_prep)(struct net_device *dev,
 + struct mp_port *port);
 +#endif

Please rename this config option so that it is clear, by name, that
this option is for a networking facility.

F.e. CONFIG_NET_MEDIATE_PASSTHRU
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2010-12-10 Thread Avi Kivity

On 12/09/2010 07:07 PM, Rik van Riel wrote:

Right. May be clearer by using a for () loop instead of the goto.



And open coding kvm_for_each_vcpu ?

Somehow I suspect that won't add to clarity...


No, I meant having a for (pass = 0; pass  2; ++pass) and nesting 
kvm_for_each_vcpu() in it.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-10 Thread Avi Kivity

On 12/10/2010 07:03 AM, Balbir Singh wrote:


  Scheduler people, please flame me with anything I may have done
  wrong, so I can do it right for a next version :)


This is a good problem statement, there are other things to consider
as well

1. If a hard limit feature is enabled underneath, donating the
timeslice would probably not make too much sense in that case


What's the alternative?

Consider a two vcpu guest with a 50% hard cap.  Suppose the workload 
involves ping-ponging within the guest.  If the scheduler decides to 
schedule the vcpus without any overlap, then the throughput will be 
dictated by the time slice.  If we allow donation, throughput is limited 
by context switch latency.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html