date:20090617

From: Weidong Han weidong@intel.com

In free_assigned_device, kvm_remove_ioperm_data won't be
called, because the check skips type IORESOURCE_IO. For
IORESOURCE_MEM, it should destroy the registered memory,
otherwise it may be failed to create new memory slot because
it is already there. hot add and hot remove a device several
times can trigger this failure.

This patch fixes the above issues.

Signed-off-by: Weidong Han weidong@intel.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 65920d0..357a946 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -561,19 +561,26 @@ static void free_assigned_device(AssignedDevInfo *adev)
 PCIRegion *pci_region = dev-real_device.regions[i];
 AssignedDevRegion *region = dev-v_addrs[i];
 
-if (!pci_region-valid || !(pci_region-type  IORESOURCE_MEM))
+if (!pci_region-valid)
 continue;
 
-kvm_remove_ioperm_data(region-u.r_baseport, region-r_size);
-
-if (region-u.r_virtbase) {
-int ret = munmap(region-u.r_virtbase,
- (pci_region-size + 0xFFF)  0xF000);
-if (ret != 0)
-fprintf(stderr,
-Failed to unmap assigned device region: %s\n,
-strerror(errno));
-}
+if (pci_region-type  IORESOURCE_IO) {
+kvm_remove_ioperm_data(region-u.r_baseport, region-r_size);
+continue;
+} else if (pci_region-type  IORESOURCE_MEM) {
+if (region-e_size  0)
+kvm_destroy_phys_mem(kvm_context, region-e_physbase,
+ TARGET_PAGE_ALIGN(region-e_size));
+
+if (region-u.r_virtbase) {
+int ret = munmap(region-u.r_virtbase,
+ (pci_region-size + 0xFFF)  0xF000);
+if (ret != 0)
+fprintf(stderr,
+   Failed to unmap assigned device region: %s\n,
+   strerror(errno));
+}
+   }
 }
 
 if (dev-real_device.config_fd) {
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Do not use env-halted to decide where halted state should be handled

From: Gleb Natapov g...@redhat.com

Use kvm_irqchip_in_kernel() for that. If irq chip is not handled by
userspace kernel should be entered even when CPU is halted.

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/hw/apic.c b/hw/apic.c
index c5d97b2..f186202 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -467,8 +467,7 @@ static void apic_init_ipi(APICState *s)
 
 cpu_reset(s-cpu_env);
 
-if (!(s-apicbase  MSR_IA32_APICBASE_BSP) 
-(!kvm_enabled() || !qemu_kvm_irqchip_in_kernel()))
+if (!(s-apicbase  MSR_IA32_APICBASE_BSP))
 s-cpu_env-halted = 1;
 
 if (kvm_enabled()  !qemu_kvm_irqchip_in_kernel())
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 9dc0a01..59de374 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1725,6 +1725,8 @@ static int has_work(CPUState *env)
 {
 if (!vm_running || (env  env-kvm_cpu_state.stopped))
return 0;
+if (kvm_irqchip_in_kernel(kvm_context))
+return 1;
 if (!env-halted)
return 1;
 return kvm_arch_has_work(env);
@@ -1898,8 +1900,6 @@ static int kvm_main_loop_cpu(CPUState *env)
 setup_kernel_sigmask(env);
 
 pthread_mutex_lock(qemu_mutex);
-if (kvm_irqchip_in_kernel(kvm_context))
-   env-halted = 0;
 
 kvm_qemu_init_env(env);
 #ifdef TARGET_I386
@@ -1920,7 +1920,7 @@ static int kvm_main_loop_cpu(CPUState *env)
if (env-kvm_cpu_state.sipi_needed)
update_regs_for_sipi(env);
 }
-   if (!env-halted)
+   if (!env-halted || kvm_irqchip_in_kernel(kvm_context))
kvm_cpu_exec(env);
env-exit_request = 0;
 env-exception_index = EXCP_INTERRUPT;
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Drop unnecessary check of kvm_cpu_state.init

From: Gleb Natapov g...@redhat.com

env-kvm_cpu_state.init is always zero here.

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/qemu-kvm.c b/qemu-kvm.c
index eefca61..9dc0a01 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1920,7 +1920,7 @@ static int kvm_main_loop_cpu(CPUState *env)
if (env-kvm_cpu_state.sipi_needed)
update_regs_for_sipi(env);
 }
-   if (!env-halted  !env-kvm_cpu_state.init)
+   if (!env-halted)
kvm_cpu_exec(env);
env-exit_request = 0;
 env-exception_index = EXCP_INTERRUPT;
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Retrieve mp state info in cpu_synchronize_state()

From: Gleb Natapov g...@redhat.com

And set env-halted based on the value to show accurate vcpu state in
QEMU monitor.

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/qemu-kvm.c b/qemu-kvm.c
index 60e2e76..10d44a6 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1708,6 +1708,33 @@ void kvm_save_registers(CPUState *env)
 on_vcpu(env, kvm_do_save_registers, env);
 }
 
+static void kvm_do_load_mpstate(void *_env)
+{
+CPUState *env = _env;
+
+kvm_arch_load_mpstate(env);
+}
+
+void kvm_load_mpstate(CPUState *env)
+{
+if (kvm_enabled()  qemu_system_ready)
+on_vcpu(env, kvm_do_load_mpstate, env);
+}
+
+static void kvm_do_save_mpstate(void *_env)
+{
+CPUState *env = _env;
+
+kvm_arch_save_mpstate(env);
+env-halted = (env-mp_state == KVM_MP_STATE_HALTED);
+}
+
+void kvm_save_mpstate(CPUState *env)
+{
+if (kvm_enabled())
+on_vcpu(env, kvm_do_save_mpstate, env);
+}
+
 int kvm_cpu_exec(CPUState *env)
 {
 int r;
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 1bae0b9..b6db1bd 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -219,11 +219,13 @@ static inline int kvm_sync_vcpus(void) { return 0; }
 static inline void kvm_arch_get_registers(CPUState *env)
 {
 kvm_save_registers(env);
+kvm_save_mpstate(env);
 }
 
 static inline void kvm_arch_put_registers(CPUState *env)
 {
 kvm_load_registers(env);
+kvm_load_mpstate(env);
 }
 
 static inline void cpu_synchronize_state(CPUState *env, int modified)
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Merge branch 'master' of git://git.sv.gnu.org/qemu

From: Avi Kivity a...@redhat.com

* 'master' of git://git.sv.gnu.org/qemu: (58 commits)
  exec.c: remove unnecessary #if NB_MMU_MODES
  Fix vga_screen_dump_blank() PPM generation
  Prevent CD-ROM media eject while device is locked
  set migration max downtime
  add non-arbitrary migration stop condition
  kvm: Fix IRQ injection into full queue
  Call qemu_bh_delete at bdrv_aio_bh_cb.
  Remove dead code
  QEMU KVM: i386: Fix the cpu reset state
  allow CPUID vendor override
  Fix help message for new configure option --enable-debug.
  provide cpu_index to env mapping
  pci: add define for communication class devices
  vnc: improve numpad support for qemu console.
  virtio blk: fix warning.
  lsi53c895a: Implement write access to DMA Byte Counter
  lsi53c895a: Implement read and write access to DMA Next Address
  lsi53c895a: Implement Scratch Byte Register
  Rename pci_register_io_region() to pci_register_bar()
  Rearrange io_mem_init()
  ...

Conflicts:
configure

Signed-off-by: Avi Kivity a...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Remove redunant check for s-cpu_env

From: Gleb Natapov g...@redhat.com

s-cpu_env cannot be zero here.

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/hw/apic.c b/hw/apic.c
index f186202..eac54fd 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -471,8 +471,7 @@ static void apic_init_ipi(APICState *s)
 s-cpu_env-halted = 1;
 
 if (kvm_enabled()  !qemu_kvm_irqchip_in_kernel())
-   if (s-cpu_env)
-   kvm_apic_init(s-cpu_env);
+kvm_apic_init(s-cpu_env);
 }
 
 /* send a SIPI message to the CPU to start it */
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Drop unncessary setting of env-exception_index

From: Gleb Natapov g...@redhat.com

env-exception_index is not used by kvm code.

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/qemu-kvm-ia64.c b/qemu-kvm-ia64.c
index 234602c..477d24c 100644
--- a/qemu-kvm-ia64.c
+++ b/qemu-kvm-ia64.c
@@ -35,7 +35,6 @@ int kvm_arch_halt(void *opaque, kvm_vcpu_context_t vcpu)
 {
 CPUState *env = cpu_single_env;
 env-hflags |= HF_HALTED_MASK;
-env-exception_index = EXCP_HLT;
 return 1;
 }
 
@@ -135,7 +134,6 @@ void kvm_arch_cpu_reset(CPUState *env)
 } else {
env-interrupt_request = ~CPU_INTERRUPT_HARD;
env-halted = 1;
-   env-exception_index = EXCP_HLT;
 }
 }
 
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 8e6fb75..6865385 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -611,7 +611,6 @@ int kvm_arch_halt(void *opaque, kvm_vcpu_context_t vcpu)
  (env-eflags  IF_MASK)) 
!(env-interrupt_request  CPU_INTERRUPT_NMI)) {
 env-halted = 1;
-   env-exception_index = EXCP_HLT;
 }
 return 1;
 }
@@ -707,7 +706,6 @@ void kvm_arch_cpu_reset(CPUState *env)
} else {
env-interrupt_request = ~CPU_INTERRUPT_HARD;
env-halted = 1;
-   env-exception_index = EXCP_HLT;
}
 }
 }
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 10d44a6..d3eba1a 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1950,7 +1950,6 @@ static int kvm_main_loop_cpu(CPUState *env)
if (!env-halted || kvm_irqchip_in_kernel(kvm_context))
kvm_cpu_exec(env);
env-exit_request = 0;
-env-exception_index = EXCP_INTERRUPT;
kvm_main_loop_wait(env, 0);
 }
 pthread_mutex_unlock(qemu_mutex);
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Update source link

From: Avi Kivity a...@redhat.com

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/linux-2.6 b/linux-2.6
index 8fecb97..720ac78 16
--- a/linux-2.6
+++ b/linux-2.6
@@ -1 +1 @@
-Subproject commit 8fecb979727b1582ba2c1a35c34c0bb6d341fa15
+Subproject commit 720ac78b6b38277c941af04ef8f73aeea1951842
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Update source link

From: Avi Kivity a...@redhat.com

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/linux-2.6 b/linux-2.6
index 720ac78..77bc807 16
--- a/linux-2.6
+++ b/linux-2.6
@@ -1 +1 @@
-Subproject commit 720ac78b6b38277c941af04ef8f73aeea1951842
+Subproject commit 77bc807c54c1abf0808c692837a956e33a172f61
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] KVM: Ignore reads to K7 EVNTSEL MSRs

From: Amit Shah amit.s...@redhat.com

In commit 7fe29e0faacb650d31b9e9f538203a157bec821d we ignored the
reads to the P6 EVNTSEL MSRs. That fixed crashes on Intel machines.

Ignore the reads to K7 EVNTSEL MSRs as well to fix this on AMD
hosts.

This fixes Kaspersky antivirus crashing Windows guests on AMD hosts.

Signed-off-by: Amit Shah amit.s...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6025e5b..71090d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -993,6 +993,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 
*pdata)
case MSR_VM_HSAVE_PA:
case MSR_P6_EVNTSEL0:
case MSR_P6_EVNTSEL1:
+   case MSR_K7_EVNTSEL0:
data = 0;
break;
case MSR_MTRRcap:
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] KVM: VMX: Handle vmx instruction vmexits

From: Avi Kivity a...@redhat.com

IF a guest tries to use vmx instructions, inject a #UD to let it know the
instruction is not implemented, rather than crashing.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 82df867..9332938 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3104,6 +3104,12 @@ static int handle_vmcall(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
return 1;
 }
 
+static int handle_vmx_insn(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -3373,6 +3379,15 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu 
*vcpu,
[EXIT_REASON_HLT] = handle_halt,
[EXIT_REASON_INVLPG]  = handle_invlpg,
[EXIT_REASON_VMCALL]  = handle_vmcall,
+   [EXIT_REASON_VMCLEAR] = handle_vmx_insn,
+   [EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
+   [EXIT_REASON_VMPTRLD] = handle_vmx_insn,
+   [EXIT_REASON_VMPTRST] = handle_vmx_insn,
+   [EXIT_REASON_VMREAD]  = handle_vmx_insn,
+   [EXIT_REASON_VMRESUME]= handle_vmx_insn,
+   [EXIT_REASON_VMWRITE] = handle_vmx_insn,
+   [EXIT_REASON_VMOFF]   = handle_vmx_insn,
+   [EXIT_REASON_VMON]= handle_vmx_insn,
[EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold,
[EXIT_REASON_APIC_ACCESS] = handle_apic_access,
[EXIT_REASON_WBINVD]  = handle_wbinvd,
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] KVM: SVM: Don't save/restore host cr2

From: Avi Kivity a...@redhat.com

The host never reads cr2 in process context, so are free to clobber it.  The
vmx code does this, so we can safely remove the save/restore code.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cdbd4d1..c5867fd 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -82,7 +82,6 @@ struct vcpu_svm {
 
u64 host_user_msrs[NR_HOST_SAVE_USER_MSRS];
u64 host_gs_base;
-   unsigned long host_cr2;
 
u32 *msrpm;
struct vmcb *hsave;
@@ -187,19 +186,6 @@ static inline void invlpga(unsigned long addr, u32 asid)
asm volatile (__ex(SVM_INVLPGA) :: a(addr), c(asid));
 }
 
-static inline unsigned long kvm_read_cr2(void)
-{
-   unsigned long cr2;
-
-   asm volatile (mov %%cr2, %0 : =r (cr2));
-   return cr2;
-}
-
-static inline void kvm_write_cr2(unsigned long val)
-{
-   asm volatile (mov %0, %%cr2 :: r (val));
-}
-
 static inline void force_new_asid(struct kvm_vcpu *vcpu)
 {
to_svm(vcpu)-asid_generation--;
@@ -2528,7 +2514,6 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
fs_selector = kvm_read_fs();
gs_selector = kvm_read_gs();
ldt_selector = kvm_read_ldt();
-   svm-host_cr2 = kvm_read_cr2();
if (!is_nested(svm))
svm-vmcb-save.cr2 = vcpu-arch.cr2;
/* required for live migration with NPT */
@@ -2615,8 +2600,6 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
vcpu-arch.regs[VCPU_REGS_RSP] = svm-vmcb-save.rsp;
vcpu-arch.regs[VCPU_REGS_RIP] = svm-vmcb-save.rip;
 
-   kvm_write_cr2(svm-host_cr2);
-
kvm_load_fs(fs_selector);
kvm_load_gs(gs_selector);
kvm_load_ldt(ldt_selector);
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] KVM: VMX: Only reload guest cr2 if different from host cr2

From: Avi Kivity a...@redhat.com

cr2 changes only rarely, and writing it is expensive.  Avoid the costly cr2
writes by checking if it does not already hold the desired value.

Shaves 70 cycles off the vmexit latency.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6c0902e..82df867 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3636,11 +3636,16 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
mov %%Rsp, %c[host_rsp](%0) \n\t
__ex(ASM_VMX_VMWRITE_RSP_RDX) \n\t
1: \n\t
+   /* Reload cr2 if changed */
+   mov %c[cr2](%0), %%Rax \n\t
+   mov %%cr2, %%Rdx \n\t
+   cmp %%Rax, %%Rdx \n\t
+   je 2f \n\t
+   mov %%Rax, %%cr2 \n\t
+   2: \n\t
/* Check if vmlaunch of vmresume is needed */
cmpl $0, %c[launched](%0) \n\t
/* Load guest registers.  Don't clobber flags. */
-   mov %c[cr2](%0), %%Rax \n\t
-   mov %%Rax, %%cr2 \n\t
mov %c[rax](%0), %%Rax \n\t
mov %c[rbx](%0), %%Rbx \n\t
mov %c[rdx](%0), %%Rdx \n\t
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

From: Avi Kivity a...@redhat.com

Conflicts:
arch/x86/include/asm/mce.h

Signed-off-by: Avi Kivity a...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] x86: Add definition for IGNNE MSR

From: Alexander Graf ag...@suse.de

Hyper-V accesses MSR_IGNNE while running under KVM.

Signed-off-by: Alexander Graf ag...@suse.de
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1692fb5..1723635 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -378,6 +378,7 @@
 /* AMD-V MSRs */
 
 #define MSR_VM_CR   0xc0010114
+#define MSR_VM_IGNNE0xc0010115
 #define MSR_VM_HSAVE_PA 0xc0010117
 
 #endif /* _ASM_X86_MSR_INDEX_H */
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] KVM: SVM: Implement INVLPGA

From: Alexander Graf ag...@suse.de

SVM adds another way to do INVLPG by ASID which Hyper-V makes use of,
so let's implement it!

For now we just do the same thing invlpg does, as asid switching
means we flush the mmu anyways. That might change one day though.

Signed-off-by: Alexander Graf ag...@suse.de
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index b25078d..75c8040 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1883,6 +1883,19 @@ static int clgi_interception(struct vcpu_svm *svm, 
struct kvm_run *kvm_run)
return 1;
 }
 
+static int invlpga_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+{
+   struct kvm_vcpu *vcpu = svm-vcpu;
+   nsvm_printk(INVLPGA\n);
+
+   /* Let's treat INVLPGA the same as INVLPG (can be optimized!) */
+   kvm_mmu_invlpg(vcpu, vcpu-arch.regs[VCPU_REGS_RAX]);
+
+   svm-next_rip = kvm_rip_read(svm-vcpu) + 3;
+   skip_emulated_instruction(svm-vcpu);
+   return 1;
+}
+
 static int invalid_op_interception(struct vcpu_svm *svm,
   struct kvm_run *kvm_run)
 {
@@ -2228,7 +2241,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
[SVM_EXIT_INVD] = emulate_on_interception,
[SVM_EXIT_HLT]  = halt_interception,
[SVM_EXIT_INVLPG]   = invlpg_interception,
-   [SVM_EXIT_INVLPGA]  = invalid_op_interception,
+   [SVM_EXIT_INVLPGA]  = invlpga_interception,
[SVM_EXIT_IOIO] = io_interception,
[SVM_EXIT_MSR]  = msr_interception,
[SVM_EXIT_TASK_SWITCH]  = task_switch_interception,
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] KVM: SVM: Improve nested interrupt injection

From: Alexander Graf ag...@suse.de

While trying to get Hyper-V running, I realized that the interrupt injection
mechanisms that are in place right now are not 100% correct.

This patch makes nested SVM's interrupt injection behave more like on a
real machine.

Signed-off-by: Alexander Graf ag...@suse.de
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 75c8040..c283201 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1615,7 +1615,8 @@ static int nested_svm_vmexit_real(struct vcpu_svm *svm, 
void *arg1,
/* Kill any pending exceptions */
if (svm-vcpu.arch.exception.pending == true)
nsvm_printk(WARNING: Pending Exception\n);
-   svm-vcpu.arch.exception.pending = false;
+   kvm_clear_exception_queue(svm-vcpu);
+   kvm_clear_interrupt_queue(svm-vcpu);
 
/* Restore selected save entries */
svm-vmcb-save.es = hsave-save.es;
@@ -1683,7 +1684,8 @@ static int nested_svm_vmrun(struct vcpu_svm *svm, void 
*arg1,
svm-nested_vmcb = svm-vmcb-save.rax;
 
/* Clear internal status */
-   svm-vcpu.arch.exception.pending = false;
+   kvm_clear_exception_queue(svm-vcpu);
+   kvm_clear_interrupt_queue(svm-vcpu);
 
/* Save the old vmcb, so we don't need to pick what we save, but
   can restore everything when a VMEXIT occurs */
@@ -2363,21 +2365,14 @@ static inline void svm_inject_irq(struct vcpu_svm *svm, 
int irq)
((/*control-int_vector  4*/ 0xf)  V_INTR_PRIO_SHIFT);
 }
 
-static void svm_queue_irq(struct kvm_vcpu *vcpu, unsigned nr)
-{
-   struct vcpu_svm *svm = to_svm(vcpu);
-
-   svm-vmcb-control.event_inj = nr |
-   SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
-}
-
 static void svm_set_irq(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
-   nested_svm_intr(svm);
+   BUG_ON(!(svm-vcpu.arch.hflags  HF_GIF_MASK));
 
-   svm_queue_irq(vcpu, vcpu-arch.interrupt.nr);
+   svm-vmcb-control.event_inj = vcpu-arch.interrupt.nr |
+   SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
 }
 
 static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
@@ -2405,13 +2400,25 @@ static int svm_interrupt_allowed(struct kvm_vcpu *vcpu)
struct vmcb *vmcb = svm-vmcb;
return (vmcb-save.rflags  X86_EFLAGS_IF) 
!(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
-   (svm-vcpu.arch.hflags  HF_GIF_MASK);
+   (svm-vcpu.arch.hflags  HF_GIF_MASK) 
+   !is_nested(svm);
 }
 
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
-   svm_set_vintr(to_svm(vcpu));
-   svm_inject_irq(to_svm(vcpu), 0x0);
+   struct vcpu_svm *svm = to_svm(vcpu);
+   nsvm_printk(Trying to open IRQ window\n);
+
+   nested_svm_intr(svm);
+
+   /* In case GIF=0 we can't rely on the CPU to tell us when
+* GIF becomes 1, because that's a separate STGI/VMRUN intercept.
+* The next time we get that intercept, this function will be
+* called again though and we'll get the vintr intercept. */
+   if (svm-vcpu.arch.hflags  HF_GIF_MASK) {
+   svm_set_vintr(svm);
+   svm_inject_irq(svm, 0x0);
+   }
 }
 
 static void enable_nmi_window(struct kvm_vcpu *vcpu)
@@ -2490,6 +2497,8 @@ static void svm_complete_interrupts(struct vcpu_svm *svm)
case SVM_EXITINTINFO_TYPE_EXEPT:
/* In case of software exception do not reinject an exception
   vector, but re-execute and instruction instead */
+   if (is_nested(svm))
+   break;
if (kvm_exception_is_soft(vector))
break;
if (exitintinfo  SVM_EXITINTINFO_VALID_ERR) {
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Disable -Werror by default for kvm

From: Avi Kivity a...@redhat.com

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/configure b/configure
index cab39bc..420e101 100755
--- a/configure
+++ b/configure
@@ -560,6 +560,8 @@ if test -z $werror ; then
 else
 werror=no
 fi
+# disable default werror for kvm
+werror=no
 fi
 
 if test $werror = yes ; then
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Update source link

From: Avi Kivity a...@redhat.com

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/linux-2.6 b/linux-2.6
index 77bc807..f0f1a43 16
--- a/linux-2.6
+++ b/linux-2.6
@@ -1 +1 @@
-Subproject commit 77bc807c54c1abf0808c692837a956e33a172f61
+Subproject commit f0f1a433c63883874bcf93c338deecaccf654e12
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Backward compatibility for pt_regs.flags

From: Avi Kivity a...@redhat.com

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/sync b/sync
index b12fb8a..b09c424 100755
--- a/sync
+++ b/sync
@@ -40,6 +40,7 @@ def __hack(data):
 'do_machine_check '
 )
 anon_inodes = anon_inodes_exit = False
+mce = False
 result = []
 def sub(regexp, repl, str):
 return re_cache(regexp).sub(repl, str)
@@ -104,6 +105,11 @@ def __hack(data):
 line = 'IF_ANON_INODES_DOES_REFCOUNTS(' + line + ')'
 if not match(r'#include'):
 line = sub(r'\blapic\n', 'l_apic', line)
+if match(r'struct pt_regs regs'):
+mce = True
+if mce and match(r'\.flags'):
+line = sub(r'flags', r'kvm_pt_regs_flags', line)
+mce = False
 w(line)
 if match(r'\tkvm_init_debug'):
 w('\thrtimer_kallsyms_resolve();')
diff --git a/x86/external-module-compat.h b/x86/external-module-compat.h
index 14eb931..881602e 100644
--- a/x86/external-module-compat.h
+++ b/x86/external-module-compat.h
@@ -577,3 +577,14 @@ static inline void kvm_do_machine_check(struct pt_regs 
*regs, long error_code)
 
 #endif
 
+/* pt_regs.flags was once pt_regs.eflags */
+
+#if LINUX_VERSION_CODE  KERNEL_VERSION(2,6,25)
+
+#define kvm_pt_regs_flags eflags
+
+#else
+
+#define kvm_pt_regs_flags flags
+
+#endif
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[COMMIT master] Fix module version info for releases

From: Avi Kivity a...@redhat.com

For releases, we should display the release name, not kvm-devel.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/scripts/make-release b/scripts/make-release
index 09b4ba3..e8092e9 100755
--- a/scripts/make-release
+++ b/scripts/make-release
@@ -64,7 +64,13 @@ git --git-dir=$linux_git 
--work-tree=$tmpdir/$name/linux-2.6 \
 
 cd $tmpdir/$name
 
-./sync $name
+if [[ -z $formal ]]; then
+version=kvm-devel
+else
+version=$name
+fi
+
+./sync $name -v $version
 
 rm -rf $tmpdir/$name/linux-2.6
 
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]


On 06/16/2009 09:32 PM, Jamie Lokier wrote:

Avi Kivity wrote:
   

Another issue is enumeration.  Guests will present their devices in the
order they find them on the pci bus (of course enumeration is guest
specific).  So if I have 2 virtio controllers the only way I can
distinguish between them is using their pci slots.
 


virtio controllers really should have a user-suppliable string or UUID
to identify them to the guest.  Don't they?
   


virtio controllers don't exist.  When they do, they may have a UUID or 
not, but in either case guest infrastructure is in place for reporting 
the PCI slot, not the UUID.


virtio disks do have a UUID.  I don't think older versions of Windows 
will use it though, so if you reorder your slots you'll see your drive 
letters change.  Same with Linux if you don't use udev by-uuid rules.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] add sysenter/syscall emulation for 32bit compat mode


On 06/16/2009 04:25 PM, Andre Przywara wrote:

sysenter/sysexit are not supported on AMD's 32bit compat mode, whereas
syscall is not supported on Intel's 32bit compat mode. To allow cross
vendor migration we emulate the missing instructions by setting up the
processor state accordingly.
The sysenter code was originally sketched by Amit Shah, it was completed,
debugged,  syscall added and made-to-work by Christoph Egger and polished
up by Andre Przywara.
Please note that sysret does not need to be emulated, because it will be
exectued in 64bit mode and returning to 32bit compat mode works on Intel.


diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6025e5b..9066fb3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2635,11 +2635,38 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
/* Reject the instructions other than VMCALL/VMMCALL when
 * try to emulate invalid opcode */
c =vcpu-arch.emulate_ctxt.decode;
-   if ((emulation_type  EMULTYPE_TRAP_UD)
-   (!(c-twobyte  c-b == 0x01
- (c-modrm_reg == 0 || c-modrm_reg == 3)
-  c-modrm_mod == 3  c-modrm_rm == 1)))
-   return EMULATE_FAIL;
+
+   if (emulation_type  EMULTYPE_TRAP_UD) {
+   if (!c-twobyte)
+   return EMULATE_FAIL;
+   switch (c-b) {
+   case 0x01: /* VMMCALL */
+   if (c-modrm_mod != 3)
+   return EMULATE_FAIL;
+   if (c-modrm_rm != 1)
+   return EMULATE_FAIL;
+   break;
+   case 0x34: /* sysenter */
+   case 0x35: /* sysexit */
+   if (c-modrm_mod != 0)
+   return EMULATE_FAIL;
+   if (c-modrm_rm != 0)
+   return EMULATE_FAIL;
+   break;
+   case 0x05: /* syscall */
+   r = 0;
+   if (c-modrm_mod != 0)
+   return EMULATE_FAIL;
+   if (c-modrm_rm != 0)
+   return EMULATE_FAIL;
+   break;
+   default:
+   return EMULATE_FAIL;
+   }
+
+   if (!(c-modrm_reg == 0 || c-modrm_reg == 3))
+   return EMULATE_FAIL;
+   }

++vcpu-stat.insn_emulation;
if (r)  {
   


This would be better encoded in the opcode table.  No need to do it for 
this patch though.



diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index 22c765d..6c65462 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -32,6 +32,8 @@
  #includelinux/module.h
  #includeasm/kvm_x86_emulate.h

+#include mmu.h
+
   


Still unneeded I presume.@@ -320,8 +324,11 @@ static u32 group2_table[] = {

  };

  /* EFLAGS bit definitions. */
+#define EFLG_VM (117)
+#define EFLG_RF (116)
  #define EFLG_OF (111)
  #define EFLG_DF (110)
+#define EFLG_IF (19)
  #define EFLG_SF (17)
  #define EFLG_ZF (16)
  #define EFLG_AF (14)
@@ -1390,6 +1397,207 @@ void toggle_interruptibility(struct x86_emulate_ctxt 
*ctxt, u32 mask)
ctxt-interruptibility = mask;
  }

+#define EMUL_SYSENTER 1
+#define EMUL_SYSEXIT 2
+#define EMUL_SYSCALL 3
+
+static int
+emulate_syscalls(struct x86_emulate_ctxt *ctxt, int ins)
+{
+   struct decode_cache *c =ctxt-decode;
+   struct kvm_segment cs, ss;
+   unsigned long cr0;
+   u64 msr_data;
+   int usermode;
+
+   /* inject #UD if LOCK prefix is used */
+   if (c-lock_prefix)
+   return -1;
   


We should encode legality of the lock prefix in the opcode tables.  
Again unrelated to this patch.



+
+   /* inject #GP if in real mode or paging is disabled */
+   cr0 = ctxt-vcpu-arch.cr0;
+   if (ctxt-mode == X86EMUL_MODE_REAL || !(cr0  X86_CR0_PE)) {
+   if (ins == EMUL_SYSENTER || ins == EMUL_SYSEXIT)
+   kvm_inject_gp(ctxt-vcpu, 0);
+   return -1;
+   }
+
+   /* XXX sysenter/sysexit have not been tested in 64bit mode.
+* Therefore, we inject an #UD.
+*/
+   if (ins == EMUL_SYSENTER  ctxt-mode == X86EMUL_MODE_PROT64)
+   return -1;
   


We should return EMULATE_FAIL rather than confusing the poor guest.


+
+   /* sysexit must be called from CPL 0 */
+   if (ins == EMUL_SYSEXIT  kvm_x86_ops-get_cpl(ctxt-vcpu) != 0) {
+   kvm_inject_gp(ctxt-vcpu, 0);
+   return -1;
+   }
+
+   /* setup common descriptor content

RE: Network throughput limits for local VM - VM communication

2009-06-17 Thread Fischer, Anna

 Subject: Re: Network throughput limits for local VM - VM
 communication
 
 Fischer, Anna wrote:
  Not sure I understand. As far as I can see the packets are replicated
 on the tun/tap interface before they actually enter the bridge. So this
 is not about the bridge learning MAC addresses and flooding frames to
 unknown destinations. So I think this is different.
 
 
 Okay.
 
 You said:
 
  However, without VLANs, the tun
  interface will pass packets to all tap interfaces. It has to, as it
  doesn't know to which one the packet has to go to.
 
 Well, it shouldn't.  The tun interface should pass the packets to just
 one tap interface.
 
 Can you post the qemu command line you're using?  There's a gotcha
 there
 that can result in what you're seeing.

Sorry for the late reply on this issue. The command line I am using looks 
roughly like this:

/usr/bin/qemu-system-x86_64 -m 1024 -smp 2 -name FC10-2 -uuid 
b811b278-fae2-a3cc-d51d-8f5b078b2477 -boot c -drive 
file=,if=ide,media=cdrom,index=2 -drive 
file=/var/lib/libvirt/images/FC10-2.img,if=virtio,index=0,boot=on -net 
nic,macaddr=54:52:00:11:ae:79,model=e1000 -net tap net 
nic,macaddr=54:52:00:11:ae:78,model=e1000 -net tap  -serial pty -parallel none 
-usb -vnc 127.0.0.1:2 -k en-gb -soundhw es1370

This is my routing VM that has two network interfaces and routes packets 
between two subnets. It has one interface plugged into bridge virbr0 and the 
other interface is plugged into virbr1:

brctl show
bridge name bridge id   STP enabled interfaces
virbr0  8000.8ac1d18c63ec   no  vnet0
vnet1
virbr1  8000.2ebfcbb9ed70   no  vnet2
vnet3

If I use the e1000 virtual NIC model, I see performance drop significantly 
compared to using virtio_net. However, with virtio_net I have the network 
stalling after a few seconds of high-throughput traffic (as I mentioned in my 
previous post). Just to reiterate my scenario: I run three guests on the same 
physical machine, one guest is my routing VM that is routing IP network traffic 
between the other two guests.

I am also wondering about the fact that I do not seem to get CPU utilization 
maxed out in this case while throughput does not go any higher. I do not 
understand what is stopping KVM from using more CPU for guest I/O processing? 
There is nothing else running on my machine. I have analyzed the amount of CPU 
that each KVM thread is using, and I can see that the thread that is running 
the VCPU of the routing VM which is processing interrupts of the e1000 virtual 
network card is using the highest amount of CPU. Is there any way that I can 
optimize my network set-up? Maybe some specific configuration of the e1000 
driver within the guest? Are there any known issues with this?

I also see very difference CPU utilization and network throughput figures when 
pinning threads to CPU cores using taskset. At one point I managed to double 
the throughput, but I could not reproduce that setup for some reason. What are 
the major issues that I would need to pay attention to when pinning threads to 
cores in order to optimize my specific set-up so that I can achieve better 
network I/O performance?

Thanks for your help.

Anna
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] add sysenter/syscall emulation for 32bit compat mode


Amit Shah wrote:

Hi Andre,

On (Tue) Jun 16 2009 [15:25:13], Andre Przywara wrote:

sysenter/sysexit are not supported on AMD's 32bit compat mode, whereas
syscall is not supported on Intel's 32bit compat mode. To allow cross
vendor migration we emulate the missing instructions by setting up the
processor state accordingly.
The sysenter code was originally sketched by Amit Shah, it was completed,
debugged,  syscall added and made-to-work by Christoph Egger and polished
up by Andre Przywara.
Please note that sysret does not need to be emulated, because it will be
exectued in 64bit mode and returning to 32bit compat mode works on Intel.


Thanks for picking this up. You also had a testcase for this, right?
Yes. We started a Linux 64bit guest on an AMD host and migrated it to an 
Intel box. On both boxes we ran a test benchmark which is measuring the 
time needed for a few million syscalls (GETPID via syscall()), both as a 
64bit and a 32bit binary. Then we repeated the test the other way round 
(started on Intel, migrated to AMD).
Without the patch the 32bit test was aborted with SIGILL. With the patch 
everything went well, excecpt that the execution time was much longer.
Earlier we observed that this shouldn't be much of an issue in real 
life, as a tight-loop execution of GETPID is a real corner case.


Regards,
Andre.

--
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network throughput limits for local VM - VM communication


On 06/17/2009 10:36 AM, Fischer, Anna wrote:


/usr/bin/qemu-system-x86_64 -m 1024 -smp 2 -name FC10-2 -uuid 
b811b278-fae2-a3cc-d51d-8f5b078b2477 -boot c -drive 
file=,if=ide,media=cdrom,index=2 -drive 
file=/var/lib/libvirt/images/FC10-2.img,if=virtio,index=0,boot=on -net 
nic,macaddr=54:52:00:11:ae:79,model=e1000 -net tap net 
nic,macaddr=54:52:00:11:ae:78,model=e1000 -net tap  -serial pty -parallel none 
-usb -vnc 127.0.0.1:2 -k en-gb -soundhw es1370

   


Okay, like I suspected, qemu has a trap here and you walked into it.  
The -net option plugs the device you specify into a virtual hub.  The 
command line you provided plugs the two virtual NICs and the two tap 
devices into one virtual hub, so any packet received from any of the 
four clients will be propagated to the other three.


To get this to work right, specify the vlan= parameter which says which 
virtual hub a component is plugged into.  Note this has nothing to do 
with 802.blah vlans.


So your command line should look like

   qemu ... -net nic,...,vlan=0 -net tap,...,vlan=0 -net nic,...,vlan=1 
-net tap,...,vlan=1


This will give you two virtual hubs, each bridging a virtual nic to a 
tap device.




This is my routing VM that has two network interfaces and routes packets 
between two subnets. It has one interface plugged into bridge virbr0 and the other 
interface is plugged into virbr1:

brctl show
bridge name bridge id   STP enabled interfaces
virbr0  8000.8ac1d18c63ec   no  vnet0
 vnet1
virbr1  8000.2ebfcbb9ed70   no  vnet2
 vnet3
   


Please redo the tests with qemu vlans but without 802.blah vlans, so we 
see what happens without packet duplication.



If I use the e1000 virtual NIC model, I see performance drop significantly 
compared to using virtio_net. However, with virtio_net I have the network 
stalling after a few seconds of high-throughput traffic (as I mentioned in my 
previous post). Just to reiterate my scenario: I run three guests on the same 
physical machine, one guest is my routing VM that is routing IP network traffic 
between the other two guests.

I am also wondering about the fact that I do not seem to get CPU utilization 
maxed out in this case while throughput does not go any higher. I do not 
understand what is stopping KVM from using more CPU for guest I/O processing? 
There is nothing else running on my machine. I have analyzed the amount of CPU 
that each KVM thread is using, and I can see that the thread that is running 
the VCPU of the routing VM which is processing interrupts of the e1000 virtual 
network card is using the highest amount of CPU. Is there any way that I can 
optimize my network set-up? Maybe some specific configuration of the e1000 
driver within the guest? Are there any known issues with this?
   


There are known issues with lack of flow control while sending packets 
out of a guest.  If the guest runs tcp that tends to correct for it, but 
if you run a lower level protocol that doesn't have its own flow 
control, the guest may spend a lot of cpu generating packets that are 
eventually dropped.  We are working on fixing this.



I also see very difference CPU utilization and network throughput figures when 
pinning threads to CPU cores using taskset. At one point I managed to double 
the throughput, but I could not reproduce that setup for some reason. What are 
the major issues that I would need to pay attention to when pinning threads to 
cores in order to optimize my specific set-up so that I can achieve better 
network I/O performance?
   


It's black magic, unfortunately.  But please retry with the fixed 
configuration and we'll continue from there.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Network throughput limits for local VM - VM communication

2009-06-17 Thread Fischer, Anna

 Subject: Re: Network throughput limits for local VM - VM
 communication
 
 On 06/17/2009 10:36 AM, Fischer, Anna wrote:
 
  /usr/bin/qemu-system-x86_64 -m 1024 -smp 2 -name FC10-2 -uuid
 b811b278-fae2-a3cc-d51d-8f5b078b2477 -boot c -drive
 file=,if=ide,media=cdrom,index=2 -drive
 file=/var/lib/libvirt/images/FC10-2.img,if=virtio,index=0,boot=on -net
 nic,macaddr=54:52:00:11:ae:79,model=e1000 -net tap net
 nic,macaddr=54:52:00:11:ae:78,model=e1000 -net tap  -serial pty -
 parallel none -usb -vnc 127.0.0.1:2 -k en-gb -soundhw es1370
 
 
 
 Okay, like I suspected, qemu has a trap here and you walked into it.
 The -net option plugs the device you specify into a virtual hub.  The
 command line you provided plugs the two virtual NICs and the two tap
 devices into one virtual hub, so any packet received from any of the
 four clients will be propagated to the other three.
 
 To get this to work right, specify the vlan= parameter which says which
 virtual hub a component is plugged into.  Note this has nothing to do
 with 802.blah vlans.
 
 So your command line should look like
 
 qemu ... -net nic,...,vlan=0 -net tap,...,vlan=0 -net
 nic,...,vlan=1
 -net tap,...,vlan=1
 
 This will give you two virtual hubs, each bridging a virtual nic to a
 tap device.
 
  This is my routing VM that has two network interfaces and routes
 packets between two subnets. It has one interface plugged into bridge
 virbr0 and the other interface is plugged into virbr1:
 
  brctl show
  bridge name bridge id   STP enabled interfaces
  virbr0  8000.8ac1d18c63ec   no  vnet0
   vnet1
  virbr1  8000.2ebfcbb9ed70   no  vnet2
   vnet3
 
 
 Please redo the tests with qemu vlans but without 802.blah vlans, so we
 see what happens without packet duplication.

Avi, thanks for your quick reply. I do use the vlan= parameter now, and yes, I 
do not see packet duplication any more, so everything you said is right and I 
do understand now why I was seeing packets on both bridges before. So this has 
nothing to do with tun/tap then but just with the way QEMU virtual hubs work. 
I didn't know about any details on that before.

Even with vlan= enabled, I am still having the same issues with weird CPU 
utilization and low throughput that I have described below.
 
  If I use the e1000 virtual NIC model, I see performance drop
 significantly compared to using virtio_net. However, with virtio_net I
 have the network stalling after a few seconds of high-throughput
 traffic (as I mentioned in my previous post). Just to reiterate my
 scenario: I run three guests on the same physical machine, one guest is
 my routing VM that is routing IP network traffic between the other two
 guests.
 
  I am also wondering about the fact that I do not seem to get CPU
 utilization maxed out in this case while throughput does not go any
 higher. I do not understand what is stopping KVM from using more CPU
 for guest I/O processing? There is nothing else running on my machine.
 I have analyzed the amount of CPU that each KVM thread is using, and I
 can see that the thread that is running the VCPU of the routing VM
 which is processing interrupts of the e1000 virtual network card is
 using the highest amount of CPU. Is there any way that I can optimize
 my network set-up? Maybe some specific configuration of the e1000
 driver within the guest? Are there any known issues with this?
 
 
 There are known issues with lack of flow control while sending packets
 out of a guest.  If the guest runs tcp that tends to correct for it,
 but
 if you run a lower level protocol that doesn't have its own flow
 control, the guest may spend a lot of cpu generating packets that are
 eventually dropped.  We are working on fixing this.

For the tests I run now (with vlan= enabled) I am actually using both TCP and 
UDP, and I see the problem with virtio_net for both protocols. What I am 
wondering about though is that I do not seem to have any problems if I 
communicate directly between the two guests (if I plug then into the same 
bridge and put them onto the same networks), so why do I only see the problem 
of stalling network communication when there is a routing VM in the network 
path? Is this just because the system is even more overloaded in that case? Or 
could this be an issue related to a dual NIC configuration or the fact that I 
run multiple bridges on the same physical machine?

When you say We are working on fixing this. - which code parts are you 
working on? Is this in the QEMU network I/O processing code or is this 
virtio_net related?

  I also see very difference CPU utilization and network throughput
 figures when pinning threads to CPU cores using taskset. At one point I
 managed to double the throughput, but I could not reproduce that setup
 for some reason. What are the major issues that I would

Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-17 Thread Mark McLoughlin

On Tue, 2009-06-16 at 19:44 +0100, Jamie Lokier wrote:
 Mark McLoughlin wrote:
   Worst case we hardcode those numbers (gasp, faint).
  
  Maybe we can just add the open slots to the -help output. That'd be nice
  and clean.

I was being sarcastic - libvirt currently must parse qemu -help, and
even has some test infrastructure to check that it works with various
versions of qemu. Extending this would not be nice and clean :-)

 I particularly don't like the idea of arcane machine-dependent slot
 allocation knowledge living in libvirt, because it needs to be in Qemu
 anyway for non-libvirt users.  No point in having two implementations
 of something tricky and likely to have machine quirks, if one will do.

Indeed.

Cheers,
Mark.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

smp guest questions


After seeing words from Avi about that smp guests
are ok now, I descided to try.  And immediately
got a few questions.

Running on a Phenom 9750 machine (PhenomI), AMD780G
chipset.  Host is 2.6.29 x86-64, qemu-kvm 0.10.5,
guests are linux with kvm paravirt bits enabled, also
dynticks (on both host and guest).


When booting a 2-CPU guest, I see in dmesg:

PM-Timer running at invalid rate: 200% of normal - aborting.

and indeed, in available_clocksource there's no pmtimer.
Should I be concerned?  It does not look healthy.


Some time later, I see stuff like:

hrtimer: interrupt too slow, forcing clock min delta to 47210997 ns

Which reminds me issues I had with broken hpet (time goes
back-n-forth with similar messages shown in dmesg, but
about hpet not hrtimer).  Also does not look healthy.


I haven't seen either of the two messages above on any of
single-processor guests so far, at least with recent kernels
and kvm userspace, only on smp (2 cpu for now).

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]


On 06/17/2009 11:33 AM, Mark McLoughlin wrote:

I particularly don't like the idea of arcane machine-dependent slot
allocation knowledge living in libvirt, because it needs to be in Qemu
anyway for non-libvirt users.  No point in having two implementations
of something tricky and likely to have machine quirks, if one will do.
 


Indeed.
   


I don't understand this.  Management already has to allocate MAC 
addresses, UUIDs, IDE interface and master/slave role, SCSI 
LUNs/targets/whatever.  It has to understand NUMA (if not do actual 
allocation).  Even if it doesn't allocate the slots, it has to be able 
to query them so it can tell the user which NIC or controller is 
connected where, or to do hotunplug.  It has to understand that there is 
a limitation on the number of slots, and know what that limitation is 
(unless it feels that launching an overcommitted guest and showing an 
error to the user is preferable to not allowing the user to overcommit 
in the first place.


If you'll review my patent application for pci slot allocation, you'll 
see the following line:


  slot_nr = nb_allocated_slots++; /* Allocate pci slot */

while there is a lot of complicated setup code before that (see the 
prior art section as well), I believe licensees could well implement the 
algorithm in two short months, including testing.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: smp guest questions


On 06/17/2009 11:38 AM, Michael Tokarev wrote:

After seeing words from Avi about that smp guests
are ok now, I descided to try.  And immediately
got a few questions.

Running on a Phenom 9750 machine (PhenomI), AMD780G
chipset.  Host is 2.6.29 x86-64, qemu-kvm 0.10.5,
guests are linux with kvm paravirt bits enabled, also
dynticks (on both host and guest).


When booting a 2-CPU guest, I see in dmesg:

PM-Timer running at invalid rate: 200% of normal - aborting.

and indeed, in available_clocksource there's no pmtimer.
Should I be concerned?  It does not look healthy.



It's a bug, please post guest details (kernel version, bitness).

Copying Marcelo.



Some time later, I see stuff like:

hrtimer: interrupt too slow, forcing clock min delta to 47210997 ns

Which reminds me issues I had with broken hpet (time goes
back-n-forth with similar messages shown in dmesg, but
about hpet not hrtimer).  Also does not look healthy.


I haven't seen either of the two messages above on any of
single-processor guests so far, at least with recent kernels
and kvm userspace, only on smp (2 cpu for now).


Please also post host /proc/cpuifo.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-17 Thread Mark McLoughlin

On Wed, 2009-06-17 at 12:03 +0300, Avi Kivity wrote:
 On 06/17/2009 11:33 AM, Mark McLoughlin wrote:
  I particularly don't like the idea of arcane machine-dependent slot
  allocation knowledge living in libvirt, because it needs to be in Qemu
  anyway for non-libvirt users.  No point in having two implementations
  of something tricky and likely to have machine quirks, if one will do.
 
  Indeed.
 
 I don't understand this.

Take note of the arcane machine-dependent slot allocation knowledge
bit.

If the algorithm in for management apps is as simple as query qemu for
available slots and sequentially allocate slots, then that's perfectly
fine.

If management apps need to hard-code which slots are available on
different targets and different qemu versions, or restrictions on which
devices can use which slots, or knowledge that some devices can be
multi-function, or ... anything like that is just lame.

Cheers,
Mark.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]


On 06/17/2009 12:18 PM, Mark McLoughlin wrote:

On Wed, 2009-06-17 at 12:03 +0300, Avi Kivity wrote:
   

On 06/17/2009 11:33 AM, Mark McLoughlin wrote:
 

I particularly don't like the idea of arcane machine-dependent slot
allocation knowledge living in libvirt, because it needs to be in Qemu
anyway for non-libvirt users.  No point in having two implementations
of something tricky and likely to have machine quirks, if one will do.
 

Indeed.
   

I don't understand this.
 


Take note of the arcane machine-dependent slot allocation knowledge
bit.

If the algorithm in for management apps is as simple as query qemu for
available slots and sequentially allocate slots, then that's perfectly
fine.
   


That's the thinking.


If management apps need to hard-code which slots are available on
different targets and different qemu versions, or restrictions on which
devices can use which slots, or knowledge that some devices can be
multi-function, or ... anything like that is just lame.
   


You can't abstract these things away.  If you can't put a NIC in slot 4, 
and you have 7 slots, then you cannot have 7 NICs.  Having qemu allocate 
the slot numbers does not absolve management from knowing this 
limitation and preventing the user from creating a machine with 7 slots.


Likewise, management will have to know which devices are multi-function, 
since that affects their hotpluggability.  Ditto if some slot if faster 
than others, if you want to make use of this information you have to let 
the upper layers know.


It could be done using an elaborate machine description that qemu 
exposes to management coupled with a constraint solver that optimizes 
the machine layout according to user specifications and hardware 
limitations.  Or we could take the view that real life is not perfect 
(especially where computers are involved), add some machine specific 
knowledge, and spend the rest of the summer at the beach.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: smp guest questions


Avi Kivity wrote:

On 06/17/2009 11:38 AM, Michael Tokarev wrote:

After seeing words from Avi about that smp guests
are ok now, I descided to try.  And immediately
got a few questions.

Running on a Phenom 9750 machine (PhenomI), AMD780G
chipset.  Host is 2.6.29 x86-64, qemu-kvm 0.10.5,
guests are linux with kvm paravirt bits enabled, also
dynticks (on both host and guest).


When booting a 2-CPU guest, I see in dmesg:

PM-Timer running at invalid rate: 200% of normal - aborting.

and indeed, in available_clocksource there's no pmtimer.
Should I be concerned?  It does not look healthy.



It's a bug, please post guest details (kernel version, bitness).


The guest kernel is also 2.6.29[.5], but this time it's x86-32
(compiled for P4).  kvm userspace is also 32bits (historical) --
only host kernel is 64bit for now.  I'll try to do some more
experiments later today on a test machine (this is a production
box) -- hopefully that same issue will occur on another
machine :)


Copying Marcelo.



Some time later, I see stuff like:

hrtimer: interrupt too slow, forcing clock min delta to 47210997 ns

Which reminds me issues I had with broken hpet (time goes
back-n-forth with similar messages shown in dmesg, but
about hpet not hrtimer).  Also does not look healthy.


I haven't seen either of the two messages above on any of
single-processor guests so far, at least with recent kernels
and kvm userspace, only on smp (2 cpu for now).


Please also post host /proc/cpuifo.


HOST cpuinfo (only for 4th core, other cores are similar):
processor   : 3
vendor_id   : AuthenticAMD
cpu family  : 16
model   : 2
model name  : AMD Phenom(tm) 9750 Quad-Core Processor
stepping: 3
cpu MHz : 1200.000
(yes ondemand cpufreq is in effect - nominal frequency is 2400.
I had no issues with cpufreq on this box so far, including all
the guests).
cache size  : 512 KB
physical id : 0
siblings: 4
core id : 3
cpu cores   : 4
apicid  : 3
initial apicid  : 3
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc pni monitor cx16 
lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch 
osvw ibs
bogomips: 4812.67
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate



cpuinfo on GUEST (also for only one CPU):

processor   : 1
vendor_id   : AuthenticAMD
cpu family  : 6
model   : 2
model name  : QEMU Virtual CPU version 0.10.5
stepping: 3
cpu MHz : 2405.894
cache size  : 512 KB
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat 
pse36 clflush mmx fxsr sse sse2 syscall lm pni hypervisor
bogomips: 4811.78
clflush size: 64
power management:


Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-17 Thread Jamie Lokier

Avi Kivity wrote:
 On 06/16/2009 09:32 PM, Jamie Lokier wrote:
 Avi Kivity wrote:

 Another issue is enumeration.  Guests will present their devices in the
 order they find them on the pci bus (of course enumeration is guest
 specific).  So if I have 2 virtio controllers the only way I can
 distinguish between them is using their pci slots.
 
 virtio controllers really should have a user-suppliable string or UUID
 to identify them to the guest.  Don't they?
 
 virtio controllers don't exist.  When they do, they may have a UUID or 
 not, but in either case guest infrastructure is in place for reporting 
 the PCI slot, not the UUID.
 
 virtio disks do have a UUID.  I don't think older versions of Windows 
 will use it though, so if you reorder your slots you'll see your drive 
 letters change.  Same with Linux if you don't use udev by-uuid rules.

I guess I meant virtio disks, so that's ok.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] KVM: convert custom marker based tracing to event traces


On 06/17/2009 01:13 AM, Marcelo Tosatti wrote:

This allows use of the powerful ftrace infrastructure.

See Documentation/trace/ for usage information.

   


This only applied with -C0, so I'm worried something's broken.  Please 
regenerate.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 01/13] qemu: make default_write_config use mask table

2009-06-17 Thread Michael S. Tsirkin

On Tue, Jun 16, 2009 at 03:40:13PM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 Change much of hw/pci to use symbolic constants and a table-driven
 design: add a mask table with writable bits set and readonly bits unset.
 Detect change by comparing original and new registers.

 This makes it easy to support capabilities where read-only/writeable
 bit layout differs between devices, depending on capabilities present.

 As a result, writing a single byte in BAR registers now works as
 it should. Writing to upper limit registers in the bridge
 also works as it should. Code is also shorter.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 This series introduces warning (virtio_load decl/def does not match).

Seems to match and build cleanly with -Werror for me.
What warning do you see?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/2] KVM: remove old KVMTRACE support code


On 06/17/2009 01:13 AM, Marcelo Tosatti wrote:

Return EOPNOTSUPP for KVM_TRACE_ENABLE/PAUSE/DISABLE ioctls.

   


The first feature we remove, we don't have a KVM_CAP for.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-17 Thread Jamie Lokier

Avi Kivity wrote:
 If management apps need to hard-code which slots are available on
 different targets and different qemu versions, or restrictions on which
 devices can use which slots, or knowledge that some devices can be
 multi-function, or ... anything like that is just lame.

 
 You can't abstract these things away.  If you can't put a NIC in slot 4, 
 and you have 7 slots, then you cannot have 7 NICs.  Having qemu allocate 
 the slot numbers does not absolve management from knowing this 
 limitation and preventing the user from creating a machine with 7 slots.
 
 Likewise, management will have to know which devices are multi-function, 
 since that affects their hotpluggability.  Ditto if some slot if faster 
 than others, if you want to make use of this information you have to let 
 the upper layers know.
 
 It could be done using an elaborate machine description that qemu 
 exposes to management coupled with a constraint solver that optimizes 
 the machine layout according to user specifications and hardware 
 limitations.  Or we could take the view that real life is not perfect 
 (especially where computers are involved), add some machine specific 
 knowledge, and spend the rest of the summer at the beach.

To be honest, an elaborate machine description is probably fine...

A fancy constraint solver is not required.  A simple one strikes me as
about as simple as what you'd hard-code anyway, but with fewer special
cases.

Note that the result can fail due to things like insufficient address
space for all the device BARs even when they _are_ in the right slots.
Especially if there are lots of slots, or bridges which can provide
unlimited slots.

That is arcane: device-dependent, CPU-dependent, machine-dependent,
RAM-size dependent (in a non-linear way), device-option-dependent and
probably QEMU-version-dependent too.

It would be nice if libvirt (et al) would prevent the user from
creating a VM with insufficient BAR space for that machine, but I'm
not sure how to do it sanely, without arcane knowledge getting about.

Maybe that idea of a .so shared by qemu and libvirt, to manipulate
device configurations, is a sane one after all.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network throughput limits for local VM - VM communication

On 06/17/2009 11:12 AM, Fischer, Anna wrote:

For the tests I run now (with vlan= enabled) I am actually using both TCP and
UDP, and I see the problem with virtio_net for both protocols. What I am
wondering about though is that I do not seem to have any problems if I
communicate directly between the two guests (if I plug then into the same
bridge and put them onto the same networks), so why do I only see the problem
of stalling network communication when there is a routing VM in the network
path? Is this just because the system is even more overloaded in that case? Or
could this be an issue related to a dual NIC configuration or the fact that I
run multiple bridges on the same physical machine?

My guess is that somewhere there's a queue that's shorter that the
virtio queue, or its usable size fluctuates (because it is shared with
something else). So TCP flow control doesn't work, and UDP doesn't have
a chance.

When you say We are working on fixing this. - which code parts are you
working on? Is this in the QEMU network I/O processing code or is this virtio_net related?

tap. virtio, qemu, maybe more. It's a difficult problem.

Retry with the fixed configuration? You mean setting the vlan= parameter? I
have already used the vlan= parameter for the latest tests, and so the CPU utilization
issues I am talking about are happening with that configuration.

Yeah.

Can you compare total data sent and received as seen by the guests?
That would confirm that packets being dropped causes the slowdown.

--
Do not meddle in the internals of kernels, for they are subtle and quick to
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[patch 0/2] replace kvmtrace with event traces rebased

-- 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 2/2] KVM: remove old KVMTRACE support code

Return EOPNOTSUPP for KVM_TRACE_ENABLE/PAUSE/DISABLE ioctls.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com


Index: kvm/virt/kvm/kvm_main.c
===
--- kvm.orig/virt/kvm/kvm_main.c
+++ kvm/virt/kvm/kvm_main.c
@@ -2367,7 +2367,7 @@ static long kvm_dev_ioctl(struct file *f
case KVM_TRACE_ENABLE:
case KVM_TRACE_PAUSE:
case KVM_TRACE_DISABLE:
-   r = kvm_trace_ioctl(ioctl, arg);
+   r = -EOPNOTSUPP;
break;
default:
return kvm_arch_dev_ioctl(filp, ioctl, arg);
@@ -2717,7 +2717,6 @@ EXPORT_SYMBOL_GPL(kvm_init);
 
 void kvm_exit(void)
 {
-   kvm_trace_cleanup();
tracepoint_synchronize_unregister();
misc_deregister(kvm_dev);
kmem_cache_destroy(kvm_vcpu_cache);
Index: kvm/include/linux/kvm_host.h
===
--- kvm.orig/include/linux/kvm_host.h
+++ kvm/include/linux/kvm_host.h
@@ -482,37 +482,6 @@ struct kvm_stats_debugfs_item {
 extern struct kvm_stats_debugfs_item debugfs_entries[];
 extern struct dentry *kvm_debugfs_dir;
 
-#define KVMTRACE_5D(evt, vcpu, d1, d2, d3, d4, d5, name) \
-   trace_mark(kvm_trace_##name, %u %p %u %u %u %u %u %u, KVM_TRC_##evt, \
-   vcpu, 5, d1, d2, d3, d4, d5)
-#define KVMTRACE_4D(evt, vcpu, d1, d2, d3, d4, name) \
-   trace_mark(kvm_trace_##name, %u %p %u %u %u %u %u %u, KVM_TRC_##evt, \
-   vcpu, 4, d1, d2, d3, d4, 0)
-#define KVMTRACE_3D(evt, vcpu, d1, d2, d3, name) \
-   trace_mark(kvm_trace_##name, %u %p %u %u %u %u %u %u, KVM_TRC_##evt, \
-   vcpu, 3, d1, d2, d3, 0, 0)
-#define KVMTRACE_2D(evt, vcpu, d1, d2, name) \
-   trace_mark(kvm_trace_##name, %u %p %u %u %u %u %u %u, KVM_TRC_##evt, \
-   vcpu, 2, d1, d2, 0, 0, 0)
-#define KVMTRACE_1D(evt, vcpu, d1, name) \
-   trace_mark(kvm_trace_##name, %u %p %u %u %u %u %u %u, KVM_TRC_##evt, \
-   vcpu, 1, d1, 0, 0, 0, 0)
-#define KVMTRACE_0D(evt, vcpu, name) \
-   trace_mark(kvm_trace_##name, %u %p %u %u %u %u %u %u, KVM_TRC_##evt, \
-   vcpu, 0, 0, 0, 0, 0, 0)
-
-#ifdef CONFIG_KVM_TRACE
-int kvm_trace_ioctl(unsigned int ioctl, unsigned long arg);
-void kvm_trace_cleanup(void);
-#else
-static inline
-int kvm_trace_ioctl(unsigned int ioctl, unsigned long arg)
-{
-   return -EINVAL;
-}
-#define kvm_trace_cleanup() ((void)0)
-#endif
-
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long 
mmu_seq)
 {
Index: kvm/virt/kvm/kvm_trace.c
===
--- kvm.orig/virt/kvm/kvm_trace.c
+++ /dev/null
@@ -1,285 +0,0 @@
-/*
- * kvm trace
- *
- * It is designed to allow debugging traces of kvm to be generated
- * on UP / SMP machines.  Each trace entry can be timestamped so that
- * it's possible to reconstruct a chronological record of trace events.
- * The implementation refers to blktrace kernel support.
- *
- * Copyright (c) 2008 Intel Corporation
- * Copyright (C) 2006 Jens Axboe ax...@kernel.dk
- *
- * Authors: Feng(Eric) Liu, eric.e@intel.com
- *
- * Date:Feb 2008
- */
-
-#include linux/module.h
-#include linux/relay.h
-#include linux/debugfs.h
-#include linux/ktime.h
-
-#include linux/kvm_host.h
-
-#define KVM_TRACE_STATE_RUNNING(1  0)
-#define KVM_TRACE_STATE_PAUSE  (1  1)
-#define KVM_TRACE_STATE_CLEARUP(1  2)
-
-struct kvm_trace {
-   int trace_state;
-   struct rchan *rchan;
-   struct dentry *lost_file;
-   atomic_t lost_records;
-};
-static struct kvm_trace *kvm_trace;
-
-struct kvm_trace_probe {
-   const char *name;
-   const char *format;
-   u32 timestamp_in;
-   marker_probe_func *probe_func;
-};
-
-static inline int calc_rec_size(int timestamp, int extra)
-{
-   int rec_size = KVM_TRC_HEAD_SIZE;
-
-   rec_size += extra;
-   return timestamp ? rec_size += KVM_TRC_CYCLE_SIZE : rec_size;
-}
-
-static void kvm_add_trace(void *probe_private, void *call_data,
- const char *format, va_list *args)
-{
-   struct kvm_trace_probe *p = probe_private;
-   struct kvm_trace *kt = kvm_trace;
-   struct kvm_trace_rec rec;
-   struct kvm_vcpu *vcpu;
-   inti, size;
-   u32extra;
-
-   if (unlikely(kt-trace_state != KVM_TRACE_STATE_RUNNING))
-   return;
-
-   rec.rec_val = TRACE_REC_EVENT_ID(va_arg(*args, u32));
-   vcpu= va_arg(*args, struct kvm_vcpu *);
-   rec.pid = current-tgid;
-   rec.vcpu_id = vcpu-vcpu_id;
-
-   extra   = va_arg(*args, u32);
-   WARN_ON(!(extra = KVM_TRC_EXTRA_MAX));
-   extra

[patch 1/2] KVM: convert custom marker based tracing to event traces

This allows use of the powerful ftrace infrastructure.

See Documentation/trace/ for usage information.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/kvm/lapic.c
===
--- kvm.orig/arch/x86/kvm/lapic.c
+++ kvm/arch/x86/kvm/lapic.c
@@ -34,6 +34,7 @@
 #include asm/atomic.h
 #include kvm_cache_regs.h
 #include irq.h
+#include trace.h
 
 #ifndef CONFIG_X86_64
 #define mod_64(x, y) ((x) - (y) * div64_u64(x, y))
@@ -515,8 +516,6 @@ static u32 __apic_read(struct kvm_lapic 
 {
u32 val = 0;
 
-   KVMTRACE_1D(APIC_ACCESS, apic-vcpu, (u32)offset, handler);
-
if (offset = LAPIC_MMIO_LENGTH)
return 0;
 
@@ -562,6 +561,8 @@ static void apic_mmio_read(struct kvm_io
}
result = __apic_read(apic, offset  ~0xf);
 
+   trace_kvm_apic_read(offset, result);
+
switch (len) {
case 1:
case 2:
@@ -657,7 +658,7 @@ static void apic_mmio_write(struct kvm_i
 
offset = 0xff0;
 
-   KVMTRACE_1D(APIC_ACCESS, apic-vcpu, (u32)offset, handler);
+   trace_kvm_apic_write(offset, val);
 
switch (offset) {
case APIC_ID:   /* Local APIC ID */
Index: kvm/arch/x86/kvm/svm-trace.h
===
--- /dev/null
+++ kvm/arch/x86/kvm/svm-trace.h
@@ -0,0 +1,51 @@
+#define exit_reasons svm_exit_reasons
+#define svm_exit_reasons   \
+   {SVM_EXIT_READ_CR0, read_cr0},\
+   {SVM_EXIT_READ_CR3, read_cr3},\
+   {SVM_EXIT_READ_CR4, read_cr4},\
+   {SVM_EXIT_READ_CR8, read_cr8},\
+   {SVM_EXIT_WRITE_CR0,write_cr0},   \
+   {SVM_EXIT_WRITE_CR3,write_cr3},   \
+   {SVM_EXIT_WRITE_CR4,write_cr4},   \
+   {SVM_EXIT_WRITE_CR8,write_cr8},   \
+   {SVM_EXIT_READ_DR0, read_dr0},\
+   {SVM_EXIT_READ_DR1, read_dr1},\
+   {SVM_EXIT_READ_DR2, read_dr2},\
+   {SVM_EXIT_READ_DR3, read_dr3},\
+   {SVM_EXIT_WRITE_DR0,write_dr0},   \
+   {SVM_EXIT_WRITE_DR1,write_dr1},   \
+   {SVM_EXIT_WRITE_DR2,write_dr2},   \
+   {SVM_EXIT_WRITE_DR3,write_dr3},   \
+   {SVM_EXIT_WRITE_DR5,write_dr5},   \
+   {SVM_EXIT_WRITE_DR7,write_dr7},   \
+   {SVM_EXIT_EXCP_BASE + DB_VECTOR,DB excp}, \
+   {SVM_EXIT_EXCP_BASE + BP_VECTOR,BP excp}, \
+   {SVM_EXIT_EXCP_BASE + UD_VECTOR,UD excp}, \
+   {SVM_EXIT_EXCP_BASE + PF_VECTOR,PF excp}, \
+   {SVM_EXIT_EXCP_BASE + NM_VECTOR,NM excp}, \
+   {SVM_EXIT_EXCP_BASE + MC_VECTOR,MC excp}, \
+   {SVM_EXIT_INTR, interrupt},   \
+   {SVM_EXIT_NMI,  nmi}, \
+   {SVM_EXIT_SMI,  smi}, \
+   {SVM_EXIT_INIT, init},\
+   {SVM_EXIT_VINTR,vintr},   \
+   {SVM_EXIT_CPUID,cpuid},   \
+   {SVM_EXIT_INVD, invd},\
+   {SVM_EXIT_HLT,  hlt}, \
+   {SVM_EXIT_INVLPG,   invlpg},  \
+   {SVM_EXIT_INVLPGA,  invlpga}, \
+   {SVM_EXIT_IOIO, io},  \
+   {SVM_EXIT_MSR,  msr}, \
+   {SVM_EXIT_TASK_SWITCH,  task_switch}, \
+   {SVM_EXIT_SHUTDOWN, shutdown},\
+   {SVM_EXIT_VMRUN,vmrun},   \
+   {SVM_EXIT_VMMCALL,  hypercall},   \
+   {SVM_EXIT_VMLOAD,   vmload},  \
+   {SVM_EXIT_VMSAVE,   vmsave},  \
+   {SVM_EXIT_STGI, stgi},\
+   {SVM_EXIT_CLGI, clgi},\
+   {SVM_EXIT_SKINIT,   skinit},  \
+   {SVM_EXIT_WBINVD,   wbinvd},  \
+   {SVM_EXIT_MONITOR,  monitor}, \
+   {SVM_EXIT_MWAIT,mwait},   \
+   {SVM_EXIT_NPF,

Re: KVM: init bsp_vcpu before kvm_arch_vcpu_init

2009-06-17 Thread Gleb Natapov

On Tue, Jun 16, 2009 at 11:33:16AM -0300, Marcelo Tosatti wrote:
 
 On x86 mp_state is initialized by kvm_arch_vcpu_init. Right
 now kvm_vcpu_is_bsp returns false because kvm-bsp_vcpu has
 not been initialized, so vcpu_id == 0 ends up with mp_state ==
 KVM_MP_STATE_UNINITIALIZED.
 
 Gleb do you see a better way to fix this?
 
I have two, not necessary better ways. The first one is to change
kvm_vcpu_is_bsp() to do kvm-bsp_vcpu_id == vcpu-vcpu_id. Another one
is to understand why mp_state is set to runnable for bsp here at all. May
be we can drop this use of kvm_vcpu_is_bsp() since mp_state will be set
to RUNNABLE in kvm_arch_vcpu_ioctl_set_sregs() anyway?

 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 8939ffa..7225064 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -773,6 +773,13 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm 
 *kvm, unsigned id)
   struct page *page;
   int r;
  
 + mutex_lock(kvm-lock);
 +#ifdef CONFIG_KVM_APIC_ARCHITECTURE
 + if (kvm-bsp_vcpu_id == id)
 + kvm-bsp_vcpu = vcpu;
 +#endif
 + mutex_unlock(kvm-lock);
 +
   mutex_init(vcpu-mutex);
   vcpu-cpu = -1;
   vcpu-kvm = kvm;
 @@ -1760,14 +1767,12 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, 
 u32 id)
   smp_wmb();
   atomic_inc(kvm-online_vcpus);
  
 -#ifdef CONFIG_KVM_APIC_ARCHITECTURE
 - if (kvm-bsp_vcpu_id == id)
 - kvm-bsp_vcpu = vcpu;
 -#endif
   mutex_unlock(kvm-lock);
   return r;
  
  vcpu_destroy:
 + if (kvm-bsp_vcpu_id == id)
 + kvm-bsp_vcpu = NULL;
   mutex_unlock(kvm-lock);
   kvm_arch_vcpu_destroy(vcpu);
   return r;

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/2] replace kvmtrace with event traces rebased


On 06/17/2009 03:22 PM, Marcelo Tosatti wrote:

Applied, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: init bsp_vcpu before kvm_arch_vcpu_init

On Wed, Jun 17, 2009 at 03:29:05PM +0300, Gleb Natapov wrote:
 On Tue, Jun 16, 2009 at 11:33:16AM -0300, Marcelo Tosatti wrote:
  
  On x86 mp_state is initialized by kvm_arch_vcpu_init. Right
  now kvm_vcpu_is_bsp returns false because kvm-bsp_vcpu has
  not been initialized, so vcpu_id == 0 ends up with mp_state ==
  KVM_MP_STATE_UNINITIALIZED.
  
  Gleb do you see a better way to fix this?
  
 I have two, not necessary better ways. The first one is to change
 kvm_vcpu_is_bsp() to do kvm-bsp_vcpu_id == vcpu-vcpu_id. 

Thats much better.

 Another one is to understand why mp_state is set to runnable for bsp
 here at all. May be we can drop this use of kvm_vcpu_is_bsp() since
 mp_state will be set to RUNNABLE in kvm_arch_vcpu_ioctl_set_sregs()
 anyway?

Testcase is kvmctl without -p, which does not do
kvm_arch_vcpu_ioctl_set_sregs. I suppose that is valid? (in practice its
no big deal since kvmctl can be updated, and qemu-kvm does set_sregs,
but..).

KVM: use vcpu_id instead of bsp_vcpu pointer in kvm_vcpu_is_bsp

Change kvm_vcpu_is_bsp to use vcpu_id instead of bsp_vcpu pointer, which
is only initialized at the end of kvm_vm_ioctl_create_vcpu.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com


diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1b48092..026ed0a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -569,7 +569,7 @@ static inline void kvm_irqfd_release(struct kvm *kvm) {}
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
 static inline bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
 {
-   return vcpu-kvm-bsp_vcpu == vcpu;
+   return vcpu-kvm-bsp_vcpu_id == vcpu-vcpu_id;
 }
 #endif
 #endif
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/2] KVM: remove old KVMTRACE support code


On 06/17/2009 03:22 PM, Marcelo Tosatti wrote:

Return EOPNOTSUPP for KVM_TRACE_ENABLE/PAUSE/DISABLE ioctls.

   


I dropped this patch, breaks the powerpc build.  I guess it should be 
simple to convert powerpc?


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

a few words about drive/pci add/remove


I tried drive hot add/remove today.

And immediately had a few.. issues.
qemu-kvm-0.10.5.

First of all, there are several definitions
of drive_add and several others in `help'
monitor command output:

This one appears to be real:
drive_add pci_addr=[[domain:]bus:]slot
[file=file][,if=type][,bus=n]
[,unit=m][,media=d][index=i]
[,cyls=c,heads=h,secs=s[,trans=t]]
[snapshot=on|off][,cache=on|off] -- add drive to PCI storage controller

And this, a few lines below it, is fake:
drive_add pcibus pcidevfn [file=file][,if=type][,bus=n]
[,unit=m][,media=d][index=i]
[,cyls=c,heads=h,secs=s[,trans=t]]

Note also the cache argument is not the same
as it is for -drive command-line argument.


Ok, so trying the first option (after looking
at the source and after noticing the first correct
form), I realized it does not work.  Or at least
I can't get it to work.  It either says the PCI
slot mentioned is in use (when specifying
pci_addr=slot which is present in the guest),
or that the given slot does not exists.

Ok, so I discovered also pci_add.  That one worked,
and I've got new pci device in guest which recognized
it nicely.

Now, after realizing I added the wrong file ;), I
tried to remove it -- and got an instant warning
in the guest:

WARNING: at drivers/base/core.c:122 device_release+0x5f/0x70()
Device 'virtio8' does not have a release() function, it is broken and must be 
fixed.
Modules linked in: acpiphp dock pci_hotplug xfs e1000 button ext3 jbd mbcache 
virtio_blk virtio_net virtio_pci virtio_ring virtio
Pid: 39, comm: kacpi_notify Tainted: G S2.6.27-i686smp #2.6.27.25
 [c012b11f] warn_slowpath+0x6f/0xa0
 [c0110030] generic_set_mtrr+0x50/0x110
 [c011fd2e] __wake_up+0x3e/0x60
 [c01d2195] release_sysfs_dirent+0x45/0xb0
 [c01d23e1] sysfs_addrm_finish+0x1b1/0x1f0
 [c01d2447] remove_dir+0x27/0x40
 [c01d24cf] sysfs_remove_dir+0x5f/0x70
...
 [f880a44d] virtio_pci_remove+0x11/0x4c [virtio_pci]
...

When trying to re-add another (right this time) drive
the same way (using pci_addr=auto), the guest crashed.

JFYI for now.

Thanks.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/6] add sysenter/syscall emulation for 32bit compat mode

sysenter/sysexit are not supported on AMD's 32bit compat mode, whereas
syscall is not supported on Intel's 32bit compat mode. To allow cross
vendor migration we emulate the missing instructions by setting up the
processor state accordingly.
The sysenter code was originally sketched by Amit Shah, it was completed,
debugged,  syscall added and made-to-work by Christoph Egger and polished
up by Andre Przywara.
Please note that sysret does not need to be emulated, because it will be
exectued in 64bit mode and returning to 32bit compat mode works on Intel.

This has been tested with GETPIDs in a tight loop in compat mode on both
Intel and AMD boxes. Additionally a 32-bit userland was booted under a
64-bit kernel and then cross-vendor migrated.

Please apply or comment ;-) 

Regards,
Andre.

Signed-off-by: Amit Shah amit.s...@redhat.com
Signed-off-by: Christoph Egger christoph.eg...@amd.com
Signed-off-by: Andre Przywara andre.przyw...@amd.com


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/6] allow emulation of syscalls instructions on #UD

Add the opcodes for syscall, sysenter and sysexit to the list of instructions
handled by the undefined opcode handler.

Signed-off-by: Christoph Egger christoph.eg...@amd.com
Signed-off-by: Amit Shah amit.s...@redhat.com
Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/kvm/x86.c |   33 ++---
 1 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6025e5b..88e159c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2632,14 +2632,33 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
 
r = x86_decode_insn(vcpu-arch.emulate_ctxt, emulate_ops);
 
-   /* Reject the instructions other than VMCALL/VMMCALL when
-* try to emulate invalid opcode */
+   /* Only allow emulation of specific instructions on #UD
+* (namely VMMCALL, sysenter, sysexit, syscall)*/
c = vcpu-arch.emulate_ctxt.decode;
-   if ((emulation_type  EMULTYPE_TRAP_UD) 
-   (!(c-twobyte  c-b == 0x01 
- (c-modrm_reg == 0 || c-modrm_reg == 3) 
-  c-modrm_mod == 3  c-modrm_rm == 1)))
-   return EMULATE_FAIL;
+   if (emulation_type  EMULTYPE_TRAP_UD) {
+   if (!c-twobyte)
+   return EMULATE_FAIL;
+   switch (c-b) {
+   case 0x01: /* VMMCALL */
+   if (c-modrm_mod != 3 || c-modrm_rm != 1)
+   return EMULATE_FAIL;
+   break;
+   case 0x34: /* sysenter */
+   case 0x35: /* sysexit */
+   if (c-modrm_mod != 0 || c-modrm_rm != 0)
+   return EMULATE_FAIL;
+   break;
+   case 0x05: /* syscall */
+   if (c-modrm_mod != 0 || c-modrm_rm != 0)
+   return EMULATE_FAIL;
+   break;
+   default:
+   return EMULATE_FAIL;
+   }
+
+   if (!(c-modrm_reg == 0 || c-modrm_reg == 3))
+   return EMULATE_FAIL;
+   }
 
++vcpu-stat.insn_emulation;
if (r)  {
-- 
1.6.1.3


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/6] add missing EFLAGS bit definitions

Signed-off-by: Christoph Egger christoph.eg...@amd.com
Signed-off-by: Amit Shah amit.s...@redhat.com
Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/kvm/x86_emulate.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index 22c765d..e387c83 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -320,8 +320,11 @@ static u32 group2_table[] = {
 };
 
 /* EFLAGS bit definitions. */
+#define EFLG_VM (117)
+#define EFLG_RF (116)
 #define EFLG_OF (111)
 #define EFLG_DF (110)
+#define EFLG_IF (19)
 #define EFLG_SF (17)
 #define EFLG_ZF (16)
 #define EFLG_AF (14)
-- 
1.6.1.3


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/6] prepare for emulation of syscall instructions

Add the flags needed for syscall, sysenter and sysexit to the opcode table.
Catch (but for now ignore) the opcodes in the emulation switch/case.

Signed-off-by: Andre Przywara andre.przyw...@amd.com
Signed-off-by: Amit Shah amit.s...@redhat.com
Signed-off-by: Christoph Egger christoph.eg...@amd.com
---
 arch/x86/kvm/x86_emulate.c |   17 +++--
 1 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index e387c83..328ccba 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -32,6 +32,8 @@
 #include linux/module.h
 #include asm/kvm_x86_emulate.h
 
+#include mmu.h   /* for is_long_mode() */
+
 /*
  * Opcode effective-address decode tables.
  * Note that we only emulate instructions that have at least one memory
@@ -209,7 +211,7 @@ static u32 opcode_table[256] = {
 
 static u32 twobyte_table[256] = {
/* 0x00 - 0x0F */
-   0, Group | GroupDual | Group7, 0, 0, 0, 0, ImplicitOps, 0,
+   0, Group | GroupDual | Group7, 0, 0, 0, ImplicitOps, ImplicitOps, 0,
ImplicitOps, ImplicitOps, 0, 0, 0, ImplicitOps | ModRM, 0, 0,
/* 0x10 - 0x1F */
0, 0, 0, 0, 0, 0, 0, 0, ImplicitOps | ModRM, 0, 0, 0, 0, 0, 0, 0,
@@ -217,7 +219,9 @@ static u32 twobyte_table[256] = {
ModRM | ImplicitOps, ModRM, ModRM | ImplicitOps, ModRM, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
/* 0x30 - 0x3F */
-   ImplicitOps, 0, ImplicitOps, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+   ImplicitOps, 0, ImplicitOps, 0,
+   ImplicitOps, ImplicitOps, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
/* 0x40 - 0x47 */
DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
@@ -1988,6 +1992,9 @@ twobyte_insn:
goto cannot_emulate;
}
break;
+   case 0x05:  /* syscall */
+   goto cannot_emulate;
+   break;
case 0x06:
emulate_clts(ctxt-vcpu);
c-dst.type = OP_NONE;
@@ -2054,6 +2061,12 @@ twobyte_insn:
rc = X86EMUL_CONTINUE;
c-dst.type = OP_NONE;
break;
+   case 0x34:  /* sysenter */
+   goto cannot_emulate;
+   break;
+   case 0x35:  /* sysexit */
+   goto cannot_emulate;
+   break;
case 0x40 ... 0x4f: /* cmov */
c-dst.val = c-dst.orig_val = c-src.val;
if (!test_cc(c-b, ctxt-eflags))
-- 
1.6.1.3


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/6] add sysenter emulation

Handle #UD intercept of the sysenter instruction in 32bit compat mode on
an AMD host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual.

Signed-off-by: Christoph Egger christoph.eg...@amd.com
Signed-off-by: Amit Shah amit.s...@redhat.com
Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/kvm/x86_emulate.c |   72 +++-
 1 files changed, 71 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index 89bd53e..2f62aaa 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -1481,6 +1481,73 @@ emulate_syscall(struct x86_emulate_ctxt *ctxt)
return 0;
 }
 
+static int
+emulate_sysenter(struct x86_emulate_ctxt *ctxt)
+{
+   struct decode_cache *c = ctxt-decode;
+   struct kvm_segment cs, ss;
+   u64 msr_data;
+
+   /* inject #UD if LOCK prefix is used */
+   if (c-lock_prefix)
+   return -1;
+
+   /* inject #GP if in real mode or paging is disabled */
+   if (ctxt-mode == X86EMUL_MODE_REAL ||
+   !(ctxt-vcpu-arch.cr0  X86_CR0_PE)) {
+   kvm_inject_gp(ctxt-vcpu, 0);
+   return -1;
+   }
+
+   /* XXX sysenter/sysexit have not been tested in 64bit mode.
+   * Therefore, we inject an #UD.
+   */
+   if (ctxt-mode == X86EMUL_MODE_PROT64)
+   return -1;
+
+   setup_syscalls_segments(ctxt, cs, ss);
+
+   kvm_x86_ops-get_msr(ctxt-vcpu, MSR_IA32_SYSENTER_CS, msr_data);
+   switch (ctxt-mode) {
+   case X86EMUL_MODE_PROT32:
+   if ((msr_data  0xfffc) == 0x0) {
+   kvm_inject_gp(ctxt-vcpu, 0);
+   return -1;
+   }
+   break;
+   case X86EMUL_MODE_PROT64:
+   if (msr_data == 0x0) {
+   kvm_inject_gp(ctxt-vcpu, 0);
+   return -1;
+   }
+   break;
+   }
+
+   ctxt-eflags = ~(EFLG_VM | EFLG_IF | EFLG_RF);
+   cs.selector = (u16)msr_data;
+   cs.selector = ~SELECTOR_RPL_MASK;
+   ss.selector = cs.selector + 8;
+   ss.selector = ~SELECTOR_RPL_MASK;
+   if (ctxt-mode == X86EMUL_MODE_PROT64
+   || is_long_mode(ctxt-vcpu)) {
+   cs.db = 0;
+   cs.l = 1;
+   cs.limit = 0x;
+   ss.limit = 0x;
+   }
+
+   kvm_x86_ops-set_segment(ctxt-vcpu, cs, VCPU_SREG_CS);
+   kvm_x86_ops-set_segment(ctxt-vcpu, ss, VCPU_SREG_SS);
+
+   kvm_x86_ops-get_msr(ctxt-vcpu, MSR_IA32_SYSENTER_EIP, msr_data);
+   c-eip = msr_data;
+
+   kvm_x86_ops-get_msr(ctxt-vcpu, MSR_IA32_SYSENTER_ESP, msr_data);
+   c-regs[VCPU_REGS_RSP] = msr_data;
+
+   return 0;
+}
+
 int
 x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops)
 {
@@ -2149,7 +2216,10 @@ twobyte_insn:
c-dst.type = OP_NONE;
break;
case 0x34:  /* sysenter */
-   goto cannot_emulate;
+   if (emulate_sysenter(ctxt) == -1)
+   goto cannot_emulate;
+   else
+   goto writeback;
break;
case 0x35:  /* sysexit */
goto cannot_emulate;
-- 
1.6.1.3


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/6] add syscall emulation

Handle #UD intercept of the syscall instruction in 32bit compat mode on
an Intel host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual. Save the RIP and EFLAGS to the correct registers.

Signed-off-by: Christoph Egger christoph.eg...@amd.com
Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/kvm/x86_emulate.c |   89 +++-
 1 files changed, 88 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index 328ccba..89bd53e 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -1397,6 +1397,90 @@ void toggle_interruptibility(struct x86_emulate_ctxt 
*ctxt, u32 mask)
ctxt-interruptibility = mask;
 }
 
+static inline void
+setup_syscalls_segments(struct x86_emulate_ctxt *ctxt,
+   struct kvm_segment *cs, struct kvm_segment *ss)
+{
+   memset(cs, 0, sizeof(struct kvm_segment));
+   kvm_x86_ops-get_segment(ctxt-vcpu, cs, VCPU_SREG_CS);
+   memset(ss, 0, sizeof(struct kvm_segment));
+
+   cs-l = 0;  /* will be adjusted later */
+   cs-base = 0;   /* flat segment */
+   cs-g = 1;  /* 4kb granularity */
+   cs-limit = 0xf;/* 4GB limit */
+   cs-type = 0x0b;/* Read, Execute, Accessed */
+   cs-s = 1;
+   cs-dpl = 0;/* will be adjusted later */
+   cs-present = 1;
+   cs-db = 1;
+
+   ss-unusable = 0;
+   ss-base = 0;   /* flat segment */
+   ss-limit = 0xf;/* 4GB limit */
+   ss-g = 1;  /* 4kb granularity */
+   ss-s = 1;
+   ss-type = 0x03;/* Read/Write, Accessed */
+   ss-db = 1; /* 32bit stack segment */
+   ss-dpl = 0;
+   ss-present = 1;
+}
+
+static int
+emulate_syscall(struct x86_emulate_ctxt *ctxt)
+{
+   struct decode_cache *c = ctxt-decode;
+   struct kvm_segment cs, ss;
+   u64 msr_data;
+
+   /* syscall is not available in real mode */
+   if (c-lock_prefix || ctxt-mode == X86EMUL_MODE_REAL
+   || !(ctxt-vcpu-arch.cr0  X86_CR0_PE))
+   return -1;
+
+   setup_syscalls_segments(ctxt, cs, ss);
+
+   kvm_x86_ops-get_msr(ctxt-vcpu, MSR_STAR, msr_data);
+   msr_data = 32;
+   cs.selector = (u16)(msr_data  0xfffc);
+   ss.selector = (u16)(msr_data + 8);
+
+   if (is_long_mode(ctxt-vcpu)) {
+   cs.db = 0;
+   cs.l = 1;
+   if (ctxt-mode == X86EMUL_MODE_PROT64) {
+   /* Intel cares about granularity (g bit),
+* so we don't set the effective limit.
+*/
+   cs.g = 1;
+   cs.limit = 0x;
+   }
+   }
+   kvm_x86_ops-set_segment(ctxt-vcpu, cs, VCPU_SREG_CS);
+   kvm_x86_ops-set_segment(ctxt-vcpu, ss, VCPU_SREG_SS);
+
+   c-regs[VCPU_REGS_RCX] = c-eip;
+   if (is_long_mode(ctxt-vcpu)) {
+   c-regs[VCPU_REGS_R11] = ctxt-eflags  ~EFLG_RF;
+
+   kvm_x86_ops-get_msr(ctxt-vcpu,
+   ctxt-mode == X86EMUL_MODE_PROT64 ?
+   MSR_LSTAR : MSR_CSTAR, msr_data);
+   c-eip = msr_data;
+
+   kvm_x86_ops-get_msr(ctxt-vcpu, MSR_SYSCALL_MASK, msr_data);
+   ctxt-eflags = ~(msr_data | EFLG_RF);
+   } else {
+   /* legacy mode */
+   kvm_x86_ops-get_msr(ctxt-vcpu, MSR_STAR, msr_data);
+   c-eip = (u32)msr_data;
+
+   ctxt-eflags = ~(EFLG_VM | EFLG_IF | EFLG_RF);
+   }
+
+   return 0;
+}
+
 int
 x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops)
 {
@@ -1993,7 +2077,10 @@ twobyte_insn:
}
break;
case 0x05:  /* syscall */
-   goto cannot_emulate;
+   if (emulate_syscall(ctxt) == -1)
+   goto cannot_emulate;
+   else
+   goto writeback;
break;
case 0x06:
emulate_clts(ctxt-vcpu);
-- 
1.6.1.3


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/6] add sysexit emulation

Handle #UD intercept of the sysexit instruction in 64bit mode returning to
32bit compat mode on an AMD host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual.

Signed-off-by: Christoph Egger christoph.eg...@amd.com
Signed-off-by: Amit Shah amit.s...@redhat.com
Signed-off-by: Andre Przywara andre.przyw...@amd.com
---
 arch/x86/kvm/x86_emulate.c |   79 +++-
 1 files changed, 78 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index 2f62aaa..7df05cc 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -1548,6 +1548,80 @@ emulate_sysenter(struct x86_emulate_ctxt *ctxt)
return 0;
 }
 
+static int
+emulate_sysexit(struct x86_emulate_ctxt *ctxt)
+{
+   struct decode_cache *c = ctxt-decode;
+   struct kvm_segment cs, ss;
+   u64 msr_data;
+   int usermode;
+
+   /* inject #UD if LOCK prefix is used */
+   if (c-lock_prefix)
+   return -1;
+
+   /* inject #GP if in real mode or paging is disabled */
+   if (ctxt-mode == X86EMUL_MODE_REAL
+   || !(ctxt-vcpu-arch.cr0  X86_CR0_PE)) {
+   kvm_inject_gp(ctxt-vcpu, 0);
+   return -1;
+   }
+
+   /* sysexit must be called from CPL 0 */
+   if (kvm_x86_ops-get_cpl(ctxt-vcpu) != 0) {
+   kvm_inject_gp(ctxt-vcpu, 0);
+   return -1;
+   }
+
+   setup_syscalls_segments(ctxt, cs, ss);
+
+   if ((c-rex_prefix  0x8) != 0x0)
+   usermode = X86EMUL_MODE_PROT64;
+   else
+   usermode = X86EMUL_MODE_PROT32;
+
+   /* We don't care about cs.g/ss.g bits
+* (= 4kb granularity) so we have to set the effective
+* limit here or we get a #GP in the guest, otherwise.
+*/
+   cs.limit = 0x;
+   ss.limit = 0x;
+
+   cs.dpl = 3;
+   ss.dpl = 3;
+   kvm_x86_ops-get_msr(ctxt-vcpu, MSR_IA32_SYSENTER_CS, msr_data);
+   switch (usermode) {
+   case X86EMUL_MODE_PROT32:
+   cs.selector = (u16)(msr_data + 16);
+   if ((msr_data  0xfffc) == 0x0) {
+   kvm_inject_gp(ctxt-vcpu, 0);
+   return -1;
+   }
+   ss.selector = (u16)(msr_data + 24);
+   break;
+   case X86EMUL_MODE_PROT64:
+   cs.selector = (u16)(msr_data + 32);
+   if (msr_data == 0x0) {
+   kvm_inject_gp(ctxt-vcpu, 0);
+   return -1;
+   }
+   ss.selector = cs.selector + 8;
+   cs.db = 0;
+   cs.l = 1;
+   break;
+   }
+   cs.selector |= SELECTOR_RPL_MASK;
+   ss.selector |= SELECTOR_RPL_MASK;
+
+   kvm_x86_ops-set_segment(ctxt-vcpu, cs, VCPU_SREG_CS);
+   kvm_x86_ops-set_segment(ctxt-vcpu, ss, VCPU_SREG_SS);
+
+   c-eip = ctxt-vcpu-arch.regs[VCPU_REGS_RDX];
+   c-regs[VCPU_REGS_RSP] = ctxt-vcpu-arch.regs[VCPU_REGS_RCX];
+
+   return 0;
+}
+
 int
 x86_emulate_insn(struct x86_emulate_ctxt *ctxt, struct x86_emulate_ops *ops)
 {
@@ -,7 +2296,10 @@ twobyte_insn:
goto writeback;
break;
case 0x35:  /* sysexit */
-   goto cannot_emulate;
+   if (emulate_sysexit(ctxt) == -1)
+   goto cannot_emulate;
+   else
+   goto writeback;
break;
case 0x40 ... 0x4f: /* cmov */
c-dst.val = c-dst.orig_val = c-src.val;
-- 
1.6.1.3


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM: protect concurrent make_all_cpus_request



make_all_cpus_request contains a race condition which can 
trigger false request completed status, as follows:

CPU0  CPU1

if (test_and_set_bit(req,vcpu-requests))
   if 
(test_and_set_bit(req,vcpu-requests))
   ..  return
proceed to smp_call_function_many(wait=1)

Use a spinlock to serialize concurrent CPUs.

Signed-off-by: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1b48092..2451f48 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -124,6 +124,7 @@ struct kvm_kernel_irq_routing_entry {
 
 struct kvm {
spinlock_t mmu_lock;
+   spinlock_t requests_lock;
struct rw_semaphore slots_lock;
struct mm_struct *mm; /* userspace tied to this vm */
int nmemslots;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8939ffa..6847bb7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -739,6 +739,7 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned 
int req)
cpumask_clear(cpus);
 
me = get_cpu();
+   spin_lock(kvm-requests_lock);
kvm_for_each_vcpu(i, vcpu, kvm) {
if (test_and_set_bit(req, vcpu-requests))
continue;
@@ -752,6 +753,7 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned 
int req)
smp_call_function_many(cpus, ack_flush, NULL, 1);
else
called = false;
+   spin_unlock(kvm-requests_lock);
put_cpu();
free_cpumask_var(cpus);
return called;
@@ -972,6 +974,7 @@ static struct kvm *kvm_create_vm(void)
kvm-mm = current-mm;
atomic_inc(kvm-mm-mm_count);
spin_lock_init(kvm-mmu_lock);
+   spin_lock_init(kvm-requests_lock);
kvm_io_bus_init(kvm-pio_bus);
kvm_irqfd_init(kvm);
mutex_init(kvm-lock);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: a few words about drive/pci add/remove

2009-06-17 Thread Mark McLoughlin

On Wed, 2009-06-17 at 17:44 +0400, Michael Tokarev wrote:

 WARNING: at drivers/base/core.c:122 device_release+0x5f/0x70()
 Device 'virtio8' does not have a release() function, it is broken and
 must be fixed.

This should be fixed in 2.6.29

Cheers,
Mark.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: init bsp_vcpu before kvm_arch_vcpu_init

2009-06-17 Thread Gleb Natapov

On Wed, Jun 17, 2009 at 10:07:59AM -0300, Marcelo Tosatti wrote:
 On Wed, Jun 17, 2009 at 03:29:05PM +0300, Gleb Natapov wrote:
  On Tue, Jun 16, 2009 at 11:33:16AM -0300, Marcelo Tosatti wrote:
   
   On x86 mp_state is initialized by kvm_arch_vcpu_init. Right
   now kvm_vcpu_is_bsp returns false because kvm-bsp_vcpu has
   not been initialized, so vcpu_id == 0 ends up with mp_state ==
   KVM_MP_STATE_UNINITIALIZED.
   
   Gleb do you see a better way to fix this?
   
  I have two, not necessary better ways. The first one is to change
  kvm_vcpu_is_bsp() to do kvm-bsp_vcpu_id == vcpu-vcpu_id. 
 
 Thats much better.
 
  Another one is to understand why mp_state is set to runnable for bsp
  here at all. May be we can drop this use of kvm_vcpu_is_bsp() since
  mp_state will be set to RUNNABLE in kvm_arch_vcpu_ioctl_set_sregs()
  anyway?
 
 Testcase is kvmctl without -p, which does not do
 kvm_arch_vcpu_ioctl_set_sregs. I suppose that is valid? (in practice its
 no big deal since kvmctl can be updated, and qemu-kvm does set_sregs,
 but..).
 
I am not sure why all those special cases are present, may be for
backwards compatibility. How can we make vcpu runnable without
initializing it first (setting cpuid/regs/sregd)? But the patch is good
regardless.

 KVM: use vcpu_id instead of bsp_vcpu pointer in kvm_vcpu_is_bsp
 
 Change kvm_vcpu_is_bsp to use vcpu_id instead of bsp_vcpu pointer, which
 is only initialized at the end of kvm_vm_ioctl_create_vcpu.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 1b48092..026ed0a 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -569,7 +569,7 @@ static inline void kvm_irqfd_release(struct kvm *kvm) {}
  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
  static inline bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
  {
 - return vcpu-kvm-bsp_vcpu == vcpu;
 + return vcpu-kvm-bsp_vcpu_id == vcpu-vcpu_id;
  }
  #endif
  #endif

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: a few words about drive/pci add/remove


Mark McLoughlin wrote:

On Wed, 2009-06-17 at 17:44 +0400, Michael Tokarev wrote:


WARNING: at drivers/base/core.c:122 device_release+0x5f/0x70()
Device 'virtio8' does not have a release() function, it is broken and
must be fixed.


This should be fixed in 2.6.29


Yes indeed, 2.6.29 adds/removes pci devices just fine.

Can we apply this to 2.6.27.y long-term stable release too?

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

2009-06-17 Thread Michael S. Tsirkin

On Tue, Jun 16, 2009 at 02:09:38PM -0400, Gregory Haskins wrote:
  What do you mean by copy_to_user(other-mm) here?  If you are going to 
  switch
  to another mm, then I think current-mm must be valid (I think it's not 
  enough
  that you can sleep). So preemptible() might not be enough.

 
 I dont currently use switch_mm, if that is what you mean.  I save the
 task_struct into the appropriate context so current-mm doesn't matter
 to me.  I never use it.  All I need (afaik) is to acquire the proper
 mutex first.  I am not an MM expert, so perhaps I have this wrong but it
 does appear to work properly even from kthread context.
 
 -Greg
 
 

I think I saw get_user_pages + memcpy in your patch. Yes, that works
without switching contexts but it's slower than copy to user if you are
in the right context, and AFAIK it's slower than get_user_pages_fast.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
 On Tue, Jun 16, 2009 at 02:09:38PM -0400, Gregory Haskins wrote:
   
 What do you mean by copy_to_user(other-mm) here?  If you are going to 
 switch
 to another mm, then I think current-mm must be valid (I think it's not 
 enough
 that you can sleep). So preemptible() might not be enough.
   
   
 I dont currently use switch_mm, if that is what you mean.  I save the
 task_struct into the appropriate context so current-mm doesn't matter
 to me.  I never use it.  All I need (afaik) is to acquire the proper
 mutex first.  I am not an MM expert, so perhaps I have this wrong but it
 does appear to work properly even from kthread context.

 -Greg


 

 I think I saw get_user_pages + memcpy in your patch. Yes, that works
 without switching contexts but it's slower than copy to user if you are
 in the right context, and AFAIK it's slower than get_user_pages_fast.

   
Yeah, understood.  It will definitely be interesting to try that
optimization with switch_mm that you suggested.

On that front, would if (p == current) still work even if we don't
have the signal_task() hint?



signature.asc
Description: OpenPGP digital signature

[KVM-AUTOTEST PATCH] KVM test: support 'cache' and 'serial' params to the qemu -drive option

2009-06-17 Thread Michael Goldish

Support 'cache' and 'serial'.
Both can be controlled directly from the config file, e.g.:
drive_cache = writeback

or for a specific image 'image1':
drive_cache_image1 = writeback

or for 'image1' of VM 'vm1':
drive_cache_image1_vm1 = writeback

The cache options are listed in the QEMU documentation.
(currently none, writeback, or writethrough are supported.)

Signed-off-by: Michael Goldish mgold...@redhat.com
---
 client/tests/kvm/kvm_vm.py |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py
index 9aea3fb..e98ffd7 100644
--- a/client/tests/kvm/kvm_vm.py
+++ b/client/tests/kvm/kvm_vm.py
@@ -226,6 +226,10 @@ class VM:
image_dir)
 if image_params.get(drive_format):
 qemu_cmd += ,if=%s % image_params.get(drive_format)
+if image_params.get(drive_cache):
+qemu_cmd += ,cache=%s % image_params.get(drive_cache)
+if image_params.get(drive_serial):
+qemu_cmd += ,serial=%s % image_params.get(drive_serial)
 if image_params.get(image_snapshot) == yes:
 qemu_cmd += ,snapshot=on
 if image_params.get(image_boot) == yes:
-- 
1.5.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Network throughput limits for local VM - VM communication

2009-06-17 Thread Fischer, Anna

Subject: Re: Network throughput limits for local VM - VM
communication

On 06/17/2009 11:12 AM, Fischer, Anna wrote:

For the tests I run now (with vlan= enabled) I am actually using both
TCP and UDP, and I see the problem with virtio_net for both protocols.
What I am wondering about though is that I do not seem to have any
problems if I communicate directly between the two guests (if I plug
then into the same bridge and put them onto the same networks), so why
do I only see the problem of stalling network communication when there
is a routing VM in the network path? Is this just because the system is
even more overloaded in that case? Or could this be an issue related to
a dual NIC configuration or the fact that I run multiple bridges on the
same physical machine?

When you say We are working on fixing this. - which code parts are
you working on? Is this in the QEMU network I/O processing code or is
this virtio_net related?

tap. virtio, qemu, maybe more. It's a difficult problem.

Retry with the fixed configuration? You mean setting the vlan=
parameter? I have already used the vlan= parameter for the latest
tests, and so the CPU utilization issues I am talking about are
happening with that configuration.

Yeah.

Can you compare total data sent and received as seen by the guests?
That would confirm that packets being dropped causes the slowdown.

Yes, I will check on that and report back.

It still does not answer my question on why I only see low CPU utilization
numbers with the e1000 virtual device model. There is no network stalling or
packet drops or any other obvious issues when running with that model, but I am
still seeing low CPU utilization numbers. What is preventing KVM here to use
more of the host CPU capacity when the host is not doing anything else but run
virtual machines? Is there any way that I can get higher CPU utilization out of
KVM?

Thanks,
Anna
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

2009-06-17 Thread Michael S. Tsirkin

On Wed, Jun 17, 2009 at 11:02:06AM -0400, Gregory Haskins wrote:
 Michael S. Tsirkin wrote:
  On Tue, Jun 16, 2009 at 02:09:38PM -0400, Gregory Haskins wrote:

  What do you mean by copy_to_user(other-mm) here?  If you are going to 
  switch
  to another mm, then I think current-mm must be valid (I think it's not 
  enough
  that you can sleep). So preemptible() might not be enough.


  I dont currently use switch_mm, if that is what you mean.  I save the
  task_struct into the appropriate context so current-mm doesn't matter
  to me.  I never use it.  All I need (afaik) is to acquire the proper
  mutex first.  I am not an MM expert, so perhaps I have this wrong but it
  does appear to work properly even from kthread context.
 
  -Greg
 
 
  
 
  I think I saw get_user_pages + memcpy in your patch. Yes, that works
  without switching contexts but it's slower than copy to user if you are
  in the right context, and AFAIK it's slower than get_user_pages_fast.
 

 Yeah, understood.  It will definitely be interesting to try that
 optimization with switch_mm that you suggested.

BTW, I'm kind of confused - in your patch you do get_task_struct: does this
guarantee that mm is not going aways?

 On that front, would if (p == current) still work even if we don't
 have the signal_task() hint?
 

Donnu.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
 On Wed, Jun 17, 2009 at 11:02:06AM -0400, Gregory Haskins wrote:
   
 Michael S. Tsirkin wrote:
 
 On Tue, Jun 16, 2009 at 02:09:38PM -0400, Gregory Haskins wrote:
   
   
 What do you mean by copy_to_user(other-mm) here?  If you are going to 
 switch
 to another mm, then I think current-mm must be valid (I think it's not 
 enough
 that you can sleep). So preemptible() might not be enough.
   
   
   
 I dont currently use switch_mm, if that is what you mean.  I save the
 task_struct into the appropriate context so current-mm doesn't matter
 to me.  I never use it.  All I need (afaik) is to acquire the proper
 mutex first.  I am not an MM expert, so perhaps I have this wrong but it
 does appear to work properly even from kthread context.

 -Greg


 
 
 I think I saw get_user_pages + memcpy in your patch. Yes, that works
 without switching contexts but it's slower than copy to user if you are
 in the right context, and AFAIK it's slower than get_user_pages_fast.

   
   
 Yeah, understood.  It will definitely be interesting to try that
 optimization with switch_mm that you suggested.
 

 BTW, I'm kind of confused - in your patch you do get_task_struct: does this
 guarantee that mm is not going aways?
   

Well, again, I am not an expert on MM, but I would think so.  If p holds
a reference to p-mm (which I would think it would have to), and we hold
a reference to p, p-mm should be indirectly stable coincident with the
lifetime of our p reference.  OTOH, I have not actually looked at
whether p actually takes a p-mm reference or not via inspection, so
this is just an educated guess at this point. ;)

I guess if this is broken, the solution is simple:  Modify the code to
take an mm reference as well.

-Greg



signature.asc
Description: OpenPGP digital signature

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, 16 Jun 2009, Gregory Haskins wrote:

 Davide Libenzi wrote:
  On Tue, 16 Jun 2009, Gregory Haskins wrote:
 

  Does this all make sense?
  
 
  This conversation has been *really* long, and I haven't had time to look 
  at the patch yet. But looking at the amount of changes, and the amount of 
  even more changes talked in this thread, there's a very slim chance that 
  I'll ACK the eventfd code.
  You may want to consider a solution that does not litter eventfd code that 
  much.
 
 
  - Davide
 
 

 Hi Davide,
 
 I understand your position and value your time/insight into looking at
 this things.
 
 Despite the current ongoing discussion, I still stand that the current
 patch is my proposed solution (though I have yet to convince Michael). 
 But in any case,  if you have the time, please look it over because I
 still think its the right direction to head in.

I don't think so. You basically upload a bunch of stuff it could have been 
inside your irqfd into eventfd. Now the eventfd_signal() can magically 
sleep, or not, depending on what the signal functions do. This makes up a 
pretty aweful interface if you ask me.
A lot simpler and cleaner if eventfd_signal(), like all the wake up 
functions inside the kernel, can be called from atomic context. Always, 
not sometimes.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: protect concurrent make_all_cpus_request

2009-06-17 Thread Andrea Arcangeli

On Wed, Jun 17, 2009 at 10:53:47AM -0300, Marcelo Tosatti wrote:
 
 
 make_all_cpus_request contains a race condition which can 
 trigger false request completed status, as follows:
 
 CPU0  CPU1
 
 if (test_and_set_bit(req,vcpu-requests))
  if 
 (test_and_set_bit(req,vcpu-requests))
..  return
 proceed to smp_call_function_many(wait=1)
 
 Use a spinlock to serialize concurrent CPUs.

Agreed, this is the safest...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Hi Davide,

Davide Libenzi wrote:
 On Tue, 16 Jun 2009, Gregory Haskins wrote:

   
 Davide Libenzi wrote:
 
 On Tue, 16 Jun 2009, Gregory Haskins wrote:

   
   
 Does this all make sense?
 
 
 This conversation has been *really* long, and I haven't had time to look 
 at the patch yet. But looking at the amount of changes, and the amount of 
 even more changes talked in this thread, there's a very slim chance that 
 I'll ACK the eventfd code.
 You may want to consider a solution that does not litter eventfd code that 
 much.


 - Davide


   
   
 Hi Davide,

 I understand your position and value your time/insight into looking at
 this things.

 Despite the current ongoing discussion, I still stand that the current
 patch is my proposed solution (though I have yet to convince Michael). 
 But in any case,  if you have the time, please look it over because I
 still think its the right direction to head in.
 

 I don't think so. You basically upload a bunch of stuff it could have been 
 inside your irqfd into eventfd.

Can you elaborate?  I currently do not see how I could do the proposed
concept inside of irqfd while still using eventfd.  Of course, that
would be possible if we fork irqfd from eventfd,  and perhaps this is
what you are proposing.  As previously stated I don't want to give up on
the prospect of re-using it quite yet, so bear with me. :)

The issue with eventfd, as I see it, is that eventfd uses a
spin_lock_irqsave (by virtue of the wait-queue stuff) across the
signal callback (which today is implemented as a wake-up).  This
spin_lock implicitly creates a non-preemptible critical section that
occurs independently of whether eventfd_signal() itself is invoked from
a sleepable context or not.

What I strive to achieve is to remove the creation of this internal
critical section.  If eventfd_signal() is called from atomic context, so
be it.  We will detect this in the callback and be forced to take the
slow-path, and I am ok with that.  *But*, if eventfd_signal() (or
f_ops-write(), for that matter) are called from a sleepable context
*and* eventfd doesn't introduce its own critical section (such as with
my srcu patch), we can potentially optimize within the callback by
executing serially instead of deferring (e.g. via a workqueue).

(Note: The issue of changing eventfd_signal interface is a separate
tangent that Michael proposed, and is not something I am advocating.  In
the current proposal, eventfd_signal() retains its exact semantics as it
has in mainline).

  Now the eventfd_signal() can magically 
 sleep, or not, depending on what the signal functions do. This makes up a 
 pretty aweful interface if you ask me.
 A lot simpler and cleaner if eventfd_signal(), like all the wake up 
 functions inside the kernel, can be called from atomic context. Always, 
 not sometimes.
   

It can!  :) This is not changing from whats in mainline today (covered
above).

Thanks Davide,
-Greg




signature.asc
Description: OpenPGP digital signature

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Wed, 17 Jun 2009, Gregory Haskins wrote:

 Can you elaborate?  I currently do not see how I could do the proposed
 concept inside of irqfd while still using eventfd.  Of course, that
 would be possible if we fork irqfd from eventfd,  and perhaps this is
 what you are proposing.  As previously stated I don't want to give up on
 the prospect of re-using it quite yet, so bear with me. :)
 
 The issue with eventfd, as I see it, is that eventfd uses a
 spin_lock_irqsave (by virtue of the wait-queue stuff) across the
 signal callback (which today is implemented as a wake-up).  This
 spin_lock implicitly creates a non-preemptible critical section that
 occurs independently of whether eventfd_signal() itself is invoked from
 a sleepable context or not.
 
 What I strive to achieve is to remove the creation of this internal
 critical section.  If eventfd_signal() is called from atomic context, so
 be it.  We will detect this in the callback and be forced to take the
 slow-path, and I am ok with that.  *But*, if eventfd_signal() (or
 f_ops-write(), for that matter) are called from a sleepable context
 *and* eventfd doesn't introduce its own critical section (such as with
 my srcu patch), we can potentially optimize within the callback by
 executing serially instead of deferring (e.g. via a workqueue).

Since when the scheduling (assuming it's not permanently running on 
another core due to high frequency work post) of a kernel thread is such 
a big impact that interfaces need to be redesigned for that?
How much the (possible, but not certain) kernel thread context switch time 
weighs in the overall KVM IRQ service time?



 It can!  :) This is not changing from whats in mainline today (covered
 above).

It can/could, if the signal() function takes very accurate care of doing 
the magic atomic check.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH 0/3] qemu-kvm fixes (+userspace support for gbpages)

Hi,

this patchset contains two fixes to qemu-kvm userspace and the patch to make
gbpages working with qemu-kvm. I am not really sure if this is the right way to
do it or if I missed something. Therefore this patchset is RFC.

Thanks,

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] qemu-kvm: fix cpuid bitmask in kvm_setup_cpuid()

Fix the bitmask in kvm_setup_cpuid() to a value which represents only the
common bits between cpuid function 0x0001 and 0x8001.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 target-i386/libkvm.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/target-i386/libkvm.c b/target-i386/libkvm.c
index 0f4e009..6391caf 100644
--- a/target-i386/libkvm.c
+++ b/target-i386/libkvm.c
@@ -644,7 +644,7 @@ uint32_t kvm_get_supported_cpuid(kvm_context_t kvm, 
uint32_t function, int reg)
 */
if (function == 0x8001) {
cpuid_1_edx = 
kvm_get_supported_cpuid(kvm, 1, R_EDX);
-   ret |= cpuid_1_edx  0xdfeff7ff;
+   ret |= cpuid_1_edx  0x0183f3ff;
}
break;
}
-- 
1.6.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] qemu-kvm: align cpu-features with kvm_supported_cpuid

The cpuid features exposed to the guest are currently not aligned with the bits
returned by the supported_cpuid ioctl. This patch fixes it.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 qemu-kvm-x86.c |   28 +---
 1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 729d600..95774c1 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -578,6 +578,13 @@ int kvm_arch_qemu_init_env(CPUState *cenv)
 if (i == 0xd  copy.regs[R_EAX] == 0)
 break;
 }
+} else if (i == 1) {
+do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0, copy);
+kvm_trim_features(cpuid_ent[cpuid_nent].edx,
+  kvm_arch_get_supported_cpuid(cenv, 1, R_EDX));
+kvm_trim_features(cpuid_ent[cpuid_nent].ecx,
+  kvm_arch_get_supported_cpuid(cenv, 1, R_ECX));
+cpuid_nent++;
 } else
 do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0, copy);
 }
@@ -586,20 +593,19 @@ int kvm_arch_qemu_init_env(CPUState *cenv)
 qemu_kvm_cpuid_on_env(copy);
 limit = copy.regs[R_EAX];
 
-for (i = 0x8000; i = limit; ++i)
-   do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0, copy);
+for (i = 0x8000; i = limit; ++i) {
+   do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0, copy);
+if (i == 0x8001) {
+kvm_trim_features(cpuid_ent[cpuid_nent].edx,
+  kvm_arch_get_supported_cpuid(cenv, 0x8001, 
R_EDX));
+kvm_trim_features(cpuid_ent[cpuid_nent].ecx,
+  kvm_arch_get_supported_cpuid(cenv, 0x8001, 
R_ECX));
+}
+++cpuid_nent;
+}
 
 kvm_setup_cpuid2(cenv-kvm_cpu_state.vcpu_ctx, cpuid_nent, cpuid_ent);
 
-kvm_trim_features(cenv-cpuid_features,
-  kvm_arch_get_supported_cpuid(cenv, 1, R_EDX));
-kvm_trim_features(cenv-cpuid_ext_features,
-  kvm_arch_get_supported_cpuid(cenv, 1, R_ECX));
-kvm_trim_features(cenv-cpuid_ext2_features,
-  kvm_arch_get_supported_cpuid(cenv, 0x8001, R_EDX));
-kvm_trim_features(cenv-cpuid_ext3_features,
-  kvm_arch_get_supported_cpuid(cenv, 0x8001, R_ECX));
-
 return 0;
 }
 
-- 
1.6.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] qemu-kvm: add support for gbpages to kvm userspace

This patch adds a command line parameter to expose the gbpages cpuid bit to the
guest if the kvm kernel module supports it.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 qemu-kvm.c   |1 +
 qemu-kvm.h   |1 +
 qemu-options.hx  |2 ++
 target-i386/helper.c |3 +++
 vl.c |4 
 5 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/qemu-kvm.c b/qemu-kvm.c
index 2aeb17c..01a889d 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -33,6 +33,7 @@ int kvm_irqchip = 1;
 int kvm_pit = 1;
 int kvm_pit_reinject = 1;
 int kvm_nested = 0;
+int kvm_gbpages = 0;
 kvm_context_t kvm_context;
 
 pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER;
diff --git a/qemu-kvm.h b/qemu-kvm.h
index fa40542..fc06b96 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -151,6 +151,7 @@ extern int kvm_irqchip;
 extern int kvm_pit;
 extern int kvm_pit_reinject;
 extern int kvm_nested;
+extern int kvm_gbpages;
 extern kvm_context_t kvm_context;
 
 struct ioperm_data {
diff --git a/qemu-options.hx b/qemu-options.hx
index edd99db..fe9e9d3 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1594,6 +1594,8 @@ DEF(pcidevice, HAS_ARG, QEMU_OPTION_pcidevice,
 #endif
 DEF(enable-nesting, 0, QEMU_OPTION_enable_nesting,
 -enable-nesting enable support for running a VM inside the VM (AMD 
only)\n)
+DEF(enable-gbpages, 0, QEMU_OPTION_enable_gbpages,
+-enable-gbpages enable support for 1GB pages in the guest (if 
supported)\n)
 DEF(nvram, HAS_ARG, QEMU_OPTION_nvram,
 -nvram FILE  provide ia64 nvram contents\n)
 DEF(tdf, 0, QEMU_OPTION_tdf,
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 6dc0111..9fb2ae4 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1607,6 +1607,9 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *ecx = ~4UL;
 /* 3dnow */
 *edx = ~0xc000;
+   /* enable gbpages and let kvm disable it if unsupported */
+   if (kvm_gbpages)
+   *edx |= (1UL  26UL);
 }
 break;
 case 0x8002:
diff --git a/vl.c b/vl.c
index 845ed54..262ff1b 100644
--- a/vl.c
+++ b/vl.c
@@ -5621,6 +5621,10 @@ int main(int argc, char **argv, char **envp)
kvm_nested = 1;
break;
}
+   case QEMU_OPTION_enable_gbpages: {
+   kvm_gbpages = 1;
+   break;
+   }
 #if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(TARGET_IA64) || 
defined(__linux__)
 case QEMU_OPTION_pcidevice:
if (assigned_devices_index = MAX_DEV_ASSIGN_CMDLINE) {
-- 
1.6.3.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Davide Libenzi wrote:
 On Wed, 17 Jun 2009, Gregory Haskins wrote:

   
 Can you elaborate?  I currently do not see how I could do the proposed
 concept inside of irqfd while still using eventfd.  Of course, that
 would be possible if we fork irqfd from eventfd,  and perhaps this is
 what you are proposing.  As previously stated I don't want to give up on
 the prospect of re-using it quite yet, so bear with me. :)

 The issue with eventfd, as I see it, is that eventfd uses a
 spin_lock_irqsave (by virtue of the wait-queue stuff) across the
 signal callback (which today is implemented as a wake-up).  This
 spin_lock implicitly creates a non-preemptible critical section that
 occurs independently of whether eventfd_signal() itself is invoked from
 a sleepable context or not.

 What I strive to achieve is to remove the creation of this internal
 critical section.  If eventfd_signal() is called from atomic context, so
 be it.  We will detect this in the callback and be forced to take the
 slow-path, and I am ok with that.  *But*, if eventfd_signal() (or
 f_ops-write(), for that matter) are called from a sleepable context
 *and* eventfd doesn't introduce its own critical section (such as with
 my srcu patch), we can potentially optimize within the callback by
 executing serially instead of deferring (e.g. via a workqueue).
 

 Since when the scheduling (assuming it's not permanently running on 
 another core due to high frequency work post) of a kernel thread is such 
 a big impact that interfaces need to be redesigned for that?
   

Low-latency applications, for instance, care about this.  Even one
context switch can add a substantial overhead.

 How much the (possible, but not certain) kernel thread context switch time 
 weighs in the overall KVM IRQ service time?
   

Generally each one is costing me about 7us on average.  For something
like high-speed networking, we have a path that has about 30us of
base-line overhead.  So one additional ctx-switch puts me at base+7 ( =
~37us), two puts me in base+2*7 (= ~44us).  So in that context (no pun
intended ;), it hurts quite a bit.  I'll be the first to admit that not
everyone (most?) will care about latency, though.  But FWIW, I do.



   
 It can!  :) This is not changing from whats in mainline today (covered
 above).
 

 It can/could, if the signal() function takes very accurate care of doing 
 the magic atomic check.
   

True, but thats the notifiee's burden, not eventfd's.  And its always
going to be opt-in.  Even today, someone is free to either try to sleep
(which will oops on the might_sleep()), or try to check if they can
sleep (it will always say they can't due to the eventfd wqh spinlock). 
The only thing that changes with my patch is that someone that opts in
to check if they can sleep may find that they sometimes (mostly?) can.

In any case, I suspect that there are no other clients of eventfd other
than standard wait-queue sleepers, and irqfd.  The wake_up() code is
never going to care anyway, so this really comes down to future users of
the notification interfaces (irqfd today, iosignalfd in the near-term). 
Therefore, there really aren't any legacy issues to deal with that I am
aware of, even if they mattered.

Thanks Davide,
-Greg



signature.asc
Description: OpenPGP digital signature

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Wed, 17 Jun 2009, Gregory Haskins wrote:

 Davide Libenzi wrote:
 
  How much the (possible, but not certain) kernel thread context switch time 
  weighs in the overall KVM IRQ service time?

 
 Generally each one is costing me about 7us on average.  For something
 like high-speed networking, we have a path that has about 30us of
 base-line overhead.  So one additional ctx-switch puts me at base+7 ( =
 ~37us), two puts me in base+2*7 (= ~44us).  So in that context (no pun
 intended ;), it hurts quite a bit.  I'll be the first to admit that not
 everyone (most?) will care about latency, though.  But FWIW, I do.

And how a frame reception is handled in Linux nowadays?



 True, but thats the notifiee's burden, not eventfd's.  And its always
 going to be opt-in.  Even today, someone is free to either try to sleep
 (which will oops on the might_sleep()), ...

No, today you just can't sleep. As you can't sleep in any 
callback-registered wakeups, like epoll, for example.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Davide Libenzi wrote:
 On Wed, 17 Jun 2009, Gregory Haskins wrote:

   
 Davide Libenzi wrote:

 
 How much the (possible, but not certain) kernel thread context switch time 
 weighs in the overall KVM IRQ service time?
   
   
 Generally each one is costing me about 7us on average.  For something
 like high-speed networking, we have a path that has about 30us of
 base-line overhead.  So one additional ctx-switch puts me at base+7 ( =
 ~37us), two puts me in base+2*7 (= ~44us).  So in that context (no pun
 intended ;), it hurts quite a bit.  I'll be the first to admit that not
 everyone (most?) will care about latency, though.  But FWIW, I do.
 

 And how a frame reception is handled in Linux nowadays?
   

I am not clear on what you are asking here.  So in case this was a
sincere question on how things work, here is a highlight of the typical
flow of a packet that ingresses on a physical adapter and routes to KVM
via vbus.

a) interrupt from eth to host wakes the cpu out of idle, enters
interrupt-context.
b) ISR runs, disables interrupts on the eth, schedules softirq-net-rx
c) ISR completes, kernel checks softirq state before IRET, runs pending
softirq-net-rx in interrupt context to NAPI-poll the ethernet
d) packet is pulled out of eth into a layer-2 bridge, and switched to
the appropriate switch-port (which happens to be a venet-tap (soon to be
virtio-net based) device.  The packet is queued to the tap as an xmit
request, and the tap's tx-thread is awoken.
e) the xmit call returns, the napi-poll completes, and the
softirq-net-rx terminates.  The kernel does an IRET to exit interrupt
context.
f) the scheduler runs and sees the tap's tx-thread is ready to run,
schedules it in.
g) the tx-thread awakens, dequeues the posted skb, copies it to the
virtio-ring, and finally raises an interrupt on irqfd with eventfd_signal().

At this point, all of the data has been posted to the virtio-ring in
shared memory between the host and guest.  All that is left is to inject
the interrupt so the guest knows to process the ring.  We call the
eventfd_signal() from kthread context.  However, the callback to inject
the interrupt is invoked with the wqh spinlock held so we are forced to
defer the interrupt injection to a workqueue so the kvm-lock can be
safely acquired.  This adds additional latency (~7us) in a path that is
only a handful of microseconds to being with.  I can post LTTV
screenshots, if it would be helpful to visualize this.

The reasons we need the workqueue is:

A) kvm is currently implemented with a mutex for IRQ injection, so its
sleepy
B) the wqh used in eventfd wake_up() creates an non-preemptible section
when the callback is invoked.

We can (and should) address (A), but this is but one example in a common
pattern.  We will run into similar issues in other places (such as with
iosignalfd), so I would like to address (B) as well.  My patch attempts
to do that by re-implementing the callback mechanism as something other
than wait-queue based. It then adds a wait-queue as a default client of
the new interface.

Therefore, the eventfd clients that want traditional vanilla wakeups can
go right ahead and use the non-preemptible wait-queue methods as they
always have.  However, the clients that want (potentially) preemptive
callbacks can use the new interface: eventfd_notifier_register(), armed
with the knowledge that it can only sleep if the eventfd_signal() caller
was not in-atomic at the time.  Thus the dynamic state check ala
preemptible().  Without the check, the client should assume it is not
preemptible, as always.


   
 True, but thats the notifiee's burden, not eventfd's.  And its always
 going to be opt-in.  Even today, someone is free to either try to sleep
 (which will oops on the might_sleep()), ...
 

 No, today you just can't sleep. As you can't sleep in any 
 callback-registered wakeups, like epoll, for example.
   

Right, understood, and note that this is precisely why I said it would
oops.  What I was specifically trying to highlight is that its not like
this change imposes new requirements on the existing callbacks.  I also
wanted to highlight that its not really eventfd's concern what the
callback tries to do, per se (if it wants to sleep it can try, it just
wont work).  Any reasonable wakeup callback in existence would already
assume it cannot sleep, and they wouldn't even try to begin with.

On the other hand, what I am introducing here (specifically to eventfd
callbacks, not wait-queues [*]) is the possibility of removing this
restriction under the proper circumstances.  It would only be apparent
of this change iff the callback in question tried to test for this (e.g.
checking preemptible()) state.  It is thus opt-in, and existing code
does not break.

Thanks Davide,
-Greg

[*]: I contemplated solving this problem generally at the wait-queue
level, but quickly abandoned the idea.  The reason is that something
like *RCU has the properties of being particularly lightweight

Re: [PATCH] Activate Virtualization On Demand v2

2009-06-17 Thread Alexander Graf



On 16.06.2009, at 17:13, Avi Kivity wrote:


On 06/16/2009 05:08 PM, Alexander Graf wrote:

Please tell me you tested suspend/resume with/without VMs and cpu
hotunplug/hotplug.



I tested cpu hotplugging. On the last round I tested suspend/ 
resume, but

this time I couldn't because my machine can't do suspend :-(.
So I'll try hard and find a machine I can test it on for the next  
round.




I can test suspend/resume for you if you don't have a friendly  
machine.  I have a personal interest in keeping it working :)


Thinking about it again - there's only the atomic dec_and_test vs.  
read thing and the suspend test missing.


Is the atomic operation as is really that confusing? If not, we can  
keep the patch as is and you simply try s2ram on your notebook :-).  
I'm pretty sure it works - it used to.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Don't expose hypervisor bit when running nested SVM

2009-06-17 Thread Alexander Graf

Hyper-V refuses to run in hypervisor mode when it finds the hypervisor bit
set, because it assumes it's running as a guest.

While the perfect way of not setting the hypervisor would be an option to the
-cpu parameter, this is reasonable sane for now. Let's deal with the -cpu
way when we get to -cpu host.

Signed-off-by: Alexander Graf ag...@suse.de
---
 target-i386/helper.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/target-i386/helper.c b/target-i386/helper.c
index 2c5af3c..7da0e24 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1513,7 +1513,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *edx = env-cpuid_features;
 
 /* Hypervisor present bit required for Microsoft SVVP */
-if (kvm_enabled())
+if (kvm_enabled()  !kvm_nested)
 *ecx |= (1  31);
 break;
 case 2:
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface