date:20130422

Re: [PULL 0/5] ppc patch queue 2013-04-17

2013-04-22 Thread Gleb Natapov

On Wed, Apr 17, 2013 at 09:04:04PM +0200, Alexander Graf wrote:
 Hi Marcelo / Gleb,
 
 This is my current patch queue for ppc.  Please pull for 3.10.
 
Pulled. Thanks.

 Changes include:
 
   - KVM: PPC: Fix in-kernel MMIO loads
   - KVM: PPC: BookE: Fix 64-bit guest kernels with SMP
 
 Alex
 
 
 The following changes since commit 79558f112fc0352e057f7b5e158e3d88b8b62c60:
   Alexander Graf (1):
 KVM: ARM: Fix kvm_vm_ioctl_irq_line
 
 are available in the git repository at:
 
   git://github.com/agraf/linux-2.6.git kvm-ppc-next
 
 Alexander Graf (2):
   Merge commit 'origin/next' into kvm-ppc-next
   Merge commit 'origin/next' into kvm-ppc-next
 
 Bharat Bhushan (1):
   Added ONE_REG interface for debug instruction
 
 Scott Wood (1):
   kvm/ppc: don't call complete_mmio_load when it's a store
 
 Stuart Yoder (1):
   KVM: PPC: emulate dcbst
 
  arch/powerpc/include/asm/kvm_book3s.h |2 ++
  arch/powerpc/include/asm/kvm_booke.h  |2 ++
  arch/powerpc/include/uapi/asm/kvm.h   |4 
  arch/powerpc/kvm/book3s.c |6 ++
  arch/powerpc/kvm/booke.c  |6 ++
  arch/powerpc/kvm/emulate.c|2 ++
  arch/powerpc/kvm/powerpc.c|1 -
  7 files changed, 22 insertions(+), 1 deletions(-)

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/11] KVM: nVMX: shadow VMCS support, v5

2013-04-22 Thread Gleb Natapov

On Thu, Apr 18, 2013 at 02:34:25PM +0300, Abel Gordon wrote:
 This series of patches implements shadow-vmcs capability for nested VMX.
 
Applied, thanks.

 Shadow-vmcs - background and overview:
 
  In Intel VMX, vmread and vmwrite privileged instructions are used by the
  hypervisor to read and modify the guest and host specifications (VMCS). In a
  nested virtualization environment, L1 executes multiple vmread and vmwrite
  instruction to handle a single L2 exit. Each vmread and vmwrite executed by 
 L1
  traps (cause an exit) to the L0 hypervisor (KVM). L0 emulates the instruction
  behaviour and resumes L1 execution.
 
  Removing the need to trap and emulate these special instructions reduces the
  number of exits and improves nested virtualization performance. As it was 
 first
  evaluated in [1], exit-less vmread and vmwrite can reduce nested 
 virtualization
  overhead up-to 40%.
  
  Intel introduced a new feature to their processors called shadow-vmcs.  Using
  shadow-vmcs, L0 can configure the processor to let L1 running in guest-mode
  access VMCS12 fields using vmread and vmwrite instructions but without 
 causing
  an exit to L0. The VMCS12 fields' data is stored in a shadow-vmcs controlled
  by L0.
 
 Shadow-vmcs - design considerations: 
 
  A shadow-vmcs is processor-dependent and must be accessed by L0 or L1 using
  vmread and vmwrite instructions. With nested virtualization we aim to 
 abstract
  the hardware from the L1 hypervisor. Thus, to avoid hardware dependencies we
  prefered to keep the software defined VMCS12 format as part of L1 address 
 space
  and hold the processor-specific shadow-vmcs format only in L0 address space.
  In other words, the shadow-vmcs is used by L0 as an accelerator but the 
 format
  and content is never exposed to L1 directly. L0 syncs the content of the
  processor-specific shadow vmcs with the content of the software-controlled
  VMCS12 format.
 
  We could have been kept the processor-specific shadow-vmcs format in L1 
 address
  space to avoid using the software defined VMCS12 format, however, this type 
 of
  design/implementation would have been created hardware dependencies and
  would complicate other capabilities (e.g. Live Migration of L1).
 
 Changes since v1:
  1) Added sync_shadow_vmcs flag used to indicate when the content of VMCS12
 must be copied to the shadow vmcs. The flag value is checked during 
 vmx_vcpu_run.
  2) Code quality improvements
 
 Changes since v2:
  1) Allocate shadow vmcs only once per VCPU on handle_vmxon and re-use the 
 same instance for multiple VMCS12s
  2) More code quality improvements
 
 Changes since v3:
  1) Fixed VMXON emulation (new patch). 
 Previous nVMX code didn't verify if L1 is already in root mode (VMXON
 was previously called). Now we call nested_vmx_failValid if VMX is 
 already ON. This is requird to avoid host leaks (due to shadow vmcs 
 allocation) if L1 repetedly executes VMXON.
  2) Improved comment: clarified we do not shadow fields that are modified
 when L1 executes vmx instructions like the VM_INSTRUCTION_ERROR field.
 
 Changes since v4:
  1) Fixed free_nested: we now free the shadow vmcs also 
 when there is no current vmcs.
  
 Acknowledgments:
 
  Many thanks to
  Natapov, Gleb g...@redhat.com 
  Xu, Dongxiao dongxiao...@intel.com
  Nakajima, Jun jun.nakaj...@intel.com
  Har'El, Nadav na...@harel.org.il 
   
  for the insightful discussions, comments and reviews.
 
 
  These patches were easily created and maintained using
  Patchouli -- patch creator
  http://patchouli.sourceforge.net/
 
 
 [1] The Turtles Project: Design and Implementation of Nested Virtualization,
 http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix error return code in kvm_arch_vcpu_init()

2013-04-22 Thread Gleb Natapov

On Thu, Apr 18, 2013 at 07:41:00AM +0800, Wei Yongjun wrote:
 From: Wei Yongjun yongjun_...@trendmicro.com.cn
 
 Fix to return a negative error code from the error handling
 case instead of 0, as returned elsewhere in this function.
 
 Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn
Applied, thanks.

 ---
  arch/x86/kvm/x86.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index e172132..c45d656 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6706,8 +6706,10 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
   }
   vcpu-arch.mcg_cap = KVM_MAX_MCE_BANKS;
  
 - if (!zalloc_cpumask_var(vcpu-arch.wbinvd_dirty_mask, GFP_KERNEL))
 + if (!zalloc_cpumask_var(vcpu-arch.wbinvd_dirty_mask, GFP_KERNEL)) {
 + r = -ENOMEM;
   goto fail_free_mce_banks;
 + }
  
   r = fx_init(vcpu);
   if (r)

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3] KVM: x86: Fix memory leak in vmx.c

2013-04-22 Thread Gleb Natapov

On Thu, Apr 18, 2013 at 09:38:14AM -0700, Andrew Honig wrote:
 
 If userspace creates and destroys multiple VMs within the same process
 we leak 20k of memory in the userspace process context per VM.  This
 patch frees the memory in kvm_arch_destroy_vm.  If the process exits
 without closing the VM file descriptor or the file descriptor has been
 shared with another process then we don't free the memory.
 
 It's still possible for a user space process to leak memory if the last
 process to close the fd for the VM is not the process that created it.
 However, this is an unexpected case that's only caused by a user space
 process that's misbehaving.
 
 Signed-off-by: Andrew Honig aho...@google.com
Applied, thanks.

 ---
  arch/x86/kvm/x86.c |   17 +
  1 file changed, 17 insertions(+)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 8ffac42..3b389bf 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6856,6 +6856,23 @@ void kvm_arch_sync_events(struct kvm *kvm)
  
  void kvm_arch_destroy_vm(struct kvm *kvm)
  {
 + if (current-mm == kvm-mm) {
 + /*
 +  * Free memory regions allocated on behalf of userspace,
 +  * unless the the memory map has changed due to process exit
 +  * or fd copying.
 +  */
 + struct kvm_userspace_memory_region mem;
 + memset(mem, 0, sizeof(mem));
 + mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT;
 + kvm_set_memory_region(kvm, mem);
 +
 + mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT;
 + kvm_set_memory_region(kvm, mem);
 +
 + mem.slot = TSS_PRIVATE_MEMSLOT;
 + kvm_set_memory_region(kvm, mem);
 + }
   kvm_iommu_unmap_guest(kvm);
   kfree(kvm-arch.vpic);
   kfree(kvm-arch.vioapic);
 -- 
 1.7.10.4
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 5/6] KVM: nVMX: Fix conditions for NMI injection

2013-04-22 Thread Gleb Natapov

On Sun, Apr 14, 2013 at 09:04:26PM +0200, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 The logic for checking if interrupts can be injected has to be applied
 also on NMIs. The difference is that if NMI interception is on these
 events are consumed and blocked by the VM exit.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Applied, thanks.

 ---
  arch/x86/kvm/vmx.c |   26 ++
  1 files changed, 26 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 74dea94..9ad30d8 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -4190,6 +4190,12 @@ static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
   PIN_BASED_EXT_INTR_MASK;
  }
  
 +static bool nested_exit_on_nmi(struct kvm_vcpu *vcpu)
 +{
 + return get_vmcs12(vcpu)-pin_based_vm_exec_control 
 + PIN_BASED_NMI_EXITING;
 +}
 +
  static void enable_irq_window(struct kvm_vcpu *vcpu)
  {
   u32 cpu_based_vm_exec_control;
 @@ -4315,6 +4321,26 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, 
 bool masked)
  
  static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
  {
 + if (is_guest_mode(vcpu)) {
 + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 +
 + if (to_vmx(vcpu)-nested.nested_run_pending)
 + return 0;
 + if (nested_exit_on_nmi(vcpu)) {
 + nested_vmx_vmexit(vcpu);
 + vmcs12-vm_exit_reason = EXIT_REASON_EXCEPTION_NMI;
 + vmcs12-vm_exit_intr_info = NMI_VECTOR |
 + INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK;
 + /*
 +  * The NMI-triggered VM exit counts as injection:
 +  * clear this one and block further NMIs.
 +  */
 + vcpu-arch.nmi_pending = 0;
 + vmx_set_nmi_mask(vcpu, true);
 + return 0;
 + }
 + }
 +
   if (!cpu_has_virtual_nmis()  to_vmx(vcpu)-soft_vnmi_blocked)
   return 0;
  
 -- 
 1.7.3.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm/svm: emulate SVM_KEY MSR

2013-04-22 Thread Gleb Natapov

On Mon, Apr 22, 2013 at 09:24:12AM +0530, Prasad Joshi wrote:
 On Sun, Apr 21, 2013 at 2:49 PM, Gleb Natapov g...@redhat.com wrote:
  On Sun, Apr 21, 2013 at 02:12:21PM +0530, prasadjoshi.li...@gmail.com wrote:
  From: Prasad Joshi prasadjoshi.li...@gmail.com
 
  Write only SVM_KEY can be used to create a password protected mechanism
  to clear VM_CR.LOCK.
 
  This key can only be set when VM_CR.LOCK is unset. Once this key is set,
  a write to it compares the written value with already set key. The
  VM_CR.LOCK is unset only if the key matches. All reads from this MSR
  must return 0.
 
  Signed-off-by: Prasad Joshi prasadjoshi.li...@gmail.com
  ---
   arch/x86/include/uapi/asm/msr-index.h |1 +
   arch/x86/kvm/svm.c|   31 
  +++
   2 files changed, 32 insertions(+)
 
  diff --git a/arch/x86/include/uapi/asm/msr-index.h 
  b/arch/x86/include/uapi/asm/msr-index.h
  index 892ce40..eda8346 100644
  --- a/arch/x86/include/uapi/asm/msr-index.h
  +++ b/arch/x86/include/uapi/asm/msr-index.h
  @@ -170,6 +170,7 @@
 
   #define MSR_AMD64_PATCH_LEVEL0x008b
   #define MSR_AMD64_TSC_RATIO  0xc104
  +#define MSR_AMD64_SVM_KEY0xc0010118
   #define MSR_AMD64_NB_CFG 0xc001001f
   #define MSR_AMD64_PATCH_LOADER   0xc0010020
   #define MSR_AMD64_OSVW_ID_LENGTH 0xc0010140
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index fcdfdea..73b2dc2 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -150,6 +150,8 @@ struct vcpu_svm {
 
struct nested_state nested;
 
  + u64 svm_key; /* Write only SVM Key */
  +
bool nmi_singlestep;
 
unsigned int3_injected;
  @@ -3079,6 +3081,13 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, 
  unsigned ecx, u64 *data)
case MSR_VM_CR:
*data = svm-nested.vm_cr_msr;
break;
  + case MSR_AMD64_SVM_KEY:
  + /*
  +  * SVM Key is password protected mechanism to manage
  +  * VM_CR.LOCK and therefore reading it must always return 0.
  +  */
  + *data = 0;
  + break;
case MSR_IA32_UCODE_REV:
*data = 0x0165;
break;
  @@ -3132,6 +3141,26 @@ static int svm_set_vm_cr(struct kvm_vcpu *vcpu, u64 
  data)
return 0;
   }
 
  +static int svm_set_svm_key(struct kvm_vcpu *vcpu, u64 data)
  +{
  + struct vcpu_svm *svm = to_svm(vcpu);
  +
  + if (!boot_cpu_has(X86_FEATURE_SVML))
  + return 1;
  +
  Since the feature is emulated why do you care if host cpu supports it?
 
 Thanks Gleb. I changed this to checking SVML feature on the guest CPU.
 
 Something like this should work
 
 if (!(cpuid_edx(SVM_CPUID_FUNC)  SVM_FEATURE_SVML))
 return 1;
 
This still checks host cpu. You need to use kvm_find_cpuid_entry().

 
  + if (!(svm-nested.vm_cr_msr  SVM_VM_CR_SVM_LOCK_MASK)  data) {
  + /* vm_cr.lock not set: set the password key */
  + svm-svm_key = data;
  + } else {
  + /* vm_cr.lock is set */
  + if (data  svm-svm_key == data) {
  + /* password key matched: reset the lock */
  + svm-nested.vm_cr_msr = ~SVM_VM_CR_SVM_LOCK_MASK;
  + }
  + }
  That's not enough. First of all KVM does not currently checks
  SVMDIS before allowing setting of EFER_SVME. Second svm_set_vm_cr() 
  erroneously
  ignores writes into VM_CR SVMDIS/LOCK if SVMDIS is set, while it should
  check LOCK.
 
 I was aware of these changes (BUGs), however thought they must go in
 another patch.
It should go to another patch, but send them together with the series
that implement LOCK feature. Without the fixes LOCK feature does not
work.

 Any ways, I will include them in the next patch series.
 
Thanks.

  Third when write is done by a userspace during reset the
  key check should be ignored.
 
 I missed this part. I think, CPU reset must set svm_key to 0 and reset
 the VM_CR.LOCK. Both of these can be part of init_vmcb() which is
 called during CPU reset. It should ensure that the no key is
 registered during CPU reset.
init_vmcb() is called as part of kvm_vcpu_reset(). kvm_vcpu_reset() is
called during vcpu creation and when vcpu receives INIT signal (if
irqchip is in kernel). INIT signal does not cause LOCK to be unset,
otherwise it would have been pointless. Power-up reset is emulated by
userspace configuring vcpu with reset state, for that userspase should
be able to set any value there.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] KVM: nVMX: Validate EFER values for VM_ENTRY/EXIT_LOAD_IA32_EFER

2013-04-22 Thread Paolo Bonzini

Il 20/04/2013 10:52, Jan Kiszka ha scritto:
 As we may emulate the loading of EFER on VM-entry and VM-exit, implement
 the checks that VMX performs on the guest and host values on vmlaunch/
 vmresume. Factor out kvm_valid_efer for this purpose which checks for
 set reserved bits.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
 
 Changes in v2:
  - refactored if clauses as requested by Paolo
  - fixed typo in comment found my Marcelo
 
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/vmx.c  |   40 
 +++
  arch/x86/kvm/x86.c  |   29 ++-
  3 files changed, 60 insertions(+), 10 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 599f98b..18635ae 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -809,6 +809,7 @@ static inline int emulate_instruction(struct kvm_vcpu 
 *vcpu,
  }
  
  void kvm_enable_efer_bits(u64);
 +bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
  int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
  int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
  
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 19aebc7..e3b951f 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7327,6 +7327,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool 
 launch)
   struct vcpu_vmx *vmx = to_vmx(vcpu);
   int cpu;
   struct loaded_vmcs *vmcs02;
 + bool ia32e;
  
   if (!nested_vmx_check_permission(vcpu) ||
   !nested_vmx_check_vmcs12(vcpu))
 @@ -7415,6 +7416,45 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool 
 launch)
   }
  
   /*
 +  * If the “load IA32_EFER” VM-entry control is 1, the following checks
 +  * are performed on the field for the IA32_EFER MSR:
 +  * - Bits reserved in the IA32_EFER MSR must be 0.
 +  * - Bit 10 (corresponding to IA32_EFER.LMA) must equal the value of
 +  *   the IA-32e mode guest VM-exit control. It must also be identical
 +  *   to bit 8 (LME) if bit 31 in the CR0 field (corresponding to
 +  *   CR0.PG) is 1.
 +  */
 + if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_EFER) {
 + ia32e = (vmcs12-vm_entry_controls  VM_ENTRY_IA32E_MODE) != 0;
 + if (!kvm_valid_efer(vcpu, vmcs12-guest_ia32_efer) ||
 + ia32e != !!(vmcs12-guest_ia32_efer  EFER_LMA) ||
 + ((vmcs12-guest_cr0  X86_CR0_PG) 
 +  ia32e != !!(vmcs12-guest_ia32_efer  EFER_LME))) {
 + nested_vmx_entry_failure(vcpu, vmcs12,
 + EXIT_REASON_INVALID_STATE, ENTRY_FAIL_DEFAULT);
 + return 1;
 + }
 + }
 +
 + /*
 +  * If the load IA32_EFER VM-exit control is 1, bits reserved in the
 +  * IA32_EFER MSR must be 0 in the field for that register. In addition,
 +  * the values of the LMA and LME bits in the field must each be that of
 +  * the host address-space size VM-exit control.
 +  */
 + if (vmcs12-vm_exit_controls  VM_EXIT_LOAD_IA32_EFER) {
 + ia32e = (vmcs12-vm_exit_controls 
 +  VM_EXIT_HOST_ADDR_SPACE_SIZE) != 0;
 + if (!kvm_valid_efer(vcpu, vmcs12-host_ia32_efer) ||
 + ia32e != !!(vmcs12-host_ia32_efer  EFER_LMA) ||
 + ia32e != !!(vmcs12-host_ia32_efer  EFER_LME)) {
 + nested_vmx_entry_failure(vcpu, vmcs12,
 + EXIT_REASON_INVALID_STATE, ENTRY_FAIL_DEFAULT);
 + return 1;
 + }
 + }

Looks good, difficult to do better with C's operator precedence rules.

Paolo

 + /*
* We're finally done with prerequisite checking, and can start with
* the nested entry.
*/
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 50e2e10..482784d 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -845,23 +845,17 @@ static const u32 emulated_msrs[] = {
   MSR_IA32_MCG_CTL,
  };
  
 -static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
 +bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer)
  {
 - u64 old_efer = vcpu-arch.efer;
 -
   if (efer  efer_reserved_bits)
 - return 1;
 -
 - if (is_paging(vcpu)
 -  (vcpu-arch.efer  EFER_LME) != (efer  EFER_LME))
 - return 1;
 + return false;
  
   if (efer  EFER_FFXSR) {
   struct kvm_cpuid_entry2 *feat;
  
   feat = kvm_find_cpuid_entry(vcpu, 0x8001, 0);
   if (!feat || !(feat-edx  bit(X86_FEATURE_FXSR_OPT)))
 - return 1;
 + return false;
   }
  
   if (efer  EFER_SVME) {
 @@ -869,9 +863,24 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
  
   feat = kvm_find_cpuid_entry(vcpu, 0x8001, 0);
   if (!feat || !(feat-ecx

Re: [PATCH v7 2/3] tcm_vhost: Add helper to check if endpoint is setup

2013-04-22 Thread Asias He

On Thu, Apr 18, 2013 at 10:38:23AM +0300, Michael S. Tsirkin wrote:
 On Thu, Apr 18, 2013 at 04:32:30PM +0800, Asias He wrote:
  On Thu, Apr 18, 2013 at 10:09:53AM +0300, Michael S. Tsirkin wrote:
   On Thu, Apr 18, 2013 at 09:05:53AM +0800, Asias He wrote:
Signed-off-by: Asias He as...@redhat.com
---
 drivers/vhost/tcm_vhost.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index 88ebb79..8f05528 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -111,6 +111,24 @@ static bool tcm_vhost_check_feature(struct 
vhost_scsi *vs, int feature)
return ret;
 }
 
+static bool tcm_vhost_check_endpoint(struct vhost_virtqueue *vq)
+{
+   bool ret = false;
+
+   /*
+   * We can handle the vq only after the endpoint is setup by 
calling the
+   * VHOST_SCSI_SET_ENDPOINT ioctl.
+   *
+   * TODO: Check that we are running from vhost_worker which acts
+   * as read-side critical section for vhost kind of RCU.
+   * See the comments in struct vhost_virtqueue in 
drivers/vhost/vhost.h
+   */
+   if (rcu_dereference_check(vq-private_data, 1))
+   ret = true;
+
+   return ret;
+}
+
   
   You can't use RCU in this way. You need to actually
   derefence. Instead, just move this within vq mutex
   and do rcu_dereference_protected. document that this function
   requires vq mutex.
  
  Any difference with: 
  
 vs_tpg = rcu_dereference_check(vq-private_data, 1);
 
 Yes, it's not the same. rcu_dereference_protected says
 this is generally RCU but here I use it under lock.
 rcu_dereference_check can only be used if you
 really dereference later, and you don't.

So you want the helper to be this?

static bool tcm_vhost_check_endpoint()
{
struct tcm_vhost_tpg **vs_tpg;

vs_tpg = rcu_dereference_check(vq-private_data, 1);
if (vs_tpg)
return true;
else
return false
}

And what difference code the compiler will generate with this:

static bool tcm_vhost_check_endpoint()
{

if (rcu_dereference_check(vq-private_data, 1))
return true
else
return false
}


 
 if (vs_tpg)
  return ture
 else
  return false;
  
  Note, tcm_vhost_check_endpoint() is called in multiple places. Having a
  helper is more convenient.
 
 I'm fine with the helper, but fix the implementation
 and fix all callers to have vq mutex when calling.

All the caller are in the vhost thread. Why need vq mutex to be taken?

 static int tcm_vhost_check_true(struct se_portal_group *se_tpg)
 {
return 1;
-- 
1.8.1.4
  
  -- 
  Asias

-- 
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -v2] kvm: Emulate MOVBE

2013-04-22 Thread Paolo Bonzini

Il 21/04/2013 14:23, Borislav Petkov ha scritto:
 On Sun, Apr 21, 2013 at 01:46:50PM +0200, Borislav Petkov wrote:
 We probably need something with copying values to a temp variable or so.
 
 Basically something like that:
 
 case 2:
 /*
  * From MOVBE definition: ...When the operand size is 16 
 bits,
  * the upper word of the destination register remains 
 unchanged
  * ...
  *
  * Both casting -valptr and -val to u16 breaks strict 
 aliasing
  * rules so we have to do the operation almost per hand.
  */
 tmp = (u16)ctxt-src.val;
 ctxt-dst.val = ~0xUL;
 ctxt-dst.val |= (unsigned long)swab16(tmp);
   break;
 
 This passes all gcc checks, even the stricter ones when building with W=3.

I thought the valptr one was ok.  I find this one more readable, too.
How does the generated code look like?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/3] tcm_vhost: Add hotplug/hotunplug support

2013-04-22 Thread Asias He

On Sat, Apr 20, 2013 at 10:01:31PM +0300, Michael S. Tsirkin wrote:
 On Fri, Apr 19, 2013 at 10:34:10AM +0800, Asias He wrote:
  On Thu, Apr 18, 2013 at 11:21:54AM +0300, Michael S. Tsirkin wrote:
   On Thu, Apr 18, 2013 at 04:59:08PM +0800, Asias He wrote:
On Thu, Apr 18, 2013 at 10:34:29AM +0300, Michael S. Tsirkin wrote:
 On Thu, Apr 18, 2013 at 09:05:54AM +0800, Asias He wrote:
  In commit 365a7150094 ([SCSI] virtio-scsi: hotplug support for
  virtio-scsi), hotplug support is added to virtio-scsi.
  
  This patch adds hotplug and hotunplug support to tcm_vhost.
  
  You can create or delete a LUN in targetcli to hotplug or hotunplug 
  a
  LUN in guest.
  
  Changes in v7:
  - Add vhost_work_flush for vs-vs_event_work to this series
  
  Changes in v6:
  - Pass tcm_vhost_evt to tcm_vhost_do_evt_work
  
  Changes in v5:
  - Switch to int from u64 to vs_events_nr
  - Set s-vs_events_dropped flag in tcm_vhost_allocate_evt
  - Do not nest dev mutex within vq mutex
  - Use vs_events_lock to protect vs_events_dropped and vs_events_nr
  - Rebase to target/master
  
  Changes in v4:
  - Drop tcm_vhost_check_endpoint in tcm_vhost_send_evt
  - Add tcm_vhost_check_endpoint in vhost_scsi_evt_handle_kick
  
  Changes in v3:
  - Separate the bug fix to another thread
  
  Changes in v2:
  - Remove code duplication in tcm_vhost_{hotplug,hotunplug}
  - Fix racing of vs_events_nr
  - Add flush fix patch to this series
  
  Signed-off-by: Asias He as...@redhat.com
  Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
  ---
   drivers/vhost/tcm_vhost.c | 208 
  +-
   drivers/vhost/tcm_vhost.h |  10 +++
   2 files changed, 214 insertions(+), 4 deletions(-)
  
  diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
  index 8f05528..da2021b 100644
  --- a/drivers/vhost/tcm_vhost.c
  +++ b/drivers/vhost/tcm_vhost.c
  @@ -66,11 +66,13 @@ enum {
* TODO: debug and remove the workaround.
*/
   enum {
  -   VHOST_SCSI_FEATURES = VHOST_FEATURES  
  (~VIRTIO_RING_F_EVENT_IDX)
  +   VHOST_SCSI_FEATURES = (VHOST_FEATURES  
  (~VIRTIO_RING_F_EVENT_IDX)) |
  + (1ULL  VIRTIO_SCSI_F_HOTPLUG)
   };
   
   #define VHOST_SCSI_MAX_TARGET  256
   #define VHOST_SCSI_MAX_VQ  128
  +#define VHOST_SCSI_MAX_EVENT   128
   
   struct vhost_scsi {
  /* Protected by vhost_scsi-dev.mutex */
  @@ -82,6 +84,13 @@ struct vhost_scsi {
   
  struct vhost_work vs_completion_work; /* cmd completion work 
  item */
  struct llist_head vs_completion_list; /* cmd completion queue */
  +
  +   struct vhost_work vs_event_work; /* evt injection work item */
  +   struct llist_head vs_event_list; /* evt injection queue */
  +
  +   struct mutex vs_events_lock; /* protect 
  vs_events_dropped,events_nr */
 
 Looking at this code, there are just so many locks now.
 This does not make me feel safe :)
 At least, please document lock nesting.
 

Do you see a real problem?
   
   Complexity is a real problem. My head already spins. No I don't see a
   bug, but we need to simplify locking.
   
   And I think I see a nice way to do this:
   1. move away from a global work to per-event work - so no list
   2. remove dynamic allocation of events - so no events_nr
   3. set/test overrun flag under the appropriate vq mutex
   
   I think that's ideal.  But we can move there in small steps.  As a first
   step - why can't we always take the vq mutex lock and drop
   vs_events_lock?
  
  There are really different ways to solve the same problem. You are
  welcome to implement you ideas.
 
 I have a problem with this patch. It just seems too tricky,
 while it is really trying to do a very simple thing.

How tricky is it?

 There's new lock nesting.  

I used dev mutex in the past, but it does not work because it introduces
a deadlock in vhost_scsi_set_endpoint. vs_events is a per device stat,
so I think it deserves a per device lock instead of a per queue lock.

 There are new places which do lock, read
 data, unlock, use data.

With vs_events_lock?

 There's new RCU. 

It is not a *new* RCU. It uses the existing one vq-private_data.

 All this feels fragile, and
 will be hard to maintain even though I don't see bugs.

 So I would prefer it if you tried to simplify
 this code making it easier to maintain. I sent some ideas on how to
 do this, but if you like, do it some other way.

I think it is pretty simple already.

 I also made some comments about the coding style but these
 are less important (except goto again I guess) if these
 will be te only issues, I'd say we can apply.
 
 
  +   bool vs_events_dropped; /* any missed events */
  +

Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

2013-04-22 Thread Gleb Natapov

On Sun, Apr 21, 2013 at 10:09:29PM +0800, Xiao Guangrong wrote:
 On 04/21/2013 09:03 PM, Gleb Natapov wrote:
  On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
  This patchset is based on my previous two patchset:
  [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
  (https://lkml.org/lkml/2013/4/1/2)
 
  [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
  (https://lkml.org/lkml/2013/4/1/134)
 
  Changlog:
  V3:
completely redesign the algorithm, please see below.
 
  This looks pretty complicated. Is it still needed in order to avoid soft
  lockups after avoid potential soft lockup and unneeded mmu reload patch?
 
 Yes.
 
 I discussed this point with Marcelo:
 
 ==
 BTW, to my honest, i do not think spin_needbreak is a good way - it does
 not fix the hot-lock contention and it just occupies more cpu time to avoid
 possible soft lock-ups.
 
 Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
 mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
 create page tables again. zap-all-shadow-page need long time to be finished,
 the worst case is, it can not completed forever on intensive vcpu and memory
 usage.
 
So what about mixed approach: use generation numbers and reload roots to
quickly invalidate all shadow pages and then do kvm_mmu_zap_all_invalid().
kvm_mmu_zap_all_invalid() is a new function that invalidates only shadow
pages with stale generation number (and uses lock break technique). It
may traverse active_mmu_pages from tail to head since new shadow pages
will be added to the head of the list or it may use invalid slot rmap to
find exactly what should be invalidated.

 I still think the right way to fix this kind of thing is optimization for
 mmu-lock.
 ==
 
 Which parts scare you? Let's find a way to optimize for it. ;). For example,
 if you do not like unmap_memslot_rmap_nolock(), we can simplify it - We can
 use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() to
 protect spte instead of kvm-being_unmaped_rmap.
 

kvm-being_unmaped_rmap is particularly tricky, although looks
correct. Additional indirection with rmap ops also does not help following
the code. I'd rather have if(slot is invalid) in a couple of places where
things should be done differently. In most places it will be WARN_ON(slot
is invalid).

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -v2] kvm: Emulate MOVBE

2013-04-22 Thread Borislav Petkov

On Mon, Apr 22, 2013 at 10:53:42AM +0200, Paolo Bonzini wrote:
 Il 21/04/2013 14:23, Borislav Petkov ha scritto:
  On Sun, Apr 21, 2013 at 01:46:50PM +0200, Borislav Petkov wrote:
  We probably need something with copying values to a temp variable or so.
  
  Basically something like that:
  
  case 2:
  /*
   * From MOVBE definition: ...When the operand size is 16 
  bits,
   * the upper word of the destination register remains 
  unchanged
   * ...
   *
   * Both casting -valptr and -val to u16 breaks strict 
  aliasing
   * rules so we have to do the operation almost per hand.
   */
  tmp = (u16)ctxt-src.val;
  ctxt-dst.val = ~0xUL;
  ctxt-dst.val |= (unsigned long)swab16(tmp);
  break;
  
  This passes all gcc checks, even the stricter ones when building with W=3.
 
 I thought the valptr one was ok.

Yep, it looked like that too. And, it could actually really be ok and
the gcc's warning here is bogus. I'll try to talk to gcc people about
it.

 I find this one more readable, too. How does the generated code look
 like?

Well, so so:

movzwl  112(%rdi), %eax # ctxt_5(D)-src.D.27823.val, tmp87
movq240(%rdi), %rdx # ctxt_5(D)-dst.D.27823.val, tmp89
xorw%dx, %dx# tmp89
rolw$8, %ax #, tmp87
movzwl  %ax, %eax   # tmp87, tmp91

I have hard time understanding why it is adding this insn here - it can
simply drop it and continue with the 64-bit OR. It's not like it changes
anything...

orq %rdx, %rax  # tmp89, tmp91
movq%rax, 240(%rdi) # tmp91, ctxt_5(D)-dst.D.27823.val

Btw, I wanted to ask: when kvm commits the results, does it look at
ctxt-op_bytes to know exactly how many bytes to write to the guest?
Because if it does, we can save ourselves the trouble here.

Or does it simply write both the full sizeof(unsigned long) bytes of
-src.val and -dst.val to the guest?

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -v2] kvm: Emulate MOVBE

2013-04-22 Thread Gleb Natapov

On Mon, Apr 22, 2013 at 11:38:10AM +0200, Borislav Petkov wrote:
 On Mon, Apr 22, 2013 at 10:53:42AM +0200, Paolo Bonzini wrote:
  Il 21/04/2013 14:23, Borislav Petkov ha scritto:
   On Sun, Apr 21, 2013 at 01:46:50PM +0200, Borislav Petkov wrote:
   We probably need something with copying values to a temp variable or so.
   
   Basically something like that:
   
   case 2:
   /*
* From MOVBE definition: ...When the operand size is 16 
   bits,
* the upper word of the destination register remains 
   unchanged
* ...
*
* Both casting -valptr and -val to u16 breaks strict 
   aliasing
* rules so we have to do the operation almost per hand.
*/
   tmp = (u16)ctxt-src.val;
   ctxt-dst.val = ~0xUL;
   ctxt-dst.val |= (unsigned long)swab16(tmp);
 break;
   
   This passes all gcc checks, even the stricter ones when building with W=3.
  
  I thought the valptr one was ok.
 
 Yep, it looked like that too. And, it could actually really be ok and
 the gcc's warning here is bogus. I'll try to talk to gcc people about
 it.
 
  I find this one more readable, too. How does the generated code look
  like?
 
 Well, so so:
 
   movzwl  112(%rdi), %eax # ctxt_5(D)-src.D.27823.val, tmp87
   movq240(%rdi), %rdx # ctxt_5(D)-dst.D.27823.val, tmp89
   xorw%dx, %dx# tmp89
   rolw$8, %ax #, tmp87
   movzwl  %ax, %eax   # tmp87, tmp91
 
 I have hard time understanding why it is adding this insn here - it can
 simply drop it and continue with the 64-bit OR. It's not like it changes
 anything...
 
   orq %rdx, %rax  # tmp89, tmp91
   movq%rax, 240(%rdi) # tmp91, ctxt_5(D)-dst.D.27823.val
 
 Btw, I wanted to ask: when kvm commits the results, does it look at
 ctxt-op_bytes to know exactly how many bytes to write to the guest?
 Because if it does, we can save ourselves the trouble here.
 
 Or does it simply write both the full sizeof(unsigned long) bytes of
 -src.val and -dst.val to the guest?
 
No, it does this in case of register operand:

static void write_register_operand(struct operand *op)
{
/* The 4-byte case *is* correct: in 64-bit mode we zero-extend.  */
switch (op-bytes) {
case 1:
*(u8 *)op-addr.reg = (u8)op-val;
break;
case 2:
*(u16 *)op-addr.reg = (u16)op-val;
break;
case 4:
*op-addr.reg = (u32)op-val;
break;  /* 64b: zero-extend */
case 8:
*op-addr.reg = op-val;
break;
}
}

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -v2] kvm: Emulate MOVBE

2013-04-22 Thread Borislav Petkov

On Mon, Apr 22, 2013 at 12:42:46PM +0300, Gleb Natapov wrote:
  Btw, I wanted to ask: when kvm commits the results, does it look at
  ctxt-op_bytes to know exactly how many bytes to write to the guest?
  Because if it does, we can save ourselves the trouble here.
  
  Or does it simply write both the full sizeof(unsigned long) bytes of
  -src.val and -dst.val to the guest?
  
 No, it does this in case of register operand:
 
 static void write_register_operand(struct operand *op)
 {
 /* The 4-byte case *is* correct: in 64-bit mode we zero-extend.  */
 switch (op-bytes) {
 case 1:
 *(u8 *)op-addr.reg = (u8)op-val;
 break;
 case 2:
 *(u16 *)op-addr.reg = (u16)op-val;
 break;
 case 4:
 *op-addr.reg = (u32)op-val;
 break;  /* 64b: zero-extend */
 case 8:
 *op-addr.reg = op-val;
 break;
 }
 }

Ok, and for OP_MEM it does look at ctxt-dst.bytes in writeback(),
AFAICT. And I see other emulated instructions like POPF, for example, do
this:

ctxt-dst.bytes = ctxt-op_bytes;

Which means, we can drop all the bullshit in em_movbe and even destroy
some of the bytes in dst.val but only write out the correct ones. Which
means, a simpler code and a lot less jumping through hoops.

Would that be the more accepted practice?

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: nVMX: Validate EFER values for VM_ENTRY/EXIT_LOAD_IA32_EFER

2013-04-22 Thread Gleb Natapov

On Sun, Apr 14, 2013 at 12:44:09PM +0200, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 As we may emulate the loading of EFER on VM-entry and VM-exit, implement
 the checks that VMX performs on the guest and host values on vmlaunch/
 vmresume. Factor out kvm_valid_efer for this purpose which checks for
 set reserved bits.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Applied v2 of 1/2 and 2/2.

 ---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/vmx.c  |   38 ++
  arch/x86/kvm/x86.c  |   29 +++--
  3 files changed, 58 insertions(+), 10 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index b2c7263..28a458f 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -805,6 +805,7 @@ static inline int emulate_instruction(struct kvm_vcpu 
 *vcpu,
  }
  
  void kvm_enable_efer_bits(u64);
 +bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
  int kvm_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data);
  int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
  
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index cc2ba55..0d13b29 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7257,6 +7257,44 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool 
 launch)
   }
  
   /*
 +  * If the “load IA32_EFER” VM-entry control is 1, the following checks
 +  * are performed on the field for the IA32_EFER MSR:
 +  * - Bits reserved in the IA32_EFER MSR must be 0.
 +  * - Bit 10 (corresponding to IA32_EFER.LMA) must equal the value of
 +  *   the IA-32e mode guest VM-exit control. It must also be identical
 +  *   to bit 8 (LME) if bit 31 in the CR0 field (corresponding to
 +  *   CR0.PG) is 1.1
 +  */
 + if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_EFER 
 + (!kvm_valid_efer(vcpu, vmcs12-guest_ia32_efer) ||
 +  !!(vmcs12-vm_entry_controls  VM_ENTRY_IA32E_MODE) !=
 +  !!(vmcs12-guest_ia32_efer  EFER_LMA) ||
 +  (vmcs12-guest_cr0  X86_CR0_PG 
 +   !!(vmcs12-vm_entry_controls  VM_ENTRY_IA32E_MODE) !=
 +   !!(vmcs12-guest_ia32_efer  EFER_LME {
 + nested_vmx_entry_failure(vcpu, vmcs12,
 + EXIT_REASON_INVALID_STATE, ENTRY_FAIL_DEFAULT);
 + return 1;
 + }
 +
 + /*
 +  * If the load IA32_EFER VM-exit control is 1, bits reserved in the
 +  * IA32_EFER MSR must be 0 in the field for that register. In addition,
 +  * the values of the LMA and LME bits in the field must each be that of
 +  * the host address-space size VM-exit control.
 +  */
 + if (vmcs12-vm_exit_controls  VM_EXIT_LOAD_IA32_EFER 
 + (!kvm_valid_efer(vcpu, vmcs12-host_ia32_efer) ||
 +  !!(vmcs12-vm_exit_controls  VM_EXIT_HOST_ADDR_SPACE_SIZE) !=
 +  !!(vmcs12-host_ia32_efer  EFER_LMA) ||
 +  !!(vmcs12-vm_exit_controls  VM_EXIT_HOST_ADDR_SPACE_SIZE) !=
 +  !!(vmcs12-host_ia32_efer  EFER_LME))) {
 + nested_vmx_entry_failure(vcpu, vmcs12,
 + EXIT_REASON_INVALID_STATE, ENTRY_FAIL_DEFAULT);
 + return 1;
 + }
 +
 + /*
* We're finally done with prerequisite checking, and can start with
* the nested entry.
*/
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index eb9927e..f248a3a 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -845,23 +845,17 @@ static const u32 emulated_msrs[] = {
   MSR_IA32_MCG_CTL,
  };
  
 -static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
 +bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer)
  {
 - u64 old_efer = vcpu-arch.efer;
 -
   if (efer  efer_reserved_bits)
 - return 1;
 -
 - if (is_paging(vcpu)
 -  (vcpu-arch.efer  EFER_LME) != (efer  EFER_LME))
 - return 1;
 + return false;
  
   if (efer  EFER_FFXSR) {
   struct kvm_cpuid_entry2 *feat;
  
   feat = kvm_find_cpuid_entry(vcpu, 0x8001, 0);
   if (!feat || !(feat-edx  bit(X86_FEATURE_FXSR_OPT)))
 - return 1;
 + return false;
   }
  
   if (efer  EFER_SVME) {
 @@ -869,9 +863,24 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
  
   feat = kvm_find_cpuid_entry(vcpu, 0x8001, 0);
   if (!feat || !(feat-ecx  bit(X86_FEATURE_SVM)))
 - return 1;
 + return false;
   }
  
 + return true;
 +}
 +EXPORT_SYMBOL_GPL(kvm_valid_efer);
 +
 +static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
 +{
 + u64 old_efer = vcpu-arch.efer;
 +
 + if (!kvm_valid_efer(vcpu, efer))
 + return 1;
 +
 + if (is_paging(vcpu)
 +  (vcpu-arch.efer  EFER_LME) != (efer  EFER_LME))
 + return 1;
 +
   efer = ~EFER_LMA;

Re: [PATCH -v2] kvm: Emulate MOVBE

2013-04-22 Thread Gleb Natapov

On Mon, Apr 22, 2013 at 11:52:03AM +0200, Borislav Petkov wrote:
 On Mon, Apr 22, 2013 at 12:42:46PM +0300, Gleb Natapov wrote:
   Btw, I wanted to ask: when kvm commits the results, does it look at
   ctxt-op_bytes to know exactly how many bytes to write to the guest?
   Because if it does, we can save ourselves the trouble here.
   
   Or does it simply write both the full sizeof(unsigned long) bytes of
   -src.val and -dst.val to the guest?
   
  No, it does this in case of register operand:
  
  static void write_register_operand(struct operand *op)
  {
  /* The 4-byte case *is* correct: in 64-bit mode we zero-extend.  */
  switch (op-bytes) {
  case 1:
  *(u8 *)op-addr.reg = (u8)op-val;
  break;
  case 2:
  *(u16 *)op-addr.reg = (u16)op-val;
  break;
  case 4:
  *op-addr.reg = (u32)op-val;
  break;  /* 64b: zero-extend */
  case 8:
  *op-addr.reg = op-val;
  break;
  }
  }
 
 Ok, and for OP_MEM it does look at ctxt-dst.bytes in writeback(),
 AFAICT. And I see other emulated instructions like POPF, for example, do
 this:
 
   ctxt-dst.bytes = ctxt-op_bytes;
 
 Which means, we can drop all the bullshit in em_movbe and even destroy
 some of the bytes in dst.val but only write out the correct ones. Which
 means, a simpler code and a lot less jumping through hoops.
 
 Would that be the more accepted practice?
 
For most instructions the decoder already sets op-bytes to correct value,
given that all flags a correctly specified in opcode table. Explicit
op-bytes setting should be done only if it cannot be expressed by
opcode flags.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fwd: kvm

2013-04-22 Thread Gary Lloyd

Hi

I was wondering if anyone could help me with an issue with KVM and ISCSI.

If we restart a controller on our EqualLogic SAN or there are any
network interruptions on the storage network, KVM guests throw a
wobbler and their files systems go into read only(centos 5.9 guest
with virtio driver).

I have read a few forums that indicate you can set disk timeout values
on the guests themselves but this is not possible using the virtio
driver, which is what we are currently using.

Is there any way we can instruct KVM to pause the vm's if there is a
storage failure and resume them when the storage comes back online ?

We are currently running Centos 6.4. There seems to be a werror='stop'
and rerror='stop' options to achieve this but if I try to put these in
options in the libvirt xml file for a vm, libvirt appears to be
removing them.


Thanks
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Peter Zijlstra

On Sun, 2013-04-21 at 17:12 -0400, Rik van Riel wrote:
 
 If we always incremented the ticket number by 2 (instead of 1), then
 we could use the lower bit of the ticket number as the spinlock.

ISTR that paravirt ticket locks already do that and use the lsb to
indicate the unlock needs to perform wakeups.

Also, since all of this is virt nonsense, shouldn't it live in the
paravirt ticket lock code and leave the native code as is?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

2013-04-22 Thread Gleb Natapov

On Sun, Apr 21, 2013 at 12:35:08PM -0300, Marcelo Tosatti wrote:
 On Sun, Apr 21, 2013 at 12:27:51PM -0300, Marcelo Tosatti wrote:
  On Sun, Apr 21, 2013 at 04:03:46PM +0300, Gleb Natapov wrote:
   On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
This patchset is based on my previous two patchset:
[PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu 
reload
(https://lkml.org/lkml/2013/4/1/2)

[PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
(https://lkml.org/lkml/2013/4/1/134)

Changlog:
V3:
  completely redesign the algorithm, please see below.

   This looks pretty complicated. Is it still needed in order to avoid soft
   lockups after avoid potential soft lockup and unneeded mmu reload patch?
  
  Do not want kvm_set_memory (cases: DELETE/MOVE/CREATES) to be
  suspectible to:
  
  vcpu 1  |   kvm_set_memory
  create shadow page  
  nuke shadow page
  create shadow page
  nuke shadow page
  
  Which is guest triggerable behavior with spinlock preemption algorithm.
 
 Not only guest triggerable as in the sense of a malicious guest, 
 but condition above can be induced by host workload with non-malicious
 guest system.
 
Is the problem that newly created shadow pages are immediately zapped?
Shouldn't generation number/kvm_mmu_zap_all_invalid() idea described here
https://lkml.org/lkml/2013/4/22/111 solve this?

 Also kvm_set_memory being relatively fast with huge memory guests
 is nice (which is what Xiaos idea allows).
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel


On 04/22/2013 07:51 AM, Peter Zijlstra wrote:

On Sun, 2013-04-21 at 17:12 -0400, Rik van Riel wrote:


If we always incremented the ticket number by 2 (instead of 1), then
we could use the lower bit of the ticket number as the spinlock.


ISTR that paravirt ticket locks already do that and use the lsb to
indicate the unlock needs to perform wakeups.

Also, since all of this is virt nonsense, shouldn't it live in the
paravirt ticket lock code and leave the native code as is?


Sure, but that is still no reason not to have the virt
implementation be as fast as possible, and share the same
data type as the non-virt implementation.

Also, is it guaranteed that the native spin_lock code has
not been called yet before we switch over to the paravirt
functions?

If the native spin_lock code has been called already at
that time, the native code would still need to be modified
to increment the ticket number by 2, so we end up with a
compatible value in each spin lock's .tickets field, and
prevent a deadlock after we switch over to the paravirt
variant.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/3] tcm_vhost: Add hotplug/hotunplug support

2013-04-22 Thread Michael S. Tsirkin

On Mon, Apr 22, 2013 at 05:20:24PM +0800, Asias He wrote:
 On Sat, Apr 20, 2013 at 10:01:31PM +0300, Michael S. Tsirkin wrote:
  On Fri, Apr 19, 2013 at 10:34:10AM +0800, Asias He wrote:
   On Thu, Apr 18, 2013 at 11:21:54AM +0300, Michael S. Tsirkin wrote:
On Thu, Apr 18, 2013 at 04:59:08PM +0800, Asias He wrote:
 On Thu, Apr 18, 2013 at 10:34:29AM +0300, Michael S. Tsirkin wrote:
  On Thu, Apr 18, 2013 at 09:05:54AM +0800, Asias He wrote:
   In commit 365a7150094 ([SCSI] virtio-scsi: hotplug support for
   virtio-scsi), hotplug support is added to virtio-scsi.
   
   This patch adds hotplug and hotunplug support to tcm_vhost.
   
   You can create or delete a LUN in targetcli to hotplug or 
   hotunplug a
   LUN in guest.
   
   Changes in v7:
   - Add vhost_work_flush for vs-vs_event_work to this series
   
   Changes in v6:
   - Pass tcm_vhost_evt to tcm_vhost_do_evt_work
   
   Changes in v5:
   - Switch to int from u64 to vs_events_nr
   - Set s-vs_events_dropped flag in tcm_vhost_allocate_evt
   - Do not nest dev mutex within vq mutex
   - Use vs_events_lock to protect vs_events_dropped and vs_events_nr
   - Rebase to target/master
   
   Changes in v4:
   - Drop tcm_vhost_check_endpoint in tcm_vhost_send_evt
   - Add tcm_vhost_check_endpoint in vhost_scsi_evt_handle_kick
   
   Changes in v3:
   - Separate the bug fix to another thread
   
   Changes in v2:
   - Remove code duplication in tcm_vhost_{hotplug,hotunplug}
   - Fix racing of vs_events_nr
   - Add flush fix patch to this series
   
   Signed-off-by: Asias He as...@redhat.com
   Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
   ---
drivers/vhost/tcm_vhost.c | 208 
   +-
drivers/vhost/tcm_vhost.h |  10 +++
2 files changed, 214 insertions(+), 4 deletions(-)
   
   diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
   index 8f05528..da2021b 100644
   --- a/drivers/vhost/tcm_vhost.c
   +++ b/drivers/vhost/tcm_vhost.c
   @@ -66,11 +66,13 @@ enum {
 * TODO: debug and remove the workaround.
 */
enum {
   - VHOST_SCSI_FEATURES = VHOST_FEATURES  
   (~VIRTIO_RING_F_EVENT_IDX)
   + VHOST_SCSI_FEATURES = (VHOST_FEATURES  
   (~VIRTIO_RING_F_EVENT_IDX)) |
   +   (1ULL  VIRTIO_SCSI_F_HOTPLUG)
};

#define VHOST_SCSI_MAX_TARGET256
#define VHOST_SCSI_MAX_VQ128
   +#define VHOST_SCSI_MAX_EVENT 128

struct vhost_scsi {
 /* Protected by vhost_scsi-dev.mutex */
   @@ -82,6 +84,13 @@ struct vhost_scsi {

 struct vhost_work vs_completion_work; /* cmd completion work 
   item */
 struct llist_head vs_completion_list; /* cmd completion queue */
   +
   + struct vhost_work vs_event_work; /* evt injection work item */
   + struct llist_head vs_event_list; /* evt injection queue */
   +
   + struct mutex vs_events_lock; /* protect 
   vs_events_dropped,events_nr */
  
  Looking at this code, there are just so many locks now.
  This does not make me feel safe :)
  At least, please document lock nesting.
  
 
 Do you see a real problem?

Complexity is a real problem. My head already spins. No I don't see a
bug, but we need to simplify locking.

And I think I see a nice way to do this:
1. move away from a global work to per-event work - so no list
2. remove dynamic allocation of events - so no events_nr
3. set/test overrun flag under the appropriate vq mutex

I think that's ideal.  But we can move there in small steps.  As a first
step - why can't we always take the vq mutex lock and drop
vs_events_lock?
   
   There are really different ways to solve the same problem. You are
   welcome to implement you ideas.
  
  I have a problem with this patch. It just seems too tricky,
  while it is really trying to do a very simple thing.
 
 How tricky is it?

It has its own locking scheme which seems unwarranted.
Find a way to use the existing locking please.

  There's new lock nesting.  
 
 I used dev mutex in the past, but it does not work because it introduces
 a deadlock in vhost_scsi_set_endpoint. vs_events is a per device stat,
 so I think it deserves a per device lock instead of a per queue lock.

We don't add a lock per field, it just does not make sense.

  There are new places which do lock, read
  data, unlock, use data.
 
 With vs_events_lock?

With this patch.

  There's new RCU. 
 
 It is not a *new* RCU. It uses the existing one vq-private_data.

Incorrectly. It should be used on the workqueue where it's
dereferenced.

  All this feels fragile, and
  will be hard to maintain even though I don't see bugs.
 
  So I would

Re: [PATCH v7 2/3] tcm_vhost: Add helper to check if endpoint is setup

2013-04-22 Thread Michael S. Tsirkin

On Mon, Apr 22, 2013 at 04:53:27PM +0800, Asias He wrote:
 On Thu, Apr 18, 2013 at 10:38:23AM +0300, Michael S. Tsirkin wrote:
  On Thu, Apr 18, 2013 at 04:32:30PM +0800, Asias He wrote:
   On Thu, Apr 18, 2013 at 10:09:53AM +0300, Michael S. Tsirkin wrote:
On Thu, Apr 18, 2013 at 09:05:53AM +0800, Asias He wrote:
 Signed-off-by: Asias He as...@redhat.com
 ---
  drivers/vhost/tcm_vhost.c | 18 ++
  1 file changed, 18 insertions(+)
 
 diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
 index 88ebb79..8f05528 100644
 --- a/drivers/vhost/tcm_vhost.c
 +++ b/drivers/vhost/tcm_vhost.c
 @@ -111,6 +111,24 @@ static bool tcm_vhost_check_feature(struct 
 vhost_scsi *vs, int feature)
   return ret;
  }
  
 +static bool tcm_vhost_check_endpoint(struct vhost_virtqueue *vq)
 +{
 + bool ret = false;
 +
 + /*
 + * We can handle the vq only after the endpoint is setup by 
 calling the
 + * VHOST_SCSI_SET_ENDPOINT ioctl.
 + *
 + * TODO: Check that we are running from vhost_worker which acts
 + * as read-side critical section for vhost kind of RCU.
 + * See the comments in struct vhost_virtqueue in 
 drivers/vhost/vhost.h
 + */
 + if (rcu_dereference_check(vq-private_data, 1))
 + ret = true;
 +
 + return ret;
 +}
 +

You can't use RCU in this way. You need to actually
derefence. Instead, just move this within vq mutex
and do rcu_dereference_protected. document that this function
requires vq mutex.
   
   Any difference with: 
   
  vs_tpg = rcu_dereference_check(vq-private_data, 1);
  
  Yes, it's not the same. rcu_dereference_protected says
  this is generally RCU but here I use it under lock.
  rcu_dereference_check can only be used if you
  really dereference later, and you don't.
 
 So you want the helper to be this?
 
 static bool tcm_vhost_check_endpoint()
 {
 struct tcm_vhost_tpg **vs_tpg;
 
 vs_tpg = rcu_dereference_check(vq-private_data, 1);
 if (vs_tpg)
 return true;
 else
 return false
 }
 
 And what difference code the compiler will generate with this:
 
 static bool tcm_vhost_check_endpoint()
 {
 
 if (rcu_dereference_check(vq-private_data, 1))
 return true
 else
 return false
 }
 

No, what I want is that private data is
either dereferenced (not tested against NULL)
or used under vq mutex.
In this case the second option is the easiest
and the cleanest. So move it under mutex
and then test it safely without rcu.

  
  if (vs_tpg)
 return ture
  else
   return false;
   
   Note, tcm_vhost_check_endpoint() is called in multiple places. Having a
   helper is more convenient.
  
  I'm fine with the helper, but fix the implementation
  and fix all callers to have vq mutex when calling.
 
 All the caller are in the vhost thread. Why need vq mutex to be taken?

vhost thread is an implementation detail, people are experimenting with
various threading strategies.  In particular, code must
keep working when we move it to use a workqueue.

  static int tcm_vhost_check_true(struct se_portal_group *se_tpg)
  {
   return 1;
 -- 
 1.8.1.4
   
   -- 
   Asias
 
 -- 
 Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

2013-04-22 Thread Takuya Yoshikawa

On Mon, 22 Apr 2013 15:39:38 +0300
Gleb Natapov g...@redhat.com wrote:

   Do not want kvm_set_memory (cases: DELETE/MOVE/CREATES) to be
   suspectible to:
   
   vcpu 1|   kvm_set_memory
   create shadow page
 nuke shadow page
   create shadow page
 nuke shadow page
   
   Which is guest triggerable behavior with spinlock preemption algorithm.
  
  Not only guest triggerable as in the sense of a malicious guest, 
  but condition above can be induced by host workload with non-malicious
  guest system.
  
 Is the problem that newly created shadow pages are immediately zapped?
 Shouldn't generation number/kvm_mmu_zap_all_invalid() idea described here
 https://lkml.org/lkml/2013/4/22/111 solve this?

I guess so.  That's what Avi described when he tried to achieve
lockless TLB flushes.  Mixing that idea with Xiao's approach will
achieve reasonably nice performance, I think.

Various improvements should be added later on top of that if needed.

  Also kvm_set_memory being relatively fast with huge memory guests
  is nice (which is what Xiaos idea allows).

I agree with this point.  But if so, it should be actually measured on
such guests, even if the algorithm looks promising.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -v2] kvm: Emulate MOVBE

2013-04-22 Thread Borislav Petkov

On Mon, Apr 22, 2013 at 12:58:12PM +0300, Gleb Natapov wrote:
 For most instructions the decoder already sets op-bytes to correct
 value, given that all flags a correctly specified in opcode table.
 Explicit op-bytes setting should be done only if it cannot be
 expressed by opcode flags.

MOVBE encodes operands in ModRM and operand size is determined by the
effective operand size. By looking at that switch(mode) thing near
the beginning of x86_decode_insn, we make sure ctxt-op_bytes is set
accordingly. Then, we have the following definitions for MOVBE:

+   [0xf0] = I(DstReg | SrcMem | ModRM | Mov | ThreeByte | EmulateOnUD, 
em_movbe),
+   [0xf1] = I(DstMem | SrcReg | ModRM | Mov | ThreeByte | EmulateOnUD, 
em_movbe),

and from looking at decode_operand(), it makes sure that op-bytes gets
the correct value since we have the proper {Src,Dst}{Reg,Mem} flags in
the insn definition.

So everything is fine, I'll make sure it works that way too, though,
when testing.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 55201] host panic when creating guest, doing scp and killing QEMU process continuously

2013-04-22 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=55201


Jay Ren yongjie@intel.com changed:

   What|Removed |Added

 Status|VERIFIED|CLOSED




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 55421] igb VF can't work in KVM guest

2013-04-22 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=55421


Jay Ren yongjie@intel.com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||INVALID




--- Comment #5 from Jay Ren yongjie@intel.com  2013-04-22 14:14:26 ---
As this is the latest igb drive change, I'll close this bug.

There are two methods to work around or fix this issue.
1. update igbvf driver or kernel of the guest
2. use ip command to set VF's MAC before assigning to a guest
(e.g. ip link set eth0 vf 0 mac 00:1E:67:65:93:01 )

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 55421] igb VF can't work in KVM guest

2013-04-22 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=55421


Jay Ren yongjie@intel.com changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 56971] New: [nested virt] L1 CPU Stuck when booting a L2 guest

2013-04-22 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=56971

   Summary: [nested virt] L1 CPU Stuck when booting a L2 guest
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.9.0-rc3
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: yongjie@intel.com
Regression: No


Created an attachment (id=99651)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=99651)
call trace in L1 guest 

Environment:

Host OS (ia32/ia32e/IA64):ia32e
Guest OS (ia32/ia32e/IA64):ia32e
Guest OS Type (Linux/Windows):Linux
kvm.git next Commit:c0d1c770c05ac7051df86914f9627b68f29c1d67
qemu-kvm.git Commit:8912bdea01e8671e59fe0287314379be9c1f40ec
Host Kernel Version:3.9.0-rc3
Hardware: Sandy Bridge - EP


Bug detailed description:
--
create L1 guest with -cpu host, then create a L2 guest. We found CPU Stuck in
L1 guest when booting L2 guest.
This should be a regression.

kvm(next)+ qemu-kvm   =  result
188424ba + 8912bdea   = good
c0d1c770 + 8912bdea   = bad 


Reproduce steps:

1. create L1 guests:
qemu-system-x86_64 --enable-kvm -m 10240 -smp 8 -net
nic,macaddr=00:12:45:67:2B:1C -net tap,script=/etc/kvm/qemu-ifup
nested-kvm-rhel6u4.qcow -cpu host
2. create L2 guest
qemu-system-x86_64 --enable-kvm -m 1024 -smp 2 -net none rhel6u4.img


Current result:

L1 call trace  (CPU Stuck)

Expected result:

L1 and L2 work fine

Basic root-causing log: 
--


[  312.158002] BUG: soft lockup - CPU#7 stuck for 22s! [qemu-system-x86:2353]

[  312.158002] Modules linked in: bridge nfsv3 nfs_acl nfsv4 auth_rpcgss nfs
fuse fscache dns_resolver lockd sunrpc 8021q garp stp llc binfmt_misc uinput
ppdev parport_pc parport kvm_intel kvm crc32c_intel ghash_clmulni_intel
microcode pcspkr e1000 cirrus ttm drm_kms_helper drm i2c_piix4 i2c_core
floppy(F)

[  312.158002] CPU 7 

[  312.158002] Pid: 2353, comm: qemu-system-x86 Tainted: GF3.8.5 #1
Bochs Bochs

[  312.158002] RIP: 0010:[810c0982]  [810c0982]
smp_call_function_many+0x202/0x270

[  312.158002] RSP: 0018:88027a21dce8  EFLAGS: 0202

[  312.158002] RAX: 0008 RBX: 00088ede8700 RCX:
0004

[  312.158002] RDX: 0004 RSI: 0080 RDI:
0292

[  312.158002] RBP: 88027a21dd38 R08: 88029fdd4890 R09:
0080

[  312.158002] R10: 0002 R11:  R12:
0292

[  312.158002] R13: 0001 R14: ea0009e34640 R15:
00480008

[  312.158002] FS:  7ff26ef41700() GS:88029fdc()
knlGS:

[  312.158002] CS:  0010 DS:  ES:  CR0: 80050033

[  312.158002] CR2: 7ff271248000 CR3: 0002902a7000 CR4:
000427e0

[  312.158002] DR0:  DR1:  DR2:


[  312.158002] DR3:  DR6: 0ff0 DR7:
0400

[  312.158002] Process qemu-system-x86 (pid: 2353, threadinfo 88027a21c000,
task 88027d368000)

[  312.158002] Stack:

[  312.158002]  88027d368000 01ff88027d368000 88027a21dd48
8104f110

[  312.158002]  7ff23fff 88028ede89c0 88028ede8700
7ff267dff000

[  312.158002]  88027a21de18 7ff26800 88027a21dd68
8104ed5e

[  312.158002] Call Trace:

[  312.158002]  [8104f110] ? flush_tlb_mm_range+0x250/0x250

[  312.158002]  [8104ed5e] native_flush_tlb_others+0x2e/0x30

[  312.158002]  [8104ef30] flush_tlb_mm_range+0x70/0x250

[  312.158002]  [8115d822] tlb_flush_mmu+0xa2/0xb0

[  312.158002]  [8115e15c] tlb_finish_mmu+0x1c/0x50

[  312.158002]  [8116552a] unmap_region+0xea/0x110

[  312.158002]  [811672c2] ? __split_vma+0x1e2/0x230

[  312.158002]  [81167c14] do_munmap+0x274/0x3a0

[  312.158002]  [81167d91] vm_munmap+0x51/0x80

[  312.158002]  [81167dec] sys_munmap+0x2c/0x40

[  312.158002]  [81653999] system_call_fastpath+0x16/0x1b

[  312.158002] Code: a6 58 00 0f ae f0 4c 89 e7 ff 15 e2 08 b6 00 80 7d bf 00
0f 84 89 fe ff ff f6 43 20 01 0f 84 7f fe ff ff 66 0f 1f 44 00 00 f3 90 f6 43
20 01 75 f8 e9 6c fe ff ff 0f 1f 00 4c 89 ea 4c 89 f6 44

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 56981] New: [SR-IOV] Intel I350 NIC VF cannot work in a Windows 2008 guest.

2013-04-22 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=56981

   Summary: [SR-IOV] Intel I350 NIC VF cannot work in a Windows
2008 guest.
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.9.0-rc3
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: yongjie@intel.com
Regression: No


Environment:

Host OS (ia32/ia32e/IA64):ia32e
Guest OS (ia32/ia32e/IA64):ia32e
Guest OS Type (Linux/Windows):Windows
kvm.git Commit:79558f112fc0352e057f7b5e158e3d88b8b62c60
qemu-kvm Commit:8912bdea01e8671e59fe0287314379be9c1f40ec
Host Kernel Version:3.9.0-rc3
Hardware: Sandy Bridge - EP


Bug detailed description:
--
The assigned Intel I350 NIC VF (a igbvf) cannot work in a Windows 2008 guest.

1. If the guest is Windows 7/8/2012 or RHEL6.x , the Intel I350 NIC VF can work
fine in the guest.
2. Intel 82576 NIC VF (also igbvf) can work in a Windows 2008 guest.


Reproduce steps:

1. ./pcistub.sh -h 0b:10.0   ( to hide a device)
2. qemu-system-x86_64 -enable-kvm -m 4096 -smp 4 -device
pci-assign,host=0b:10.0 -net none -hda win2k8.img

Current result:

Intel I350 NIC VF cannot work in win2k8 guest

Expected result:

Intel I350 NIC VF can work in win2k8 guest

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 56971] [nested virt] L1 CPU Stuck when booting a L2 guest

2013-04-22 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=56971


Jay Ren yongjie@intel.com changed:

   What|Removed |Added

 Regression|No  |Yes




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 56981] [SR-IOV] Intel I350 NIC VF cannot work in a Windows 2008 guest.

2013-04-22 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=56981





--- Comment #1 from Jay Ren yongjie@intel.com  2013-04-22 14:43:00 ---
Created an attachment (id=99661)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=99661)
lspci info of the igbvf in kvm host

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 56981] [SR-IOV] Intel I350 NIC VF cannot work in a Windows 2008 guest.

2013-04-22 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=56981





--- Comment #2 from Jay Ren yongjie@intel.com  2013-04-22 14:43:25 ---
Created an attachment (id=99671)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=99671)
host dmesg

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 2/3] tcm_vhost: Add helper to check if endpoint is setup

2013-04-22 Thread Asias He

On Mon, Apr 22, 2013 at 04:28:07PM +0300, Michael S. Tsirkin wrote:
 On Mon, Apr 22, 2013 at 04:53:27PM +0800, Asias He wrote:
  On Thu, Apr 18, 2013 at 10:38:23AM +0300, Michael S. Tsirkin wrote:
   On Thu, Apr 18, 2013 at 04:32:30PM +0800, Asias He wrote:
On Thu, Apr 18, 2013 at 10:09:53AM +0300, Michael S. Tsirkin wrote:
 On Thu, Apr 18, 2013 at 09:05:53AM +0800, Asias He wrote:
  Signed-off-by: Asias He as...@redhat.com
  ---
   drivers/vhost/tcm_vhost.c | 18 ++
   1 file changed, 18 insertions(+)
  
  diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
  index 88ebb79..8f05528 100644
  --- a/drivers/vhost/tcm_vhost.c
  +++ b/drivers/vhost/tcm_vhost.c
  @@ -111,6 +111,24 @@ static bool tcm_vhost_check_feature(struct 
  vhost_scsi *vs, int feature)
  return ret;
   }
   
  +static bool tcm_vhost_check_endpoint(struct vhost_virtqueue *vq)
  +{
  +   bool ret = false;
  +
  +   /*
  +   * We can handle the vq only after the endpoint is setup by 
  calling the
  +   * VHOST_SCSI_SET_ENDPOINT ioctl.
  +   *
  +   * TODO: Check that we are running from vhost_worker which acts
  +   * as read-side critical section for vhost kind of RCU.
  +   * See the comments in struct vhost_virtqueue in 
  drivers/vhost/vhost.h
  +   */
  +   if (rcu_dereference_check(vq-private_data, 1))
  +   ret = true;
  +
  +   return ret;
  +}
  +
 
 You can't use RCU in this way. You need to actually
 derefence. Instead, just move this within vq mutex
 and do rcu_dereference_protected. document that this function
 requires vq mutex.

Any difference with: 

   vs_tpg = rcu_dereference_check(vq-private_data, 1);
   
   Yes, it's not the same. rcu_dereference_protected says
   this is generally RCU but here I use it under lock.
   rcu_dereference_check can only be used if you
   really dereference later, and you don't.
  
  So you want the helper to be this?
  
  static bool tcm_vhost_check_endpoint()
  {
  struct tcm_vhost_tpg **vs_tpg;
  
  vs_tpg = rcu_dereference_check(vq-private_data, 1);
  if (vs_tpg)
  return true;
  else
  return false
  }
  
  And what difference code the compiler will generate with this:
  
  static bool tcm_vhost_check_endpoint()
  {
  
  if (rcu_dereference_check(vq-private_data, 1))
  return true
  else
  return false
  }
  
 
 No, what I want is that private data is
 either dereferenced (not tested against NULL)
 or used under vq mutex.
 In this case the second option is the easiest
 and the cleanest. So move it under mutex
 and then test it safely without rcu.

Why you use it in -net? The test against NULL is under in the rcu
read critical section. What is wrong with this really?

   
   if (vs_tpg)
return ture
   else
return false;

Note, tcm_vhost_check_endpoint() is called in multiple places. Having a
helper is more convenient.
   
   I'm fine with the helper, but fix the implementation
   and fix all callers to have vq mutex when calling.
  
  All the caller are in the vhost thread. Why need vq mutex to be taken?
 
 vhost thread is an implementation detail, people are experimenting with
 various threading strategies.  In particular, code must
 keep working when we move it to use a workqueue.

Well, you should remove your vhost RCU thing first and use real RCU
explicitly. Does -net keep working after vhost changes to workqueue?
Since when this becomes a requirement?

   static int tcm_vhost_check_true(struct se_portal_group *se_tpg)
   {
  return 1;
  -- 
  1.8.1.4

-- 
Asias
  
  -- 
  Asias

-- 
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/3] tcm_vhost: Add hotplug/hotunplug support

2013-04-22 Thread Asias He

On Mon, Apr 22, 2013 at 04:17:04PM +0300, Michael S. Tsirkin wrote:
 On Mon, Apr 22, 2013 at 05:20:24PM +0800, Asias He wrote:
  On Sat, Apr 20, 2013 at 10:01:31PM +0300, Michael S. Tsirkin wrote:
   On Fri, Apr 19, 2013 at 10:34:10AM +0800, Asias He wrote:
On Thu, Apr 18, 2013 at 11:21:54AM +0300, Michael S. Tsirkin wrote:
 On Thu, Apr 18, 2013 at 04:59:08PM +0800, Asias He wrote:
  On Thu, Apr 18, 2013 at 10:34:29AM +0300, Michael S. Tsirkin wrote:
   On Thu, Apr 18, 2013 at 09:05:54AM +0800, Asias He wrote:
In commit 365a7150094 ([SCSI] virtio-scsi: hotplug support for
virtio-scsi), hotplug support is added to virtio-scsi.

This patch adds hotplug and hotunplug support to tcm_vhost.

You can create or delete a LUN in targetcli to hotplug or 
hotunplug a
LUN in guest.

Changes in v7:
- Add vhost_work_flush for vs-vs_event_work to this series

Changes in v6:
- Pass tcm_vhost_evt to tcm_vhost_do_evt_work

Changes in v5:
- Switch to int from u64 to vs_events_nr
- Set s-vs_events_dropped flag in tcm_vhost_allocate_evt
- Do not nest dev mutex within vq mutex
- Use vs_events_lock to protect vs_events_dropped and 
vs_events_nr
- Rebase to target/master

Changes in v4:
- Drop tcm_vhost_check_endpoint in tcm_vhost_send_evt
- Add tcm_vhost_check_endpoint in vhost_scsi_evt_handle_kick

Changes in v3:
- Separate the bug fix to another thread

Changes in v2:
- Remove code duplication in tcm_vhost_{hotplug,hotunplug}
- Fix racing of vs_events_nr
- Add flush fix patch to this series

Signed-off-by: Asias He as...@redhat.com
Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
---
 drivers/vhost/tcm_vhost.c | 208 
+-
 drivers/vhost/tcm_vhost.h |  10 +++
 2 files changed, 214 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/tcm_vhost.c 
b/drivers/vhost/tcm_vhost.c
index 8f05528..da2021b 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -66,11 +66,13 @@ enum {
  * TODO: debug and remove the workaround.
  */
 enum {
-   VHOST_SCSI_FEATURES = VHOST_FEATURES  
(~VIRTIO_RING_F_EVENT_IDX)
+   VHOST_SCSI_FEATURES = (VHOST_FEATURES  
(~VIRTIO_RING_F_EVENT_IDX)) |
+ (1ULL  VIRTIO_SCSI_F_HOTPLUG)
 };
 
 #define VHOST_SCSI_MAX_TARGET  256
 #define VHOST_SCSI_MAX_VQ  128
+#define VHOST_SCSI_MAX_EVENT   128
 
 struct vhost_scsi {
/* Protected by vhost_scsi-dev.mutex */
@@ -82,6 +84,13 @@ struct vhost_scsi {
 
struct vhost_work vs_completion_work; /* cmd completion 
work item */
struct llist_head vs_completion_list; /* cmd completion 
queue */
+
+   struct vhost_work vs_event_work; /* evt injection work 
item */
+   struct llist_head vs_event_list; /* evt injection queue 
*/
+
+   struct mutex vs_events_lock; /* protect 
vs_events_dropped,events_nr */
   
   Looking at this code, there are just so many locks now.
   This does not make me feel safe :)
   At least, please document lock nesting.
   
  
  Do you see a real problem?
 
 Complexity is a real problem. My head already spins. No I don't see a
 bug, but we need to simplify locking.
 
 And I think I see a nice way to do this:
 1. move away from a global work to per-event work - so no list
 2. remove dynamic allocation of events - so no events_nr
 3. set/test overrun flag under the appropriate vq mutex
 
 I think that's ideal.  But we can move there in small steps.  As a 
 first
 step - why can't we always take the vq mutex lock and drop
 vs_events_lock?

There are really different ways to solve the same problem. You are
welcome to implement you ideas.
   
   I have a problem with this patch. It just seems too tricky,
   while it is really trying to do a very simple thing.
  
  How tricky is it?
 
 It has its own locking scheme which seems unwarranted.
 Find a way to use the existing locking please.

Why is this tricky?

   There's new lock nesting.  
  
  I used dev mutex in the past, but it does not work because it introduces
  a deadlock in vhost_scsi_set_endpoint. vs_events is a per device stat,
  so I think it deserves a per device lock instead of a per queue lock.
 
 We don't add a lock per field, it just does not make sense.
 
   There are new places which do lock, read
   data, unlock, use data.

KVM call agenda for 2013-04-23

2013-04-22 Thread Juan Quintela


Hi

Please send in any agenda topics you are interested in.

Later, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for 2013-04-23

2013-04-22 Thread Paolo Bonzini

Il 22/04/2013 18:11, Juan Quintela ha scritto:
 
 Hi
 
 Please send in any agenda topics you are interested in.

* 1.5 pending patches

Paolo

 Later, Juan.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 2/3] tcm_vhost: Add helper to check if endpoint is setup

2013-04-22 Thread Michael S. Tsirkin

On Mon, Apr 22, 2013 at 11:00:19PM +0800, Asias He wrote:
 On Mon, Apr 22, 2013 at 04:28:07PM +0300, Michael S. Tsirkin wrote:
  On Mon, Apr 22, 2013 at 04:53:27PM +0800, Asias He wrote:
   On Thu, Apr 18, 2013 at 10:38:23AM +0300, Michael S. Tsirkin wrote:
On Thu, Apr 18, 2013 at 04:32:30PM +0800, Asias He wrote:
 On Thu, Apr 18, 2013 at 10:09:53AM +0300, Michael S. Tsirkin wrote:
  On Thu, Apr 18, 2013 at 09:05:53AM +0800, Asias He wrote:
   Signed-off-by: Asias He as...@redhat.com
   ---
drivers/vhost/tcm_vhost.c | 18 ++
1 file changed, 18 insertions(+)
   
   diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
   index 88ebb79..8f05528 100644
   --- a/drivers/vhost/tcm_vhost.c
   +++ b/drivers/vhost/tcm_vhost.c
   @@ -111,6 +111,24 @@ static bool tcm_vhost_check_feature(struct 
   vhost_scsi *vs, int feature)
 return ret;
}

   +static bool tcm_vhost_check_endpoint(struct vhost_virtqueue *vq)
   +{
   + bool ret = false;
   +
   + /*
   + * We can handle the vq only after the endpoint is setup by 
   calling the
   + * VHOST_SCSI_SET_ENDPOINT ioctl.
   + *
   + * TODO: Check that we are running from vhost_worker which acts
   + * as read-side critical section for vhost kind of RCU.
   + * See the comments in struct vhost_virtqueue in 
   drivers/vhost/vhost.h
   + */
   + if (rcu_dereference_check(vq-private_data, 1))
   + ret = true;
   +
   + return ret;
   +}
   +
  
  You can't use RCU in this way. You need to actually
  derefence. Instead, just move this within vq mutex
  and do rcu_dereference_protected. document that this function
  requires vq mutex.
 
 Any difference with: 
 
vs_tpg = rcu_dereference_check(vq-private_data, 1);

Yes, it's not the same. rcu_dereference_protected says
this is generally RCU but here I use it under lock.
rcu_dereference_check can only be used if you
really dereference later, and you don't.
   
   So you want the helper to be this?
   
   static bool tcm_vhost_check_endpoint()
   {
   struct tcm_vhost_tpg **vs_tpg;
   
   vs_tpg = rcu_dereference_check(vq-private_data, 1);
   if (vs_tpg)
   return true;
   else
   return false
   }
   
   And what difference code the compiler will generate with this:
   
   static bool tcm_vhost_check_endpoint()
   {
   
   if (rcu_dereference_check(vq-private_data, 1))
   return true
   else
   return false
   }
   
  
  No, what I want is that private data is
  either dereferenced (not tested against NULL)
  or used under vq mutex.
  In this case the second option is the easiest
  and the cleanest. So move it under mutex
  and then test it safely without rcu.
 
 Why you use it in -net? The test against NULL is under in the rcu
 read critical section. What is wrong with this really?

Look at the code and see that barriers in rcu_dereference
assume that pointer really is dereferenced.
You don't dereference it, so rcu_dereference_XXX is wrong.


if (vs_tpg)
   return ture
else
 return false;
 
 Note, tcm_vhost_check_endpoint() is called in multiple places. Having 
 a
 helper is more convenient.

I'm fine with the helper, but fix the implementation
and fix all callers to have vq mutex when calling.
   
   All the caller are in the vhost thread. Why need vq mutex to be taken?
  
  vhost thread is an implementation detail, people are experimenting with
  various threading strategies.  In particular, code must
  keep working when we move it to use a workqueue.
 
 Well, you should remove your vhost RCU thing first and use real RCU
 explicitly.

I don't think it's useful or possible.

 Does -net keep working after vhost changes to workqueue?

I think so, yes.

 Since when this becomes a requirement?

It's the requirement to use the abstractions that are in place. Devices
should use the interfaces and not assume vhost implementation, it
happens to be the case but it's not guaranteed.

static int tcm_vhost_check_true(struct se_portal_group *se_tpg)
{
 return 1;
   -- 
   1.8.1.4
 
 -- 
 Asias
   
   -- 
   Asias
 
 -- 
 Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2013-04-23

2013-04-22 Thread Eric Blake

On 04/22/2013 10:11 AM, Juan Quintela wrote:
 
 Hi
 
 Please send in any agenda topics you are interested in.

* What can libvirt expect in 1.5 for introspection of command-line support?
* What are the rules for adding optional parameters to existing QMP
commands?  Would it help if we had introspection of QMP commands?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Jiannan Ouyang

On Mon, Apr 22, 2013 at 1:58 AM, Raghavendra K T
raghavendra...@linux.vnet.ibm.com wrote:

  The intuition is to

 downgrade a fair lock to an unfair lock automatically upon preemption,
 and preserve the fairness otherwise.


 I also hope being little unfair, does not affect the original intention
 of introducing ticket spinlocks too.
 Some discussions were here long back in this thead,
 https://lkml.org/lkml/2010/6/3/331


Good point. I also have the question that why not use unfair lock
under virtual environment,
and is fairness really a big issue. However, given that current kernel
is using ticket lock, I
assume fairness is a necessary spinlock feature.

Regard the fairness of preemptable-lock, I did a user space experiment
using 8 pCPU to
compete on one spinlock, and count the lock acquisition times. Results
show that lock acquisition
counts are *almost* evenly distributed between threads in preemptable-lock.


 AFAIU, the experiments are on non PLE machines and it would be worth
 experimenting on PLE machines too. and also bigger machines.
 (we may get some surprises there otherwise).
 'll wait for your next iteration of the patches with using lower bit
 changes.


Yes, they are on PLE machines. Current implementation and evaluation
is still at the stage of
concept proving. More experiments (with PLE, bigger machiens, etc) are
needed to better understand
the lock behavior.

 +#define TIMEOUT_UNIT (114)


 This value seem to be at the higher end. But I hope you have experimented
 enough to come up with this. Better again to test all these tunables?? on
 PLE machines too.


I actually didn't tune this parameter at all... But yes, find a better
value is necessary if it is in industry code.

   static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
   {
  register struct __raw_tickets inc = { .tail = 1 };
 +   unsigned int timeout = 0;
 +   __ticket_t current_head;

  inc = xadd(lock-tickets, inc);
 -
 +   if (likely(inc.head == inc.tail))
 +   goto spin;
 +
 +   timeout =  TIMEOUT_UNIT * (inc.tail - inc.head);
 +   do {
 +   current_head = ACCESS_ONCE(lock-tickets.head);
 +   if (inc.tail = current_head) {
 +   goto spin;
 +   } else if (inc.head != current_head) {
 +   inc.head = current_head;
 +   timeout =  TIMEOUT_UNIT * (inc.tail - inc.head);


 Good idea indeed to base the loop on head and tail difference.. But for
 virtualization I believe this directly proportional notion is little
 tricky too.


Could you explain your concern a little bit more?

Thanks
--Jiannan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

suggesting wording fixes for virtio-spec 0.9.5

2013-04-22 Thread Laszlo Ersek

Hi,

(I'm not subscribed to either list,)

using the word descriptor is misleading in the following sections:

  2.4.1.2 Updating The Available Ring

  [...] However, in general we can add many descriptors before we
  update the idx field (at which point they become visible to the
  device), so we keep a counter of how many we've added: [...]

and

  2.4.1.3 Updating The Index Field

  Once the idx field of the virtqueue is updated, the device will be
  able to access the descriptor entries we've created and the memory
  they refer to. [...]

(The word descriptor in the above language is the reason I
mis-implemented the virtio-blk guest driver in OVMF.)

In fact the available ring tracks *head* descriptors only. I suggest

  s/many descriptors/many separate descriptor chains/
  s/descriptor entries/separate descriptor chains/

for the above.

Similarly, 2.3.4 Available Ring should start with something like

  The available ring describes what descriptor chains we are offering
  the device: each entry of the available ring refers to the head
  descriptor of a separate descriptor chain.

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio-net mq vq initialization

2013-04-22 Thread Sasha Levin

On 04/15/2013 01:58 AM, Jason Wang wrote:
 Initializing them only when they're actually needed will do the trick here.
 I don't see so much memory allocation with qemu, the main problem I
 guess here is kvmtool does not support mergeable rx buffers ( does it?
 ). So guest must allocate 64K per packet which would be even worse in
 multiqueue case. If mergeable rx buffer was used, the receive buffer
 will only occupy about 256 * 4K (1M) which seems pretty acceptable here.

Yup, kvmtool doesn't do mergable RX at the moment.

I'll add support for that in, thanks for pointing it out!


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Peter Zijlstra

On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote:
 On 04/22/2013 07:51 AM, Peter Zijlstra wrote:
  On Sun, 2013-04-21 at 17:12 -0400, Rik van Riel wrote:
 
  If we always incremented the ticket number by 2 (instead of 1), then
  we could use the lower bit of the ticket number as the spinlock.
 
  ISTR that paravirt ticket locks already do that and use the lsb to
  indicate the unlock needs to perform wakeups.
 
  Also, since all of this is virt nonsense, shouldn't it live in the
  paravirt ticket lock code and leave the native code as is?
 
 Sure, but that is still no reason not to have the virt
 implementation be as fast as possible, and share the same
 data type as the non-virt implementation.

It has to share the same data-type..

 Also, is it guaranteed that the native spin_lock code has
 not been called yet before we switch over to the paravirt
 functions?
 
 If the native spin_lock code has been called already at
 that time, the native code would still need to be modified
 to increment the ticket number by 2, so we end up with a
 compatible value in each spin lock's .tickets field, and
 prevent a deadlock after we switch over to the paravirt
 variant.

I thought the stuff already made it upstream, but apparently not; the
lastest posting I'm aware of is here:

  https://lkml.org/lkml/2012/5/2/105

That stuff changes the normal ticket increment as well.. 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel


On 04/22/2013 03:49 PM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote:



If the native spin_lock code has been called already at
that time, the native code would still need to be modified
to increment the ticket number by 2, so we end up with a
compatible value in each spin lock's .tickets field, and
prevent a deadlock after we switch over to the paravirt
variant.


I thought the stuff already made it upstream, but apparently not; the
lastest posting I'm aware of is here:

   https://lkml.org/lkml/2012/5/2/105

That stuff changes the normal ticket increment as well..


Jiannan,

It looks like the patch above could make a good patch
1 (or 2) in your patch series :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Jiannan Ouyang

On Mon, Apr 22, 2013 at 3:56 PM, Rik van Riel r...@redhat.com wrote:
 On 04/22/2013 03:49 PM, Peter Zijlstra wrote:

 On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote:


 If the native spin_lock code has been called already at
 that time, the native code would still need to be modified
 to increment the ticket number by 2, so we end up with a
 compatible value in each spin lock's .tickets field, and
 prevent a deadlock after we switch over to the paravirt
 variant.


 I thought the stuff already made it upstream, but apparently not; the
 lastest posting I'm aware of is here:

https://lkml.org/lkml/2012/5/2/105

 That stuff changes the normal ticket increment as well..


 Jiannan,

 It looks like the patch above could make a good patch
 1 (or 2) in your patch series :)

 --
 All rights reversed

Yes.
I'm going to move my code, updated with Rik's suggestions, to paravirt
ops based on Jeremy's patch.
I'll post a new patch series soon.

Thanks to everyone for the great feedback!
--Jiannan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Peter Zijlstra

On Mon, 2013-04-22 at 15:56 -0400, Rik van Riel wrote:
 On 04/22/2013 03:49 PM, Peter Zijlstra wrote:
  On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote:
 
  If the native spin_lock code has been called already at
  that time, the native code would still need to be modified
  to increment the ticket number by 2, so we end up with a
  compatible value in each spin lock's .tickets field, and
  prevent a deadlock after we switch over to the paravirt
  variant.
 
  I thought the stuff already made it upstream, but apparently not; the
  lastest posting I'm aware of is here:
 
 https://lkml.org/lkml/2012/5/2/105
 
  That stuff changes the normal ticket increment as well..
 
 Jiannan,
 
 It looks like the patch above could make a good patch
 1 (or 2) in your patch series :)

I much prefer the entire series from Jeremy since it maintains the
ticket semantics and doesn't degrade the lock to unfair under
contention.

Now I suppose there's a reason its not been merged yet and I suspect
its !paravirt hotpath impact which wasn't rightly justified or somesuch
so maybe someone can work on that or so.. dunno.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel


On 04/22/2013 04:08 PM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 15:56 -0400, Rik van Riel wrote:

On 04/22/2013 03:49 PM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote:



If the native spin_lock code has been called already at
that time, the native code would still need to be modified
to increment the ticket number by 2, so we end up with a
compatible value in each spin lock's .tickets field, and
prevent a deadlock after we switch over to the paravirt
variant.


I thought the stuff already made it upstream, but apparently not; the
lastest posting I'm aware of is here:

https://lkml.org/lkml/2012/5/2/105

That stuff changes the normal ticket increment as well..


Jiannan,

It looks like the patch above could make a good patch
1 (or 2) in your patch series :)


I much prefer the entire series from Jeremy since it maintains the
ticket semantics and doesn't degrade the lock to unfair under
contention.

Now I suppose there's a reason its not been merged yet and I suspect
its !paravirt hotpath impact which wasn't rightly justified or somesuch
so maybe someone can work on that or so.. dunno.


IIRC one of the reasons was that the performance improvement wasn't
as obvious.  Rescheduling VCPUs takes a fair amount of time, quite
probably more than the typical hold time of a spinlock.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Peter Zijlstra

On Mon, 2013-04-22 at 16:32 -0400, Rik van Riel wrote:
 
 IIRC one of the reasons was that the performance improvement wasn't
 as obvious.  Rescheduling VCPUs takes a fair amount of time, quite
 probably more than the typical hold time of a spinlock.

IIRC it would spin for a while before blocking..

/me goes re-read some of that thread...

Ah, its because PLE is curing most of it.. !PLE it had huge gains but
apparently nobody cares about !PLE hardware anymore :-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Jiannan Ouyang

On Mon, Apr 22, 2013 at 4:08 PM, Peter Zijlstra pet...@infradead.org wrote:


 I much prefer the entire series from Jeremy since it maintains the
 ticket semantics and doesn't degrade the lock to unfair under
 contention.

 Now I suppose there's a reason its not been merged yet and I suspect
 its !paravirt hotpath impact which wasn't rightly justified or somesuch
 so maybe someone can work on that or so.. dunno.



In my paper, I comparable preemptable-lock and pv_lock on KVM from
Raghu and Jeremy.
Results show that:
- preemptable-lock improves performance significantly without paravirt support
- preemptable-lock can also be paravirtualized, which outperforms
pv_lock, especially when overcommited by 3 or more
- pv-preemptable-lock has much less performance variance compare to
pv_lock, because it adapts to preemption within  VM,
  other than using rescheduling that increase VM interference

It would still be very interesting to conduct more experiments to
compare these two, to see if the fairness enforced by pv_lock is
mandatory, and if preemptable-lock outperforms pv_lock in most cases,
and how do they work with PLE.

--Jiannan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Peter Zijlstra

On Mon, 2013-04-22 at 22:44 +0200, Peter Zijlstra wrote:
 On Mon, 2013-04-22 at 16:32 -0400, Rik van Riel wrote:
  
  IIRC one of the reasons was that the performance improvement wasn't
  as obvious.  Rescheduling VCPUs takes a fair amount of time, quite
  probably more than the typical hold time of a spinlock.
 
 IIRC it would spin for a while before blocking..
 
 /me goes re-read some of that thread...
 
 Ah, its because PLE is curing most of it.. !PLE it had huge gains but
 apparently nobody cares about !PLE hardware anymore :-)

Hmm.. it looked like under light overcommit the paravirt ticket lock
still had some gain (~10%) and of course it brings the fairness thing
which is always good.

I can only imagine the mess unfair + vcpu preemption can bring to guest
tasks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel


On 04/22/2013 04:46 PM, Jiannan Ouyang wrote:


It would still be very interesting to conduct more experiments to
compare these two, to see if the fairness enforced by pv_lock is
mandatory, and if preemptable-lock outperforms pv_lock in most cases,
and how do they work with PLE.


Given the fairly high cost of rescheduling a VCPU (which is likely
to include an IPI), versus the short hold time of most spinlocks,
I have the strong suspicion that your approach would win.

The fairness is only compromised in a limited way and in certain
circumstances, so I am not too worried about that.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Jiannan Ouyang

On Mon, Apr 22, 2013 at 4:44 PM, Peter Zijlstra pet...@infradead.org wrote:
 On Mon, 2013-04-22 at 16:32 -0400, Rik van Riel wrote:

 IIRC one of the reasons was that the performance improvement wasn't
 as obvious.  Rescheduling VCPUs takes a fair amount of time, quite
 probably more than the typical hold time of a spinlock.

 IIRC it would spin for a while before blocking..

 /me goes re-read some of that thread...

 Ah, its because PLE is curing most of it.. !PLE it had huge gains but
 apparently nobody cares about !PLE hardware anymore :-)


For now, I don't know how good it can work with PLE. But I think it
should save the time of VMEXIT on PLE machine.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel


On 04/22/2013 04:48 PM, Peter Zijlstra wrote:


Hmm.. it looked like under light overcommit the paravirt ticket lock
still had some gain (~10%) and of course it brings the fairness thing
which is always good.

I can only imagine the mess unfair + vcpu preemption can bring to guest
tasks.


If you think unfairness + vcpu preemption is bad, you haven't
tried full fairness + vcpu preemption :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Chegu Vinod


On 4/22/2013 1:50 PM, Jiannan Ouyang wrote:

On Mon, Apr 22, 2013 at 4:44 PM, Peter Zijlstra pet...@infradead.org wrote:

On Mon, 2013-04-22 at 16:32 -0400, Rik van Riel wrote:

IIRC one of the reasons was that the performance improvement wasn't
as obvious.  Rescheduling VCPUs takes a fair amount of time, quite
probably more than the typical hold time of a spinlock.

IIRC it would spin for a while before blocking..

/me goes re-read some of that thread...

Ah, its because PLE is curing most of it.. !PLE it had huge gains but
apparently nobody cares about !PLE hardware anymore :-)


For now, I don't know how good it can work with PLE. But I think it
should save the time of VMEXIT on PLE machine.
.

Thanks for sharing your patch. 'am waiting for your v2 patch(es) and 
then let you any review feedback. Hoping to verify your changes on a 
large box (PLE enabled) and get back to you with some data...


Thanks
Vinod
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Peter Zijlstra

On Mon, 2013-04-22 at 16:46 -0400, Jiannan Ouyang wrote:
 On Mon, Apr 22, 2013 at 4:08 PM, Peter Zijlstra pet...@infradead.org wrote:
 
 
  I much prefer the entire series from Jeremy since it maintains the
  ticket semantics and doesn't degrade the lock to unfair under
  contention.
 
  Now I suppose there's a reason its not been merged yet and I suspect
  its !paravirt hotpath impact which wasn't rightly justified or somesuch
  so maybe someone can work on that or so.. dunno.
 
 
 
 In my paper, I comparable preemptable-lock and pv_lock on KVM from
 Raghu and Jeremy.

Which pv_lock? The current pv spinlock mess is basically the old unfair
thing. The later patch series I referred to earlier implemented a
paravirt ticket lock, that should perform much better under overcommit.

 Results show that:
 - preemptable-lock improves performance significantly without paravirt support

But completely wrecks our native spinlock implementation so that's not
going to happen of course ;-)

 - preemptable-lock can also be paravirtualized, which outperforms
 pv_lock, especially when overcommited by 3 or more

See above.. 

 - pv-preemptable-lock has much less performance variance compare to
 pv_lock, because it adapts to preemption within  VM,
   other than using rescheduling that increase VM interference

I would say it has a _much_ worse worst case (and thus worse variance)
than the paravirt ticket implementation from Jeremy. While full
paravirt ticket lock results in vcpu scheduling it does maintain
fairness.

If you drop strict fairness you can end up in unbounded starvation
cases and those are very ugly indeed.

 It would still be very interesting to conduct more experiments to
 compare these two, to see if the fairness enforced by pv_lock is
 mandatory, and if preemptable-lock outperforms pv_lock in most cases,
 and how do they work with PLE.

Be more specific, pv_lock as currently upstream is a trainwreck mostly
done because pure ticket spinners and vcpu-preemption are even worse.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Peter Zijlstra

On Mon, 2013-04-22 at 16:49 -0400, Rik van Riel wrote:
 Given the fairly high cost of rescheduling a VCPU (which is likely
 to include an IPI), versus the short hold time of most spinlocks,
 I have the strong suspicion that your approach would win.

  https://lkml.org/lkml/2012/5/2/101

If you schedule too often your SPIN_THRESHOLD is far too low.

Anyway.. performance can't be that bad, otherwise Jeremey would have
spend as much time on it as he did.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Jiannan Ouyang

On Mon, Apr 22, 2013 at 4:55 PM, Peter Zijlstra pet...@infradead.org wrote:


 Which pv_lock? The current pv spinlock mess is basically the old unfair
 thing. The later patch series I referred to earlier implemented a
 paravirt ticket lock, that should perform much better under overcommit.


Yes, it is a paravirt *ticket* spinck. I got the patch from
Raghavendra K T through email
http://lwn.net/Articles/495597/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Andi Kleen

Rik van Riel r...@redhat.com writes:

 If we always incremented the ticket number by 2 (instead of 1), then
 we could use the lower bit of the ticket number as the spinlock.

Spinning on a single bit is very inefficient, as you need to do
try lock in a loop which is very unfriendly to the MESI state protocol.
It's much better to have at least three states and allow
spinning-while-reading-only.

This is typically very visible on systems with 2S.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel


On 04/22/2013 04:55 PM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 16:46 -0400, Jiannan Ouyang wrote:



- pv-preemptable-lock has much less performance variance compare to
pv_lock, because it adapts to preemption within  VM,
   other than using rescheduling that increase VM interference


I would say it has a _much_ worse worst case (and thus worse variance)
than the paravirt ticket implementation from Jeremy. While full
paravirt ticket lock results in vcpu scheduling it does maintain
fairness.

If you drop strict fairness you can end up in unbounded starvation
cases and those are very ugly indeed.


If needed, Jiannan's scheme could easily be bounded to prevent
infinite starvation. For example, we could allow only the first
8 CPUs in line to jump the queue.

However, given the way that virtual CPUs get scheduled in and
out all the time, I suspect starvation is not a worry, and we
will not need the additional complexity to deal with it.

You may want to play around with virtualization a bit, to get
a feel for how things work in virt land.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel


On 04/22/2013 05:56 PM, Andi Kleen wrote:

Rik van Riel r...@redhat.com writes:


If we always incremented the ticket number by 2 (instead of 1), then
we could use the lower bit of the ticket number as the spinlock.


Spinning on a single bit is very inefficient, as you need to do
try lock in a loop which is very unfriendly to the MESI state protocol.
It's much better to have at least three states and allow
spinning-while-reading-only.

This is typically very visible on systems with 2S.


Absolutely, the spinning should be read-only, until the CPU
sees that the desired bit is clear.  MESI-friendly spinning
is essential.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/17] KVM: PPC: Support irq routing and irqfd for in-kernel MPIC

2013-04-22 Thread Scott Wood


On 04/18/2013 07:15:46 PM, Alexander Graf wrote:


On 18.04.2013, at 23:39, Scott Wood wrote:

 Do we really want any default routes?  There's no platform notion  
of GSI
 here, so how is userspace to know how the kernel set it up (or what  
GSIs
 are free to be used for new routes) -- other than the read the  
code

 answer I got when I asked about x86?  :-P

The default routes really are just expose all pins 1:1 as GSI. I  
think it makes sense to have a simple default for user space that  
doesn't want to mess with irq routing.


What GSIs are free for new routes doesn't matter. Routes are always  
completely rewritten as a while from user space. So when user space  
goes in and wants to change only a single line it has to lay out the  
full map itself anyway.


It looks like you already write the routes in your QEMU patches, so I'd  
like to avoid adding MPIC default routes in KVM to keep things simple.   
It's legacy baggage from day one.  With default routes, what happens if  
we later support instantiating multiple interrupt controllers?


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

2013-04-22 Thread Xiao Guangrong

On 04/22/2013 05:21 PM, Gleb Natapov wrote:
 On Sun, Apr 21, 2013 at 10:09:29PM +0800, Xiao Guangrong wrote:
 On 04/21/2013 09:03 PM, Gleb Natapov wrote:
 On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
 This patchset is based on my previous two patchset:
 [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
 (https://lkml.org/lkml/2013/4/1/2)

 [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
 (https://lkml.org/lkml/2013/4/1/134)

 Changlog:
 V3:
   completely redesign the algorithm, please see below.

 This looks pretty complicated. Is it still needed in order to avoid soft
 lockups after avoid potential soft lockup and unneeded mmu reload patch?

 Yes.

 I discussed this point with Marcelo:

 ==
 BTW, to my honest, i do not think spin_needbreak is a good way - it does
 not fix the hot-lock contention and it just occupies more cpu time to avoid
 possible soft lock-ups.

 Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
 mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
 create page tables again. zap-all-shadow-page need long time to be finished,
 the worst case is, it can not completed forever on intensive vcpu and memory
 usage.

 So what about mixed approach: use generation numbers and reload roots to
 quickly invalidate all shadow pages and then do kvm_mmu_zap_all_invalid().
 kvm_mmu_zap_all_invalid() is a new function that invalidates only shadow
 pages with stale generation number (and uses lock break technique). It
 may traverse active_mmu_pages from tail to head since new shadow pages
 will be added to the head of the list or it may use invalid slot rmap to
 find exactly what should be invalidated.

I prefer to unmapping the invalid rmap instead of zapping stale shadow pages
in kvm_mmu_zap_all_invalid(), the former is faster.

This way may help but not good, after reload mmu with the new generation number,
all of the vcpu will fault in a long time, try to hold mmu-lock is not good even
if use lock break technique.

I think we can do this step first, then unmapping invalid rmap out of mmu-lock
later.

 
 I still think the right way to fix this kind of thing is optimization for
 mmu-lock.
 ==

 Which parts scare you? Let's find a way to optimize for it. ;). For example,
 if you do not like unmap_memslot_rmap_nolock(), we can simplify it - We can
 use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() to
 protect spte instead of kvm-being_unmaped_rmap.

 
 kvm-being_unmaped_rmap is particularly tricky, although looks

Okay. Will use walk_shadow_page_lockless_begin() and 
walk_shadow_page_lockless_end
instead.

 correct. Additional indirection with rmap ops also does not help following
 the code. I'd rather have if(slot is invalid) in a couple of places where
 things should be done differently. In most places it will be WARN_ON(slot
 is invalid).

Less change, good to me, will do. ;)

Thanks!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm tools: virtio-net mergable rx buffers

2013-04-22 Thread Sasha Levin

Support mergable rx buffers for virtio-net. This helps reduce the amount
of memory the guest kernel has to allocate per rx vq.

Signed-off-by: Sasha Levin sasha.le...@oracle.com
---
 tools/kvm/include/kvm/uip.h  |  4 ++--
 tools/kvm/include/kvm/util.h |  3 +++
 tools/kvm/net/uip/core.c | 54 +---
 tools/kvm/net/uip/tcp.c  |  2 +-
 tools/kvm/net/uip/udp.c  |  2 +-
 tools/kvm/util/util.c| 15 
 tools/kvm/virtio/net.c   | 37 ++
 7 files changed, 55 insertions(+), 62 deletions(-)

diff --git a/tools/kvm/include/kvm/uip.h b/tools/kvm/include/kvm/uip.h
index ac248d2..fa82f10 100644
--- a/tools/kvm/include/kvm/uip.h
+++ b/tools/kvm/include/kvm/uip.h
@@ -252,7 +252,7 @@ struct uip_tcp_socket {
 };
 
 struct uip_tx_arg {
-   struct virtio_net_hdr *vnet;
+   struct virtio_net_hdr_mrg_rxbuf *vnet;
struct uip_info *info;
struct uip_eth *eth;
int vnet_len;
@@ -332,7 +332,7 @@ static inline u16 uip_eth_hdrlen(struct uip_eth *eth)
 }
 
 int uip_tx(struct iovec *iov, u16 out, struct uip_info *info);
-int uip_rx(struct iovec *iov, u16 in, struct uip_info *info);
+int uip_rx(unsigned char *buffer, u32 length, struct uip_info *info);
 int uip_init(struct uip_info *info);
 
 int uip_tx_do_ipv4_udp_dhcp(struct uip_tx_arg *arg);
diff --git a/tools/kvm/include/kvm/util.h b/tools/kvm/include/kvm/util.h
index 0df9f0d..6f8ac83 100644
--- a/tools/kvm/include/kvm/util.h
+++ b/tools/kvm/include/kvm/util.h
@@ -22,6 +22,7 @@
 #include sys/param.h
 #include sys/types.h
 #include linux/types.h
+#include sys/uio.h
 
 #ifdef __GNUC__
 #define NORETURN __attribute__((__noreturn__))
@@ -94,4 +95,6 @@ struct kvm;
 void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
 void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *hugetlbfs_path, u64 
size);
 
+int memcpy_toiovecend(const struct iovec *iov, int iovlen, unsigned char 
*kdata, size_t len);
+
 #endif /* KVM__UTIL_H */
diff --git a/tools/kvm/net/uip/core.c b/tools/kvm/net/uip/core.c
index 4e5bb82..d9e9993 100644
--- a/tools/kvm/net/uip/core.c
+++ b/tools/kvm/net/uip/core.c
@@ -7,7 +7,7 @@
 
 int uip_tx(struct iovec *iov, u16 out, struct uip_info *info)
 {
-   struct virtio_net_hdr *vnet;
+   struct virtio_net_hdr_mrg_rxbuf *vnet;
struct uip_tx_arg arg;
int eth_len, vnet_len;
struct uip_eth *eth;
@@ -74,63 +74,21 @@ int uip_tx(struct iovec *iov, u16 out, struct uip_info 
*info)
return vnet_len + eth_len;
 }
 
-int uip_rx(struct iovec *iov, u16 in, struct uip_info *info)
+int uip_rx(unsigned char *buffer, u32 length, struct uip_info *info)
 {
-   struct virtio_net_hdr *vnet;
-   struct uip_eth *eth;
struct uip_buf *buf;
-   int vnet_len;
-   int eth_len;
-   char *p;
int len;
-   int cnt;
-   int i;
 
/*
 * Sleep until there is a buffer for guest
 */
buf = uip_buf_get_used(info);
 
-   /*
-* Fill device to guest buffer, vnet hdr fisrt
-*/
-   vnet_len = iov[0].iov_len;
-   vnet = iov[0].iov_base;
-   if (buf-vnet_len  vnet_len) {
-   len = -1;
-   goto out;
-   }
-   memcpy(vnet, buf-vnet, buf-vnet_len);
-
-   /*
-* Then, the real eth data
-* Note: Be sure buf-eth_len is not bigger than the buffer len that 
guest provides
-*/
-   cnt = buf-eth_len;
-   p = buf-eth;
-   for (i = 1; i  in; i++) {
-   eth_len = iov[i].iov_len;
-   eth = iov[i].iov_base;
-   if (cnt  eth_len) {
-   memcpy(eth, p, eth_len);
-   cnt -= eth_len;
-   p += eth_len;
-   } else {
-   memcpy(eth, p, cnt);
-   cnt -= cnt;
-   break;
-   }
-   }
-
-   if (cnt) {
-   pr_warning(uip_rx error);
-   len = -1;
-   goto out;
-   }
+   memcpy(buffer, buf-vnet, buf-vnet_len);
+   memcpy(buffer + buf-vnet_len, buf-eth, buf-eth_len);
 
len = buf-vnet_len + buf-eth_len;
 
-out:
uip_buf_set_free(info, buf);
return len;
 }
@@ -172,8 +130,8 @@ int uip_init(struct uip_info *info)
}
 
list_for_each_entry(buf, buf_head, list) {
-   buf-vnet   = malloc(sizeof(struct virtio_net_hdr));
-   buf-vnet_len   = sizeof(struct virtio_net_hdr);
+   buf-vnet   = malloc(sizeof(struct 
virtio_net_hdr_mrg_rxbuf));
+   buf-vnet_len   = sizeof(struct virtio_net_hdr_mrg_rxbuf);
buf-eth= malloc(1024*64 + sizeof(struct 
uip_pseudo_hdr));
buf-eth_len= 1024*64 + sizeof(struct uip_pseudo_hdr);
 
diff --git a/tools/kvm/net/uip/tcp.c b/tools/kvm/net/uip/tcp.c
index 9044f40..a99b95e 100644
--- a/tools/kvm/net/uip/tcp.c
+++

[PATCH] virtio-net: fill only rx queues which are being used

2013-04-22 Thread Sasha Levin

Due to MQ support we may allocate a whole bunch of rx queues but
never use them. With this patch we'll safe the space used by
the receive buffers until they are actually in use:

sh-4.2# free -h
 total   used   free sharedbuffers cached
Mem:  490M35M   455M 0B 0B   4.1M
-/+ buffers/cache:31M   459M
Swap:   0B 0B 0B
sh-4.2# ethtool -L eth0 combined 8
sh-4.2# free -h
 total   used   free sharedbuffers cached
Mem:  490M   162M   327M 0B 0B   4.1M
-/+ buffers/cache:   158M   331M
Swap:   0B 0B 0B

Signed-off-by: Sasha Levin sasha.le...@oracle.com
---
 drivers/net/virtio_net.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6bfc511..4d82d17 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -581,7 +581,7 @@ static void refill_work(struct work_struct *work)
bool still_empty;
int i;
 
-   for (i = 0; i  vi-max_queue_pairs; i++) {
+   for (i = 0; i  vi-curr_queue_pairs; i++) {
struct receive_queue *rq = vi-rq[i];
 
napi_disable(rq-napi);
@@ -636,7 +636,7 @@ static int virtnet_open(struct net_device *dev)
struct virtnet_info *vi = netdev_priv(dev);
int i;
 
-   for (i = 0; i  vi-max_queue_pairs; i++) {
+   for (i = 0; i  vi-curr_queue_pairs; i++) {
/* Make sure we have some buffers: if oom use wq. */
if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
schedule_delayed_work(vi-refill, 0);
@@ -900,6 +900,7 @@ static int virtnet_set_queues(struct virtnet_info *vi, u16 
queue_pairs)
struct scatterlist sg;
struct virtio_net_ctrl_mq s;
struct net_device *dev = vi-dev;
+   int i;
 
if (!vi-has_cvq || !virtio_has_feature(vi-vdev, VIRTIO_NET_F_MQ))
return 0;
@@ -912,8 +913,13 @@ static int virtnet_set_queues(struct virtnet_info *vi, u16 
queue_pairs)
dev_warn(dev-dev, Fail to set num of queue pairs to %d\n,
 queue_pairs);
return -EINVAL;
-   } else
+   } else {
+   if (queue_pairs  vi-curr_queue_pairs)
+   for (i = 0; i  queue_pairs; i++)
+   if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
+   schedule_delayed_work(vi-refill, 0);
vi-curr_queue_pairs = queue_pairs;
+   }
 
return 0;
 }
@@ -1568,7 +1574,7 @@ static int virtnet_probe(struct virtio_device *vdev)
}
 
/* Last of all, set up some receive buffers. */
-   for (i = 0; i  vi-max_queue_pairs; i++) {
+   for (i = 0; i  vi-curr_queue_pairs; i++) {
try_fill_recv(vi-rq[i], GFP_KERNEL);
 
/* If we didn't even get one input buffer, we're useless. */
@@ -1692,7 +1698,7 @@ static int virtnet_restore(struct virtio_device *vdev)
 
netif_device_attach(vi-dev);
 
-   for (i = 0; i  vi-max_queue_pairs; i++)
+   for (i = 0; i  vi-curr_queue_pairs; i++)
if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
schedule_delayed_work(vi-refill, 0);
 
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Bug 56981] New: [SR-IOV] Intel I350 NIC VF cannot work in a Windows 2008 guest.

2013-04-22 Thread Ren, Yongjie

Added e1000-devel list to see whether this's a known issue.

Best Regards,
 Yongjie (Jay)


 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org]
 On Behalf Of bugzilla-dae...@bugzilla.kernel.org
 Sent: Monday, April 22, 2013 10:41 PM
 To: kvm@vger.kernel.org
 Subject: [Bug 56981] New: [SR-IOV] Intel I350 NIC VF cannot work in a
 Windows 2008 guest.
 
 https://bugzilla.kernel.org/show_bug.cgi?id=56981
 
Summary: [SR-IOV] Intel I350 NIC VF cannot work in a
 Windows
 2008 guest.
Product: Virtualization
Version: unspecified
 Kernel Version: 3.9.0-rc3
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: kvm
 AssignedTo: virtualization_...@kernel-bugs.osdl.org
 ReportedBy: yongjie@intel.com
 Regression: No
 
 
 Environment:
 
 Host OS (ia32/ia32e/IA64):ia32e
 Guest OS (ia32/ia32e/IA64):ia32e
 Guest OS Type (Linux/Windows):Windows
 kvm.git Commit:79558f112fc0352e057f7b5e158e3d88b8b62c60
 qemu-kvm Commit:8912bdea01e8671e59fe0287314379be9c1f40ec
 Host Kernel Version:3.9.0-rc3
 Hardware: Sandy Bridge - EP
 
 
 Bug detailed description:
 --
 The assigned Intel I350 NIC VF (a igbvf) cannot work in a Windows 2008
 guest.
 
 1. If the guest is Windows 7/8/2012 or RHEL6.x , the Intel I350 NIC VF can
 work
 fine in the guest.
 2. Intel 82576 NIC VF (also igbvf) can work in a Windows 2008 guest.
 
 
 Reproduce steps:
 
 1. ./pcistub.sh -h 0b:10.0   ( to hide a device)
 2. qemu-system-x86_64 -enable-kvm -m 4096 -smp 4 -device
 pci-assign,host=0b:10.0 -net none -hda win2k8.img
 
 Current result:
 
 Intel I350 NIC VF cannot work in win2k8 guest
 
 Expected result:
 
 Intel I350 NIC VF can work in win2k8 guest
 
 --
 Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
 --- You are receiving this mail because: ---
 You are watching the assignee of the bug.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Raghavendra K T


On 04/23/2013 01:19 AM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote:

On 04/22/2013 07:51 AM, Peter Zijlstra wrote:

On Sun, 2013-04-21 at 17:12 -0400, Rik van Riel wrote:


If we always incremented the ticket number by 2 (instead of 1), then
we could use the lower bit of the ticket number as the spinlock.


ISTR that paravirt ticket locks already do that and use the lsb to
indicate the unlock needs to perform wakeups.

Also, since all of this is virt nonsense, shouldn't it live in the
paravirt ticket lock code and leave the native code as is?


Sure, but that is still no reason not to have the virt
implementation be as fast as possible, and share the same
data type as the non-virt implementation.


It has to share the same data-type..


Also, is it guaranteed that the native spin_lock code has
not been called yet before we switch over to the paravirt
functions?

If the native spin_lock code has been called already at
that time, the native code would still need to be modified
to increment the ticket number by 2, so we end up with a
compatible value in each spin lock's .tickets field, and
prevent a deadlock after we switch over to the paravirt
variant.


I thought the stuff already made it upstream, but apparently not; the
lastest posting I'm aware of is here:

   https://lkml.org/lkml/2012/5/2/105

That stuff changes the normal ticket increment as well..



pv-ticket spinlock went on hold state, after Avi acked because of:

though on non-PLE, we get a huge advantage, on PLE machine the benefit 
was not as impressive (~10% as you stated in email chain) compared to 
the complexity of the patches.

So Avi suggested to try PLE improvements first, so they are going upstream.

https://lkml.org/lkml/2012/7/18/247
https://lkml.org/lkml/2013/1/22/104
https://lkml.org/lkml/2013/2/6/345 (on the way in kvm tree)

Current status of PV spinlock:
I have the rebased patches of pv spinlocks and experimenting with latest 
kernel.I have
Gleb's irq delivery incorporated into the patch series. But I am 
thinknig whether I can

improve some guest side logic in unlock.
I will probably setup a githup and post the link soon.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Raghavendra K T


On 04/22/2013 10:12 PM, Jiannan Ouyang wrote:

On Mon, Apr 22, 2013 at 1:58 AM, Raghavendra K T
raghavendra...@linux.vnet.ibm.com wrote:


[...]


   static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
   {
  register struct __raw_tickets inc = { .tail = 1 };
+   unsigned int timeout = 0;
+   __ticket_t current_head;

  inc = xadd(lock-tickets, inc);
-
+   if (likely(inc.head == inc.tail))
+   goto spin;
+
+   timeout =  TIMEOUT_UNIT * (inc.tail - inc.head);


Forgot to mention about this, for immediate wait case,
you can busyloop instead of timeout (I mean

timeout =  TIMEOUT_UNIT * (inc.tail - inc.head -1);

This ideas was used by Rik in his spinlock  backoff patches.


+   do {
+   current_head = ACCESS_ONCE(lock-tickets.head);
+   if (inc.tail = current_head) {
+   goto spin;
+   } else if (inc.head != current_head) {
+   inc.head = current_head;
+   timeout =  TIMEOUT_UNIT * (inc.tail - inc.head);



Good idea indeed to base the loop on head and tail difference.. But for
virtualization I believe this directly proportional notion is little
tricky too.



Could you explain your concern a little bit more?



Consider a big machine with 2 VMs running.
If nth vcpu of say VM1 waiting in the queue, the question is,

Do we have to have all the n VCPU doing busyloop and thus burning
sigma (n*(n+1) * TIMEOUT_UNIT)) ?

OR

Is it that, far off vcpu in the queue worth giving his time back so that 
probably some other vcpu of VM1 doing good work OR vcpu of VM2 can 
benefit from this.


I mean far the vcpu in the queue, let him yield voluntarily. (inversely 
proportional notion just because it is vcpu). and of course for some n  
THRESHOLD we can still have directly proportional wait idea.


Does this idea sound good ?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

2013-04-22 Thread Marcelo Tosatti

On Mon, Apr 22, 2013 at 10:45:53PM +0900, Takuya Yoshikawa wrote:
 On Mon, 22 Apr 2013 15:39:38 +0300
 Gleb Natapov g...@redhat.com wrote:
 
Do not want kvm_set_memory (cases: DELETE/MOVE/CREATES) to be
suspectible to:

vcpu 1  |   kvm_set_memory
create shadow page  
nuke shadow page
create shadow page
nuke shadow page

Which is guest triggerable behavior with spinlock preemption algorithm.
   
   Not only guest triggerable as in the sense of a malicious guest, 
   but condition above can be induced by host workload with non-malicious
   guest system.
   
  Is the problem that newly created shadow pages are immediately zapped?
  Shouldn't generation number/kvm_mmu_zap_all_invalid() idea described here
  https://lkml.org/lkml/2013/4/22/111 solve this?
 
 I guess so.  That's what Avi described when he tried to achieve
 lockless TLB flushes.  Mixing that idea with Xiao's approach will
 achieve reasonably nice performance, I think.

Yes.

 Various improvements should be added later on top of that if needed.
 
   Also kvm_set_memory being relatively fast with huge memory guests
   is nice (which is what Xiaos idea allows).
 
 I agree with this point.  But if so, it should be actually measured on
 such guests, even if the algorithm looks promising.

Works for me.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: ARM: Fix API documentation for ONE_REG encoding

2013-04-22 Thread Christoffer Dall

Unless I'm mistaken, the size field was encoded 4 bits off and a wrong
value was used for 64-bit FP registers.

Signed-off-by: Christoffer Dall cd...@cs.columbia.edu
---
 Documentation/virtual/kvm/api.txt |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 976eb65..7145ee9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1797,22 +1797,22 @@ ARM registers are mapped using the lower 32 bits.  The 
upper 16 of that
 is the register group type, or coprocessor number:
 
 ARM core registers have the following id bit patterns:
-  0x4002  0010 index into the kvm_regs struct:16
+  0x4020  0010 index into the kvm_regs struct:16
 
 ARM 32-bit CP15 registers have the following id bit patterns:
-  0x4002  000F zero:1 crn:4 crm:4 opc1:4 opc2:3
+  0x4020  000F zero:1 crn:4 crm:4 opc1:4 opc2:3
 
 ARM 64-bit CP15 registers have the following id bit patterns:
-  0x4003  000F zero:1 zero:4 crm:4 opc1:4 zero:3
+  0x4030  000F zero:1 zero:4 crm:4 opc1:4 zero:3
 
 ARM CCSIDR registers are demultiplexed by CSSELR value:
-  0x4002  0011 00 csselr:8
+  0x4020  0011 00 csselr:8
 
 ARM 32-bit VFP control registers have the following id bit patterns:
-  0x4002  0012 1 regno:12
+  0x4020  0012 1 regno:12
 
 ARM 64-bit FP registers have the following id bit patterns:
-  0x4002  0012 0 regno:12
+  0x4030  0012 0 regno:12
 
 4.69 KVM_GET_ONE_REG
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: KVM: kvm_set_slave_cpu: Invalid argument when trying direct interrupt delivery

2013-04-22 Thread Yangminqiang

HI Tomoki,

Thanks for you config file, but it is for linux-3.5-rc4, but the patches you 
posted to the community was based on linux-3.6 as described in the following
link. 
  http://thread.gmane.org/gmane.linux.kernel/1353803
I also tested the config file on linux-3.6, still can not work. 

Could you please provide me the detail about your setup, for example:
- kernel version and config
- the version and url of your patch
- qemu command line
- other info
so that I can try and reproduce your benchmark :-).

I think that your patch would be a good workaround on realtime and 
performance improvement with hardware which do not support apicv.

BTW, I notice your benchmark result in your slides, which shows that 
the throughput of a 10Gbps NIC exceeds 10Gbps? How did that happen?

My benchmark also shows a interesting data: the throughput of a pass-
through device is higher than bare-mental sometime.(I use NetPIPE)

Thanks a lot.

Yang Minqiang

 -Original Message-
 From: Tomoki Sekiyama [mailto:tomoki.sekiy...@hds.com]
 Sent: Thursday, April 11, 2013 4:37 AM
 To: Yangminqiang
 Cc: kvm@vger.kernel.org; Haofeng; Luohao (brian)
 Subject: Re: KVM: kvm_set_slave_cpu: Invalid argument when trying direct
 interrupt delivery
 
 Hi,
 
 
 On 4/7/13 6:09 , Yangminqiang yangminqi...@huawei.com wrote:
 Hi Tomoki,
 
 I offline the cpu2 and cpu3 on my machine and continue to try your patch.
  I run the vm without pass-through device for I only want to know the
 interrupt latency improvement.(Am I right?)
 
 This should be OK.
 
 my qemu parameter:
 ./x86_64-softmmu/qemu-system-x86_64 -enable-kvm -m 1024 -cpu
 qemu64,+x2apic -no-kvm-pit -serial pty -nographic -drive
 file=/mnt/sdb/vms/testfc/testfc.qcow2,if=virtio,index=0,format=qcow2
 -spice
 port=12000,addr=186.100.8.171,disable-ticketing,plaintext-channel=main,
 pla
 intext-channel=playback,plaintext-channel=record,image-compression=aut
 o
 _gl
 z -no-kvm-pit
 
 Using video (-spice) could cause a large latency, because non-passthrough
 devices cause larger latency with this patch.
 Please use serial console (but should avoid serial I/O while benchmark,
 because serial I/O is also non-passthrough device).
 
 You may need to turn off CPU low power features, or running some infinite
 loop (e.g. yes/dev/null) to avoid cpu core from going into power saving
 mode.
 
 
 cyclictest:
cyclictest -m -p 99 -n -l 10 -h 3000 -q
 
 
 but I got very bad result:
   avg lantency: 2+ us
   max lantency: 5+ us
 
 and got
 
 Message from syslogd@kvmsteven at Apr  7 05:43:30 ...
  kernel:[ 2201.151817] BUG: soft lockup - CPU#18 stuck for 22s!
 [qemu-system-x86:2365]
 
 This patch is not yet tested in various kernel CONFIG_* settings, and some
 config may cause issue like this...
 I was using Fedora-based config (attached). Could you try with this?
 
 my setup:
 host kernel: 3.6.0-rc4+ and your patches guest kernel: 3.6.11.1-rt32
 qemu: qemu-kvm-1.0 with your patch
 
 BTW, I am sure that my rt-kernel works well, which got 12us max latency
 as a host OS.
 
 Could you please provoide me more detail about your benchmark so I
 could reproduce your benchmark result?
 
 Thanks,
 Steven

Re: [PATCH] virtio-net: fill only rx queues which are being used

2013-04-22 Thread Rusty Russell

Sasha Levin sasha.le...@oracle.com writes:
 Due to MQ support we may allocate a whole bunch of rx queues but
 never use them. With this patch we'll safe the space used by
 the receive buffers until they are actually in use:

Idea is good, implementation needs a tiny tweak:

 @@ -912,8 +913,13 @@ static int virtnet_set_queues(struct virtnet_info *vi, 
 u16 queue_pairs)
   dev_warn(dev-dev, Fail to set num of queue pairs to %d\n,
queue_pairs);
   return -EINVAL;
 - } else
 + } else {
 + if (queue_pairs  vi-curr_queue_pairs)
 + for (i = 0; i  queue_pairs; i++)
 + if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
 + schedule_delayed_work(vi-refill, 0);
   vi-curr_queue_pairs = queue_pairs;
 + }
  
   return 0;
  }

You don't want to refill existing queues, so you don't need the if.

for (i = vi-curr_queue_pairs; i  queue_pairs; i++) {
if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
schedule_delayed_work(vi-refill, 0);

We don't free up buffers when we're reducing queues, but I consider that
a corner case.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/3] tcm_vhost: Add hotplug/hotunplug support

2013-04-22 Thread Rusty Russell

Asias He as...@redhat.com writes:
 On Mon, Apr 22, 2013 at 04:17:04PM +0300, Michael S. Tsirkin wrote:
+  evt = kzalloc(sizeof(*evt), GFP_KERNEL);
   
   I think kzalloc not needed here, you init all fields.
  
  Not really! evt-event.lun[4-7] is not initialized. It needs to be 
  0.
 
 So that is 4 bytes just init them when you set rest of lun.

It is not in the fast path. You can do it this way but not a must.
   
   I think that's a bit cleaner than relying on kzalloc to zero-initialize.
  
  I think it is a bit shorter, 4 lines saved. 
 
 OTOH you don't have to think what about the rest of the bytes? then
 hunt through the code and go ah, it's kzalloc.
 Looks it's a style issue but I prefer explicit initialization.

 It is just make none sense to argue about this. If you really want you
 can make patches to remove all the lines where use kzalloc because it is
 cleaner.

I prefer explicit initialization too, FWIW.

+static void tcm_vhost_do_evt_work(struct vhost_scsi *vs,
+  struct tcm_vhost_evt *evt)
+{
+  struct vhost_virtqueue *vq = 
vs-vqs[VHOST_SCSI_VQ_EVT];
+  struct virtio_scsi_event *event = evt-event;
+  struct virtio_scsi_event __user *eventp;
+  unsigned out, in;
+  int head, ret;
+
+  if (!tcm_vhost_check_endpoint(vq))
+  return;
+
+  mutex_lock(vs-vs_events_lock);
+  mutex_lock(vq-mutex);
+again:
+  vhost_disable_notify(vs-dev, vq);
+  head = vhost_get_vq_desc(vs-dev, vq, vq-iov,
+  ARRAY_SIZE(vq-iov), out, in,
+  NULL, NULL);
+  if (head  0) {
+  vs-vs_events_dropped = true;
+  goto out;
+  }
+  if (head == vq-num) {
+  if (vhost_enable_notify(vs-dev, vq))
+  goto again;
   
   Please write loops with while() or for().
   Not with goto. goto is for error handling.
  
  This makes extra indention which is more ugly.
 
 I don't care. No loops with goto and that's a hard rule.

It is not a loop.
   
   If same code repeats, it's a loop.
  
  We really need here is to call vhost_get_vq_desc once. It's not a loop
  like other places to use while() or for() with vhost_get_vq_desc.
 
 Exactly. Like a mutex for example? We only need to
 lock it once. See __mutex_lock_common - uses a for loop,
 doesn't it?

 So what?

 So fix this please, use while or for or even until.

 I do not think it is necessary.

I tend to use again: in such corner cases, where I don't expect it to
be invoked very often.  A do loop looks too normal.

I do exactly the same thing in virtio-net.c, for the same case.

Would you like me to review the entire thing?

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: suggesting wording fixes for virtio-spec 0.9.5

2013-04-22 Thread Rusty Russell

Laszlo Ersek ler...@redhat.com writes:
 Hi,

 (I'm not subscribed to either list,)

 using the word descriptor is misleading in the following sections:

Yes, I like the use of 'descriptor chains'.  This is a definite
improvement.

Here's the diff I ended up with (massaged to minimize it).

Thanks!
Rusty.

--- virtio-spec.txt-old 2013-04-23 13:22:21.339158214 +0930
+++ virtio-spec.txt 2013-04-23 13:34:14.055176464 +0930
@@ -482,10 +482,10 @@
 
 2.3.4 Available Ring
 
-The available ring refers to what descriptors we are offering the 
-device: it refers to the head of a descriptor chain. The “flags” 
+The available ring refers to what descriptor chains we are offering the
+device: each entry refers to the head of a descriptor chain. The “flags”
 field is currently 0 or 1: 1 indicating that we do not need an 
-interrupt when the device consumes a descriptor from the 
+interrupt when the device consumes a descriptor chain from the
 available ring. Alternatively, the guest can ask the device to 
 delay interrupts until an entry with an index specified by the “
 used_event” field is written in the used ring (equivalently, 
@@ -671,16 +671,16 @@
 
 avail-ring[avail-idx % qsz] = head;
 
-However, in general we can add many descriptors before we update 
-the “idx” field (at which point they become visible to the 
-device), so we keep a counter of how many we've added:
+However, in general we can add many separate descriptor chains before we update
+the “idx” field (at which point they become visible to the device),
+so we keep a counter of how many we've added:
 
 avail-ring[(avail-idx + added++) % qsz] = head;
 
 2.4.1.3 Updating The Index Field
 
 Once the idx field of the virtqueue is updated, the device will 
-be able to access the descriptor entries we've created and the 
+be able to access the descriptor chains we've created and the 
 memory they refer to. This is why a memory barrier is generally 
 used before the idx update, to ensure it sees the most up-to-date 
 copy.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/3] tcm_vhost: Add hotplug/hotunplug support

2013-04-22 Thread Asias He

On Tue, Apr 23, 2013 at 01:48:51PM +0930, Rusty Russell wrote:
 Asias He as...@redhat.com writes:
  On Mon, Apr 22, 2013 at 04:17:04PM +0300, Michael S. Tsirkin wrote:
 +evt = kzalloc(sizeof(*evt), GFP_KERNEL);

I think kzalloc not needed here, you init all fields.
   
   Not really! evt-event.lun[4-7] is not initialized. It needs to 
   be 0.
  
  So that is 4 bytes just init them when you set rest of lun.
 
 It is not in the fast path. You can do it this way but not a must.

I think that's a bit cleaner than relying on kzalloc to 
zero-initialize.
   
   I think it is a bit shorter, 4 lines saved. 
  
  OTOH you don't have to think what about the rest of the bytes? then
  hunt through the code and go ah, it's kzalloc.
  Looks it's a style issue but I prefer explicit initialization.
 
  It is just make none sense to argue about this. If you really want you
  can make patches to remove all the lines where use kzalloc because it is
  cleaner.
 
 I prefer explicit initialization too, FWIW.

I do not really care. I am OK with either way.

 +static void tcm_vhost_do_evt_work(struct vhost_scsi *vs,
 +struct tcm_vhost_evt *evt)
 +{
 +struct vhost_virtqueue *vq = 
 vs-vqs[VHOST_SCSI_VQ_EVT];
 +struct virtio_scsi_event *event = evt-event;
 +struct virtio_scsi_event __user *eventp;
 +unsigned out, in;
 +int head, ret;
 +
 +if (!tcm_vhost_check_endpoint(vq))
 +return;
 +
 +mutex_lock(vs-vs_events_lock);
 +mutex_lock(vq-mutex);
 +again:
 +vhost_disable_notify(vs-dev, vq);
 +head = vhost_get_vq_desc(vs-dev, vq, vq-iov,
 +ARRAY_SIZE(vq-iov), out, in,
 +NULL, NULL);
 +if (head  0) {
 +vs-vs_events_dropped = true;
 +goto out;
 +}
 +if (head == vq-num) {
 +if (vhost_enable_notify(vs-dev, vq))
 +goto again;

Please write loops with while() or for().
Not with goto. goto is for error handling.
   
   This makes extra indention which is more ugly.
  
  I don't care. No loops with goto and that's a hard rule.
 
 It is not a loop.

If same code repeats, it's a loop.
   
   We really need here is to call vhost_get_vq_desc once. It's not a loop
   like other places to use while() or for() with vhost_get_vq_desc.
  
  Exactly. Like a mutex for example? We only need to
  lock it once. See __mutex_lock_common - uses a for loop,
  doesn't it?
 
  So what?
 
  So fix this please, use while or for or even until.
 
  I do not think it is necessary.
 
 I tend to use again: in such corner cases, where I don't expect it to
 be invoked very often.  A do loop looks too normal.
 
 I do exactly the same thing in virtio-net.c, for the same case.
 
 Would you like me to review the entire thing?

Sure. Thanks Rusty!

 Cheers,
 Rusty.

-- 
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-net: fill only rx queues which are being used

2013-04-22 Thread Sasha Levin

On 04/23/2013 12:13 AM, Rusty Russell wrote:
 Sasha Levin sasha.le...@oracle.com writes:
 Due to MQ support we may allocate a whole bunch of rx queues but
 never use them. With this patch we'll safe the space used by
 the receive buffers until they are actually in use:
 
 Idea is good, implementation needs a tiny tweak:
 
 @@ -912,8 +913,13 @@ static int virtnet_set_queues(struct virtnet_info *vi, 
 u16 queue_pairs)
  dev_warn(dev-dev, Fail to set num of queue pairs to %d\n,
   queue_pairs);
  return -EINVAL;
 -} else
 +} else {
 +if (queue_pairs  vi-curr_queue_pairs)
 +for (i = 0; i  queue_pairs; i++)
 +if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
 +schedule_delayed_work(vi-refill, 0);
  vi-curr_queue_pairs = queue_pairs;
 +}
  
  return 0;
  }
 
 You don't want to refill existing queues, so you don't need the if.
 
 for (i = vi-curr_queue_pairs; i  queue_pairs; i++) {
   if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
   schedule_delayed_work(vi-refill, 0);

That makes more sense, I'll resend.

 We don't free up buffers when we're reducing queues, but I consider that
 a corner case.

It didn't bother anyone up until now, and the spec doesn't state anything
about it - so I preferred to just leave that alone. Unless the ARM folks
would find it useful?


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Raghavendra K T


On 04/23/2013 02:31 AM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 16:49 -0400, Rik van Riel wrote:

Given the fairly high cost of rescheduling a VCPU (which is likely
to include an IPI), versus the short hold time of most spinlocks,
I have the strong suspicion that your approach would win.


   https://lkml.org/lkml/2012/5/2/101

If you schedule too often your SPIN_THRESHOLD is far too low.

Anyway.. performance can't be that bad, otherwise Jeremey would have
spend as much time on it as he did.


When I experimented last time ideal SPIN_THRESHOLD for PLE machine,
was around 4k, 8k. Jeremy's experiment was on a non-PLE machine AFAIK,
which complemented PLE feature in a nice way with 2k threshold.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Para-Virtualized Clock Usage

2013-04-22 Thread Gleb Natapov

On Mon, Apr 22, 2013 at 04:58:01PM +, Joji Mekkattuparamban (joji) wrote:
 Greetings,
 
 I have a SMP guest application, running on the 2.6.27 Linux kernel. The 
 application, originally written for bare metal, makes extensive use of the 
 TSC, by directly invoking rdtsc from the user space for timestamp purposes. 
 While running on KVM (RHEL version 6.3), we are running into TSC issues on 
 some hardware. As a solution, I am considering migrating to the pvclock. I am 
 wondering if there is an example for migrating from TSC to the pvclock. Any 
 pointers?
 
Wrong list, you should ask KVM (copied). Recent kernels have pvclock
vdso support which means that gettimeofday() uses it without entering
the kernel. Marcelo?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Gleb Natapov

On Mon, Apr 22, 2013 at 07:08:06PM -0400, Rik van Riel wrote:
 On 04/22/2013 04:55 PM, Peter Zijlstra wrote:
 On Mon, 2013-04-22 at 16:46 -0400, Jiannan Ouyang wrote:
 
 - pv-preemptable-lock has much less performance variance compare to
 pv_lock, because it adapts to preemption within  VM,
other than using rescheduling that increase VM interference
 
 I would say it has a _much_ worse worst case (and thus worse variance)
 than the paravirt ticket implementation from Jeremy. While full
 paravirt ticket lock results in vcpu scheduling it does maintain
 fairness.
 
 If you drop strict fairness you can end up in unbounded starvation
 cases and those are very ugly indeed.
 
 If needed, Jiannan's scheme could easily be bounded to prevent
 infinite starvation. For example, we could allow only the first
 8 CPUs in line to jump the queue.
 
 However, given the way that virtual CPUs get scheduled in and
 out all the time, I suspect starvation is not a worry, and we
 will not need the additional complexity to deal with it.
 
FWIW RHEL6 uses unfair spinlock when it runs as a guest. We never got
reports about problems due to this on any scale.

 You may want to play around with virtualization a bit, to get
 a feel for how things work in virt land.
 
 -- 
 All rights reversed

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PULL 0/5] ppc patch queue 2013-04-17

2013-04-22 Thread Gleb Natapov

On Wed, Apr 17, 2013 at 09:04:04PM +0200, Alexander Graf wrote:
 Hi Marcelo / Gleb,
 
 This is my current patch queue for ppc.  Please pull for 3.10.
 
Pulled. Thanks.

 Changes include:
 
   - KVM: PPC: Fix in-kernel MMIO loads
   - KVM: PPC: BookE: Fix 64-bit guest kernels with SMP
 
 Alex
 
 
 The following changes since commit 79558f112fc0352e057f7b5e158e3d88b8b62c60:
   Alexander Graf (1):
 KVM: ARM: Fix kvm_vm_ioctl_irq_line
 
 are available in the git repository at:
 
   git://github.com/agraf/linux-2.6.git kvm-ppc-next
 
 Alexander Graf (2):
   Merge commit 'origin/next' into kvm-ppc-next
   Merge commit 'origin/next' into kvm-ppc-next
 
 Bharat Bhushan (1):
   Added ONE_REG interface for debug instruction
 
 Scott Wood (1):
   kvm/ppc: don't call complete_mmio_load when it's a store
 
 Stuart Yoder (1):
   KVM: PPC: emulate dcbst
 
  arch/powerpc/include/asm/kvm_book3s.h |2 ++
  arch/powerpc/include/asm/kvm_booke.h  |2 ++
  arch/powerpc/include/uapi/asm/kvm.h   |4 
  arch/powerpc/kvm/book3s.c |6 ++
  arch/powerpc/kvm/booke.c  |6 ++
  arch/powerpc/kvm/emulate.c|2 ++
  arch/powerpc/kvm/powerpc.c|1 -
  7 files changed, 22 insertions(+), 1 deletions(-)

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/17] KVM: PPC: Support irq routing and irqfd for in-kernel MPIC

2013-04-22 Thread Scott Wood


On 04/18/2013 07:15:46 PM, Alexander Graf wrote:


On 18.04.2013, at 23:39, Scott Wood wrote:

 Do we really want any default routes?  There's no platform notion  
of GSI
 here, so how is userspace to know how the kernel set it up (or what  
GSIs
 are free to be used for new routes) -- other than the read the  
code

 answer I got when I asked about x86?  :-P

The default routes really are just expose all pins 1:1 as GSI. I  
think it makes sense to have a simple default for user space that  
doesn't want to mess with irq routing.


What GSIs are free for new routes doesn't matter. Routes are always  
completely rewritten as a while from user space. So when user space  
goes in and wants to change only a single line it has to lay out the  
full map itself anyway.


It looks like you already write the routes in your QEMU patches, so I'd  
like to avoid adding MPIC default routes in KVM to keep things simple.   
It's legacy baggage from day one.  With default routes, what happens if  
we later support instantiating multiple interrupt controllers?


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

79 matches

Mail list logo