date:20140507

Re: [kvm-kmod PATCH] Provide pci_enable_msix_exact() for kernels 3.15

2014-05-07 Thread Jan Kiszka

On 2014-05-06 20:35, gso...@gmail.com wrote:
 Signed-off-by: Gabriel Somlo so...@cmu.edu
 ---
 
 Jan,
 
 After today's pull from kvm, I also need this to build against my
 Fedora 20 kernel (3.13.10-200.fc20.x86_64).

Which version did you pull? Neither next nor queue contain this change yet.

 
 Thanks,
   Gabriel
 
  x86/external-module-compat.h | 5 +
  1 file changed, 5 insertions(+)
 
 diff --git a/x86/external-module-compat.h b/x86/external-module-compat.h
 index dde9463..e62486b 100644
 --- a/x86/external-module-compat.h
 +++ b/x86/external-module-compat.h
 @@ -1480,6 +1480,11 @@ static inline int __register_hotcpu_notifier(struct 
 notifier_block *nb)
  {
   return 0;
  }
 +static inline int pci_enable_msix_exact(struct pci_dev *dev,
 +   struct msix_entry *entries, int nvec)
 +{
 + return -ENOSYS;
 +}

First, this is not x86-specific, thus should go to
external-module-compat-comm.h. Second, pci_enable_msix_exact was
introduced with 3.14. And then I think we can (and should) provide the
original content of this wrapper once it is needed, not a stub.

Right now I'm seeing different build problems:

http://buildbot.kiszka.org/kvm-kmod/builders/4-latest-kvm/builds/1788/steps/3.11-x86-64/logs/stdio

Didn't look into details yet.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-kmod PATCH] Provide pci_enable_msix_exact() for kernels 3.15

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 08:29, Jan Kiszka ha scritto:

On 2014-05-06 20:35, gso...@gmail.com wrote:

Signed-off-by: Gabriel Somlo so...@cmu.edu
---

Jan,

After today's pull from kvm, I also need this to build against my
Fedora 20 kernel (3.13.10-200.fc20.x86_64).


Which version did you pull? Neither next nor queue contain this change yet.


It went in via kvm/master.  It was submitted before 3.15, but I didn't 
send it during the merge window because: 1) we had to wait for the 
function itself to get in via the PCI tree; 2) Marcelo didn't know about 
this patch.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM exit on UD interception

2014-05-07 Thread Paolo Bonzini


Il 06/05/2014 22:11, Alexandru Duţu ha scritto:

What is puzzling thought is the fact that even if there is an
instruction that can't be emulated by KVM, re-entering virtualized
mode is still attempted. Re-entering will produce a double fault,


That would be a bug.  It should inject the #UD into the guest.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] KVM: PPC: Book3S PR: Ignore PMU SPRs

2014-05-07 Thread Paul Mackerras

On Fri, May 02, 2014 at 10:35:09AM +0200, Alexander Graf wrote:
 On 05/01/2014 12:12 AM, Paul Mackerras wrote:
 On Tue, Apr 29, 2014 at 06:17:37PM +0200, Alexander Graf wrote:
 When we expose a POWER8 CPU into the guest, it will start accessing PMU SPRs
 that we don't emulate. Just ignore accesses to them.
 
 Signed-off-by: Alexander Graf ag...@suse.de
 This patch is OK as it stands, but in fact the architecture says that
 kernel accesses to unimplemented SPRs are mostly supposed to be no-ops
 rather than causing a trap (mostly == excluding mtspr to 0 or mfspr
 from 0, 4, 5 or 6).  I have a patch to implement that, which I'll
 post.
 
 I think what we want is a flag similar to x86 where we can force
 ignore unknown SPRs, but leave it at triggering an interrupt as
 default. We usually have to be at least aware of unknown SPRs and
 check that not implementing them is ok for the guest.
 
 Debugging a program interrupt because of an unknown SPR is usually a
 lot easier than debugging a breaking guest because it was using the
 SPR as storage and we didn't back it by anything.

That has not been my experience, for accesses by the Linux kernel
early in the boot process; usually we end up in a loop of ISI
interrupts because the HPT isn't set up yet, with the original
interrupt cause (and PC) lost long ago.

The Power ISA was changed in version 2.05 (POWER6) to specify that
accesses to unimplemented SPRs by privileged code must be no-ops on
server processors.  Before that the architecture allowed either an
illegal instruction interrupt or boundedly undefined behaviour
(which would include a no-op).

So, if we're emulating POWERx for x = 6, to be correct we need to do
the no-op behaviour, even if we retain the option of making them trap
for debugging purposes.  Of course at the moment we basically never
look at what specific CPU we're emulating, but maybe now we have to.

Regards,
Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on

2014-05-07 Thread Aneesh Kumar K.V

Paul Mackerras pau...@samba.org writes:

 On Sun, May 04, 2014 at 10:56:08PM +0530, Aneesh Kumar K.V wrote:
 With debug option sleep inside atomic section checking enabled we get
 the below WARN_ON during a PR KVM boot. This is because upstream now
 have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the
 warning by adding preempt_disable/enable around floating point and altivec
 enable.

 This worries me a bit.  In this code:

  if (msr  MSR_FP) {
 +preempt_disable();
  enable_kernel_fp();
  load_fp_state(vcpu-arch.fp);
  t-fp_save_area = vcpu-arch.fp;
 +preempt_enable();

 What would happen if we actually did get preempted at this point?
 Wouldn't we lose the FP state we just loaded?

I was not sure we have got CONFIG_PREEMPT working with kvm. So i was
not looking at preempted case. But yes, if we have PREEMPT enabled it
would break as per the current code. 


 In other words, how come we're not already preempt-disabled at this
 point?

I don't see us disabling preempt in the code path. Are you suggesting
that we should be preempt disabled for the whole duration of
kvmppc_handle_ext ? 

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 04:30, Bandan Das ha scritto:

 +  if (unlikely(ctxt-_eip == fc-end)) {

Is this really going to be unlikely ?



Yes, it happens at most once per instruction and only for instructions 
that cross pages.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/4] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 06:21, Bandan Das ha scritto:

 +  if (rc != X86EMUL_CONTINNUE)
 +  goto done;
 +  }
 +
while (size--) {
 -  if (unlikely(ctxt-_eip == fc-end)) {
 -  rc = do_insn_fetch_bytes(ctxt);
 -  if (rc != X86EMUL_CONTINUE)
 -  return rc;
 -  }
*dest++ = *src++;
ctxt-_eip++;
continue;
 @@ -4273,7 +4282,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
if (insn_len  0)
memcpy(ctxt-fetch.data, insn, insn_len);
else {
 -  rc = do_insn_fetch_bytes(ctxt);
 +  rc = do_insn_fetch_bytes(ctxt, 1);

Is this saying that if the cache is full, then we fetch one more byte ?



No, it is saying that if the instruction is being executed for the first 
time (we can execute it multiple times if we reenter a repeated 
instruction after a userspace exit) we try to get at least one byte from 
RIP.  Most of the time, do_insn_fetch_bytes will fetch 15 bytes which 
are the maximum length of an instruction.


Passing op_size == 1 matches this change in do_insn_fetch_bytes:

-   if (unlikely(size == 0))
+   if (unlikely(size  op_size))

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 3/4] KVM: emulate: avoid per-byte copying in instruction fetches

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 06:36, Bandan Das ha scritto:

 +  _x = *(_type __aligned(1) *) _fc-data[ctxt-_eip - _fc-start]; \

For my own understanding, how does the __aligned help here ?


Except for 16-byte SSE accesses, x86 doesn't distinguish aligned and 
unaligned accesses.  You can read 4 bytes at 0x2345 and the processor 
will do the right thing.  Still it's better to tell the compiler what 
we're doing.



Wouldn't
that result in unaligned accesses that will actually impact performance ?


These accesses *can* and will in fact be unaligned.  For example, say 
you have mov ax,0x1234 which is 0xb8 0x34 0x12.  When you read it into 
the fetch cache, you will have data[0] == 0xb8, data[1] == 0x34, data[2] 
== 0x12.  Fetching the 16-bit immediate from data[1] will then be an 
unaligned access.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM Nested L2 guest startup problems

2014-05-07 Thread Paolo Bonzini


Il 04/05/2014 18:33, Hu Yaohui ha scritto:

I experienced a similar problem that was related to nested code
having some bugs related to apicv and other new vmx features.

For example, the code enabled posted interrupts to run L2 even when the
feature was not exposed to L1 and L1 didn't use it.

Try changing prepare_vmcs02  to force disabling posted_interrupts,
code should looks like:



exec_control = vmcs12-pin_based_vm_exec_control;
exec_control |= vmcs_config.pin_based_exec_ctrl;
exec_control = ~(PIN_BASED_VMX_PREEMPTION_TIMER|PIN_BASED_POSTED_INTR);
vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control);

...

and also

...
...
exec_control = ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
 SECONDARY_EXEC_APIC_REGISTER_VIRT |
 SECONDARY_EXEC_PAUSE_LOOP_EXITING);


PLE should be left enabled, I think.

Apart from that, I'll change the suggestion into a patch.

Thanks!

Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping

2014-05-07 Thread Marc Zyngier

Kim, Christoffer,

On Tue, May 06 2014 at  7:04:48 pm BST, Christoffer Dall 
christoffer.d...@linaro.org wrote:
 On Tue, Mar 25, 2014 at 05:08:14PM -0500, Kim Phillips wrote:
 Use the correct memory type for device MMIO mappings: PAGE_S2_DEVICE.
 
 Signed-off-by: Kim Phillips kim.phill...@linaro.org
 ---
  arch/arm/kvm/mmu.c | 11 ---
  1 file changed, 8 insertions(+), 3 deletions(-)
 
 diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
 index 7789857..a354610 100644
 --- a/arch/arm/kvm/mmu.c
 +++ b/arch/arm/kvm/mmu.c
 @@ -652,6 +652,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
 phys_addr_t fault_ipa,
  struct kvm_mmu_memory_cache *memcache = vcpu-arch.mmu_page_cache;
  struct vm_area_struct *vma;
  pfn_t pfn;
 +pgprot_t mem_type = PAGE_S2;
  
  write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
  if (fault_status == FSC_PERM  !write_fault) {
 @@ -702,6 +703,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
 phys_addr_t fault_ipa,
  if (is_error_pfn(pfn))
  return -EFAULT;
  
 +if (kvm_is_mmio_pfn(pfn))
 +mem_type = PAGE_S2_DEVICE;
 +
  spin_lock(kvm-mmu_lock);
  if (mmu_notifier_retry(kvm, mmu_seq))
  goto out_unlock;
 @@ -709,7 +713,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
 phys_addr_t fault_ipa,
  hugetlb = transparent_hugepage_adjust(pfn, fault_ipa);
  
  if (hugetlb) {
 -pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
 +pmd_t new_pmd = pfn_pmd(pfn, mem_type);
  new_pmd = pmd_mkhuge(new_pmd);
  if (writable) {
  kvm_set_s2pmd_writable(new_pmd);
 @@ -718,13 +722,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
 phys_addr_t fault_ipa,
  coherent_icache_guest_page(kvm, hva  PMD_MASK, PMD_SIZE);
  ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, new_pmd);
  } else {
 -pte_t new_pte = pfn_pte(pfn, PAGE_S2);
 +pte_t new_pte = pfn_pte(pfn, mem_type);
  if (writable) {
  kvm_set_s2pte_writable(new_pte);
  kvm_set_pfn_dirty(pfn);
  }
  coherent_icache_guest_page(kvm, hva, PAGE_SIZE);
 -ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, false);
 +ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte,
 + mem_type == PAGE_S2_DEVICE);
  }
  
  
 -- 
 1.9.1
 

 I think this looks reasonable.

 Acked-by: Christoffer Dall christoffer.d...@linaro.org

I feel like I'm missing some context here, and the commit message is way
too terse for me to make sense of it.

So far, we can only get into user_mem_abort on a Stage-2 fault
(translation or permission) for memory. How can we suddenly get here for
a *device* fault? Do we get a special kind of memslot?

I'm not saying the patch does anything wrong, but I'd like to understand
the rationale behind it. On its own, it doesn't make much sense.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Peter Maydell

On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:
 On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:
 On Tue, May 06 2014 at  3:28:07 pm BST, Will Deacon will.dea...@arm.com 
 wrote:
 On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
 +reg.addr = (u64)data;
 +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 +die(KVM_GET_ONE_REG failed (SCTLR_EL1));
 +
 +return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : 
 VIRTIO_ENDIAN_LE;

 This rules out guests where userspace and kernelspace can run with different
 endinness. Whilst Linux doesn't currently do this, can we support it here?
 It all gets a bit hairy if the guest is using a stage-1 SMMU to let
 userspace play with a virtio device...

 Yeah, I suppose we could check either EE or E0 depending on the mode
 when the access was made. We already have all the information, just need
 to handle the case. I'll respin the series.

 How virtio implementations should determine their endianness is
 a spec question, I think; at any rate QEMU and kvmtool ought to
 agree on how it's done. I think the most recent suggestion on the
 QEMU mailing list (for PPC) is that we should care about the
 guest kernel endianness, but I don't know if anybody thought of
 the pass-through-to-userspace usecase...

Current opinion on the qemu-devel thread seems to be that we
should just define that the endianness of the virtio device is
the endianness of the guest kernel at the point where the guest
triggers a reset of the virtio device by writing zero the QueuePFN
or Status registers.

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Alexander Graf



 Am 07.05.2014 um 11:34 schrieb Peter Maydell peter.mayd...@linaro.org:
 
 On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:
 On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:
 On Tue, May 06 2014 at  3:28:07 pm BST, Will Deacon will.dea...@arm.com 
 wrote:
 On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
 +reg.addr = (u64)data;
 +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 +die(KVM_GET_ONE_REG failed (SCTLR_EL1));
 +
 +return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : 
 VIRTIO_ENDIAN_LE;
 
 This rules out guests where userspace and kernelspace can run with 
 different
 endinness. Whilst Linux doesn't currently do this, can we support it here?
 It all gets a bit hairy if the guest is using a stage-1 SMMU to let
 userspace play with a virtio device...
 
 Yeah, I suppose we could check either EE or E0 depending on the mode
 when the access was made. We already have all the information, just need
 to handle the case. I'll respin the series.
 
 How virtio implementations should determine their endianness is
 a spec question, I think; at any rate QEMU and kvmtool ought to
 agree on how it's done. I think the most recent suggestion on the
 QEMU mailing list (for PPC) is that we should care about the
 guest kernel endianness, but I don't know if anybody thought of
 the pass-through-to-userspace usecase...
 
 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.

Virtio by design has full access to guest physical memory. It doesn't route DMA 
via PCI. So user space drivers simply don't make sense here.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Marc Zyngier

On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell 
peter.mayd...@linaro.org wrote:
 On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:
 On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:
 On Tue, May 06 2014 at  3:28:07 pm BST, Will Deacon will.dea...@arm.com 
 wrote:
 On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
 +reg.addr = (u64)data;
 +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 +die(KVM_GET_ONE_REG failed (SCTLR_EL1));
 +
 +return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : 
 VIRTIO_ENDIAN_LE;

 This rules out guests where userspace and kernelspace can run with 
 different
 endinness. Whilst Linux doesn't currently do this, can we support it here?
 It all gets a bit hairy if the guest is using a stage-1 SMMU to let
 userspace play with a virtio device...

 Yeah, I suppose we could check either EE or E0 depending on the mode
 when the access was made. We already have all the information, just need
 to handle the case. I'll respin the series.

 How virtio implementations should determine their endianness is
 a spec question, I think; at any rate QEMU and kvmtool ought to
 agree on how it's done. I think the most recent suggestion on the
 QEMU mailing list (for PPC) is that we should care about the
 guest kernel endianness, but I don't know if anybody thought of
 the pass-through-to-userspace usecase...

 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.

On AArch32, we only have the CPSR.E bit to select the endiannes. Are we
going to simply explode if the access comes from userspace?

On AArch64, we can either select the kernel endianness, or userspace
endianness. Are we going to go a different route just for the sake of
enforcing kernel access?

I'm inclined to think of userspace access as a valid use case.

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Alexander Graf



 Am 07.05.2014 um 11:52 schrieb Marc Zyngier marc.zyng...@arm.com:
 
 On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell 
 peter.mayd...@linaro.org wrote:
 On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:
 On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:
 On Tue, May 06 2014 at  3:28:07 pm BST, Will Deacon will.dea...@arm.com 
 wrote:
 On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
 +reg.addr = (u64)data;
 +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 +die(KVM_GET_ONE_REG failed (SCTLR_EL1));
 +
 +return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : 
 VIRTIO_ENDIAN_LE;
 
 This rules out guests where userspace and kernelspace can run with 
 different
 endinness. Whilst Linux doesn't currently do this, can we support it here?
 It all gets a bit hairy if the guest is using a stage-1 SMMU to let
 userspace play with a virtio device...
 
 Yeah, I suppose we could check either EE or E0 depending on the mode
 when the access was made. We already have all the information, just need
 to handle the case. I'll respin the series.
 
 How virtio implementations should determine their endianness is
 a spec question, I think; at any rate QEMU and kvmtool ought to
 agree on how it's done. I think the most recent suggestion on the
 QEMU mailing list (for PPC) is that we should care about the
 guest kernel endianness, but I don't know if anybody thought of
 the pass-through-to-userspace usecase...
 
 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.
 
 On AArch32, we only have the CPSR.E bit to select the endiannes. Are we
 going to simply explode if the access comes from userspace?
 
 On AArch64, we can either select the kernel endianness, or userspace
 endianness. Are we going to go a different route just for the sake of
 enforcing kernel access?
 
 I'm inclined to think of userspace access as a valid use case.

It's not for virtio-legacy. It'll be much more productive to influence 
virtio-1.0 to not redo the same mistakes than enabling even more hackery with 
the legacy one.

Alex

 
M.
 -- 
 Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Marc Zyngier

On Wed, May 07 2014 at 10:42:54 am BST, Alexander Graf ag...@suse.de wrote:
 Am 07.05.2014 um 11:34 schrieb Peter Maydell peter.mayd...@linaro.org:
 
 On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:
 On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:
 On Tue, May 06 2014 at  3:28:07 pm BST, Will Deacon will.dea...@arm.com 
 wrote:
 On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
 +reg.addr = (u64)data;
 +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 +die(KVM_GET_ONE_REG failed (SCTLR_EL1));
 +
 +return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : 
 VIRTIO_ENDIAN_LE;
 
 This rules out guests where userspace and kernelspace can run with 
 different
 endinness. Whilst Linux doesn't currently do this, can we support it here?
 It all gets a bit hairy if the guest is using a stage-1 SMMU to let
 userspace play with a virtio device...
 
 Yeah, I suppose we could check either EE or E0 depending on the mode
 when the access was made. We already have all the information, just need
 to handle the case. I'll respin the series.
 
 How virtio implementations should determine their endianness is
 a spec question, I think; at any rate QEMU and kvmtool ought to
 agree on how it's done. I think the most recent suggestion on the
 QEMU mailing list (for PPC) is that we should care about the
 guest kernel endianness, but I don't know if anybody thought of
 the pass-through-to-userspace usecase...
 
 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.

 Virtio by design has full access to guest physical memory. It doesn't
 route DMA via PCI. So user space drivers simply don't make sense here.

Huh? What if my guest has usespace using an idmap, with Stage-1 MMU for
isolation only (much like an MPU)? R-class guests anyone?

Agreed, this is not the general use case, but that doesn't seem to be
completely unrealistic either.

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Alexander Graf


On 05/07/2014 11:57 AM, Marc Zyngier wrote:

On Wed, May 07 2014 at 10:42:54 am BST, Alexander Graf ag...@suse.de wrote:

Am 07.05.2014 um 11:34 schrieb Peter Maydell peter.mayd...@linaro.org:


On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:

On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:

On Tue, May 06 2014 at  3:28:07 pm BST, Will Deacon will.dea...@arm.com wrote:

On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
+reg.addr = (u64)data;
+if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
+die(KVM_GET_ONE_REG failed (SCTLR_EL1));
+
+return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : VIRTIO_ENDIAN_LE;

This rules out guests where userspace and kernelspace can run with different
endinness. Whilst Linux doesn't currently do this, can we support it here?
It all gets a bit hairy if the guest is using a stage-1 SMMU to let
userspace play with a virtio device...

Yeah, I suppose we could check either EE or E0 depending on the mode
when the access was made. We already have all the information, just need
to handle the case. I'll respin the series.

How virtio implementations should determine their endianness is
a spec question, I think; at any rate QEMU and kvmtool ought to
agree on how it's done. I think the most recent suggestion on the
QEMU mailing list (for PPC) is that we should care about the
guest kernel endianness, but I don't know if anybody thought of
the pass-through-to-userspace usecase...

Current opinion on the qemu-devel thread seems to be that we
should just define that the endianness of the virtio device is
the endianness of the guest kernel at the point where the guest
triggers a reset of the virtio device by writing zero the QueuePFN
or Status registers.

Virtio by design has full access to guest physical memory. It doesn't
route DMA via PCI. So user space drivers simply don't make sense here.

Huh? What if my guest has usespace using an idmap, with Stage-1 MMU for
isolation only (much like an MPU)? R-class guests anyone?

Agreed, this is not the general use case, but that doesn't seem to be
completely unrealistic either.


Yes, and once that user tries the same without idmap virtio ends up 
overwriting random memory. It's just not a good idea and I'd much rather 
see us solve this properly with virtio 1.0 really.


Of course alternatively we can continue bikeshedding about this until 
everything becomes moot because we switched to virtio 1.0 ;).


Rusty / Michael, virtio 1.0 does go via normal DMA channels, right?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Peter Maydell

On 7 May 2014 10:52, Marc Zyngier marc.zyng...@arm.com wrote:
 On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell 
 peter.mayd...@linaro.org wrote:
 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.

 On AArch32, we only have the CPSR.E bit to select the endiannes. Are we
 going to simply explode if the access comes from userspace?

There's SCTLR.EE in AArch32, right?

 On AArch64, we can either select the kernel endianness, or userspace
 endianness. Are we going to go a different route just for the sake of
 enforcing kernel access?

 I'm inclined to think of userspace access as a valid use case.

I don't actually care much about the details of what we decide;
I just want us to be consistent between QEMU and kvmtool and
(to the extent that architectural differences permit) consistent
between PPC and ARM. At the moment we seem to be heading
in gratuitously different directions.

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Marc Zyngier

On Wed, May 07 2014 at 10:55:45 am BST, Alexander Graf ag...@suse.de wrote:
 Am 07.05.2014 um 11:52 schrieb Marc Zyngier marc.zyng...@arm.com:
 
 On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell
 peter.mayd...@linaro.org wrote:
 On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:
 On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:
 On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon
 will.dea...@arm.com wrote:
 On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
 +reg.addr = (u64)data;
 +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 +die(KVM_GET_ONE_REG failed (SCTLR_EL1));
 +
 + return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE :
 VIRTIO_ENDIAN_LE;
 
 This rules out guests where userspace and kernelspace can run
 with different
 endinness. Whilst Linux doesn't currently do this, can we support it 
 here?
 It all gets a bit hairy if the guest is using a stage-1 SMMU to let
 userspace play with a virtio device...
 
 Yeah, I suppose we could check either EE or E0 depending on the mode
 when the access was made. We already have all the information, just need
 to handle the case. I'll respin the series.
 
 How virtio implementations should determine their endianness is
 a spec question, I think; at any rate QEMU and kvmtool ought to
 agree on how it's done. I think the most recent suggestion on the
 QEMU mailing list (for PPC) is that we should care about the
 guest kernel endianness, but I don't know if anybody thought of
 the pass-through-to-userspace usecase...
 
 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.
 
 On AArch32, we only have the CPSR.E bit to select the endiannes. Are we
 going to simply explode if the access comes from userspace?
 
 On AArch64, we can either select the kernel endianness, or userspace
 endianness. Are we going to go a different route just for the sake of
 enforcing kernel access?
 
 I'm inclined to think of userspace access as a valid use case.

 It's not for virtio-legacy. It'll be much more productive to influence
 virtio-1.0 to not redo the same mistakes than enabling even more
 hackery with the legacy one.

Are you saying I shouldn't improve an existing code base and implement a
useful feature, and should instead work on some new fancy stuff for
which there is no platform support, no kernel support, and not an
official spec either? Watch me.

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Michael S. Tsirkin

On Wed, May 07, 2014 at 12:11:13PM +0200, Alexander Graf wrote:
 On 05/07/2014 11:57 AM, Marc Zyngier wrote:
 On Wed, May 07 2014 at 10:42:54 am BST, Alexander Graf ag...@suse.de wrote:
 Am 07.05.2014 um 11:34 schrieb Peter Maydell peter.mayd...@linaro.org:
 
 On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:
 On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:
 On Tue, May 06 2014 at  3:28:07 pm BST, Will Deacon 
 will.dea...@arm.com wrote:
 On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
 +reg.addr = (u64)data;
 +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
 +die(KVM_GET_ONE_REG failed (SCTLR_EL1));
 +
 +return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : 
 VIRTIO_ENDIAN_LE;
 This rules out guests where userspace and kernelspace can run with 
 different
 endinness. Whilst Linux doesn't currently do this, can we support it 
 here?
 It all gets a bit hairy if the guest is using a stage-1 SMMU to let
 userspace play with a virtio device...
 Yeah, I suppose we could check either EE or E0 depending on the mode
 when the access was made. We already have all the information, just need
 to handle the case. I'll respin the series.
 How virtio implementations should determine their endianness is
 a spec question, I think; at any rate QEMU and kvmtool ought to
 agree on how it's done. I think the most recent suggestion on the
 QEMU mailing list (for PPC) is that we should care about the
 guest kernel endianness, but I don't know if anybody thought of
 the pass-through-to-userspace usecase...
 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.
 Virtio by design has full access to guest physical memory. It doesn't
 route DMA via PCI. So user space drivers simply don't make sense here.
 Huh? What if my guest has usespace using an idmap, with Stage-1 MMU for
 isolation only (much like an MPU)? R-class guests anyone?
 
 Agreed, this is not the general use case, but that doesn't seem to be
 completely unrealistic either.
 
 Yes, and once that user tries the same without idmap virtio ends up
 overwriting random memory. It's just not a good idea and I'd much
 rather see us solve this properly with virtio 1.0 really.
 
 Of course alternatively we can continue bikeshedding about this
 until everything becomes moot because we switched to virtio 1.0 ;).
 
 Rusty / Michael, virtio 1.0 does go via normal DMA channels, right?
 
 
 Alex

By default it doesn't at the moment, in particular IOMMUs really
seem to hurt performance when enabled, and many guests seem to be
dumb and enable it everywhere if present.

By design IOMMUs can protect you from malicious devices, which
is relevant if you assign a device but of course isn't
relevant for qemu as virtio is part of qemu atm.


In virtio 1.0 it's possible for a device to have required features
which drivers must ack.
So we'll be able to have hypervisor tell guest that it requires DMA for
some VFs.
You would then be able to do mix fast virtio PF bypassing IOMMU and handle
that in kernel, and slow virtio VF going through an IOMMU and handle that
in userspace.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvmclock: Ensure time in migration never goes backward

2014-05-07 Thread Nick Thomas

Hi all,

On 06/05/14 08:16, Alexander Graf wrote:
 
 On 06.05.14 01:23, Marcelo Tosatti wrote:

 1) By what algorithm you retrieve
 and compare time in kvmclock guest structure and KVM_GET_CLOCK.
 What are the results of the comparison.
 And whether and backwards time was visible in the guest.
 
 I've managed to get my hands on a broken migration stream from Nick.
 There I looked at the curr_clocksource structure and saw that the last
 seen time on the kvmclock clock source was greater than the value that
 the kvmclock device migrated.

We've been seeing live migration failures where the guest sees time go
backwards (= massive forward leap to the kernel, apparently)  for a
while now, affecting perhaps 5-10% of migrations we'd do (usually a
large proportion of the migrations on a few hosts, rather than an even
spread); initially in December, when we tried an upgrade to QEMU 1.7.1
and a 3.mumble (3.10?) kernel, from 1.5.0 and Debian's 3.2.

My testing at the time seemed to indicate that either upgrade - qemu or
kernel - caused the problems to show up. Guest symptoms are that the
kernel enters a tight loop in __run_timers and stays there. In the end,
I gave up and downgraded us again without any clear idea of what was
happening, or why.

In April, we finally got together a fairly reliable test case. This
patch resolves the guest hangs in that test, and I've also been able to
conduct  1000 migrations of production guests without seeing the issue
recur. So,

Tested-by: Nick Thomas n...@bytemark.co.uk

/Nick


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Marc Zyngier

On Wed, May 07 2014 at 11:11:13 am BST, Alexander Graf ag...@suse.de wrote:
 On 05/07/2014 11:57 AM, Marc Zyngier wrote:

 Huh? What if my guest has usespace using an idmap, with Stage-1 MMU for
 isolation only (much like an MPU)? R-class guests anyone?

 Agreed, this is not the general use case, but that doesn't seem to be
 completely unrealistic either.

 Yes, and once that user tries the same without idmap virtio ends up
 overwriting random memory.

And how different is that from the kernel suddenly deciding to use VAs
instead of PAs? Just as broken. Are we going to prevent the kernel from
using virtio?

 It's just not a good idea and I'd much rather see us solve this
 properly with virtio 1.0 really.

Again, what is virtio 1.0 doing here?

   M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Greg Kurz

On Wed, 07 May 2014 10:52:01 +0100
Marc Zyngier marc.zyng...@arm.com wrote:

 On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell 
 peter.mayd...@linaro.org wrote:
  On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:
  On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:
  On Tue, May 06 2014 at  3:28:07 pm BST, Will Deacon will.dea...@arm.com 
  wrote:
  On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
  +reg.addr = (u64)data;
  +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
  +die(KVM_GET_ONE_REG failed (SCTLR_EL1));
  +
  +return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE : 
  VIRTIO_ENDIAN_LE;
 
  This rules out guests where userspace and kernelspace can run with 
  different
  endinness. Whilst Linux doesn't currently do this, can we support it 
  here?
  It all gets a bit hairy if the guest is using a stage-1 SMMU to let
  userspace play with a virtio device...
 
  Yeah, I suppose we could check either EE or E0 depending on the mode
  when the access was made. We already have all the information, just need
  to handle the case. I'll respin the series.
 
  How virtio implementations should determine their endianness is
  a spec question, I think; at any rate QEMU and kvmtool ought to
  agree on how it's done. I think the most recent suggestion on the
  QEMU mailing list (for PPC) is that we should care about the
  guest kernel endianness, but I don't know if anybody thought of
  the pass-through-to-userspace usecase...
 
  Current opinion on the qemu-devel thread seems to be that we
  should just define that the endianness of the virtio device is
  the endianness of the guest kernel at the point where the guest
  triggers a reset of the virtio device by writing zero the QueuePFN
  or Status registers.
 
 On AArch32, we only have the CPSR.E bit to select the endiannes. Are we
 going to simply explode if the access comes from userspace?
 
 On AArch64, we can either select the kernel endianness, or userspace
 endianness. Are we going to go a different route just for the sake of
 enforcing kernel access?
 
 I'm inclined to think of userspace access as a valid use case.
 
   M.

All the fuzz is not really about enforcing kernel access... PPC also
has a current endianness selector (MSR_LE) but it only makes sense
if you are in the cpu context. Initial versions of the virtio biendian
support for QEMU PPC64 used an arbitrary cpu: in this case, the
only sensible thing to look at to support kernel based virtio is the 
interrupt endianness selector (LPCR_ILE), because if gives a safe
hint of the kernel endianness.

The patch set has evolved and now uses current_cpu at device reset time.
As a consequence, we are not necessarily tied to the kernel LPCR_ILE
selector I guess.

Cheers.

-- 
Gregory Kurz kurzg...@fr.ibm.com
 gk...@linux.vnet.ibm.com
Software Engineer @ IBM/Meiosys  http://www.ibm.com
Tel +33 (0)562 165 496

Anarchy is about taking complete responsibility for yourself.
Alan Moore.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Marc Zyngier

On Wed, May 07 2014 at 11:10:56 am BST, Peter Maydell 
peter.mayd...@linaro.org wrote:
 On 7 May 2014 10:52, Marc Zyngier marc.zyng...@arm.com wrote:
 On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell
 peter.mayd...@linaro.org wrote:
 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.

 On AArch32, we only have the CPSR.E bit to select the endiannes. Are we
 going to simply explode if the access comes from userspace?

 There's SCTLR.EE in AArch32, right?

Indeed, good point.

 On AArch64, we can either select the kernel endianness, or userspace
 endianness. Are we going to go a different route just for the sake of
 enforcing kernel access?

 I'm inclined to think of userspace access as a valid use case.

 I don't actually care much about the details of what we decide; I just
 want us to be consistent between QEMU and kvmtool and (to the extent
 that architectural differences permit) consistent between PPC and
 ARM. At the moment we seem to be heading in gratuitously different
 directions.

My point is: is there any good technical reason for deciding not to
support guest user space access, other than religious matters about the
latest incarnation of The Holy Virtio Spec?

Thanks,

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Marc Zyngier

On Wed, May 07 2014 at 11:40:54 am BST, Greg Kurz gk...@linux.vnet.ibm.com 
wrote:
 On Wed, 07 May 2014 10:52:01 +0100
 Marc Zyngier marc.zyng...@arm.com wrote:

 On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell
 peter.mayd...@linaro.org wrote:
  On 6 May 2014 19:38, Peter Maydell peter.mayd...@linaro.org wrote:
  On 6 May 2014 18:25, Marc Zyngier marc.zyng...@arm.com wrote:
  On Tue, May 06 2014 at 3:28:07 pm BST, Will Deacon
  will.dea...@arm.com wrote:
  On Thu, Apr 24, 2014 at 07:17:23PM +0100, Marc Zyngier wrote:
  +reg.addr = (u64)data;
  +if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg)  0)
  +die(KVM_GET_ONE_REG failed (SCTLR_EL1));
  +
  + return (data  SCTLR_EL1_EE_MASK) ? VIRTIO_ENDIAN_BE :
  VIRTIO_ENDIAN_LE;
 
  This rules out guests where userspace and kernelspace can run
  with different
  endinness. Whilst Linux doesn't currently do this, can we support it 
  here?
  It all gets a bit hairy if the guest is using a stage-1 SMMU to let
  userspace play with a virtio device...
 
  Yeah, I suppose we could check either EE or E0 depending on the mode
  when the access was made. We already have all the information, just need
  to handle the case. I'll respin the series.
 
  How virtio implementations should determine their endianness is
  a spec question, I think; at any rate QEMU and kvmtool ought to
  agree on how it's done. I think the most recent suggestion on the
  QEMU mailing list (for PPC) is that we should care about the
  guest kernel endianness, but I don't know if anybody thought of
  the pass-through-to-userspace usecase...
 
  Current opinion on the qemu-devel thread seems to be that we
  should just define that the endianness of the virtio device is
  the endianness of the guest kernel at the point where the guest
  triggers a reset of the virtio device by writing zero the QueuePFN
  or Status registers.
 
 On AArch32, we only have the CPSR.E bit to select the endiannes. Are we
 going to simply explode if the access comes from userspace?
 
 On AArch64, we can either select the kernel endianness, or userspace
 endianness. Are we going to go a different route just for the sake of
 enforcing kernel access?
 
 I'm inclined to think of userspace access as a valid use case.
 
  M.

Hi Gregory,

 All the fuzz is not really about enforcing kernel access... PPC also
 has a current endianness selector (MSR_LE) but it only makes sense
 if you are in the cpu context. Initial versions of the virtio biendian
 support for QEMU PPC64 used an arbitrary cpu: in this case, the
 only sensible thing to look at to support kernel based virtio is the 
 interrupt endianness selector (LPCR_ILE), because if gives a safe
 hint of the kernel endianness.

 The patch set has evolved and now uses current_cpu at device reset time.
 As a consequence, we are not necessarily tied to the kernel LPCR_ILE
 selector I guess.

That makes a lot of sense, thanks for explaining that. You're basically
doing the exact same thing we do with kvmtool on ARM. So if we have
similar architectural features on both sides, why don't we support both
kernel and userspace access?

Cheers,

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM Nested L2 guest startup problems

2014-05-07 Thread Abel Gordon

On Wed, May 7, 2014 at 11:58 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 04/05/2014 18:33, Hu Yaohui ha scritto:

 I experienced a similar problem that was related to nested code
 having some bugs related to apicv and other new vmx features.

 For example, the code enabled posted interrupts to run L2 even when the
 feature was not exposed to L1 and L1 didn't use it.

 Try changing prepare_vmcs02  to force disabling posted_interrupts,
 code should looks like:

 
 
 exec_control = vmcs12-pin_based_vm_exec_control;
 exec_control |= vmcs_config.pin_based_exec_ctrl;
 exec_control = ~(PIN_BASED_VMX_PREEMPTION_TIMER|PIN_BASED_POSTED_INTR);
 vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control);
 
 ...

 and also

 ...
 ...
 exec_control = ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
  SECONDARY_EXEC_APIC_REGISTER_VIRT |
  SECONDARY_EXEC_PAUSE_LOOP_EXITING);


 PLE should be left enabled, I think.

Well... the PLE settings L0 uses to run  L1 (vmcs01) may be different
than the PLE settings L1 configured to run  L2 (vmcs12).
For example, L0 can use a  ple_gap to run L1 that is bigger than the
ple_gap L1 configured to run L2. Or  L0 can use a ple_window to run L1
that is smaller than the ple_window L1 configured to run L2.

So seems PLE should never be exposed to L1 or an appropriate nested
handling needs to be implemented. Note the handling may become complex
because in some cases a PLE exit from L2 should be handled directly by
L0 and not passed to L1... remember nested preemption timer support :)
?



 Apart from that, I'll change the suggestion into a patch.

Great!


 Thanks!

 Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on

2014-05-07 Thread Alexander Graf


On 05/07/2014 07:56 AM, Paul Mackerras wrote:

On Sun, May 04, 2014 at 10:56:08PM +0530, Aneesh Kumar K.V wrote:

With debug option sleep inside atomic section checking enabled we get
the below WARN_ON during a PR KVM boot. This is because upstream now
have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the
warning by adding preempt_disable/enable around floating point and altivec
enable.

This worries me a bit.  In this code:


if (msr  MSR_FP) {
+   preempt_disable();
enable_kernel_fp();
load_fp_state(vcpu-arch.fp);
t-fp_save_area = vcpu-arch.fp;
+   preempt_enable();

What would happen if we actually did get preempted at this point?
Wouldn't we lose the FP state we just loaded?

In other words, how come we're not already preempt-disabled at this
point?


This is probably because we're trying to confuse Linux :). The entry 
path happens with interrupts hard disabled, but preempt enabled so that 
Linux doesn't consider the guest time as non-preemptible. That's the 
only call I could find where preempt is logically enabled (though it 
really isn't).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM Nested L2 guest startup problems

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 13:16, Abel Gordon ha scritto:

 PLE should be left enabled, I think.

Well... the PLE settings L0 uses to run  L1 (vmcs01) may be different
than the PLE settings L1 configured to run  L2 (vmcs12).
For example, L0 can use a  ple_gap to run L1 that is bigger than the
ple_gap L1 configured to run L2. Or  L0 can use a ple_window to run L1
that is smaller than the ple_window L1 configured to run L2.


That's correct.  We should leave PLE enabled while running L2, but hide 
the feature altogether from L1.


Paolo


So seems PLE should never be exposed to L1 or an appropriate nested
handling needs to be implemented. Note the handling may become complex
because in some cases a PLE exit from L2 should be handled directly by
L0 and not passed to L1... remember nested preemption timer support :)
?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM Nested L2 guest startup problems

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 13:37, Paolo Bonzini ha scritto:

Il 07/05/2014 13:16, Abel Gordon ha scritto:

 PLE should be left enabled, I think.

Well... the PLE settings L0 uses to run  L1 (vmcs01) may be different
than the PLE settings L1 configured to run  L2 (vmcs12).
For example, L0 can use a  ple_gap to run L1 that is bigger than the
ple_gap L1 configured to run L2. Or  L0 can use a ple_window to run L1
that is smaller than the ple_window L1 configured to run L2.


That's correct.  We should leave PLE enabled while running L2, but hide
the feature altogether from L1.


... which we already do.  The only secondary execution controls we allow 
are APIC page, unrestricted guest, WBINVD exits, and of course EPT.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Alexander Graf


On 05/07/2014 12:46 PM, Marc Zyngier wrote:

On Wed, May 07 2014 at 11:10:56 am BST, Peter Maydell 
peter.mayd...@linaro.org wrote:

On 7 May 2014 10:52, Marc Zyngier marc.zyng...@arm.com wrote:

On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell
peter.mayd...@linaro.org wrote:

Current opinion on the qemu-devel thread seems to be that we
should just define that the endianness of the virtio device is
the endianness of the guest kernel at the point where the guest
triggers a reset of the virtio device by writing zero the QueuePFN
or Status registers.

On AArch32, we only have the CPSR.E bit to select the endiannes. Are we
going to simply explode if the access comes from userspace?

There's SCTLR.EE in AArch32, right?

Indeed, good point.


On AArch64, we can either select the kernel endianness, or userspace
endianness. Are we going to go a different route just for the sake of
enforcing kernel access?

I'm inclined to think of userspace access as a valid use case.

I don't actually care much about the details of what we decide; I just
want us to be consistent between QEMU and kvmtool and (to the extent
that architectural differences permit) consistent between PPC and
ARM. At the moment we seem to be heading in gratuitously different
directions.

My point is: is there any good technical reason for deciding not to
support guest user space access, other than religious matters about the
latest incarnation of The Holy Virtio Spec?


Yes, because it can't be isolated as per the current spec. User space 
has no business in physical addresses. And since so far I haven't heard 
of a single case where people on ARM are either


  a) nesting virtualization or
  b) running different endian user space

I don't think this point is valid. Virtio 1.0 is defined to be little 
endian only, so we don't need all that messy magic logic anymore. By the 
time people will do nesting or different endian user space we will most 
likely be in virtio 1.0 land. Shoehorning in anything in between is just 
a waste of time.


If you like to see a constructed case where your logic falls apart, I 
can easily give you one too (because the whole thing is just insanely 
fragile). Imagine you have nesting. Your L1 guest passes its virtio 
device into the L2 guest with idmap. The L1 guest wants to trace MMIO 
accesses, so it traps on every access and delivers it on its own. L2 is 
LE, L1 is BE. Virtio gets initialized BE even through the guest that 
really wants to access it is LE.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Marc Zyngier

On 07/05/14 12:49, Alexander Graf wrote:
 On 05/07/2014 12:46 PM, Marc Zyngier wrote:
 On Wed, May 07 2014 at 11:10:56 am BST, Peter Maydell 
 peter.mayd...@linaro.org wrote:
 On 7 May 2014 10:52, Marc Zyngier marc.zyng...@arm.com wrote:
 On Wed, May 07 2014 at 10:34:30 am BST, Peter Maydell
 peter.mayd...@linaro.org wrote:
 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.
 On AArch32, we only have the CPSR.E bit to select the endiannes. Are we
 going to simply explode if the access comes from userspace?
 There's SCTLR.EE in AArch32, right?
 Indeed, good point.

 On AArch64, we can either select the kernel endianness, or userspace
 endianness. Are we going to go a different route just for the sake of
 enforcing kernel access?

 I'm inclined to think of userspace access as a valid use case.
 I don't actually care much about the details of what we decide; I just
 want us to be consistent between QEMU and kvmtool and (to the extent
 that architectural differences permit) consistent between PPC and
 ARM. At the moment we seem to be heading in gratuitously different
 directions.
 My point is: is there any good technical reason for deciding not to
 support guest user space access, other than religious matters about the
 latest incarnation of The Holy Virtio Spec?
 
 Yes, because it can't be isolated as per the current spec. User space 
 has no business in physical addresses. And since so far I haven't heard 
 of a single case where people on ARM are either
 
a) nesting virtualization or
b) running different endian user space
 
 I don't think this point is valid. Virtio 1.0 is defined to be little 
 endian only, so we don't need all that messy magic logic anymore. By the 

Alex, please read my lips: at the moment, I don't care about virtio-1.0.
At all. Doesn't register. And hammering it on and on won't change a
thing (yes, I've rewritten this sentence at least five times to remove
all the fscking swear words).

 time people will do nesting or different endian user space we will most 
 likely be in virtio 1.0 land. Shoehorning in anything in between is just 
 a waste of time.

If you don't want to support it on your pet platform/environment, fine.

 If you like to see a constructed case where your logic falls apart, I 
 can easily give you one too (because the whole thing is just insanely 
 fragile). Imagine you have nesting. Your L1 guest passes its virtio 
 device into the L2 guest with idmap. The L1 guest wants to trace MMIO 
 accesses, so it traps on every access and delivers it on its own. L2 is 
 LE, L1 is BE. Virtio gets initialized BE even through the guest that 
 really wants to access it is LE.

Then it is a bug in your L1 that doesn't properly emulate accesses it
traps. Not that I care, really.

That being said, I'm going to stop replying to this thread, and instead
go back writing code, posting it, and getting on with my life in
virtio-legacy land.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Peter Maydell

On 7 May 2014 12:04, Marc Zyngier marc.zyng...@arm.com wrote:
 On Wed, May 07 2014 at 11:40:54 am BST, Greg Kurz gk...@linux.vnet.ibm.com 
 wrote:
 All the fuzz is not really about enforcing kernel access... PPC also
 has a current endianness selector (MSR_LE) but it only makes sense
 if you are in the cpu context. Initial versions of the virtio biendian
 support for QEMU PPC64 used an arbitrary cpu: in this case, the
 only sensible thing to look at to support kernel based virtio is the
 interrupt endianness selector (LPCR_ILE), because if gives a safe
 hint of the kernel endianness.

 The patch set has evolved and now uses current_cpu at device reset time.
 As a consequence, we are not necessarily tied to the kernel LPCR_ILE
 selector I guess.

Ah yes, I'd forgotten the history behind why we ended up looking
at interrupt endianness.

 That makes a lot of sense, thanks for explaining that. You're basically
 doing the exact same thing we do with kvmtool on ARM. So if we have
 similar architectural features on both sides, why don't we support both
 kernel and userspace access?

I don't think that we really need to get into whether userspace
access is or is not a good idea -- endianness of the CPU which
does the virtio reset at the point when it does that reset is a
nice simple rule that should generalise across architectures,
so why make it more complicated than that?

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Peter Maydell

On 7 May 2014 13:16, Marc Zyngier marc.zyng...@arm.com wrote:
 That being said, I'm going to stop replying to this thread, and instead
 go back writing code, posting it, and getting on with my life in
 virtio-legacy land.

Some of us are trying to have a conversation in this thread
about virtio-legacy behaviour :-)

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Marc Zyngier

On 07/05/14 13:17, Peter Maydell wrote:
 On 7 May 2014 12:04, Marc Zyngier marc.zyng...@arm.com wrote:
 On Wed, May 07 2014 at 11:40:54 am BST, Greg Kurz gk...@linux.vnet.ibm.com 
 wrote:
 All the fuzz is not really about enforcing kernel access... PPC also
 has a current endianness selector (MSR_LE) but it only makes sense
 if you are in the cpu context. Initial versions of the virtio biendian
 support for QEMU PPC64 used an arbitrary cpu: in this case, the
 only sensible thing to look at to support kernel based virtio is the
 interrupt endianness selector (LPCR_ILE), because if gives a safe
 hint of the kernel endianness.

 The patch set has evolved and now uses current_cpu at device reset time.
 As a consequence, we are not necessarily tied to the kernel LPCR_ILE
 selector I guess.
 
 Ah yes, I'd forgotten the history behind why we ended up looking
 at interrupt endianness.
 
 That makes a lot of sense, thanks for explaining that. You're basically
 doing the exact same thing we do with kvmtool on ARM. So if we have
 similar architectural features on both sides, why don't we support both
 kernel and userspace access?
 
 I don't think that we really need to get into whether userspace
 access is or is not a good idea -- endianness of the CPU which
 does the virtio reset at the point when it does that reset is a
 nice simple rule that should generalise across architectures,
 so why make it more complicated than that?

This definition looks pretty good to me. Simple and to the point.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-07 Thread Greg Kurz

On Wed, 7 May 2014 13:17:51 +0100
Peter Maydell peter.mayd...@linaro.org wrote:

 On 7 May 2014 12:04, Marc Zyngier marc.zyng...@arm.com wrote:
  On Wed, May 07 2014 at 11:40:54 am BST, Greg Kurz 
  gk...@linux.vnet.ibm.com wrote:
  All the fuzz is not really about enforcing kernel access... PPC also
  has a current endianness selector (MSR_LE) but it only makes sense
  if you are in the cpu context. Initial versions of the virtio biendian
  support for QEMU PPC64 used an arbitrary cpu: in this case, the
  only sensible thing to look at to support kernel based virtio is the
  interrupt endianness selector (LPCR_ILE), because if gives a safe
  hint of the kernel endianness.
 
  The patch set has evolved and now uses current_cpu at device reset time.
  As a consequence, we are not necessarily tied to the kernel LPCR_ILE
  selector I guess.
 
 Ah yes, I'd forgotten the history behind why we ended up looking
 at interrupt endianness.
 
  That makes a lot of sense, thanks for explaining that. You're basically
  doing the exact same thing we do with kvmtool on ARM. So if we have
  similar architectural features on both sides, why don't we support both
  kernel and userspace access?
 
 I don't think that we really need to get into whether userspace
 access is or is not a good idea -- endianness of the CPU which
 does the virtio reset at the point when it does that reset is a
 nice simple rule that should generalise across architectures,
 so why make it more complicated than that?
 
 thanks
 -- PMM
 

I am convinced... and feeling a bit guilty for all the noise ;)
I'll come with a new virtio patch set for QEMU that does just what
you say.

-- 
Gregory Kurz kurzg...@fr.ibm.com
 gk...@linux.vnet.ibm.com
Software Engineer @ IBM/Meiosys  http://www.ibm.com
Tel +33 (0)562 165 496

Anarchy is about taking complete responsibility for yourself.
Alan Moore.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/5] KVM: x86: Fix wrong masking on relative jump/call

2014-05-07 Thread Nadav Amit

Relative jumps and calls do the masking according to the operand size, and not
according to the address size as the KVM emulator does today.  In 64-bit mode,
the resulting RIP is always 64-bit. Otherwise it is masked according to the
instruction operand-size. Note that when 16-bit address size is used, bits
63:32 are unmodified.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6833b41..e406705 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -506,7 +506,9 @@ static void rsp_increment(struct x86_emulate_ctxt *ctxt, 
int inc)
 
 static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel)
 {
-   register_address_increment(ctxt, ctxt-_eip, rel);
+   /* 64-bit mode relative jumps are always 64-bit; otherwise mask */
+   int op_bytes = ctxt-mode == X86EMUL_MODE_PROT64 ? 8 : ctxt-op_bytes;
+   masked_increment(ctxt-eip, op_bytes, rel);
 }
 
 static u32 desc_limit_scaled(struct desc_struct *desc)
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5] KVM: x86: Emulator does not calculate address correctly

2014-05-07 Thread Nadav Amit

In long-mode, when the address size is 4 bytes, the linear address is not
truncated as the emulator mistakenly does.  Instead, the offset within the
segment (the ea field) should be truncated according to the address size.

As Intel SDM says: In 64-bit mode, the effective address components are added
and the effective address is truncated ... before adding the full 64-bit
segment base.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e8a5840..743e8e3 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -631,7 +631,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
u16 sel;
unsigned cpl;
 
-   la = seg_base(ctxt, addr.seg) + addr.ea;
+   la = seg_base(ctxt, addr.seg) +
+   (ctxt-ad_bytes == 8 ? addr.ea : (u32)addr.ea);
switch (ctxt-mode) {
case X86EMUL_MODE_PROT64:
if (((signed long)la  16)  16 != la)
@@ -678,7 +679,7 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
}
break;
}
-   if (fetch ? ctxt-mode != X86EMUL_MODE_PROT64 : ctxt-ad_bytes != 8)
+   if (ctxt-mode != X86EMUL_MODE_PROT64)
la = (u32)-1;
if (insn_aligned(ctxt, size)  ((la  (size - 1)) != 0))
return emulate_gp(ctxt, 0);
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5] KVM: x86: Fix exit handler and emulation bugs

2014-05-07 Thread Nadav Amit

This series of patches fixes various scenarios in which KVM does not follow x86
specifications.  Patches #4 and #5 are related; they reflect a new revision of
the previously submitted patch that dealt with the wrong masking of registers
in long-mode. Patch #3 is a follow-up to the previously sumbitted patch that
fixed the wrong reserved page table masks. Patches #3 and #5 were not tested in
a manner that actually checks the modified behavior. Not all the pathes in
patch #4 were tested.

Thanks for reviewing the patches.

Nadav Amit (5):
  KVM: x86: Emulator does not calculate address correctly
  KVM: vmx: handle_dr does not handle RSP correctly
  KVM: x86: Mark bit 7 in long-mode PDPTE according to 1GB pages support
  KVM: x86: Wrong register masking in 64-bit mode
  KVM: x86: Fix wrong masking on relative jump/call

 arch/x86/kvm/cpuid.h   |  7 +++
 arch/x86/kvm/emulate.c | 47 +--
 arch/x86/kvm/mmu.c |  8 ++--
 arch/x86/kvm/vmx.c |  2 +-
 4 files changed, 43 insertions(+), 21 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5] KVM: x86: Wrong register masking in 64-bit mode

2014-05-07 Thread Nadav Amit

32-bit operations are zero extended in 64-bit mode. Currently, the code does
not handle them correctly and keeps the high bits. In 16-bit mode, the high
32-bits are kept intact.

In addition, although it is not well-documented, when address override prefix
is used with REP-string instruction, RCX high half is zeroed even if ECX was
zero on the first iteration (as if an assignment was performed to ECX).

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 38 +++---
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 743e8e3..6833b41 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -434,9 +434,21 @@ static int emulator_check_intercept(struct 
x86_emulate_ctxt *ctxt,
return ctxt-ops-intercept(ctxt, info, stage);
 }
 
-static void assign_masked(ulong *dest, ulong src, ulong mask)
+static void assign_masked(ulong *dest, ulong src, int bytes)
 {
-   *dest = (*dest  ~mask) | (src  mask);
+   switch (bytes) {
+   case 2:
+   *dest = (u16)src | (*dest  ~0xul);
+   break;
+   case 4:
+   *dest = (u32)src;
+   break;
+   case 8:
+   *dest = src;
+   break;
+   default:
+   BUG();
+   }
 }
 
 static inline unsigned long ad_mask(struct x86_emulate_ctxt *ctxt)
@@ -476,26 +488,20 @@ register_address(struct x86_emulate_ctxt *ctxt, unsigned 
long reg)
return address_mask(ctxt, reg);
 }
 
-static void masked_increment(ulong *reg, ulong mask, int inc)
+static void masked_increment(ulong *reg, int size, int inc)
 {
-   assign_masked(reg, *reg + inc, mask);
+   assign_masked(reg, *reg + inc, size);
 }
 
 static inline void
 register_address_increment(struct x86_emulate_ctxt *ctxt, unsigned long *reg, 
int inc)
 {
-   ulong mask;
-
-   if (ctxt-ad_bytes == sizeof(unsigned long))
-   mask = ~0UL;
-   else
-   mask = ad_mask(ctxt);
-   masked_increment(reg, mask, inc);
+   masked_increment(reg, ctxt-ad_bytes, inc);
 }
 
 static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc)
 {
-   masked_increment(reg_rmw(ctxt, VCPU_REGS_RSP), stack_mask(ctxt), inc);
+   masked_increment(reg_rmw(ctxt, VCPU_REGS_RSP), stack_size(ctxt), inc);
 }
 
 static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel)
@@ -1712,17 +1718,17 @@ static int em_enter(struct x86_emulate_ctxt *ctxt)
if (rc != X86EMUL_CONTINUE)
return rc;
assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt, 
VCPU_REGS_RSP),
- stack_mask(ctxt));
+ stack_size(ctxt));
assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP),
  reg_read(ctxt, VCPU_REGS_RSP) - frame_size,
- stack_mask(ctxt));
+ stack_size(ctxt));
return X86EMUL_CONTINUE;
 }
 
 static int em_leave(struct x86_emulate_ctxt *ctxt)
 {
assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP), reg_read(ctxt, 
VCPU_REGS_RBP),
- stack_mask(ctxt));
+ stack_size(ctxt));
return emulate_pop(ctxt, reg_rmw(ctxt, VCPU_REGS_RBP), ctxt-op_bytes);
 }
 
@@ -4570,6 +4576,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
if (ctxt-rep_prefix  (ctxt-d  String)) {
/* All REP prefixes have the same first termination condition */
if (address_mask(ctxt, reg_read(ctxt, VCPU_REGS_RCX)) == 0) {
+   assign_masked(reg_write(ctxt, VCPU_REGS_RCX), 0,
+ ctxt-ad_bytes);
ctxt-eip = ctxt-_eip;
goto done;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] KVM: vmx: handle_dr does not handle RSP correctly

2014-05-07 Thread Nadav Amit

The RSP register is not automatically cached, causing mov DR instruction with
RSP to fail.  Instead the regular register accessing interface should be used.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 72b8012..0e2793f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5142,7 +5142,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
return 1;
kvm_register_write(vcpu, reg, val);
} else
-   if (kvm_set_dr(vcpu, dr, vcpu-arch.regs[reg]))
+   if (kvm_set_dr(vcpu, dr, kvm_register_read(vcpu, reg)))
return 1;
 
skip_emulated_instruction(vcpu);
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5] KVM: x86: Mark bit 7 in long-mode PDPTE according to 1GB pages support

2014-05-07 Thread Nadav Amit

In long-mode, bit 7 in the PDPTE is not reserved only if 1GB pages are
supported by the CPU. Currently the bit is considered by KVM as always
reserved.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/cpuid.h | 7 +++
 arch/x86/kvm/mmu.c   | 8 ++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index eeecbed..f908731 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -88,4 +88,11 @@ static inline bool guest_cpuid_has_x2apic(struct kvm_vcpu 
*vcpu)
return best  (best-ecx  bit(X86_FEATURE_X2APIC));
 }
 
+static inline bool guest_cpuid_has_gbpages(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 0x8001, 0);
+   return best  (best-edx  bit(X86_FEATURE_GBPAGES));
+}
 #endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 65f2400..9314678 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -22,6 +22,7 @@
 #include mmu.h
 #include x86.h
 #include kvm_cache_regs.h
+#include cpuid.h
 
 #include linux/kvm_host.h
 #include linux/types.h
@@ -3516,11 +3517,14 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 {
int maxphyaddr = cpuid_maxphyaddr(vcpu);
u64 exb_bit_rsvd = 0;
+   u64 gbpages_bit_rsvd = 0;
 
context-bad_mt_xwr = 0;
 
if (!context-nx)
exb_bit_rsvd = rsvd_bits(63, 63);
+   if (!guest_cpuid_has_gbpages(vcpu))
+   gbpages_bit_rsvd = rsvd_bits(7, 7);
switch (context-root_level) {
case PT32_ROOT_LEVEL:
/* no rsvd bits for 2 level 4K page table entries */
@@ -3557,14 +3561,14 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
context-rsvd_bits_mask[0][3] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 7);
context-rsvd_bits_mask[0][2] = exb_bit_rsvd |
-   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 7);
+   gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51);
context-rsvd_bits_mask[0][1] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 51);
context-rsvd_bits_mask[0][0] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 51);
context-rsvd_bits_mask[1][3] = context-rsvd_bits_mask[0][3];
context-rsvd_bits_mask[1][2] = exb_bit_rsvd |
-   rsvd_bits(maxphyaddr, 51) |
+   gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51) |
rsvd_bits(13, 29);
context-rsvd_bits_mask[1][1] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 51) |
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-kmod PATCH] Provide pci_enable_msix_exact() for kernels 3.15

2014-05-07 Thread Gabriel L. Somlo

On Wed, May 07, 2014 at 08:29:19AM +0200, Jan Kiszka wrote:
 On 2014-05-06 20:35, gso...@gmail.com wrote:
  Signed-off-by: Gabriel Somlo so...@cmu.edu
  ---
  
  Jan,
  
  After today's pull from kvm, I also need this to build against my
  Fedora 20 kernel (3.13.10-200.fc20.x86_64).
 
 Which version did you pull? Neither next nor queue contain this change yet.

I have 89ca3b881987f5a4be4c5dbaa7f0df12bbdde2fd (Linux 3.15-rc4) as
the top commit.

 Right now I'm seeing different build problems:
 
 http://buildbot.kiszka.org/kvm-kmod/builders/4-latest-kvm/builds/1788/steps/3.11-x86-64/logs/stdio
 
 Didn't look into details yet.

That's kernel 3.11, afaict, so entirely different problem :)

   x86/external-module-compat.h | 5 +
   1 file changed, 5 insertions(+)
  
  diff --git a/x86/external-module-compat.h b/x86/external-module-compat.h
  index dde9463..e62486b 100644
  --- a/x86/external-module-compat.h
  +++ b/x86/external-module-compat.h
  @@ -1480,6 +1480,11 @@ static inline int __register_hotcpu_notifier(struct 
  notifier_block *nb)
   {
  return 0;
   }
  +static inline int pci_enable_msix_exact(struct pci_dev *dev,
  + struct msix_entry *entries, int nvec)
  +{
  +   return -ENOSYS;
  +}
 
 First, this is not x86-specific, thus should go to
 external-module-compat-comm.h. Second, pci_enable_msix_exact was
 introduced with 3.14. And then I think we can (and should) provide the
 original content of this wrapper once it is needed, not a stub.

The patch I sent fixed building for 3.13.10-200.fc20.x86_64; as soon
as I upgraded to 3.14.2-200.fc20.x86_64 (about half hour later), I
could build for it just fine, no errors.

So I don't really *need* this fix anymore, but given what you said and
that it works fine on 3.14 without a patch (at least for me), the
patch *should* then probably be:

diff --git a/external-module-compat-comm.h b/external-module-compat-comm.h
index 086127d..94b18cc 100644
--- a/external-module-compat-comm.h
+++ b/external-module-compat-comm.h
@@ -1401,3 +1401,11 @@ static inline void guest_exit(void)
 #if LINUX_VERSION_CODE  KERNEL_VERSION(3,13,0)
 static inline void smp_mb__after_srcu_read_unlock(void) {}
 #endif
+
+#if LINUX_VERSION_CODE  KERNEL_VERSION(3,14,0)
+static inline int pci_enable_msix_exact(struct pci_dev *dev,
+ struct msix_entry *entries, int nvec)
+{
+   return -ENOSYS;
+}
+#endif
--

except I can't test it anymore without rebooting to 3.13, which I can't
do right away :)

Regards,
--G

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: x86: Emulator does not calculate address correctly

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 14:32, Nadav Amit ha scritto:

In long-mode, when the address size is 4 bytes, the linear address is not
truncated as the emulator mistakenly does.  Instead, the offset within the
segment (the ea field) should be truncated according to the address size.

As Intel SDM says: In 64-bit mode, the effective address components are added
and the effective address is truncated ... before adding the full 64-bit
segment base.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e8a5840..743e8e3 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -631,7 +631,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
u16 sel;
unsigned cpl;

-   la = seg_base(ctxt, addr.seg) + addr.ea;
+   la = seg_base(ctxt, addr.seg) +
+   (ctxt-ad_bytes == 8 ? addr.ea : (u32)addr.ea);


I think you need fetch || ctxt-ad_bytes == 8 here.

Paolo


switch (ctxt-mode) {
case X86EMUL_MODE_PROT64:
if (((signed long)la  16)  16 != la)
@@ -678,7 +679,7 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
}
break;
}
-   if (fetch ? ctxt-mode != X86EMUL_MODE_PROT64 : ctxt-ad_bytes != 8)
+   if (ctxt-mode != X86EMUL_MODE_PROT64)
la = (u32)-1;
if (insn_aligned(ctxt, size)  ((la  (size - 1)) != 0))
return emulate_gp(ctxt, 0);



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm/x86: implement hv EOI assist

2014-05-07 Thread Michael S. Tsirkin

It seems that it's easy to implement the EOI assist
on top of the PV EOI feature: simply convert the
page address to the format expected by PV EOI.

Notes:
-No EOI required is set only if interrupt injected
 is edge triggered; this is true because level interrupts are going
 through IOAPIC which disables PV EOI.
 In any case, if guest triggers EOI the bit will get cleared on exit.
-For migration, set of HV_X64_MSR_APIC_ASSIST_PAGE sets
 KVM_PV_EOI_EN internally, so restoring HV_X64_MSR_APIC_ASSIST_PAGE
 seems sufficient
 In any case, bit is cleared on exit so worst case it's never re-enabled
-no handling of PV EOI data is performed at HV_X64_MSR_EOI write;
 HV_X64_MSR_EOI is a separate optimization - it's an X2APIC
 replacement that lets you do EOI with an MSR and not IO.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

---

patch is unchanged from RFC except for documenting the design points in
the commit log, do address Marcelo's comments.
This passed basic testing.  Migration was not tested.

I think at this point it's a good idea to merge this so
it can get tested and used by more people.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ae1ff5..d84d750fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1890,6 +1890,8 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
 
if (!(data  HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE)) {
vcpu-arch.hv_vapic = data;
+   if (kvm_lapic_enable_pv_eoi(vcpu, 0))
+   return 1;
break;
}
gfn = data  HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT;
@@ -1900,6 +1902,8 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
return 1;
vcpu-arch.hv_vapic = data;
mark_page_dirty(vcpu-kvm, gfn);
+   if (kvm_lapic_enable_pv_eoi(vcpu, gfn_to_gpa(gfn) | 
KVM_MSR_ENABLED))
+   return 1;
break;
}
case HV_X64_MSR_EOI:
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: x86: Fix wrong masking on relative jump/call

2014-05-07 Thread Bandan Das

Nadav Amit na...@cs.technion.ac.il writes:

 Relative jumps and calls do the masking according to the operand size, and not
 according to the address size as the KVM emulator does today.  In 64-bit mode,
 the resulting RIP is always 64-bit. Otherwise it is masked according to the
 instruction operand-size. Note that when 16-bit address size is used, bits
 63:32 are unmodified.

 Signed-off-by: Nadav Amit na...@cs.technion.ac.il
 ---
  arch/x86/kvm/emulate.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 6833b41..e406705 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -506,7 +506,9 @@ static void rsp_increment(struct x86_emulate_ctxt *ctxt, 
 int inc)
  
  static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel)
  {
 - register_address_increment(ctxt, ctxt-_eip, rel);
 + /* 64-bit mode relative jumps are always 64-bit; otherwise mask */
 + int op_bytes = ctxt-mode == X86EMUL_MODE_PROT64 ? 8 : ctxt-op_bytes;

Just a nit, probably break this up for readability ?

 + masked_increment(ctxt-eip, op_bytes, rel);
  }
  
  static u32 desc_limit_scaled(struct desc_struct *desc)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] KVM: x86: Wrong register masking in 64-bit mode

2014-05-07 Thread Bandan Das

Nadav Amit na...@cs.technion.ac.il writes:

 32-bit operations are zero extended in 64-bit mode. Currently, the code does
 not handle them correctly and keeps the high bits. In 16-bit mode, the high
 32-bits are kept intact.

 In addition, although it is not well-documented, when address override prefix

It would be helpful to have a pointer in the SDM especially for cases 
that are not well-documented.

 is used with REP-string instruction, RCX high half is zeroed even if ECX was
 zero on the first iteration (as if an assignment was performed to ECX).

 Signed-off-by: Nadav Amit na...@cs.technion.ac.il
 ---
  arch/x86/kvm/emulate.c | 38 +++---
  1 file changed, 23 insertions(+), 15 deletions(-)

 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 743e8e3..6833b41 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -434,9 +434,21 @@ static int emulator_check_intercept(struct 
 x86_emulate_ctxt *ctxt,
   return ctxt-ops-intercept(ctxt, info, stage);
  }
  
 -static void assign_masked(ulong *dest, ulong src, ulong mask)
 +static void assign_masked(ulong *dest, ulong src, int bytes)
  {
 - *dest = (*dest  ~mask) | (src  mask);
 + switch (bytes) {
 + case 2:
 + *dest = (u16)src | (*dest  ~0xul);
 + break;
 + case 4:
 + *dest = (u32)src;
 + break;
 + case 8:
 + *dest = src;
 + break;
 + default:
 + BUG();

IIRC, Paolo mentioned that a WARN() is preferable. But I see
a lot other places where BUG() is called, maybe, he can confirm.

 + }
  }
  
  static inline unsigned long ad_mask(struct x86_emulate_ctxt *ctxt)
 @@ -476,26 +488,20 @@ register_address(struct x86_emulate_ctxt *ctxt, 
 unsigned long reg)
   return address_mask(ctxt, reg);
  }
  
 -static void masked_increment(ulong *reg, ulong mask, int inc)
 +static void masked_increment(ulong *reg, int size, int inc)
  {
 - assign_masked(reg, *reg + inc, mask);
 + assign_masked(reg, *reg + inc, size);
  }
  
  static inline void
  register_address_increment(struct x86_emulate_ctxt *ctxt, unsigned long 
 *reg, int inc)
  {
 - ulong mask;
 -
 - if (ctxt-ad_bytes == sizeof(unsigned long))
 - mask = ~0UL;
 - else
 - mask = ad_mask(ctxt);
 - masked_increment(reg, mask, inc);
 + masked_increment(reg, ctxt-ad_bytes, inc);
  }
  
  static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc)
  {
 - masked_increment(reg_rmw(ctxt, VCPU_REGS_RSP), stack_mask(ctxt), inc);
 + masked_increment(reg_rmw(ctxt, VCPU_REGS_RSP), stack_size(ctxt), inc);
  }
  
  static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel)
 @@ -1712,17 +1718,17 @@ static int em_enter(struct x86_emulate_ctxt *ctxt)
   if (rc != X86EMUL_CONTINUE)
   return rc;
   assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt, 
 VCPU_REGS_RSP),
 -   stack_mask(ctxt));
 +   stack_size(ctxt));
   assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP),
 reg_read(ctxt, VCPU_REGS_RSP) - frame_size,
 -   stack_mask(ctxt));
 +   stack_size(ctxt));
   return X86EMUL_CONTINUE;
  }
  
  static int em_leave(struct x86_emulate_ctxt *ctxt)
  {
   assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP), reg_read(ctxt, 
 VCPU_REGS_RBP),
 -   stack_mask(ctxt));
 +   stack_size(ctxt));
   return emulate_pop(ctxt, reg_rmw(ctxt, VCPU_REGS_RBP), ctxt-op_bytes);
  }
  
 @@ -4570,6 +4576,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
   if (ctxt-rep_prefix  (ctxt-d  String)) {
   /* All REP prefixes have the same first termination condition */
   if (address_mask(ctxt, reg_read(ctxt, VCPU_REGS_RCX)) == 0) {
 + assign_masked(reg_write(ctxt, VCPU_REGS_RCX), 0,
 +   ctxt-ad_bytes);
   ctxt-eip = ctxt-_eip;
   goto done;
   }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping

2014-05-07 Thread Christoffer Dall

On Wed, May 07, 2014 at 10:00:21AM +0100, Marc Zyngier wrote:
 Kim, Christoffer,
 
 On Tue, May 06 2014 at  7:04:48 pm BST, Christoffer Dall 
 christoffer.d...@linaro.org wrote:
  On Tue, Mar 25, 2014 at 05:08:14PM -0500, Kim Phillips wrote:
  Use the correct memory type for device MMIO mappings: PAGE_S2_DEVICE.
  
  Signed-off-by: Kim Phillips kim.phill...@linaro.org
  ---
   arch/arm/kvm/mmu.c | 11 ---
   1 file changed, 8 insertions(+), 3 deletions(-)
  
  diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
  index 7789857..a354610 100644
  --- a/arch/arm/kvm/mmu.c
  +++ b/arch/arm/kvm/mmu.c
  @@ -652,6 +652,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
  phys_addr_t fault_ipa,
 struct kvm_mmu_memory_cache *memcache = vcpu-arch.mmu_page_cache;
 struct vm_area_struct *vma;
 pfn_t pfn;
  +  pgprot_t mem_type = PAGE_S2;
   
 write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
 if (fault_status == FSC_PERM  !write_fault) {
  @@ -702,6 +703,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
  phys_addr_t fault_ipa,
 if (is_error_pfn(pfn))
 return -EFAULT;
   
  +  if (kvm_is_mmio_pfn(pfn))
  +  mem_type = PAGE_S2_DEVICE;
  +
 spin_lock(kvm-mmu_lock);
 if (mmu_notifier_retry(kvm, mmu_seq))
 goto out_unlock;
  @@ -709,7 +713,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
  phys_addr_t fault_ipa,
 hugetlb = transparent_hugepage_adjust(pfn, fault_ipa);
   
 if (hugetlb) {
  -  pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
  +  pmd_t new_pmd = pfn_pmd(pfn, mem_type);
 new_pmd = pmd_mkhuge(new_pmd);
 if (writable) {
 kvm_set_s2pmd_writable(new_pmd);
  @@ -718,13 +722,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
  phys_addr_t fault_ipa,
 coherent_icache_guest_page(kvm, hva  PMD_MASK, PMD_SIZE);
 ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, new_pmd);
 } else {
  -  pte_t new_pte = pfn_pte(pfn, PAGE_S2);
  +  pte_t new_pte = pfn_pte(pfn, mem_type);
 if (writable) {
 kvm_set_s2pte_writable(new_pte);
 kvm_set_pfn_dirty(pfn);
 }
 coherent_icache_guest_page(kvm, hva, PAGE_SIZE);
  -  ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, false);
  +  ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte,
  +   mem_type == PAGE_S2_DEVICE);
 }
   
   
  -- 
  1.9.1
  
 
  I think this looks reasonable.
 
  Acked-by: Christoffer Dall christoffer.d...@linaro.org
 
 I feel like I'm missing some context here, and the commit message is way
 too terse for me to make sense of it.
 
 So far, we can only get into user_mem_abort on a Stage-2 fault
 (translation or permission) for memory. How can we suddenly get here for
 a *device* fault? Do we get a special kind of memslot?
 
 I'm not saying the patch does anything wrong, but I'd like to understand
 the rationale behind it. On its own, it doesn't make much sense.
 
Think device passthrough.  There's nothing preventing user space from
setting up a memory region to point to device memory (through VFIO or
/dev/mem).  If that's done, we should enforce device memory properties
so writes don't linger around in the cache to be written some time later
when that device memory potentially doesn't belong to the VM anymore.

This is just one tiny piece of all of them to make device passthrough
work, and we could hold off with this patch until we have something more
complete.  On the other hand, we need to start somewhere, and this is
hardly intrusive and is functionally correct even though you don't have
a full device passthrough setup.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v9 00/19] qspinlock: a 4-byte queue spinlock with PV support

2014-05-07 Thread Waiman Long


On 04/27/2014 02:09 PM, Raghavendra K T wrote:


For kvm part feel free to add:
Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com

V9 testing has shown no hangs.
I was able to do some performance testing. here are the results:

Overall we are seeing good improvement for pv-unfair version.

System : 32 cpu sandybridge with HT on. (4 node machine with 32 GB each)
Guest: 8GB with 16 vcpu/VM.
Average was taken over 8-10 data points.

Base = 3.15-rc2with PRAVIRT_SPINLOCK = y
A = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y 
PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = y (unfair lock)
B =  3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y 
PRAVIRT_SPINLOCK = n PARAVIRT_UNFAIR_LOCKS = n (queue spinlock without 
paravirt)
C = 3.15-rc2 + qspinlock v9 patch with  QUEUE_SPINLOCK = y 
PRAVIRT_SPINLOCK = y  PARAVIRT_UNFAIR_LOCKS = n (queue spinlock with 
paravirt)



Ebizzy % improvements

overcommit ABC
0.5x4.42652.06111.5824
1.0x0.9015-7.7828   4.5443
1.5x46.1162   -2.9845   -3.5046
2.0x99.8150   -2.7116   4.7461

Dbench %improvements
overcommit ABC
0.5x3.26173.54362.5676
1.0x0.63022.23425.2201
1.5x5.00274.82753.8375
2.0x23.8242   4.578212.6067

Absolute values of base results: (overcommit, value, stdev)
Ebizzy ( records / sec with 120 sec run)
0.5x 20941.8750 (2%)
1.0x 17623.8750 (5%)
1.5x  5874.7778 (15%)
2.0x  3581.8750 (7%)

Dbench (throughput in MB/sec)
0.5x 10009.6610 (5%)
1.0x  6583.0538 (1%)
1.5x  3991.9622 (4%)
2.0x  2527.0613 (2.5%)



Thank for the testing. I will include your Test-by tag in the next version.

-Longman
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 02/19] qspinlock, x86: Enable x86-64 to use queue spinlock

2014-05-07 Thread Waiman Long

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/qspinlock.h  |   29 +
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 
 4 files changed, 39 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..95c9c4e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_INT128 if X86_64
select ARCH_WANTS_PROT_NUMA_PROT_NONE
+   select ARCH_USE_QUEUE_SPINLOCK
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_PCSPKR_PLATFORM
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 000..e4a4f5d
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,29 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+#if !defined(CONFIG_X86_OOSTORE)  !defined(CONFIG_X86_PPRO_FENCE)
+
+#definequeue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   barrier();
+   ACCESS_ONCE(*(u8 *)lock) = 0;
+   barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE  !CONFIG_X86_PPRO_FENCE */
+
+#include asm-generic/qspinlock.h
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0f62f54..958d20f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm/qspinlock.h
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -180,6 +184,7 @@ static __always_inline void 
arch_spin_lock_flags(arch_spinlock_t *lock,
 {
arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..7960268 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm-generic/qspinlock_types.h
+#else
 typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include asm/rwlock.h
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 19/19] pvqspinlock, x86: Enable PV qspinlock for XEN

2014-05-07 Thread Waiman Long

This patch adds the necessary XEN specific code to allow XEN to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/xen/spinlock.c |  147 +--
 kernel/Kconfig.locks|2 +-
 2 files changed, 143 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d1b6a32..2a259bb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,12 @@
 #include xen-ops.h
 #include debugfs.h
 
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifndef CONFIG_QUEUE_SPINLOCK
+
 enum xen_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -100,12 +106,9 @@ struct xen_lock_waiting {
__ticket_t want;
 };
 
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
 static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
 static cpumask_t waiting_cpus;
 
-static bool xen_pvspin = true;
 __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 {
int irq = __this_cpu_read(lock_kicker_irq);
@@ -213,6 +216,118 @@ static void xen_unlock_kick(struct arch_spinlock *lock, 
__ticket_t next)
}
 }
 
+#else /* CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_XEN_DEBUG_FS
+static u32 kick_nohlt_stats;   /* Kick but not halt count  */
+static u32 halt_qhead_stats;   /* Queue head halting count */
+static u32 halt_qnode_stats;   /* Queue node halting count */
+static u32 halt_abort_stats;   /* Halting abort count  */
+static u32 wake_kick_stats;/* Wakeup by kicking count  */
+static u32 wake_spur_stats;/* Spurious wakeup count*/
+static u64 time_blocked;   /* Total blocking time  */
+
+static inline void xen_halt_stats(enum pv_lock_stats type)
+{
+   if (type == PV_HALT_QHEAD)
+   add_smp(halt_qhead_stats, 1);
+   else if (type == PV_HALT_QNODE)
+   add_smp(halt_qnode_stats, 1);
+   else /* type == PV_HALT_ABORT */
+   add_smp(halt_abort_stats, 1);
+}
+
+static inline void xen_lock_stats(enum pv_lock_stats type)
+{
+   if (type == PV_WAKE_KICKED)
+   add_smp(wake_kick_stats, 1);
+   else if (type == PV_WAKE_SPURIOUS)
+   add_smp(wake_spur_stats, 1);
+   else /* type == PV_KICK_NOHALT */
+   add_smp(kick_nohlt_stats, 1);
+}
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u64 delta;
+
+   delta = sched_clock() - start;
+   add_smp(time_blocked, delta);
+}
+#else /* CONFIG_XEN_DEBUG_FS */
+static inline void xen_halt_stats(enum pv_lock_stats type)
+{
+}
+
+static inline void xen_lock_stats(enum pv_lock_stats type)
+{
+}
+
+static inline u64 spin_time_start(void)
+{
+   return 0;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+}
+#endif /* CONFIG_XEN_DEBUG_FS */
+
+static void xen_kick_cpu(int cpu)
+{
+   xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+}
+
+/*
+ * Halt the current CPU  release it back to the host
+ */
+static void xen_halt_cpu(enum pv_lock_stats type, s8 *state, s8 sval)
+{
+   int irq = __this_cpu_read(lock_kicker_irq);
+   unsigned long flags;
+   u64 start;
+
+   /* If kicker interrupts not initialized yet, just spin */
+   if (irq == -1)
+   return;
+
+   /*
+* Make sure an interrupt handler can't upset things in a
+* partially setup state.
+*/
+   local_irq_save(flags);
+   start = spin_time_start();
+
+   xen_halt_stats(type);
+   /* clear pending */
+   xen_clear_irq_pending(irq);
+
+   /* Allow interrupts while blocked */
+   local_irq_restore(flags);
+   /*
+* Don't halt if the CPU state has been changed.
+*/
+   if (ACCESS_ONCE(*state) != sval) {
+   xen_halt_stats(PV_HALT_ABORT);
+   return;
+   }
+   /*
+* If an interrupt happens here, it will leave the wakeup irq
+* pending, which will cause xen_poll_irq() to return
+* immediately.
+*/
+
+   /* Block until irq becomes pending (or perhaps a spurious wakeup) */
+   xen_poll_irq(irq);
+   spin_time_accum_blocked(start);
+}
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
BUG();
@@ -258,7 +373,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -275,8 +389,15 @@ void __init xen_init_spinlocks(void)
return;
}
printk(KERN_DEBUG

[PATCH v10 18/19] pvqspinlock, x86: Enable PV qspinlock PV for KVM

2014-05-07 Thread Waiman Long

This patch adds the necessary KVM specific code to allow KVM to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Two KVM guests of 20 CPU cores (2 nodes) were created for performance
testing in one of the following three configurations:
 1) Only 1 VM is active
 2) Both VMs are active and they share the same 20 physical CPUs
   (200% overcommit)
 3) Both VMs are active and they shares 30 physical CPUs (10 delicated
and 10 shared - 133% overcommit)

The tests run included the disk workload of the AIM7 benchmark on both
ext4 and xfs RAM disks at 3000 users on a 3.15-rc1 based kernel. The
ebizzy -m test was was also run and its performance data were
recorded.  With two VMs running, the idle=poll kernel option was
added to simulate a busy guest. The entry unfair + PV qspinlock
below means that both the unfair lock and PV spinlock configuration
options were turned on.

AIM7 XFS Disk Test (no overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 24896267.23  101.08   5.30
  qspinlock 25316467.11  100.75   5.43
  PV qspinlock  2507.20  101.94   5.40
  unfair qspinlock  25495757.06   99.81   5.35
  unfair + PV qspinlock 24861887.24  101.55   5.51

AIM7 XFS Disk Test (133% overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 1114551   16.15  220.17  10.75
  qspinlock 1159047   15.53  216.60  10.24
  PV qspinlock  1170351   15.38  216.16  11.03
  unfair qspinlock  1188119   15.15  209.37  10.82
  unfair + PV qspinlock 1178782   15.27  211.37  11.25

AIM7 XFS Disk Test (200% overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 58746730.64  444.95  11.92
  qspinlock 59327630.34  439.39  14.59
  PV qspinlock  60140329.93  426.04  14.49
  unfair qspinlock  65407027.52  400.82  10.86
  unfair + PV qspinlock 61433429.30  393.38  28.56

AIM7 EXT4 Disk Test (no overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 20022259.07  105.62   5.43
  qspinlock 20066898.97  105.65   5.26
  PV qspinlock  20022258.99  103.19   5.19
  unfair qspinlock  19889509.05  103.81   5.03
  unfair + PV qspinlock 19933559.03  107.99   5.68

AIM7 EXT4 Disk Test (133% overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock  987383   18.23  221.63   8.89
  qspinlock 1050788   17.13  206.87   8.35
  PV qspinlock  1058823   17.00  205.22   9.18
  unfair qspinlock  1161290   15.50  184.22   8.84
  unfair + PV qspinlock 1122894   16.03  195.86   9.34

AIM7 EXT4 Disk Test (200% overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 42075742.78  565.96   5.84
  qspinlock 42745242.11  543.08  11.12
  PV qspinlock  42065942.79  548.30  10.56
  unfair qspinlock  50490935.65  466.71   5.38
  unfair + PV qspinlock 50097435.93  469.02   6.77

EBIZZY-M Test (no overcommit)
  kernelRec/s   Real Time   Sys TimeUsr Time
  - -   -   
  PV ticketlock 1230  10.00  88.341.42
  qspinlock 1212  10.00  68.251.47
  PV qspinlock  1265  10.00  91.501.41
  unfair qspinlock  1304  10.00  77.941.49
  unfair + PV qspinlock 1445  10.00  75.451.68

EBIZZY-M Test (133% overcommit)
  kernelRec/s   Real Time   Sys TimeUsr Time
  - -   -   
  PV ticketlock  467  10.00  88.160.73
  qspinlock  463  10.00  89.440.78
  PV qspinlock   441  10.00  95.100.74
  unfair qspinlock  1233  10.00  35.761.76
  unfair + PV qspinlock 1555  10.00  32.121.96

EBIZZY-M Test (200% overcommit)
  kernelRec/s   Real Time   Sys Time

[PATCH v10 14/19] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled

2014-05-07 Thread Waiman Long

This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/spinlock.h  |4 ++--
 arch/x86/kernel/kvm.c|2 +-
 arch/x86/kernel/paravirt-spinlocks.c |4 ++--
 arch/x86/xen/spinlock.c  |2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 958d20f..428d0d1 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -39,7 +39,7 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD (1  15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
 #ifdef CONFIG_QUEUE_SPINLOCK
@@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t 
*lock,
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
if (TICKET_SLOWPATH_FLAG 
-   static_key_false(paravirt_ticketlocks_enabled)) {
+   static_key_false(paravirt_spinlocks_enabled)) {
arch_spinlock_t prev;
 
prev = *lock;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 0331cb3..7ab8ab3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -817,7 +817,7 @@ static __init int kvm_spinlock_init_jump(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
printk(KERN_INFO KVM setup paravirtual spinlock\n);
 
return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 7dfd02d..6d36731 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
 #endif
 
 #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 0ba5f3b..d1b6a32 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jump(void)
if (!xen_domain())
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
return 0;
 }
 early_initcall(xen_init_spinlocks_jump);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 15/19] pvqspinlock, x86: Add PV data structure methods

2014-05-07 Thread Waiman Long

This patch modifies the para-virtualization (PV) infrastructure code
of the x86-64 architecture to support the PV queue spinlock. Three
new virtual methods are added to support PV qspinlock:

 1) kick_cpu - schedule in a virtual CPU
 2) halt_cpu - schedule out a virtual CPU
 3) lockstat - update statistical data for debugfs

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/paravirt.h   |   18 +-
 arch/x86/include/asm/paravirt_types.h |   17 +
 arch/x86/kernel/paravirt-spinlocks.c  |6 ++
 3 files changed, 40 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..d71e123 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -711,7 +711,23 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 }
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
+#ifdef CONFIG_QUEUE_SPINLOCK
+static __always_inline void __queue_kick_cpu(int cpu)
+{
+   PVOP_VCALL1(pv_lock_ops.kick_cpu, cpu);
+}
+
+static __always_inline void
+__queue_halt_cpu(enum pv_lock_stats type, s8 *state, s8 sval)
+{
+   PVOP_VCALL3(pv_lock_ops.halt_cpu, type, state, sval);
+}
 
+static __always_inline void __queue_lockstat(enum pv_lock_stats type)
+{
+   PVOP_VCALL1(pv_lock_ops.lockstat, type);
+}
+#else
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -723,7 +739,7 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
 {
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
-
+#endif
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..549b3a0 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,26 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+enum pv_lock_stats {
+   PV_HALT_QHEAD,  /* Queue head halting   */
+   PV_HALT_QNODE,  /* Other queue node halting */
+   PV_HALT_ABORT,  /* Halting aborted  */
+   PV_WAKE_KICKED, /* Wakeup by kicking*/
+   PV_WAKE_SPURIOUS,   /* Spurious wakeup  */
+   PV_KICK_NOHALT  /* Kick but CPU not halted  */
+};
+#endif
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+   void (*kick_cpu)(int cpu);
+   void (*halt_cpu)(enum pv_lock_stats type, s8 *state, s8 sval);
+   void (*lockstat)(enum pv_lock_stats type);
+#else
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6d36731..8d15bea 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -11,9 +11,15 @@
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
+#ifdef CONFIG_QUEUE_SPINLOCK
+   .kick_cpu = paravirt_nop,
+   .halt_cpu = paravirt_nop,
+   .lockstat = paravirt_nop,
+#else
.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
.unlock_kick = paravirt_nop,
 #endif
+#endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 16/19] pvqspinlock: Enable coexistence with the unfair lock

2014-05-07 Thread Waiman Long

This patch enables the coexistence of both the PV qspinlock and
unfair lock.  When both are enabled, however, only the lock fastpath
will perform lock stealing whereas the slowpath will have that disabled
to get the best of both features.

We also need to transition a CPU spinning too long in the pending
bit code path back to the regular queuing code path so that it can
be properly halted by the PV qspinlock code.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   74 ++--
 1 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0c86a6f..fb05e64 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -72,6 +72,30 @@ struct qnode {
 #define qhead  mcs.locked  /* The queue head flag */
 
 /*
+ * Allow spinning loop count only if either PV spinlock or unfair lock is
+ * configured.
+ */
+#if defined(CONFIG_PARAVIRT_UNFAIR_LOCKS) || defined(CONFIG_PARAVIRT_SPINLOCKS)
+#defineDEF_LOOP_CNT(c) int c = 0
+#defineINC_LOOP_CNT(c) (c)++
+#defineLOOP_CNT(c) c
+#else
+#defineDEF_LOOP_CNT(c)
+#defineINC_LOOP_CNT(c)
+#defineLOOP_CNT(c) 0
+#endif
+
+/*
+ * Check the pending bit spinning threshold only if PV qspinlock is enabled
+ */
+#define PSPIN_THRESHOLD(1  10)
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define pv_qspinlock_enabled() static_key_false(paravirt_spinlocks_enabled)
+#else
+#define pv_qspinlock_enabled() false
+#endif
+
+/*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
@@ -302,9 +326,6 @@ cmpxchg_tail(struct qspinlock *lock, u32 old, u32 new)
  * starvation.
  */
 #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
-#define DEF_LOOP_CNT(c)int c = 0
-#define INC_LOOP_CNT(c)(c)++
-#define LOOP_CNT(c)c
 #define LSTEAL_MIN (1  3)
 #define LSTEAL_MAX (1  10)
 #define LSTEAL_MIN_MASK(LSTEAL_MIN - 1)
@@ -330,7 +351,11 @@ static inline void unfair_init_vars(struct qnode *node)
 static inline void
 unfair_set_vars(struct qnode *node, struct qnode *prev, u32 prev_tail)
 {
-   if (!static_key_false(paravirt_unfairlocks_enabled))
+   /*
+* Disable waiter lock stealing if PV spinlock is enabled
+*/
+   if (pv_qspinlock_enabled() ||
+  !static_key_false(paravirt_unfairlocks_enabled))
return;
 
node-qprev = prev;
@@ -356,7 +381,11 @@ unfair_set_vars(struct qnode *node, struct qnode *prev, 
u32 prev_tail)
  */
 static inline int unfair_check_and_clear_tail(struct qspinlock *lock, u32 tail)
 {
-   if (!static_key_false(paravirt_unfairlocks_enabled))
+   /*
+* Disable waiter lock stealing if PV spinlock is enabled
+*/
+   if (pv_qspinlock_enabled() ||
+  !static_key_false(paravirt_unfairlocks_enabled))
return false;
 
/*
@@ -385,7 +414,11 @@ unfair_get_lock(struct qspinlock *lock, struct qnode 
*node, u32 tail, int count)
int  isqhead;
struct qnode *next;
 
-   if (!static_key_false(paravirt_unfairlocks_enabled) ||
+   /*
+* Disable waiter lock stealing if PV spinlock is enabled
+*/
+   if (pv_qspinlock_enabled() ||
+  !static_key_false(paravirt_unfairlocks_enabled) ||
   ((count  node-lsteal_mask) != node-lsteal_mask))
return false;
 
@@ -463,9 +496,6 @@ unfair_get_lock(struct qspinlock *lock, struct qnode *node, 
u32 tail, int count)
 }
 
 #else /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
-#defineDEF_LOOP_CNT(c)
-#defineINC_LOOP_CNT(c)
-#defineLOOP_CNT(c) 0
 
 static void unfair_init_vars(struct qnode *node)   {}
 static void unfair_set_vars(struct qnode *node, struct qnode *prev,
@@ -582,9 +612,28 @@ static inline int trylock_pending(struct qspinlock *lock, 
u32 *pval)
 * store-release that clears the locked bit and create lock
 * sequentiality; this because not all clear_pending_set_locked()
 * implementations imply full barriers.
+*
+* When PV qspinlock is enabled, exit the pending bit code path and
+* go back to the regular queuing path if the lock isn't available
+* within a certain threshold.
 */
-   while ((val = smp_load_acquire(lock-val.counter))  _Q_LOCKED_MASK)
+   if (pv_qspinlock_enabled())
+   retry = PSPIN_THRESHOLD;
+   while ((val = smp_load_acquire(lock-val.counter))  _Q_LOCKED_MASK) {
+   if (pv_qspinlock_enabled()  (--retry == 0)) {
+   /*
+* Clear the pending bit and exit
+*/
+   for (;;) {
+   new = val  ~_Q_PENDING_MASK;
+   old =

[PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-05-07 Thread Waiman Long

Locking is always an issue in a virtualized environment because of 2
different types of problems:
 1) Lock holder preemption
 2) Lock waiter preemption

One solution to the lock waiter preemption problem is to allow unfair
lock in a virtualized environment. In this case, a new lock acquirer
can come and steal the lock if the next-in-line CPU to get the lock
is scheduled out.

A simple unfair lock is the test-and-set byte lock where an lock
acquirer constantly spins on the lock word and attempt to grab it
when the lock is freed. This simple unfair lock has 2 main problems:
 1) The constant spinning on the lock word put a lot of cacheline
contention traffic on the affected cacheline, thus slowing tasks
that need to access the cacheline.
 2) Lock starvation is a real possibility especially if the number of
virtual CPUs is large.

A simple unfair queue spinlock can be implemented by allowing lock
stealing in the fast path. The slowpath will still be the same as
before and all the pending lock acquirers will have to wait in the
queue in FIFO order. This cannot completely solve the lock waiter
preemption problem, but it does help to alleviate the impact of
this problem.

To illustrate the performance impact of the various approaches, the
disk workload of the AIM7 benchmark and the ebizzy test were run on
a 4-socket 40-core Westmere-EX system (bare metal, HT off, ramdisk)
on a 3.14 based kernel.  The table below shows the performance
of the different kernel flavors.

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97
  simple test-and-set   56250003.20   98.29   5.93
  simple unfair 57507993.13   95.91   5.98
qspinlock

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01
  simple test-and-set593081   30.35  967.55   9.00
  simple unfair 22929947.85  222.84   5.89
qspinlock

Ebizzy -m test
  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80
  simple test-and-set1667   10.00  198.93   2.89
  simple unfair  2915   10.00  165.68   4.31
qspinlock

The disk-xfs workload spent only about 2.88% of CPU time in
_raw_spin_lock() whereas the disk-ext4 workload spent 57.8% of CPU
time in _raw_spin_lock(). It can be seen that there wasn't too much
difference in performance with low spinlock contention in the disk-xfs
workload. With heavy spinlock contention, the performance of simple
test-and-set lock can plummet when compared with the ticket and
queue spinlocks.

Unfair lock in a native environment is generally not a good idea as
there is a possibility of lock starvation for a heavily contended lock.

This patch adds a new configuration option for the x86 architecture
to enable the use of unfair queue spinlock (PARAVIRT_UNFAIR_LOCKS) in
a para-virtualized guest. A jump label (paravirt_unfairlocks_enabled)
is used to switch between a fair and an unfair version of the spinlock
code. This jump label will only be enabled in a virtual guest where
the X86_FEATURE_HYPERVISOR feature bit is set.

Enabling this configuration feature causes a slight decrease the
performance of an uncontended lock-unlock operation by about 1-2%
mainly due to the use of a static key. However, uncontended lock-unlock
operation are really just a tiny percentage of a real workload. So
there should no noticeable change in application performance.

With the unfair locking activated on bare metal 4-socket Westmere-EX
box, the execution times (in ms) of a spinlock micro-benchmark were
as follows:

  # ofTicket   Fair  Unfair simpleUnfair
  taskslock queue lockqueue lockbyte lock
  --  ---   -----
1   135135   137  137
2   890   1082   421  718
3  1932   2248   708 1263
4  2829   2819  1030 1916
5  3834   3522  1323 2327
6  4963   4173  1723 2938
7  6299   4875  2067 3292
8  7691   5563  2360 3768

Executing one task per node, the performance data were:

  # ofTicket   Fair  Unfair simpleUnfair
  nodeslock queue lockqueue

[PATCH v10 12/19] unfair qspinlock: Variable frequency lock stealing mechanism

2014-05-07 Thread Waiman Long

In order to fully resolve the lock waiter preemption problem in virtual
guests, it is necessary to enable lock stealing in the lock waiters.
A simple test-and-set lock, however, has 2 main problems:

 1) The constant spinning on the lock word put a lot of cacheline
contention traffic on the affected cacheline, thus slowing tasks
that need to access the cacheline.
 2) Lock starvation is a real possibility especially if the number of
virtual CPUs is large.

To alleviate these problems, this patch implements a variable frequency
(from 1/8 to 1/1024) lock stealing mechanism for the lock waiters
in the queue. The node next to the queue head try to steal lock once
every 8 iterations of the pause loop. The next one in the queue has
half the lock stealing frequency (once every 16 iterations) and so
on until it reaches a maximum of once every 1024 iterations.

This mechanism reduces the cacheline contention problem on the lock
word while trying to maintain as much of a FIFO order as possible.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |  147 +++-
 1 files changed, 146 insertions(+), 1 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index a14241e..06dd486 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -63,6 +63,11 @@
  */
 struct qnode {
struct mcs_spinlock mcs;
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+   int lsteal_mask;/* Lock stealing frequency mask */
+   u32 prev_tail;  /* Tail code of previous node   */
+   struct qnode   *qprev;  /* Previous queue node addr */
+#endif
 };
 #define qhead  mcs.locked  /* The queue head flag */
 
@@ -215,6 +220,139 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval)
 }
 #endif /* _Q_PENDING_BITS == 8 */
 
+/*
+ 
+ * Inline functions for supporting unfair queue lock   *
+ 
+ */
+/*
+ * Unfair lock support in a virtualized guest
+ *
+ * An unfair lock can be implemented using a simple test-and-set lock like
+ * what is being done in a read-write lock. This simple scheme has 2 major
+ * problems:
+ *  1) It needs constant reading and occasionally writing to the lock word
+ * thus putting a lot of cacheline contention traffic on the affected
+ * cacheline.
+ *  2) Lock starvation is a real possibility especially if the number of
+ * virtual CPUs is large.
+ *
+ * To reduce the undesirable side effects of an unfair lock, the queue
+ * unfair spinlock implements a more elaborate scheme.  Lock stealing is
+ * allowed in the following places:
+ *  1) In the spin_lock and spin_trylock fastpaths
+ *  2) When spinning in the waiter queue before becoming the queue head
+ *
+ * A lock acquirer has only one chance of stealing the lock in the spin_lock
+ * and spin_trylock fastpath. If the attempt fails for spin_lock, the task
+ * will be queued in the wait queue.
+ *
+ * Even in the wait queue, the task can still attempt to steal the lock
+ * periodically at a frequency about inversely and logarithmically proportional
+ * to its distance from the queue head. In other word, the closer it is to
+ * the queue head, the higher a chance it has of stealing the lock. This
+ * scheme reduces the load on the lock cacheline while trying to maintain
+ * a somewhat FIFO way of getting the lock so as to reduce the chance of lock
+ * starvation.
+ */
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+#define DEF_LOOP_CNT(c)int c = 0
+#define INC_LOOP_CNT(c)(c)++
+#define LOOP_CNT(c)c
+#define LSTEAL_MIN (1  3)
+#define LSTEAL_MAX (1  10)
+#define LSTEAL_MIN_MASK(LSTEAL_MIN - 1)
+#define LSTEAL_MAX_MASK(LSTEAL_MAX - 1)
+
+/**
+ * unfair_init_vars - initialize unfair relevant fields in queue node structure
+ * @node: Current queue node address
+ */
+static inline void unfair_init_vars(struct qnode *node)
+{
+   node-qprev   = NULL;
+   node-prev_tail   = 0;
+   node-lsteal_mask = LSTEAL_MIN_MASK;
+}
+
+/**
+ * unfair_set_vars - set unfair related fields in the queue node structure
+ * @node : Current queue node address
+ * @prev : Previous queue node address
+ * @prev_tail: Previous tail code
+ */
+static inline void
+unfair_set_vars(struct qnode *node, struct qnode *prev, u32 prev_tail)
+{
+   if (!static_key_false(paravirt_unfairlocks_enabled))
+   return;
+
+   node-qprev = prev;
+   node-prev_tail = prev_tail;
+   /*
+* This node will spin double the number of time of the previous node
+* before attempting to steal the lock until it reaches a maximum.
+*/
+   node-lsteal_mask = prev-qhead ? LSTEAL_MIN_MASK :
+

[PATCH v10 13/19] unfair qspinlock: Enable lock stealing in lock waiters

2014-05-07 Thread Waiman Long

The simple unfair queue lock cannot completely solve the lock waiter
preemption problem as a preempted CPU at the front of the queue will
block forward progress in all the other CPUs behind it in the queue.
To allow those CPUs to move forward, it is necessary to enable lock
stealing for those lock waiters as well.

Enabling those lock waiters to try to steal the lock increases the
cacheline pressure on the lock word. As a result, performance can
suffer on a workload with heavy spinlock contention.

The tables below shows the the performance (jobs/minutes) of other
kernel flavors of a 3.14-based kernel at 3000 users of the AIM7 disk
workload on a 4-socket Westmere-EX bare-metal system. The ebizzy test
was run.

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97
  simple test-and-set   56250003.20   98.29   5.93
  simple unfair 57507993.13   95.91   5.98
qspinlock
  complex unfair56782333.17   97.40   5.93
qspinlock

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01
  simple test-and-set593081   30.35  967.55   9.00
  simple unfair 22929947.85  222.84   5.89
qspinlock
  complex unfair 972447   18.51  589.88   6.65
qspinlock

Ebizzy -m test
  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80
  simple test-and-set1667   10.00  198.93   2.89
  simple unfair  2915   10.00  165.68   4.31
qspinlock
  complex unfair 1965   10.00  191.96   3.17
qspinlock

With heavy spinlock contention, the complex unfair lock is faster
than the simple test-and-set lock, but it is still slower than the
baseline ticketlock.

The table below shows the execution times (in ms) of a spinlock
micro-benchmark on the same 4-socket Westmere-EX system.

  # ofTicket   Fair  Unfair simple  Unfair complex
  taskslock queue lockqueue lock  queue lock
  --  ---   ----  -
1   135135   137137
2   890   1082   421663
3  1932   2248   708   1263
4  2829   2819  1030   1806
5  3834   3522  1323   2315
6  4963   4173  1723   2831
7  6299   4875  2067   2878
8  7691   5563  2360   3256

Executing one task per node, the performance data were:

  # ofTicket   Fair  Unfair simple  Unfair complex
  nodeslock queue lockqueue lock  queue lock
  --  ---   ----  -
1135135  137137
2   4603   1034  670888
3  10940  12087 1389   2041
4  21555  10507 1869   4307

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |  160 ++--
 1 files changed, 154 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 06dd486..0c86a6f 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -166,6 +166,23 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval)
return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
 }
 
+/*
+ * cmpxchg_tail - Put in the new tail code if it matches the old one
+ * @lock : Pointer to queue spinlock structure
+ * @old  : The old tail code value
+ * @new  : The new tail code value
+ * Return: true if operation succeeds, fail otherwise
+ */
+static __always_inline int
+cmpxchg_tail(struct qspinlock *lock, u32 old, u32 new)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   old = _Q_TAIL_OFFSET;
+   new = _Q_TAIL_OFFSET;
+   return cmpxchg(l-tail, old, new) == old;
+}
+
 #else /* _Q_PENDING_BITS == 8 */
 
 /**
@@ -218,6 +235,35 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval)
*pval = new;
return old;
 }
+
+/*
+ * cmpxchg_tail - Put in the new tail code if it matches the old one
+ * @lock : Pointer to queue spinlock structure
+ * @old  : The old tail code value
+ * @new  : The new tail code value
+ *

[PATCH v10 17/19] pvqspinlock: Add qspinlock para-virtualization support

2014-05-07 Thread Waiman Long

This patch adds base para-virtualization support to the queue
spinlock in the same way as was done in the PV ticket lock code. In
essence, the lock waiters will spin for a specified number of times
(QSPIN_THRESHOLD = 2^14) and then halted itself. The queue head waiter,
unlike the other waiter, will spins 2*QSPIN_THRESHOLD times before
halting itself.  Before being halted, the queue head waiter will set
a flag (_Q_LOCKED_SLOWPATH) in the lock byte to indicate that the
unlock slowpath has to be invoked.

In the unlock slowpath, the current lock holder will find the queue
head by following the previous node pointer links stored in the queue
node structure until it finds one that has the qhead flag turned
on. It then attempt to kick in the CPU of the queue head.

After the queue head acquired the lock, it will also check the status
of the next node and set _Q_LOCKED_SLOWPATH flag if it has been halted.

Enabling the PV code does have a performance impact on spinlock
acquisitions and releases. The following table shows the execution
time (in ms) of a spinlock micro-benchmark that does lock/unlock
operations 5M times for each task versus the number of contending
tasks on a Westmere-EX system.

  # ofTicket lockQueue lock
  tasks   PV off/PV on/%ChangePV off/PV on/%Change
  --     -
1135/  179/+33%  137/  168/+23%
2   1045/ 1103/ +6% 1161/ 1248/ +7%
3   1827/ 2683/+47% 2357/ 2600/+10%
4   2689/ 4191/+56% 2882/ 3115/ +8%
5   3736/ 5830/+56% 3493/ 3571/ +2%
6   4942/ 7609/+54% 4239/ 4198/ -1%
7   6304/ 9570/+52% 4931/ 4895/ -1%
8   7736/11323/+46% 5632/ 5588/ -1%

It can be seen that the ticket lock PV code has a fairly big decrease
in performance when there are 3 or more contending tasks. The queue
spinlock PV code, on the other hand, only has a relatively minor drop
in performance for with 1-4 contending tasks. With 5 or more contending
tasks, there is practically no difference in performance. When coupled
with unfair lock, the queue spinlock can be much faster than the PV
ticket lock.

When both the unfair lock and PV spinlock features is turned on,
lock stealing will still be allowed in the fastpath, but not in
the slowpath.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/pvqspinlock.h |  306 
 arch/x86/include/asm/qspinlock.h   |   33 
 kernel/locking/qspinlock.c |   91 ++-
 3 files changed, 427 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h

diff --git a/arch/x86/include/asm/pvqspinlock.h 
b/arch/x86/include/asm/pvqspinlock.h
new file mode 100644
index 000..fea21aa
--- /dev/null
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -0,0 +1,306 @@
+#ifndef _ASM_X86_PVQSPINLOCK_H
+#define _ASM_X86_PVQSPINLOCK_H
+
+/*
+ * Queue Spinlock Para-Virtualization (PV) Support
+ *
+ * +--++-+   next ++
+ * | Lock ||Queue|---|Next|
+ * |Holder|---|Head |---|Node|
+ * +--+ prev_tail  +-+ prev_tail  ++
+ *
+ * The PV support code for queue spinlock is roughly the same as that
+ * of the ticket spinlock. Each CPU waiting for the lock will spin until it
+ * reaches a threshold. When that happens, it will put itself to halt so
+ * that the hypervisor can reuse the CPU cycles in some other guests as well
+ * as returning other hold-up CPUs faster.
+ *
+ * A major difference between the two versions of PV spinlock is the fact
+ * that the spin threshold of the queue spinlock is half of that of the
+ * ticket spinlock. However, the queue head will spin twice as long as the
+ * other nodes before it puts itself to halt. The reason for that is to
+ * increase halting chance of heavily contended locks to favor lightly
+ * contended locks (queue depth of 1 or less).
+ *
+ * There are 2 places where races can happen:
+ *  1) Halting of the queue head CPU (in pv_head_spin_check) and the CPU
+ * kicking by the lock holder in the unlock path (in pv_kick_node).
+ *  2) Halting of the queue node CPU (in pv_queue_spin_check) and the
+ * the status check by the previous queue head (in pv_halt_check).
+ * See the comments on those functions to see how the races are being
+ * addressed.
+ */
+
+/*
+ * Spin threshold for queue spinlock
+ */
+#defineQSPIN_THRESHOLD (1U14)
+#define MAYHALT_THRESHOLD  (QSPIN_THRESHOLD - 0x10)
+
+/*
+ * CPU state flags
+ */
+#define PV_CPU_ACTIVE  1   /* This CPU is active*/
+#define PV_CPU_KICKED   2  /* This CPU is being kicked  */
+#define PV_CPU_HALTED  -1  /* This CPU is halted*/
+
+/*
+ * Additional fields to be added to the qnode structure
+ */
+#if CONFIG_NR_CPUS = (1  16)
+#define _cpuid_t   u32
+#else
+#define _cpuid_t   u16

[PATCH v10 11/19] qspinlock: Split the MCS queuing code into a separate slowerpath

2014-05-07 Thread Waiman Long

With the pending addition of more codes to support unfair lock and
PV spinlock, the complexity of the slowpath function increases to
the point that the number of scratch-pad registers in the x86-64
architecture is not enough and so those additional non-scratch-pad
registers will need to be used. This has the downside of requiring
saving and restoring of those registers in the prolog and epilog of
the slowpath function slowing down the nominally faster pending bit
and trylock code path at the beginning of the slowpath function.

This patch separates out the actual MCS queuing code into a slowerpath
function. This avoids the slow down of the pending bit and trylock
code path at the expense of a little bit of additional overhead to
the MCS queuing code path.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |  120 +--
 1 files changed, 70 insertions(+), 50 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 10e87e1..a14241e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -335,57 +335,23 @@ static inline int trylock_pending(struct qspinlock *lock, 
u32 *pval)
 }
 
 /**
- * queue_spin_lock_slowpath - acquire the queue spinlock
+ * queue_spin_lock_slowerpath - a slower path for acquiring queue spinlock
  * @lock: Pointer to queue spinlock structure
- * @val: Current value of the queue spinlock 32-bit word
- *
- * (queue tail, pending bit, lock bit)
- *
- *  fast :slow  :unlock
- *   :  :
- * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
- *   :   | ^.--. /  :
- *   :   v   \  \|  :
- * pending   :(0,1,1) +-- (0,1,0)   \   |  :
- *   :   | ^--'  |   |  :
- *   :   v   |   |  :
- * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
- *   queue   :   | ^--'  |  :
- *   :   v   |  :
- * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
- *   queue   : ^--' :
- *
- * The pending bit processing is in the trylock_pending() function
- * whereas the uncontended and contended queue processing is in the
- * queue_spin_lock_slowpath() function.
+ * @val : Current value of the queue spinlock 32-bit word
+ * @node: Pointer to the queue node
+ * @tail: The tail code
  *
+ * The reason for splitting a slowerpath from slowpath is to avoid the
+ * unnecessary overhead of non-scratch pad register pushing and popping
+ * due to increased complexity with unfair and PV spinlock from slowing
+ * down the nominally faster pending bit and trylock code path. So this
+ * function is not inlined.
  */
-void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
+   u32 val, struct qnode *node, u32 tail)
 {
-   struct qnode *prev, *next, *node;
-   u32 old, tail;
-   int idx;
-
-   BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
-
-   if (trylock_pending(lock, val))
-   return; /* Lock acquired */
-
-   node = this_cpu_ptr(qnodes[0]);
-   idx = node-mcs.count++;
-   tail = encode_tail(smp_processor_id(), idx);
-
-   node += idx;
-   node-qhead = 0;
-   node-mcs.next = NULL;
-
-   /*
-* We touched a (possibly) cold cacheline in the per-cpu queue node;
-* attempt the trylock once more in the hope someone let go while we
-* weren't watching.
-*/
-   if (queue_spin_trylock(lock))
-   goto release;
+   struct qnode *prev, *next;
+   u32 old;
 
/*
 * we already touched the queueing cacheline; don't bother with pending
@@ -442,7 +408,7 @@ retry_queue_wait:
}
old = atomic_cmpxchg(lock-val, val, _Q_LOCKED_VAL);
if (old == val)
-   goto release;   /* No contention */
+   return; /* No contention */
else if (old  _Q_LOCKED_MASK)
goto retry_queue_wait;
 
@@ -450,14 +416,68 @@ retry_queue_wait:
}
 
/*
-* contended path; wait for next, release.
+* contended path; wait for next, return.
 */
while (!(next = (struct qnode *)ACCESS_ONCE(node-mcs.next)))
arch_mutex_cpu_relax();
 
arch_mcs_spin_unlock_contended(next-qhead);
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ * @val: Current value of the queue spinlock

[PATCH v10 05/19] qspinlock: Optimize for smaller NR_CPUS

2014-05-07 Thread Waiman Long

From: Peter Zijlstra pet...@infradead.org

When we allow for a max NR_CPUS  2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   13 
 kernel/locking/qspinlock.c|  107 +---
 2 files changed, 110 insertions(+), 10 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index ed5d89a..4914abe 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -38,6 +38,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS  16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS = 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -50,7 +58,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS  (1U  14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -61,6 +73,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index a49b82b..3e908f7 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -22,6 +22,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/mutex.h
+#include asm/byteorder.h
 #include asm/qspinlock.h
 
 /*
@@ -48,6 +49,9 @@
  * We can further change the first spinner to spin on a bit in the lock word
  * instead of its node; whereby avoiding the need to carry a node from lock to
  * unlock, and preserving API.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
  */
 
 #include mcs_spinlock.h
@@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL;
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * @pval : Pointer to current value of the queue spinlock 32-bit word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32
+xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+

[PATCH v10 09/19] qspinlock: Prepare for unfair lock support

2014-05-07 Thread Waiman Long

If unfair lock is supported, the lock acquisition loop at the end of
the queue_spin_lock_slowpath() function may need to detect the fact
the lock can be stolen. Code are added for the stolen lock detection.

A new qhead macro is also defined as a shorthand for mcs.locked.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   26 +++---
 1 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index e98d7d4..9e7659e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -64,6 +64,7 @@
 struct qnode {
struct mcs_spinlock mcs;
 };
+#define qhead  mcs.locked  /* The queue head flag */
 
 /*
  * Per-CPU queue node structures; we can never have more than 4 nested
@@ -216,18 +217,20 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval)
 
 /**
  * get_qlock - Set the lock bit and own the lock
- * @lock: Pointer to queue spinlock structure
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 otherwise
  *
  * This routine should only be called when the caller is the only one
  * entitled to acquire the lock.
  */
-static __always_inline void get_qlock(struct qspinlock *lock)
+static __always_inline int get_qlock(struct qspinlock *lock)
 {
struct __qspinlock *l = (void *)lock;
 
barrier();
ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
barrier();
+   return 1;
 }
 
 /**
@@ -365,7 +368,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
tail = encode_tail(smp_processor_id(), idx);
 
node += idx;
-   node-mcs.locked = 0;
+   node-qhead = 0;
node-mcs.next = NULL;
 
/*
@@ -391,7 +394,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
prev = decode_tail(old);
ACCESS_ONCE(prev-mcs.next) = (struct mcs_spinlock *)node;
 
-   while (!smp_load_acquire(node-mcs.locked))
+   while (!smp_load_acquire(node-qhead))
arch_mutex_cpu_relax();
}
 
@@ -403,6 +406,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 *
 * *,x,y - *,0,0
 */
+retry_queue_wait:
while ((val = smp_load_acquire(lock-val.counter))
_Q_LOCKED_PENDING_MASK)
arch_mutex_cpu_relax();
@@ -419,12 +423,20 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 */
for (;;) {
if (val != tail) {
-   get_qlock(lock);
-   break;
+   /*
+* The get_qlock function will only failed if the
+* lock was stolen.
+*/
+   if (get_qlock(lock))
+   break;
+   else
+   goto retry_queue_wait;
}
old = atomic_cmpxchg(lock-val, val, _Q_LOCKED_VAL);
if (old == val)
goto release;   /* No contention */
+   else if (old  _Q_LOCKED_MASK)
+   goto retry_queue_wait;
 
val = old;
}
@@ -435,7 +447,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
while (!(next = (struct qnode *)ACCESS_ONCE(node-mcs.next)))
arch_mutex_cpu_relax();
 
-   arch_mcs_spin_unlock_contended(next-mcs.locked);
+   arch_mcs_spin_unlock_contended(next-qhead);
 
 release:
/*
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 04/19] qspinlock: Extract out the exchange of tail code word

2014-05-07 Thread Waiman Long

This patch extracts the logic for the exchange of new and previous tail
code words into a new xchg_tail() function which can be optimized in a
later patch.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |2 +
 kernel/locking/qspinlock.c|   61 +
 2 files changed, 41 insertions(+), 22 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index bd25081..ed5d89a 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -61,6 +61,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6467bfc..a49b82b 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -86,6 +86,34 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * @pval : Pointer to current value of the queue spinlock 32-bit word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32
+xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval)
+{
+   u32 old, new, val = *pval;
+
+   for (;;) {
+   new = (val  _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   *pval = new;
+   return old;
+}
+
+/**
  * trylock_pending - try to acquire queue spinlock using the pending bit
  * @lock : Pointer to queue spinlock structure
  * @pval : Pointer to value of the queue spinlock 32-bit word
@@ -196,36 +224,25 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
node-next = NULL;
 
/*
-* we already touched the queueing cacheline; don't bother with pending
-* stuff.
-*
-* trylock || xchg(lock, node)
-*
-* 0,0,0 - 0,0,1 ; trylock
-* p,y,x - n,y,x ; prev = xchg(lock, node)
+* We touched a (possibly) cold cacheline in the per-cpu queue node;
+* attempt the trylock once more in the hope someone let go while we
+* weren't watching.
 */
-   for (;;) {
-   new = _Q_LOCKED_VAL;
-   if (val)
-   new = tail | (val  _Q_LOCKED_PENDING_MASK);
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   if (queue_spin_trylock(lock))
+   goto release;
 
/*
-* we won the trylock; forget about queueing.
+* we already touched the queueing cacheline; don't bother with pending
+* stuff.
+*
+* p,*,* - n,*,*
 */
-   if (new == _Q_LOCKED_VAL)
-   goto release;
+   old = xchg_tail(lock, tail, val);
 
/*
 * if there was a previous node; link it and wait.
 */
-   if (old  ~_Q_LOCKED_PENDING_MASK) {
+   if (old  _Q_TAIL_MASK) {
prev = decode_tail(old);
ACCESS_ONCE(prev-next) = node;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 01/19] qspinlock: A simple generic 4-byte queue spinlock

2014-05-07 Thread Waiman Long

This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock.h   |  118 
 include/asm-generic/qspinlock_types.h |   61 ++
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 +
 kernel/locking/mcs_spinlock.h |1 +
 kernel/locking/qspinlock.c|  197 +
 6 files changed, 385 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 000..e8a7ae8
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,118 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+   if (!atomic_read(lock-val) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+   u32 val;
+
+

[PATCH v10 03/19] qspinlock: Add pending bit

2014-05-07 Thread Waiman Long

From: Peter Zijlstra pet...@infradead.org

Because the qspinlock needs to touch a second cacheline; add a pending
bit and allow a single in-word spinner before we punt to the second
cacheline.

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   12 +++-
 kernel/locking/qspinlock.c|  121 +++--
 2 files changed, 110 insertions(+), 23 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index f66f845..bd25081 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -39,8 +39,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
   _Q_ ## type ## _OFFSET)
@@ -48,7 +49,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -57,5 +62,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b97a1ad..6467bfc 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -83,23 +83,97 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
return per_cpu_ptr(mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
+/**
+ * trylock_pending - try to acquire queue spinlock using the pending bit
+ * @lock : Pointer to queue spinlock structure
+ * @pval : Pointer to value of the queue spinlock 32-bit word
+ * Return: 1 if lock acquired, 0 otherwise
+ */
+static inline int trylock_pending(struct qspinlock *lock, u32 *pval)
+{
+   u32 old, new, val = *pval;
+
+   /*
+* trylock || pending
+*
+* 0,0,0 - 0,0,1 ; trylock
+* 0,0,1 - 0,1,1 ; pending
+*/
+   for (;;) {
+   /*
+* If we observe any contention; queue.
+*/
+   if (val  ~_Q_LOCKED_MASK)
+   return 0;
+
+   new = _Q_LOCKED_VAL;
+   if (val == new)
+   new |= _Q_PENDING_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   *pval = val = old;
+   }
+
+   /*
+* we won the trylock
+*/
+   if (new == _Q_LOCKED_VAL)
+   return 1;
+
+   /*
+* we're pending, wait for the owner to go away.
+*
+* *,1,1 - *,1,0
+*/
+   while ((val = atomic_read(lock-val))  _Q_LOCKED_MASK)
+   arch_mutex_cpu_relax();
+
+   /*
+* take ownership and clear the pending bit.
+*
+* *,1,0 - *,0,1
+*/
+   for (;;) {
+   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return 1;
+}
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
  *
- * (queue tail, lock bit)
+ * (queue tail, pending bit, lock bit)
+ *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +-- (0,1,0)   \   |  :
+ *   :   | ^--'  |   |  :
+ *   :   v   |   |  :
+ * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
+ *   queue   :   | ^--'  |  :
+ *   :   v   |  :
+ * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'

[PATCH v10 08/19] qspinlock: Make a new qnode structure to support virtualization

2014-05-07 Thread Waiman Long

In order to support additional virtualization features like unfair lock
and para-virtualized spinlock, it is necessary to store additional
CPU specific data into the queue node structure. As a result, a new
qnode structure is created and the mcs_spinlock structure is now part
of the new structure.

It is also necessary to expand arch_mcs_spin_lock_contended() to the
underlying while loop as additional code will need to be inserted
into the loop.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   36 +++-
 1 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0ee1a23..e98d7d4 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -57,12 +57,21 @@
 #include mcs_spinlock.h
 
 /*
+ * To have additional features for better virtualization support, it is
+ * necessary to store additional data in the queue node structure. So
+ * a new queue node structure will have to be defined and used here.
+ */
+struct qnode {
+   struct mcs_spinlock mcs;
+};
+
+/*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
  * Exactly fits one cacheline.
  */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[4]);
 
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -79,12 +88,12 @@ static inline u32 encode_tail(int cpu, int idx)
return tail;
 }
 
-static inline struct mcs_spinlock *decode_tail(u32 tail)
+static inline struct qnode *decode_tail(u32 tail)
 {
int cpu = (tail  _Q_TAIL_CPU_OFFSET) - 1;
int idx = (tail   _Q_TAIL_IDX_MASK)  _Q_TAIL_IDX_OFFSET;
 
-   return per_cpu_ptr(mcs_nodes[idx], cpu);
+   return per_cpu_ptr(qnodes[idx], cpu);
 }
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
@@ -342,7 +351,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
u32 *pval)
  */
 void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
-   struct mcs_spinlock *prev, *next, *node;
+   struct qnode *prev, *next, *node;
u32 old, tail;
int idx;
 
@@ -351,13 +360,13 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
if (trylock_pending(lock, val))
return; /* Lock acquired */
 
-   node = this_cpu_ptr(mcs_nodes[0]);
-   idx = node-count++;
+   node = this_cpu_ptr(qnodes[0]);
+   idx = node-mcs.count++;
tail = encode_tail(smp_processor_id(), idx);
 
node += idx;
-   node-locked = 0;
-   node-next = NULL;
+   node-mcs.locked = 0;
+   node-mcs.next = NULL;
 
/*
 * We touched a (possibly) cold cacheline in the per-cpu queue node;
@@ -380,9 +389,10 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 */
if (old  _Q_TAIL_MASK) {
prev = decode_tail(old);
-   ACCESS_ONCE(prev-next) = node;
+   ACCESS_ONCE(prev-mcs.next) = (struct mcs_spinlock *)node;
 
-   arch_mcs_spin_lock_contended(node-locked);
+   while (!smp_load_acquire(node-mcs.locked))
+   arch_mutex_cpu_relax();
}
 
/*
@@ -422,15 +432,15 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
/*
 * contended path; wait for next, release.
 */
-   while (!(next = ACCESS_ONCE(node-next)))
+   while (!(next = (struct qnode *)ACCESS_ONCE(node-mcs.next)))
arch_mutex_cpu_relax();
 
-   arch_mcs_spin_unlock_contended(next-locked);
+   arch_mcs_spin_unlock_contended(next-mcs.locked);
 
 release:
/*
 * release the node
 */
-   this_cpu_dec(mcs_nodes[0].count);
+   this_cpu_dec(qnodes[0].mcs.count);
 }
 EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v10 06/19] qspinlock: prolong the stay in the pending bit path

2014-05-07 Thread Waiman Long

There is a problem in the current trylock_pending() function.  When the
lock is free, but the pending bit holder hasn't grabbed the lock 
cleared the pending bit yet, the trylock_pending() function will fail.
As a result, the regular queuing code path will be used most of
the time even when there is only 2 tasks contending for the lock.
Assuming that the pending bit holder is going to get the lock and
clear the pending bit soon, it is actually better to wait than to be
queued up which has a higher overhead.

This patch modified the trylock_pending() function to wait until the
pending bit holder gets the lock and clears the pending bit. In case
both the lock and pending bits are set, the new code will also wait
a bit to see if either one is cleared. If they are not, it will quit
and be queued.

The following tables show the before-patch execution time (in ms)
of a micro-benchmark where 5M iterations of the lock/unlock cycles
were run on a 10-core Westere-EX x86-64 CPU with 2 different types of
loads - standalone (lock and protected data in different cachelines)
and embedded (lock and protected data in the same cacheline).

  [Standalone/Embedded - same node]
  # of tasksTicket lock Queue lock   %Change
  ----- --   ---
   1  135/ 111   135/ 101  0%/  -9%
   2  890/ 779  1885/1990   +112%/+156%
   3 1932/1859  2333/2341+21%/ +26%
   4 2829/2726  2900/2923 +3%/  +7%
   5 3834/3761  3655/3648 -5%/  -3%
   6 4963/4976  4336/4326-13%/ -13%
   7 6299/6269  5057/5064-20%/ -19%
   8 7691/7569  5786/5798-25%/ -23%

With 1 task per NUMA node, the execution times are:

[Standalone - different nodes]
  # of nodesTicket lock Queue lock  %Change
  ----- --  ---
   1   135135 0%
   2  4604   5087   +10%
   3 10940  12224   +12%
   4 21555  10555   -51%

It can be seen that the queue spinlock is slower than the ticket
spinlock when there are 2 or 3 contending tasks. In all the other case,
the queue spinlock is either equal or faster than the ticket spinlock.

With this patch, the performance data for 2 contending tasks are:

  [Standalone/Embedded]
  # of tasksTicket lock Queue lock  %Change
  ----- --  ---
   2  890/779984/871+11%/+12%

[Standalone - different nodes]
  # of nodesTicket lock Queue lock  %Change
  ----- --  ---
   2  4604 1364   -70%

It can be seen that the queue spinlock performance for 2 contending
tasks is now comparable to ticket spinlock on the same node, but much
faster when in different nodes. With 3 contending tasks, however,
the ticket spinlock is still quite a bit faster.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   31 +--
 1 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 3e908f7..e734acb 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -212,6 +212,7 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval)
 static inline int trylock_pending(struct qspinlock *lock, u32 *pval)
 {
u32 old, new, val = *pval;
+   int retry = 1;
 
/*
 * trylock || pending
@@ -221,11 +222,37 @@ static inline int trylock_pending(struct qspinlock *lock, 
u32 *pval)
 */
for (;;) {
/*
-* If we observe any contention; queue.
+* If we observe that the queue is not empty,
+* return and be queued.
 */
-   if (val  ~_Q_LOCKED_MASK)
+   if (val  _Q_TAIL_MASK)
return 0;
 
+   if (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)) {
+   /*
+* If both the lock and pending bits are set, we wait
+* a while to see if that either bit will be cleared.
+* If that is no change, we return and be queued.
+*/
+   if (!retry)
+   return 0;
+   retry--;
+   cpu_relax();
+   cpu_relax();
+   *pval = val = atomic_read(lock-val);
+   continue;
+   } else if (val == _Q_PENDING_VAL) {
+   /*
+* Pending bit is set, but not the lock bit.
+* Assuming

[PATCH v10 07/19] qspinlock: Use a simple write to grab the lock, if applicable

2014-05-07 Thread Waiman Long

Currently, atomic_cmpxchg() is used to get the lock. However, this is
not really necessary if there is more than one task in the queue and
the queue head don't need to reset the queue code word. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
queue code word (tail) in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   3 2324/2321  2248/2265-3%/-2%
   4 2890/2896  2819/2831-2%/-2%
   5 3611/3595  3522/3512-2%/-2%
   6 4281/4276  4173/4160-3%/-3%
   7 5018/5001  4875/4861-3%/-3%
   8 5759/5750  5563/5568-3%/-3%

[Standalone/Embedded - different nodes]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   312242/12237 12087/12093  -1%/-1%
   410688/10696 10507/10521  -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The ebizzy -m test was also run with the following results:

  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   61 +++
 1 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index e734acb..0ee1a23 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -94,23 +94,29 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
union {
atomic_t val;
-   struct {
 #ifdef __LITTLE_ENDIAN
+   u8   locked;
+   struct {
u16 locked_pending;
u16 tail;
+   };
 #else
+   struct {
u16 tail;
u16 locked_pending;
-#endif
};
+   struct {
+   u8  reserved[3];
+   u8  locked;
+   };
+#endif
};
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -200,6 +206,22 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval)
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * get_qlock - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * This routine should only be called when the caller is the only one
+ * entitled to acquire the lock.
+ */
+static __always_inline void get_qlock(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   barrier();
+   ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
+   barrier();
+}
+
+/**
  * trylock_pending - try to acquire queue spinlock using the pending bit
  * @lock : Pointer to queue spinlock structure
  * @pval : Pointer to value of the queue spinlock 32-bit word
@@ -321,7 +343,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
u32 *pval)

[PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support

2014-05-07 Thread Waiman Long

v9-v10:
  - Make some minor changes to qspinlock.c to accommodate review feedback.
  - Change author to PeterZ for 2 of the patches.
  - Include Raghavendra KT's test results in patch 18.

v8-v9:
  - Integrate PeterZ's version of the queue spinlock patch with some
modification:
http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
  - Break the more complex patches into smaller ones to ease review effort.
  - Fix a racing condition in the PV qspinlock code.

v7-v8:
  - Remove one unneeded atomic operation from the slowpath, thus
improving performance.
  - Simplify some of the codes and add more comments.
  - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
unfair lock.
  - Reduce unfair lock slowpath lock stealing frequency depending
on its distance from the queue head.
  - Add performance data for IvyBridge-EX CPU.

v6-v7:
  - Remove an atomic operation from the 2-task contending code
  - Shorten the names of some macros
  - Make the queue waiter to attempt to steal lock when unfair lock is
enabled.
  - Remove lock holder kick from the PV code and fix a race condition
  - Run the unfair lock  PV code on overcommitted KVM guests to collect
performance data.

v5-v6:
 - Change the optimized 2-task contending code to make it fairer at the
   expense of a bit of performance.
 - Add a patch to support unfair queue spinlock for Xen.
 - Modify the PV qspinlock code to follow what was done in the PV
   ticketlock.
 - Add performance data for the unfair lock as well as the PV
   support code.

v4-v5:
 - Move the optimized 2-task contending code to the generic file to
   enable more architectures to use it without code duplication.
 - Address some of the style-related comments by PeterZ.
 - Allow the use of unfair queue spinlock in a real para-virtualized
   execution environment.
 - Add para-virtualization support to the qspinlock code by ensuring
   that the lock holder and queue head stay alive as much as possible.

v3-v4:
 - Remove debugging code and fix a configuration error
 - Simplify the qspinlock structure and streamline the code to make it
   perform a bit better
 - Add an x86 version of asm/qspinlock.h for holding x86 specific
   optimization.
 - Add an optimized x86 code path for 2 contending tasks to improve
   low contention performance.

v2-v3:
 - Simplify the code by using numerous mode only without an unfair option.
 - Use the latest smp_load_acquire()/smp_store_release() barriers.
 - Move the queue spinlock code to kernel/locking.
 - Make the use of queue spinlock the default for x86-64 without user
   configuration.
 - Additional performance tuning.

v1-v2:
 - Add some more comments to document what the code does.
 - Add a numerous CPU mode to support = 16K CPUs
 - Add a configuration option to allow lock stealing which can further
   improve performance in many cases.
 - Enable wakeup of queue head CPU at unlock time for non-numerous
   CPU mode.

This patch set has 3 different sections:
 1) Patches 1-7: Introduces a queue-based spinlock implementation that
can replace the default ticket spinlock without increasing the
size of the spinlock data structure. As a result, critical kernel
data structures that embed spinlock won't increase in size and
break data alignments.
 2) Patches 8-13: Enables the use of unfair queue spinlock in a
virtual guest. This can resolve some of the locking related
performance issues due to the fact that the next CPU to get the
lock may have been scheduled out for a period of time.
 3) Patches 14-19: Enable qspinlock para-virtualization support
by halting the waiting CPUs after spinning for a certain amount of
time. The unlock code will detect the a sleeping waiter and wake it
up. This is essentially the same logic as the PV ticketlock code.

The queue spinlock has slightly better performance than the ticket
spinlock in uncontended case. Its performance can be much better
with moderate to heavy contention.  This patch has the potential of
improving the performance of all the workloads that have moderate to
heavy spinlock contention.

The queue spinlock is especially suitable for NUMA machines with at
least 2 sockets, though noticeable performance benefit probably won't
show up in machines with less than 4 sockets.

The purpose of this patch set is not to solve any particular spinlock
contention problems. Those need to be solved by refactoring the code
to make more efficient use of the lock or finer granularity ones. The
main purpose is to make the lock contention problems more tolerable
until someone can spend the time and effort to fix them.

Peter Zijlstra (2):
  qspinlock: Add pending bit
  qspinlock: Optimize for smaller NR_CPUS

Waiman Long (17):
  qspinlock: A simple generic 4-byte queue spinlock
  qspinlock, x86: Enable x86-64 to use queue spinlock
  qspinlock: Extract out the exchange of tail code word
  qspinlock: prolong the stay in the pending bit path
  qspinlock:

[PATCH 2/9] arm64: move DBG_MDSCR_* to asm/debug-monitors.h

2014-05-07 Thread Marc Zyngier

In order to be able to use the DBG_MDSCR_* macros from the KVM code,
move the relevant definitions to the obvious include file.

Also move the debug_el enum to a portion of the file that is guarded
by #ifndef __ASSEMBLY__ in order to use that file from assembly code.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm64/include/asm/debug-monitors.h | 19 ++-
 arch/arm64/kernel/debug-monitors.c  |  9 -
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/debug-monitors.h 
b/arch/arm64/include/asm/debug-monitors.h
index 6e9b5b3..7fb3437 100644
--- a/arch/arm64/include/asm/debug-monitors.h
+++ b/arch/arm64/include/asm/debug-monitors.h
@@ -18,6 +18,15 @@
 
 #ifdef __KERNEL__
 
+/* Low-level stepping controls. */
+#define DBG_MDSCR_SS   (1  0)
+#define DBG_SPSR_SS(1  21)
+
+/* MDSCR_EL1 enabling bits */
+#define DBG_MDSCR_KDE  (1  13)
+#define DBG_MDSCR_MDE  (1  15)
+#define DBG_MDSCR_MASK ~(DBG_MDSCR_KDE | DBG_MDSCR_MDE)
+
 #defineDBG_ESR_EVT(x)  (((x)  27)  0x7)
 
 /* AArch64 */
@@ -73,11 +82,6 @@
 
 #define CACHE_FLUSH_IS_SAFE1
 
-enum debug_el {
-   DBG_ACTIVE_EL0 = 0,
-   DBG_ACTIVE_EL1,
-};
-
 /* AArch32 */
 #define DBG_ESR_EVT_BKPT   0x4
 #define DBG_ESR_EVT_VECC   0x5
@@ -115,6 +119,11 @@ void unregister_break_hook(struct break_hook *hook);
 
 u8 debug_monitors_arch(void);
 
+enum debug_el {
+   DBG_ACTIVE_EL0 = 0,
+   DBG_ACTIVE_EL1,
+};
+
 void enable_debug_monitors(enum debug_el el);
 void disable_debug_monitors(enum debug_el el);
 
diff --git a/arch/arm64/kernel/debug-monitors.c 
b/arch/arm64/kernel/debug-monitors.c
index a7fb874..e022f87 100644
--- a/arch/arm64/kernel/debug-monitors.c
+++ b/arch/arm64/kernel/debug-monitors.c
@@ -30,15 +30,6 @@
 #include asm/cputype.h
 #include asm/system_misc.h
 
-/* Low-level stepping controls. */
-#define DBG_MDSCR_SS   (1  0)
-#define DBG_SPSR_SS(1  21)
-
-/* MDSCR_EL1 enabling bits */
-#define DBG_MDSCR_KDE  (1  13)
-#define DBG_MDSCR_MDE  (1  15)
-#define DBG_MDSCR_MASK ~(DBG_MDSCR_KDE | DBG_MDSCR_MDE)
-
 /* Determine debug architecture. */
 u8 debug_monitors_arch(void)
 {
-- 
1.8.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/9] arm64: KVM: rename pm_fake handler to trap_wi_raz

2014-05-07 Thread Marc Zyngier

pm_fake doesn't quite describe what the handler does (ignoring writes
and returning 0 for reads).

As we're about to use it (a lot) in a different context, rename it
with a (admitedly cryptic) name that make sense for all users.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm64/kvm/sys_regs.c | 83 ---
 1 file changed, 43 insertions(+), 40 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 0324458..fc8d4e3 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -163,18 +163,9 @@ static bool access_sctlr(struct kvm_vcpu *vcpu,
return true;
 }
 
-/*
- * We could trap ID_DFR0 and tell the guest we don't support performance
- * monitoring.  Unfortunately the patch to make the kernel check ID_DFR0 was
- * NAKed, so it will read the PMCR anyway.
- *
- * Therefore we tell the guest we have 0 counters.  Unfortunately, we
- * must always support PMCCNTR (the cycle counter): we just RAZ/WI for
- * all PM registers, which doesn't crash the guest kernel at least.
- */
-static bool pm_fake(struct kvm_vcpu *vcpu,
-   const struct sys_reg_params *p,
-   const struct sys_reg_desc *r)
+static bool trap_wi_raz(struct kvm_vcpu *vcpu,
+   const struct sys_reg_params *p,
+   const struct sys_reg_desc *r)
 {
if (p-is_write)
return ignore_write(vcpu, p);
@@ -201,6 +192,17 @@ static void reset_mpidr(struct kvm_vcpu *vcpu, const 
struct sys_reg_desc *r)
 /*
  * Architected system registers.
  * Important: Must be sorted ascending by Op0, Op1, CRn, CRm, Op2
+ *
+ * We could trap ID_DFR0 and tell the guest we don't support performance
+ * monitoring.  Unfortunately the patch to make the kernel check ID_DFR0 was
+ * NAKed, so it will read the PMCR anyway.
+ *
+ * Therefore we tell the guest we have 0 counters.  Unfortunately, we
+ * must always support PMCCNTR (the cycle counter): we just RAZ/WI for
+ * all PM registers, which doesn't crash the guest kernel at least.
+ *
+ * Same goes for the whole debug infrastructure, which probably breaks
+ * some guest functionnality. This should be fixed.
  */
 static const struct sys_reg_desc sys_reg_descs[] = {
/* DC ISW */
@@ -260,10 +262,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
/* PMINTENSET_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b001),
- pm_fake },
+ trap_wi_raz },
/* PMINTENCLR_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b010),
- pm_fake },
+ trap_wi_raz },
 
/* MAIR_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0010), Op2(0b000),
@@ -292,43 +294,43 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
/* PMCR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b000),
- pm_fake },
+ trap_wi_raz },
/* PMCNTENSET_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b001),
- pm_fake },
+ trap_wi_raz },
/* PMCNTENCLR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b010),
- pm_fake },
+ trap_wi_raz },
/* PMOVSCLR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b011),
- pm_fake },
+ trap_wi_raz },
/* PMSWINC_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b100),
- pm_fake },
+ trap_wi_raz },
/* PMSELR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b101),
- pm_fake },
+ trap_wi_raz },
/* PMCEID0_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b110),
- pm_fake },
+ trap_wi_raz },
/* PMCEID1_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b111),
- pm_fake },
+ trap_wi_raz },
/* PMCCNTR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b000),
- pm_fake },
+ trap_wi_raz },
/* PMXEVTYPER_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b001),
- pm_fake },
+ trap_wi_raz },
/* PMXEVCNTR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b010),
- pm_fake },
+ trap_wi_raz },
/* PMUSERENR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b000),
- pm_fake },
+ trap_wi_raz },
/* PMOVSSET_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b011),
- pm_fake },
+ trap_wi_raz },
 
/* TPIDR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1101), CRm(0b), Op2(0b010),
@@ -374,19 +376,20 @@ static const struct sys_reg_desc cp15_regs[] = {
{ Op1( 0), CRn( 7), CRm(10), Op2( 2), access_dcsw },
{ Op1( 0), CRn( 7), CRm(14), Op2( 2), access_dcsw

[PATCH 9/9] arm64: KVM: enable trapping of all debug registers

2014-05-07 Thread Marc Zyngier

Enable trapping of the debug registers, preventing the guests to
mess with the host state (and allowing guests to use the debug
infrastructure as well).

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm64/kvm/hyp.S | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index f9d5a1d..037d5d9 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -753,6 +753,14 @@ __kvm_hyp_code_start:
mrs x2, mdcr_el2
and x2, x2, #MDCR_EL2_HPMN_MASK
orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
+   orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
+
+   // Check for KVM_ARM64_DEBUG_DIRTY, and set to debug to trap
+   // if not dirty.
+   ldr x3, [x0, #VCPU_DEBUG_FLAGS]
+   tbnzx3, #KVM_ARM64_DEBUG_DIRTY_SHIFT, 1f
+   orr x2, x2,  #MDCR_EL2_TDA
+1:
msr mdcr_el2, x2
 .endm
 
-- 
1.8.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/9] arm64: KVM: common infrastructure for handling AArch32 CP14/CP15

2014-05-07 Thread Marc Zyngier

As we're about to trap a bunch of CP14 registers, let's rework
the CP15 handling so it can be generalized and work with multiple
tables.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm64/include/asm/kvm_asm.h|   2 +-
 arch/arm64/include/asm/kvm_coproc.h |   3 +-
 arch/arm64/include/asm/kvm_host.h   |   9 ++-
 arch/arm64/kvm/handle_exit.c|   4 +-
 arch/arm64/kvm/sys_regs.c   | 121 +---
 5 files changed, 111 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index e6b159a..12f9dd7 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -93,7 +93,7 @@
 #define c10_AMAIR0 (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */
 #define c10_AMAIR1 (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */
 #define c14_CNTKCTL(CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */
-#define NR_CP15_REGS   (NR_SYS_REGS * 2)
+#define NR_COPRO_REGS  (NR_SYS_REGS * 2)
 
 #define ARM_EXCEPTION_IRQ0
 #define ARM_EXCEPTION_TRAP   1
diff --git a/arch/arm64/include/asm/kvm_coproc.h 
b/arch/arm64/include/asm/kvm_coproc.h
index 9a59301..0b52377 100644
--- a/arch/arm64/include/asm/kvm_coproc.h
+++ b/arch/arm64/include/asm/kvm_coproc.h
@@ -39,7 +39,8 @@ void kvm_register_target_sys_reg_table(unsigned int target,
   struct kvm_sys_reg_target_table *table);
 
 int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, struct kvm_run *run);
-int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_handle_cp14_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
 int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
 int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
 int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct kvm_run *run);
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 4737961..31cff7a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -86,7 +86,7 @@ struct kvm_cpu_context {
struct kvm_regs gp_regs;
union {
u64 sys_regs[NR_SYS_REGS];
-   u32 cp15[NR_CP15_REGS];
+   u32 copro[NR_COPRO_REGS];
};
 };
 
@@ -141,7 +141,12 @@ struct kvm_vcpu_arch {
 
 #define vcpu_gp_regs(v)((v)-arch.ctxt.gp_regs)
 #define vcpu_sys_reg(v,r)  ((v)-arch.ctxt.sys_regs[(r)])
-#define vcpu_cp15(v,r) ((v)-arch.ctxt.cp15[(r)])
+/*
+ * CP14 and CP15 live in the same array, as they are backed by the
+ * same system registers.
+ */
+#define vcpu_cp14(v,r) ((v)-arch.ctxt.copro[(r)])
+#define vcpu_cp15(v,r) ((v)-arch.ctxt.copro[(r)])
 
 struct kvm_vm_stat {
u32 remote_tlb_flush;
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 7bc41ea..f0ca49f 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -69,9 +69,9 @@ static exit_handle_fn arm_exit_handlers[] = {
[ESR_EL2_EC_WFI]= kvm_handle_wfx,
[ESR_EL2_EC_CP15_32]= kvm_handle_cp15_32,
[ESR_EL2_EC_CP15_64]= kvm_handle_cp15_64,
-   [ESR_EL2_EC_CP14_MR]= kvm_handle_cp14_access,
+   [ESR_EL2_EC_CP14_MR]= kvm_handle_cp14_32,
[ESR_EL2_EC_CP14_LS]= kvm_handle_cp14_load_store,
-   [ESR_EL2_EC_CP14_64]= kvm_handle_cp14_access,
+   [ESR_EL2_EC_CP14_64]= kvm_handle_cp14_64,
[ESR_EL2_EC_HVC32]  = handle_hvc,
[ESR_EL2_EC_SMC32]  = handle_smc,
[ESR_EL2_EC_HVC64]  = handle_hvc,
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 618d4fb..feafd8d 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -474,6 +474,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  NULL, reset_val, FPEXC32_EL2, 0x70 },
 };
 
+/* Trapped cp14 registers */
+static const struct sys_reg_desc cp14_regs[] = {
+};
+
 /*
  * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding,
  * depending on the way they are accessed (as a 32bit or a 64bit
@@ -581,26 +585,19 @@ int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, 
struct kvm_run *run)
return 1;
 }
 
-int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run)
-{
-   kvm_inject_undefined(vcpu);
-   return 1;
-}
-
-static void emulate_cp15(struct kvm_vcpu *vcpu,
-const struct sys_reg_params *params)
+static int emulate_cp(struct kvm_vcpu *vcpu,
+ const struct sys_reg_params *params,
+ const struct sys_reg_desc *table,
+ size_t num)
 {
-   size_t num;
-   const struct sys_reg_desc *table, *r;
+   const struct sys_reg_desc *r;
 
-   table = get_target_table(vcpu-arch.target, false, num);
+   if (!table)
+

[PATCH 8/9] arm64: KVM: implement lazy world switch for debug registers

2014-05-07 Thread Marc Zyngier

Implement switching of the debug registers. While the number
of registers is massive, CPUs usually don't implement them all
(A57 has 6 breakpoints and 4 watchpoints, which gives us a total
of 22 registers only).

Also, we only save/restore them when MDSCR_EL1 has debug enabled,
or when we've flagged the debug registers as dirty. It means that
most of the time, we only save/restore MDSCR_EL1.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm64/kernel/asm-offsets.c |   1 +
 arch/arm64/kvm/hyp.S| 449 +++-
 2 files changed, 444 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 646f888..ae73a83 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -120,6 +120,7 @@ int main(void)
   DEFINE(VCPU_ESR_EL2, offsetof(struct kvm_vcpu, arch.fault.esr_el2));
   DEFINE(VCPU_FAR_EL2, offsetof(struct kvm_vcpu, arch.fault.far_el2));
   DEFINE(VCPU_HPFAR_EL2,   offsetof(struct kvm_vcpu, 
arch.fault.hpfar_el2));
+  DEFINE(VCPU_DEBUG_FLAGS, offsetof(struct kvm_vcpu, arch.debug_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
   DEFINE(VCPU_IRQ_LINES,   offsetof(struct kvm_vcpu, arch.irq_lines));
   DEFINE(VCPU_HOST_CONTEXT,offsetof(struct kvm_vcpu, 
arch.host_cpu_context));
diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index 2c56012..f9d5a1d 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -21,6 +21,7 @@
 #include asm/assembler.h
 #include asm/memory.h
 #include asm/asm-offsets.h
+#include asm/debug-monitors.h
 #include asm/fpsimdmacros.h
 #include asm/kvm.h
 #include asm/kvm_asm.h
@@ -215,6 +216,7 @@ __kvm_hyp_code_start:
mrs x22,amair_el1
mrs x23,cntkctl_el1
mrs x24,par_el1
+   mrs x25,mdscr_el1
 
stp x4, x5, [x3]
stp x6, x7, [x3, #16]
@@ -226,7 +228,202 @@ __kvm_hyp_code_start:
stp x18, x19, [x3, #112]
stp x20, x21, [x3, #128]
stp x22, x23, [x3, #144]
-   str x24, [x3, #160]
+   stp x24, x25, [x3, #160]
+.endm
+
+.macro save_debug
+   // x2: base address for cpu context
+   // x3: tmp register
+
+   mrs x26, id_aa64dfr0_el1
+   ubfxx24, x26, #12, #4   // Extract BRPs
+   ubfxx25, x26, #20, #4   // Extract WRPs
+   mov w26, #15
+   sub w24, w26, w24   // How many BPs to skip
+   sub w25, w26, w25   // How many WPs to skip
+
+   add x3, x2, #CPU_SYSREG_OFFSET(DBGBCR0_EL1)
+
+   adr x26, 1f
+   add x26, x26, x24, lsl #2
+   br  x26
+1:
+   mrs x20, dbgbcr15_el1
+   mrs x19, dbgbcr14_el1
+   mrs x18, dbgbcr13_el1
+   mrs x17, dbgbcr12_el1
+   mrs x16, dbgbcr11_el1
+   mrs x15, dbgbcr10_el1
+   mrs x14, dbgbcr9_el1
+   mrs x13, dbgbcr8_el1
+   mrs x12, dbgbcr7_el1
+   mrs x11, dbgbcr6_el1
+   mrs x10, dbgbcr5_el1
+   mrs x9, dbgbcr4_el1
+   mrs x8, dbgbcr3_el1
+   mrs x7, dbgbcr2_el1
+   mrs x6, dbgbcr1_el1
+   mrs x5, dbgbcr0_el1
+
+   adr x26, 1f
+   add x26, x26, x24, lsl #2
+   br  x26
+
+1:
+   str x20, [x3, #(15 * 8)]
+   str x19, [x3, #(14 * 8)]
+   str x18, [x3, #(13 * 8)]
+   str x17, [x3, #(12 * 8)]
+   str x16, [x3, #(11 * 8)]
+   str x15, [x3, #(10 * 8)]
+   str x14, [x3, #(9 * 8)]
+   str x13, [x3, #(8 * 8)]
+   str x12, [x3, #(7 * 8)]
+   str x11, [x3, #(6 * 8)]
+   str x10, [x3, #(5 * 8)]
+   str x9, [x3, #(4 * 8)]
+   str x8, [x3, #(3 * 8)]
+   str x7, [x3, #(2 * 8)]
+   str x6, [x3, #(1 * 8)]
+   str x5, [x3, #(0 * 8)]
+
+   add x3, x2, #CPU_SYSREG_OFFSET(DBGBVR0_EL1)
+
+   adr x26, 1f
+   add x26, x26, x24, lsl #2
+   br  x26
+1:
+   mrs x20, dbgbvr15_el1
+   mrs x19, dbgbvr14_el1
+   mrs x18, dbgbvr13_el1
+   mrs x17, dbgbvr12_el1
+   mrs x16, dbgbvr11_el1
+   mrs x15, dbgbvr10_el1
+   mrs x14, dbgbvr9_el1
+   mrs x13, dbgbvr8_el1
+   mrs x12, dbgbvr7_el1
+   mrs x11, dbgbvr6_el1
+   mrs x10, dbgbvr5_el1
+   mrs x9, dbgbvr4_el1
+   mrs x8, dbgbvr3_el1
+   mrs x7, dbgbvr2_el1
+   mrs x6, dbgbvr1_el1
+   mrs x5, dbgbvr0_el1
+
+   adr x26, 1f
+   add x26, x26, x24, lsl #2
+   br  x26
+
+1:
+   str x20, [x3, #(15 * 8)]
+   str x19, [x3, #(14 * 8)]
+   str x18, [x3, #(13 * 8)]
+   str x17, [x3, #(12 * 8)]
+   str x16, [x3, #(11 * 8)]
+   str x15, [x3, #(10 * 8)]
+   str x14, [x3, #(9 * 8)]
+   str x13,

[PATCH 6/9] arm64: KVM: check ordering of all system register tables

2014-05-07 Thread Marc Zyngier

We now have multiple tables for the various system registers
we trap. Make sure we check the order of all of them, as it is
critical that we get the order right (been there, done that...).

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm64/kvm/sys_regs.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 91ca0e4..c27a5cd 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1280,14 +1280,32 @@ int kvm_arm_copy_sys_reg_indices(struct kvm_vcpu *vcpu, 
u64 __user *uindices)
return write_demux_regids(uindices);
 }
 
+static int check_sysreg_table(const struct sys_reg_desc *table, unsigned int n)
+{
+   unsigned int i;
+
+   for (i = 1; i  n; i++) {
+   if (cmp_sys_reg(table[i-1], table[i]) = 0) {
+   kvm_err(sys_reg table %p out of order (%d)\n, table, 
i - 1);
+   return 1;
+   }
+   }
+
+   return 0;
+}
+
 void kvm_sys_reg_table_init(void)
 {
unsigned int i;
struct sys_reg_desc clidr;
 
/* Make sure tables are unique and in order. */
-   for (i = 1; i  ARRAY_SIZE(sys_reg_descs); i++)
-   BUG_ON(cmp_sys_reg(sys_reg_descs[i-1], sys_reg_descs[i]) = 
0);
+   BUG_ON(check_sysreg_table(sys_reg_descs, ARRAY_SIZE(sys_reg_descs)));
+   BUG_ON(check_sysreg_table(cp14_regs, ARRAY_SIZE(cp14_regs)));
+   BUG_ON(check_sysreg_table(cp14_64_regs, ARRAY_SIZE(cp14_64_regs)));
+   BUG_ON(check_sysreg_table(cp15_regs, ARRAY_SIZE(cp15_regs)));
+   BUG_ON(check_sysreg_table(cp15_64_regs, ARRAY_SIZE(cp15_64_regs)));
+   BUG_ON(check_sysreg_table(invariant_sys_regs, 
ARRAY_SIZE(invariant_sys_regs)));
 
/* We abuse the reset function to overwrite the table itself. */
for (i = 0; i  ARRAY_SIZE(invariant_sys_regs); i++)
-- 
1.8.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/9] arm64: KVM: use separate tables for AArch32 32 and 64bit traps

2014-05-07 Thread Marc Zyngier

An interesting feature of the CP14 encoding is that there is
an overlap between 32 and 64bit registers, meaning they cannot
live in the same table as we did for CP15.

Create separate tables for 64bit CP14 and CP15 registers, and
let the top level handler use the right one.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm64/kvm/sys_regs.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index feafd8d..91ca0e4 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -478,13 +478,16 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 static const struct sys_reg_desc cp14_regs[] = {
 };
 
+/* Trapped cp14 64bit registers */
+static const struct sys_reg_desc cp14_64_regs[] = {
+};
+
 /*
  * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding,
  * depending on the way they are accessed (as a 32bit or a 64bit
  * register).
  */
 static const struct sys_reg_desc cp15_regs[] = {
-   { Op1( 0), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR0 },
{ Op1( 0), CRn( 1), CRm( 0), Op2( 0), access_sctlr, NULL, c1_SCTLR },
{ Op1( 0), CRn( 2), CRm( 0), Op2( 0), access_vm_reg, NULL, c2_TTBR0 },
{ Op1( 0), CRn( 2), CRm( 0), Op2( 1), access_vm_reg, NULL, c2_TTBR1 },
@@ -525,6 +528,10 @@ static const struct sys_reg_desc cp15_regs[] = {
{ Op1( 0), CRn(10), CRm( 3), Op2( 1), access_vm_reg, NULL, c10_AMAIR1 },
{ Op1( 0), CRn(13), CRm( 0), Op2( 1), access_vm_reg, NULL, c13_CID },
 
+};
+
+static const struct sys_reg_desc cp15_64_regs[] = {
+   { Op1( 0), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR0 },
{ Op1( 1), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR1 },
 };
 
@@ -740,7 +747,7 @@ int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
target_specific = get_target_table(vcpu-arch.target, false, num);
return kvm_handle_cp_64(vcpu,
-   cp15_regs, ARRAY_SIZE(cp15_regs),
+   cp15_64_regs, ARRAY_SIZE(cp15_64_regs),
target_specific, num);
 }
 
@@ -758,7 +765,7 @@ int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
return kvm_handle_cp_64(vcpu,
-   cp14_regs, ARRAY_SIZE(cp14_regs),
+   cp14_64_regs, ARRAY_SIZE(cp14_64_regs),
NULL, 0);
 }
 
-- 
1.8.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: x86: Emulator does not calculate address correctly

2014-05-07 Thread Nadav Amit


On 5/7/14, 4:57 PM, Paolo Bonzini wrote:

Il 07/05/2014 14:32, Nadav Amit ha scritto:

In long-mode, when the address size is 4 bytes, the linear address is not
truncated as the emulator mistakenly does.  Instead, the offset within
the
segment (the ea field) should be truncated according to the address size.

As Intel SDM says: In 64-bit mode, the effective address components
are added
and the effective address is truncated ... before adding the full 64-bit
segment base.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e8a5840..743e8e3 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -631,7 +631,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
 u16 sel;
 unsigned cpl;

-la = seg_base(ctxt, addr.seg) + addr.ea;
+la = seg_base(ctxt, addr.seg) +
+(ctxt-ad_bytes == 8 ? addr.ea : (u32)addr.ea);


I think you need fetch || ctxt-ad_bytes == 8 here.

Paolo


Yes. I did not test the fetch scenario.
I intend to do so soon to avoid such mistakes.

Nadav
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/9] arm64: KVM: add trap handlers for AArch64 debug registers

2014-05-07 Thread Marc Zyngier

Add handlers for all the AArch64 debug registers that are accessible
from EL0 or EL1. The trapping code keeps track of the state of the
debug registers, allowing for the switch code to implement a lazy
switching strategy.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm64/include/asm/kvm_asm.h  |  28 ++--
 arch/arm64/include/asm/kvm_host.h |   3 +
 arch/arm64/kvm/sys_regs.c | 130 +-
 3 files changed, 151 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 9fcd54b..e6b159a 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -43,14 +43,25 @@
 #defineAMAIR_EL1   19  /* Aux Memory Attribute Indirection 
Register */
 #defineCNTKCTL_EL1 20  /* Timer Control Register (EL1) */
 #definePAR_EL1 21  /* Physical Address Register */
+#define MDSCR_EL1  22  /* Monitor Debug System Control Register */
+#define DBGBCR0_EL123  /* Debug Breakpoint Control Registers (0-15) */
+#define DBGBCR15_EL1   38
+#define DBGBVR0_EL139  /* Debug Breakpoint Value Registers (0-15) */
+#define DBGBVR15_EL1   54
+#define DBGWCR0_EL155  /* Debug Watchpoint Control Registers (0-15) */
+#define DBGWCR15_EL1   70
+#define DBGWVR0_EL171  /* Debug Watchpoint Value Registers (0-15) */
+#define DBGWVR15_EL1   86
+#define MDCCINT_EL187  /* Monitor Debug Comms Channel Interrupt Enable 
Reg */
+
 /* 32bit specific registers. Keep them at the end of the range */
-#defineDACR32_EL2  22  /* Domain Access Control Register */
-#defineIFSR32_EL2  23  /* Instruction Fault Status Register */
-#defineFPEXC32_EL2 24  /* Floating-Point Exception Control 
Register */
-#defineDBGVCR32_EL225  /* Debug Vector Catch Register */
-#defineTEECR32_EL1 26  /* ThumbEE Configuration Register */
-#defineTEEHBR32_EL127  /* ThumbEE Handler Base Register */
-#defineNR_SYS_REGS 28
+#defineDACR32_EL2  88  /* Domain Access Control Register */
+#defineIFSR32_EL2  89  /* Instruction Fault Status Register */
+#defineFPEXC32_EL2 90  /* Floating-Point Exception Control 
Register */
+#defineDBGVCR32_EL291  /* Debug Vector Catch Register */
+#defineTEECR32_EL1 92  /* ThumbEE Configuration Register */
+#defineTEEHBR32_EL193  /* ThumbEE Handler Base Register */
+#defineNR_SYS_REGS 94
 
 /* 32bit mapping */
 #define c0_MPIDR   (MPIDR_EL1 * 2) /* MultiProcessor ID Register */
@@ -87,6 +98,9 @@
 #define ARM_EXCEPTION_IRQ0
 #define ARM_EXCEPTION_TRAP   1
 
+#define KVM_ARM64_DEBUG_DIRTY_SHIFT0
+#define KVM_ARM64_DEBUG_DIRTY  (1  KVM_ARM64_DEBUG_DIRTY_SHIFT)
+
 #ifndef __ASSEMBLY__
 struct kvm;
 struct kvm_vcpu;
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 0a1d697..4737961 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -101,6 +101,9 @@ struct kvm_vcpu_arch {
/* Exception Information */
struct kvm_vcpu_fault_info fault;
 
+   /* Debug state */
+   u64 debug_flags;
+
/* Pointer to host CPU context */
kvm_cpu_context_t *host_cpu_context;
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index fc8d4e3..618d4fb 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -30,6 +30,7 @@
 #include asm/kvm_mmu.h
 #include asm/cacheflush.h
 #include asm/cputype.h
+#include asm/debug-monitors.h
 #include trace/events/kvm.h
 
 #include sys_regs.h
@@ -173,6 +174,58 @@ static bool trap_wi_raz(struct kvm_vcpu *vcpu,
return read_zero(vcpu, p);
 }
 
+static bool trap_oslsr_el1(struct kvm_vcpu *vcpu,
+  const struct sys_reg_params *p,
+  const struct sys_reg_desc *r)
+{
+   if (p-is_write) {
+   return ignore_write(vcpu, p);
+   } else {
+   *vcpu_reg(vcpu, p-Rt) = (1  3);
+   return true;
+   }
+}
+
+static bool trap_dbgauthstatus_el1(struct kvm_vcpu *vcpu,
+  const struct sys_reg_params *p,
+  const struct sys_reg_desc *r)
+{
+   if (p-is_write) {
+   return ignore_write(vcpu, p);
+   } else {
+   *vcpu_reg(vcpu, p-Rt) = 0x; /* Implemented and disabled */
+   return true;
+   }
+}
+
+/*
+ * Trap handler for DBG[BW][CV]Rn_EL1 and MDSCR_EL1. We track the
+ * dirtiness of the registers.
+ */
+static bool trap_debug_regs(struct kvm_vcpu *vcpu,
+   const struct sys_reg_params *p,
+   const struct sys_reg_desc *r)
+{
+   /*
+* The best thing to do would be to trap MDSCR_EL1

[PATCH 7/9] arm64: KVM: add trap handlers for AArch32 debug registers

2014-05-07 Thread Marc Zyngier

Add handlers for all the AArch32 debug registers that are accessible
from EL0 or EL1. The code follow the same strategy as the AArch64
counterpart with regards to tracking the dirty state of the debug
registers.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm64/include/asm/kvm_asm.h |   9 +++
 arch/arm64/kvm/sys_regs.c| 137 ++-
 2 files changed, 145 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 12f9dd7..993a7db 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -93,6 +93,15 @@
 #define c10_AMAIR0 (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */
 #define c10_AMAIR1 (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */
 #define c14_CNTKCTL(CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */
+
+#define cp14_DBGDSCRext(MDSCR_EL1 * 2)
+#define cp14_DBGBCR0   (DBGBCR0_EL1 * 2)
+#define cp14_DBGBVR0   (DBGBVR0_EL1 * 2)
+#define cp14_DBGBXVR0  (cp14_DBGBVR0 + 1)
+#define cp14_DBGWCR0   (DBGWCR0_EL1 * 2)
+#define cp14_DBGWVR0   (DBGWVR0_EL1 * 2)
+#define cp14_DBGDCCINT (MDCCINT_EL1 * 2)
+
 #define NR_COPRO_REGS  (NR_SYS_REGS * 2)
 
 #define ARM_EXCEPTION_IRQ0
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index c27a5cd..4861fe4 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -474,12 +474,148 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  NULL, reset_val, FPEXC32_EL2, 0x70 },
 };
 
+static bool trap_dbgidr(struct kvm_vcpu *vcpu,
+   const struct sys_reg_params *p,
+   const struct sys_reg_desc *r)
+{
+   if (p-is_write) {
+   return ignore_write(vcpu, p);
+   } else {
+   u64 dfr = read_cpuid(ID_AA64DFR0_EL1);
+   u64 pfr = read_cpuid(ID_AA64PFR0_EL1);
+   u32 el3 = !!((pfr  12)  0xf);
+
+   *vcpu_reg(vcpu, p-Rt) = dfr  20)  0xf)  28) |
+ (((dfr  12)  0xf)  24) |
+ (((dfr  28)  0xf)  20) |
+ (6  16) | (el3  14) | (el3  
12));
+   return true;
+   }
+}
+
+static bool trap_debug32(struct kvm_vcpu *vcpu,
+const struct sys_reg_params *p,
+const struct sys_reg_desc *r)
+{
+   if (p-is_write) {
+   vcpu_cp14(vcpu, r-reg) = *vcpu_reg(vcpu, p-Rt);
+   vcpu-arch.debug_flags |= KVM_ARM64_DEBUG_DIRTY;
+   } else {
+   *vcpu_reg(vcpu, p-Rt) = vcpu_cp14(vcpu, r-reg);
+   }
+
+   return true;
+}
+
+#define DBG_BCR_BVR_WCR_WVR(n) \
+   /* DBGBCRn */   \
+   { Op1( 0), CRn( 0), CRm((n)), Op2( 4), trap_debug32,\
+ NULL, (cp14_DBGBCR0 + (n) * 2) }, \
+   /* DBGBVRn */   \
+   { Op1( 0), CRn( 0), CRm((n)), Op2( 5), trap_debug32,\
+ NULL, (cp14_DBGBVR0 + (n) * 2) }, \
+   /* DBGWVRn */   \
+   { Op1( 0), CRn( 0), CRm((n)), Op2( 6), trap_debug32,\
+ NULL, (cp14_DBGWVR0 + (n) * 2) }, \
+   /* DBGWCRn */   \
+   { Op1( 0), CRn( 0), CRm((n)), Op2( 7), trap_debug32,\
+ NULL, (cp14_DBGWCR0 + (n) * 2) }
+
+#define DBGBXVR(n) \
+   { Op1( 0), CRn( 1), CRm((n)), Op2( 1), trap_debug32,\
+ NULL, cp14_DBGBXVR0 + n * 2 }
+
 /* Trapped cp14 registers */
 static const struct sys_reg_desc cp14_regs[] = {
+   /* DBGIDR */
+   { Op1( 0), CRn( 0), CRm( 0), Op2( 0), trap_dbgidr },
+   /* DBGDTRRXext */
+   { Op1( 0), CRn( 0), CRm( 0), Op2( 2), trap_wi_raz },
+
+   DBG_BCR_BVR_WCR_WVR(0),
+   /* DBGDSCRint */
+   { Op1( 0), CRn( 0), CRm( 1), Op2( 0), trap_wi_raz },
+   DBG_BCR_BVR_WCR_WVR(1),
+   /* DBGDCCINT */
+   { Op1( 0), CRn( 0), CRm( 2), Op2( 0), trap_debug32 },
+   /* DBGDSCRext */
+   { Op1( 0), CRn( 0), CRm( 2), Op2( 2), trap_debug32 },
+   DBG_BCR_BVR_WCR_WVR(2),
+   /* DBGDTRTXext */
+   { Op1( 0), CRn( 0), CRm( 3), Op2( 2), trap_wi_raz },
+   DBG_BCR_BVR_WCR_WVR(3),
+   DBG_BCR_BVR_WCR_WVR(4),
+   DBG_BCR_BVR_WCR_WVR(5),
+   /* DBGWFAR */
+   { Op1( 0), CRn( 0), CRm( 6), Op2( 0), trap_wi_raz },
+   /* DBGOSECCR */
+   { Op1( 0), CRn( 0), CRm( 6), Op2( 2), trap_wi_raz },
+   DBG_BCR_BVR_WCR_WVR(6),
+   /* DBGVCR */
+   { Op1( 0), CRn( 0), CRm( 7), Op2( 0), trap_debug32 },
+   DBG_BCR_BVR_WCR_WVR(7),
+   DBG_BCR_BVR_WCR_WVR(8),
+   DBG_BCR_BVR_WCR_WVR(9),
+   DBG_BCR_BVR_WCR_WVR(10),
+   DBG_BCR_BVR_WCR_WVR(11),
+

Re: KVM Nested L2 guest startup problems

2014-05-07 Thread Paolo Bonzini

Il 07/05/2014 17:30, Abel Gordon ha scritto:
  ... which we already do.  The only secondary execution controls we allow are
  APIC page, unrestricted guest, WBINVD exits, and of course EPT.
 
 But we don't verify if L1  tries to enable the feature for L1 (even if
 it's not exposed)... Or do we ?

Yes, we do:

if (!vmx_control_verify(vmcs12-cpu_based_vm_exec_control,
  nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high) ||
!vmx_control_verify(vmcs12-secondary_vm_exec_control,
  nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high) ||
!vmx_control_verify(vmcs12-pin_based_vm_exec_control,
  nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high) ||
!vmx_control_verify(vmcs12-vm_exit_controls,
  nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high) ||
!vmx_control_verify(vmcs12-vm_entry_controls,
  nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high))
{
nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
return 1;
}

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9] arm64: KVM: debug infrastructure support

2014-05-07 Thread Peter Maydell

On 7 May 2014 16:20, Marc Zyngier marc.zyng...@arm.com wrote:
 This patch series adds debug support, a key feature missing from the
 KVM/arm64 port.

 The main idea is to keep track of whether the debug registers are
 dirty (changed by the guest) or not. In this case, perform the usual
 save/restore dance, for one run only. It means we only have a penalty
 if a guest is actually using the debug registers.

 The huge amount of registers is properly frightening, but CPUs
 actually only implement a subset of them. Also, there is a number of
 registers we don't bother emulating (things having to do with external
 debug and OSlock).

Presumably these registers now appear in the userspace
interface too, yes? Did you check that they all cope with
the migration reads all register values on the source and then
writes them on the destination in arbitrary order semantics without
further fiddling? (I have a note that says that at least
OSLAR_EL1/OSLSR_EL1 won't work that way, for instance.)

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/9] arm64: KVM: rename pm_fake handler to trap_wi_raz

2014-05-07 Thread Marc Zyngier

On 07/05/14 16:34, Peter Maydell wrote:
 On 7 May 2014 16:20, Marc Zyngier marc.zyng...@arm.com wrote:
 pm_fake doesn't quite describe what the handler does (ignoring writes
 and returning 0 for reads).

 As we're about to use it (a lot) in a different context, rename it
 with a (admitedly cryptic) name that make sense for all users.
 
 -/*
 - * We could trap ID_DFR0 and tell the guest we don't support performance
 - * monitoring.  Unfortunately the patch to make the kernel check ID_DFR0 was
 - * NAKed, so it will read the PMCR anyway.
 - *
 - * Therefore we tell the guest we have 0 counters.  Unfortunately, we
 - * must always support PMCCNTR (the cycle counter): we just RAZ/WI for
 - * all PM registers, which doesn't crash the guest kernel at least.
 - */
 -static bool pm_fake(struct kvm_vcpu *vcpu,
 -   const struct sys_reg_params *p,
 -   const struct sys_reg_desc *r)
 +static bool trap_wi_raz(struct kvm_vcpu *vcpu,
 +   const struct sys_reg_params *p,
 +   const struct sys_reg_desc *r)
 
 The standard term for this is RAZ/WI, not WI/RAZ, so
 why not trap_raz_wi ?

Good point. I'll update it.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] KVM: x86: Wrong register masking in 64-bit mode

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 14:32, Nadav Amit ha scritto:

32-bit operations are zero extended in 64-bit mode. Currently, the code does
not handle them correctly and keeps the high bits. In 16-bit mode, the high
32-bits are kept intact.

In addition, although it is not well-documented, when address override prefix
is used with REP-string instruction, RCX high half is zeroed even if ECX was
zero on the first iteration (as if an assignment was performed to ECX).


Is this true even for REPZ and ZF=0 or REPNZ and ZF=1?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9] arm64: KVM: debug infrastructure support

2014-05-07 Thread Marc Zyngier

On 07/05/14 16:42, Peter Maydell wrote:
 On 7 May 2014 16:20, Marc Zyngier marc.zyng...@arm.com wrote:
 This patch series adds debug support, a key feature missing from the
 KVM/arm64 port.

 The main idea is to keep track of whether the debug registers are
 dirty (changed by the guest) or not. In this case, perform the usual
 save/restore dance, for one run only. It means we only have a penalty
 if a guest is actually using the debug registers.

 The huge amount of registers is properly frightening, but CPUs
 actually only implement a subset of them. Also, there is a number of
 registers we don't bother emulating (things having to do with external
 debug and OSlock).
 
 Presumably these registers now appear in the userspace
 interface too, yes? Did you check that they all cope with
 the migration reads all register values on the source and then
 writes them on the destination in arbitrary order semantics without
 further fiddling? (I have a note that says that at least
 OSLAR_EL1/OSLSR_EL1 won't work that way, for instance.)

The only registers that are exported to userspace are MDSCR_EL1,
DBG{BW}{CV}Rn_EL1 and MDCCINT_EL1 (and their 32bit counterparts). They
should be fine being saved/restored in any order, as long as you're not
running the vcpu in between.

The OSL*_EL1 registers are just ignored so far. So far, it is unclear
how they would be supported in a guest, and I don't have any test
environment for them. Should the need arise, support can be added on top
of what we have now.

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: x86: Fix wrong masking on relative jump/call

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 14:32, Nadav Amit ha scritto:

Relative jumps and calls do the masking according to the operand size, and not
according to the address size as the KVM emulator does today.  In 64-bit mode,
the resulting RIP is always 64-bit. Otherwise it is masked according to the
instruction operand-size. Note that when 16-bit address size is used, bits
63:32 are unmodified.


The SDM says If the operand-size attribute is 16, the upper two bytes 
of the EIP register are cleared, resulting in a maximum instruction 
pointer size of 16 bits.  I'm not sure whether that should also imply 
that 63:32 are _not_ unmodified (because you do a 32-bit write not a 
16-bit one), but in any case it looks like masked_increment is not the 
right function.


Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: x86: Fix wrong masking on relative jump/call

2014-05-07 Thread Nadav Amit


On 5/7/14, 5:43 PM, Bandan Das wrote:

Nadav Amit na...@cs.technion.ac.il writes:


Relative jumps and calls do the masking according to the operand size, and not
according to the address size as the KVM emulator does today.  In 64-bit mode,
the resulting RIP is always 64-bit. Otherwise it is masked according to the
instruction operand-size. Note that when 16-bit address size is used, bits
63:32 are unmodified.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
  arch/x86/kvm/emulate.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6833b41..e406705 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -506,7 +506,9 @@ static void rsp_increment(struct x86_emulate_ctxt *ctxt, 
int inc)

  static inline void jmp_rel(struct x86_emulate_ctxt *ctxt, int rel)
  {
-   register_address_increment(ctxt, ctxt-_eip, rel);
+   /* 64-bit mode relative jumps are always 64-bit; otherwise mask */
+   int op_bytes = ctxt-mode == X86EMUL_MODE_PROT64 ? 8 : ctxt-op_bytes;


Just a nit, probably break this up for readability ?


I will make it more readable on the next version.

Thanks,
Nadav
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/11] kvm tools: Modify term_putc to write more than one char

2014-05-07 Thread Sergei Shtylyov


Hello.

On 06-05-2014 19:51, Andreas Herrmann wrote:


From: David Daney david.da...@cavium.com



It is a performance enhancement. When running in a simulator, each
system call to write a character takes a lot of time.  Batching them
up decreases the overhead (in the root kernel) of each virtio console
write.



Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
  tools/kvm/term.c |7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)



diff --git a/tools/kvm/term.c b/tools/kvm/term.c
index 3de410b..b153eed 100644
--- a/tools/kvm/term.c
+++ b/tools/kvm/term.c
@@ -52,11 +52,14 @@ int term_getc(struct kvm *kvm, int term)
  int term_putc(char *addr, int cnt, int term)
  {
int ret;
+   int num_remaining = cnt;

-   while (cnt--) {
-   ret = write(term_fds[term][TERM_FD_OUT], addr++, 1);
+   while (num_remaining) {
+   ret = write(term_fds[term][TERM_FD_OUT], addr, num_remaining);
if (ret  0)
return 0;


   Perhaps 'return cnt - num_remaining' instead?


+   num_remaining -= ret;
+   addr += ret;
}

return cnt;


WBR, Sergei

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] KVM: x86: Fix exit handler and emulation bugs

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 14:32, Nadav Amit ha scritto:

This series of patches fixes various scenarios in which KVM does not follow x86
specifications.  Patches #4 and #5 are related; they reflect a new revision of
the previously submitted patch that dealt with the wrong masking of registers
in long-mode. Patch #3 is a follow-up to the previously sumbitted patch that
fixed the wrong reserved page table masks. Patches #3 and #5 were not tested in
a manner that actually checks the modified behavior. Not all the pathes in
patch #4 were tested.

Thanks for reviewing the patches.

Nadav Amit (5):
  KVM: x86: Emulator does not calculate address correctly
  KVM: vmx: handle_dr does not handle RSP correctly
  KVM: x86: Mark bit 7 in long-mode PDPTE according to 1GB pages support
  KVM: x86: Wrong register masking in 64-bit mode
  KVM: x86: Fix wrong masking on relative jump/call

 arch/x86/kvm/cpuid.h   |  7 +++
 arch/x86/kvm/emulate.c | 47 +--
 arch/x86/kvm/mmu.c |  8 ++--
 arch/x86/kvm/vmx.c |  2 +-
 4 files changed, 43 insertions(+), 21 deletions(-)



I'll apply patches 2 and 3.  Thanks for your work!

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm/x86: implement hv EOI assist

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 15:29, Michael S. Tsirkin ha scritto:

It seems that it's easy to implement the EOI assist
on top of the PV EOI feature: simply convert the
page address to the format expected by PV EOI.

Notes:
-No EOI required is set only if interrupt injected
 is edge triggered; this is true because level interrupts are going
 through IOAPIC which disables PV EOI.
 In any case, if guest triggers EOI the bit will get cleared on exit.
-For migration, set of HV_X64_MSR_APIC_ASSIST_PAGE sets
 KVM_PV_EOI_EN internally, so restoring HV_X64_MSR_APIC_ASSIST_PAGE
 seems sufficient
 In any case, bit is cleared on exit so worst case it's never re-enabled
-no handling of PV EOI data is performed at HV_X64_MSR_EOI write;
 HV_X64_MSR_EOI is a separate optimization - it's an X2APIC
 replacement that lets you do EOI with an MSR and not IO.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

---

patch is unchanged from RFC except for documenting the design points in
the commit log, do address Marcelo's comments.
This passed basic testing.  Migration was not tested.

I think at this point it's a good idea to merge this so
it can get tested and used by more people.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ae1ff5..d84d750fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1890,6 +1890,8 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)

if (!(data  HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE)) {
vcpu-arch.hv_vapic = data;
+   if (kvm_lapic_enable_pv_eoi(vcpu, 0))
+   return 1;
break;
}
gfn = data  HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT;
@@ -1900,6 +1902,8 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
return 1;
vcpu-arch.hv_vapic = data;
mark_page_dirty(vcpu-kvm, gfn);
+   if (kvm_lapic_enable_pv_eoi(vcpu, gfn_to_gpa(gfn) | 
KVM_MSR_ENABLED))
+   return 1;
break;
}
case HV_X64_MSR_EOI:



Applying to kvm/queue, thanks.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] KVM: x86: Wrong register masking in 64-bit mode

2014-05-07 Thread Nadav Amit


On 5/7/14, 5:50 PM, Bandan Das wrote:

Nadav Amit na...@cs.technion.ac.il writes:


32-bit operations are zero extended in 64-bit mode. Currently, the code does
not handle them correctly and keeps the high bits. In 16-bit mode, the high
32-bits are kept intact.

In addition, although it is not well-documented, when address override prefix


It would be helpful to have a pointer in the SDM especially for cases
that are not well-documented.


is used with REP-string instruction, RCX high half is zeroed even if ECX was
zero on the first iteration (as if an assignment was performed to ECX).

Signed-off-by: Nadav Amit na...@cs.technion.ac.il



First, as for the different masking/zero-extending behavior in different 
modes, I guess this behavior is documented in SDM Volume 1, Section 
3.3.7: Address Calculations in 64-Bit Mode. It appears (from 
experiments on bare-metal) that it also regards ESI/EDI/ECX on 
REP-prefix instructions and I presume it regards ESP on stack operations 
and ECX on loop operations. I will ensure it soon.


Second, the behavior of zero-extending RCX even if ECX is initially zero 
was experienced on native environment (Intel SandyBridge), and we found 
no erratum regarding it. I actually found no documentation that 
describes this behavior. The term not well-documented is probably 
misleading, and will be clarified on v2.


Thanks,
Nadav

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 73721] KVM hv-time

2014-05-07 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=73721

Paolo Bonzini bonz...@gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||bonz...@gnu.org
 Resolution|--- |CODE_FIX

--- Comment #3 from Paolo Bonzini bonz...@gnu.org ---
Already fixed by commit b351c39cc9e0151cee9b8d52a1e714928faabb38, but Marcelo
forgot to CC stable.  I'll propose the patch myself.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-kmod PATCH] Provide pci_enable_msix_exact() for kernels 3.15

2014-05-07 Thread Paolo Bonzini

Il 07/05/2014 15:19, Gabriel L. Somlo ha scritto:
 On Wed, May 07, 2014 at 08:29:19AM +0200, Jan Kiszka wrote:
 On 2014-05-06 20:35, gso...@gmail.com wrote:
 Signed-off-by: Gabriel Somlo so...@cmu.edu
 ---

 Jan,

 After today's pull from kvm, I also need this to build against my
 Fedora 20 kernel (3.13.10-200.fc20.x86_64).

 Which version did you pull? Neither next nor queue contain this change yet.
 
 I have 89ca3b881987f5a4be4c5dbaa7f0df12bbdde2fd (Linux 3.15-rc4) as
 the top commit.
 
 Right now I'm seeing different build problems:

 http://buildbot.kiszka.org/kvm-kmod/builders/4-latest-kvm/builds/1788/steps/3.11-x86-64/logs/stdio

 Didn't look into details yet.
 
 That's kernel 3.11, afaict, so entirely different problem :)

Probably time to kill srcu.c altogether, since it's available on 2.6.19.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/9] arm64: move DBG_MDSCR_* to asm/debug-monitors.h

2014-05-07 Thread Will Deacon

On Wed, May 07, 2014 at 04:20:47PM +0100, Marc Zyngier wrote:
 In order to be able to use the DBG_MDSCR_* macros from the KVM code,
 move the relevant definitions to the obvious include file.
 
 Also move the debug_el enum to a portion of the file that is guarded
 by #ifndef __ASSEMBLY__ in order to use that file from assembly code.
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm64/include/asm/debug-monitors.h | 19 ++-
  arch/arm64/kernel/debug-monitors.c  |  9 -
  2 files changed, 14 insertions(+), 14 deletions(-)

  Acked-by: Will Deacon will.dea...@arm.com

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm: x86: emulate monitor and mwait instructions as nop

2014-05-07 Thread Gabriel L. Somlo

Treat monitor and mwait instructions as nop, which is architecturally
correct (but inefficient) behavior. We do this to prevent misbehaving
guests (e.g. OS X = 10.7) from receiving invalid opcode faults after
failing to check for monitor/mwait availability via cpuid.

Since mwait-based idle loops relying on these nop-emulated instructions
would keep the host CPU pegged at 100%, do NOT advertise their presence
via cpuid, preventing compliant guests from ever using them inadvertently.

Signed-off-by: Gabriel L. Somlo so...@cmu.edu
---

On Wed, May 07, 2014 at 05:30:47PM +0200, Paolo Bonzini wrote:
 Il 07/05/2014 17:05, Michael S. Tsirkin ha scritto:
  2. Emulate monitor and mwait as nop, but continue to claim they are
 not supported via CPUID. That's the patch you cited. Not sure
 though whether that sort of undocumented functionality would be
 OK with the KVM crowd, though :)
 I'd go for this one. It seems unlikely a guest wants to get
 an exception intentionally.
 Paolo?
 
 That's okay, but please add a printk_once the first time mwait is called.

OK, here's a first pass at an official submission. I have two questions:

  1. I can't test svm.c (on AMD). As such, I'm not sure the
 skip_emulated_instruction() call in my own version of nop_interception()
 is necessary. If not, I could probably just call the already existing
 nop_on_interception() (line 1926 or thereabouts in svm.c), which
 just returns returns 1 without skipping anything.

  2. I get defined but not used warnings on invalid_op_interception() (svm.c)
 and handle_invalid_op() (vmx.c). Apparently monitor/mwait are currently
 the only VM exit reasons which lead to an invalid opcode exception.
 Should my patch just nuke those functions (so that if anyone needs them
 in the future they'd have to re-add them), or comment them out, or
 call them after the return 1; statement in the monitor/mwait functions
 to shut up gcc, or ??? :)

Thanks much,
   Gabriel

 arch/x86/kvm/cpuid.c |  2 ++
 arch/x86/kvm/svm.c   | 22 --
 arch/x86/kvm/vmx.c   | 22 --
 3 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f47a104..d094fc6 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
/* cpuid 1.ecx */
const u32 kvm_supported_word4_x86_features =
+   /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
+* but *not* advertised to guests via CPUID ! */
F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7f4f9c2..1976488 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3287,6 +3287,24 @@ static int pause_interception(struct vcpu_svm *svm)
return 1;
 }
 
+static int nop_interception(struct vcpu_svm *svm)
+{
+   skip_emulated_instruction((svm-vcpu));
+   return 1;
+}
+
+static int monitor_interception(struct vcpu_svm *svm)
+{
+   printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n);
+   return nop_interception(svm);
+}
+
+static int mwait_interception(struct vcpu_svm *svm)
+{
+   printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n);
+   return nop_interception(svm);
+}
+
 static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
[SVM_EXIT_READ_CR0] = cr_interception,
[SVM_EXIT_READ_CR3] = cr_interception,
@@ -3344,8 +3362,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm 
*svm) = {
[SVM_EXIT_CLGI] = clgi_interception,
[SVM_EXIT_SKINIT]   = skinit_interception,
[SVM_EXIT_WBINVD]   = emulate_on_interception,
-   [SVM_EXIT_MONITOR]  = invalid_op_interception,
-   [SVM_EXIT_MWAIT]= invalid_op_interception,
+   [SVM_EXIT_MONITOR]  = monitor_interception,
+   [SVM_EXIT_MWAIT]= mwait_interception,
[SVM_EXIT_XSETBV]   = xsetbv_interception,
[SVM_EXIT_NPF]  = pf_interception,
 };
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 33e8c02..060b384 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5669,6 +5669,24 @@ static int handle_pause(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_nop(struct kvm_vcpu *vcpu)
+{
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
+static int handle_mwait(struct kvm_vcpu *vcpu)
+{
+   printk_once(KERN_WARNING kvm: MWAIT instruction emulated as

Re: [PATCH] kvm: x86: emulate monitor and mwait instructions as nop

2014-05-07 Thread Michael S. Tsirkin

On Wed, May 07, 2014 at 02:10:59PM -0400, Gabriel L. Somlo wrote:
 Treat monitor and mwait instructions as nop, which is architecturally
 correct (but inefficient) behavior. We do this to prevent misbehaving
 guests (e.g. OS X = 10.7) from receiving invalid opcode faults after
 failing to check for monitor/mwait availability via cpuid.
 
 Since mwait-based idle loops relying on these nop-emulated instructions
 would keep the host CPU pegged at 100%, do NOT advertise their presence
 via cpuid, preventing compliant guests from ever using them inadvertently.
 
 Signed-off-by: Gabriel L. Somlo so...@cmu.edu

If we really want to be paranoid and worry about guests
that use this strange way to trigger invalid opcode,
we can make it possible for userspace to enable/disable
this hack, and teach qemu to set it.

That would make it even safer than it was.

Not sure it's worth it, just a thought.

 ---
 
 On Wed, May 07, 2014 at 05:30:47PM +0200, Paolo Bonzini wrote:
  Il 07/05/2014 17:05, Michael S. Tsirkin ha scritto:
   2. Emulate monitor and mwait as nop, but continue to claim they are
  not supported via CPUID. That's the patch you cited. Not sure
  though whether that sort of undocumented functionality would be
  OK with the KVM crowd, though :)
  I'd go for this one. It seems unlikely a guest wants to get
  an exception intentionally.
  Paolo?
  
  That's okay, but please add a printk_once the first time mwait is called.
 
 OK, here's a first pass at an official submission. I have two questions:
 
   1. I can't test svm.c (on AMD). As such, I'm not sure the
  skip_emulated_instruction() call in my own version of nop_interception()
  is necessary. If not, I could probably just call the already existing
  nop_on_interception() (line 1926 or thereabouts in svm.c), which
  just returns returns 1 without skipping anything.
 
   2. I get defined but not used warnings on invalid_op_interception() 
 (svm.c)
  and handle_invalid_op() (vmx.c). Apparently monitor/mwait are currently
  the only VM exit reasons which lead to an invalid opcode exception.
  Should my patch just nuke those functions (so that if anyone needs them
  in the future they'd have to re-add them), or comment them out, or
  call them after the return 1; statement in the monitor/mwait functions
  to shut up gcc, or ??? :)
 
 Thanks much,
Gabriel
 
  arch/x86/kvm/cpuid.c |  2 ++
  arch/x86/kvm/svm.c   | 22 --
  arch/x86/kvm/vmx.c   | 22 --
  3 files changed, 42 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
 index f47a104..d094fc6 100644
 --- a/arch/x86/kvm/cpuid.c
 +++ b/arch/x86/kvm/cpuid.c
 @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
 *entry, u32 function,
   0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
   /* cpuid 1.ecx */
   const u32 kvm_supported_word4_x86_features =
 + /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
 +  * but *not* advertised to guests via CPUID ! */
   F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
   0 /* DS-CPL, VMX, SMX, EST */ |
   0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 7f4f9c2..1976488 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -3287,6 +3287,24 @@ static int pause_interception(struct vcpu_svm *svm)
   return 1;
  }
  
 +static int nop_interception(struct vcpu_svm *svm)
 +{
 + skip_emulated_instruction((svm-vcpu));
 + return 1;
 +}
 +
 +static int monitor_interception(struct vcpu_svm *svm)
 +{
 + printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n);
 + return nop_interception(svm);
 +}
 +
 +static int mwait_interception(struct vcpu_svm *svm)
 +{
 + printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n);
 + return nop_interception(svm);
 +}
 +
  static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
   [SVM_EXIT_READ_CR0] = cr_interception,
   [SVM_EXIT_READ_CR3] = cr_interception,
 @@ -3344,8 +3362,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm 
 *svm) = {
   [SVM_EXIT_CLGI] = clgi_interception,
   [SVM_EXIT_SKINIT]   = skinit_interception,
   [SVM_EXIT_WBINVD]   = emulate_on_interception,
 - [SVM_EXIT_MONITOR]  = invalid_op_interception,
 - [SVM_EXIT_MWAIT]= invalid_op_interception,
 + [SVM_EXIT_MONITOR]  = monitor_interception,
 + [SVM_EXIT_MWAIT]= mwait_interception,
   [SVM_EXIT_XSETBV]   = xsetbv_interception,
   [SVM_EXIT_NPF]  = pf_interception,
  };
 diff --git a/arch/x86/kvm/vmx.c

Re: [PATCH] kvm: x86: emulate monitor and mwait instructions as nop

2014-05-07 Thread Alexander Graf


On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:

On Wed, May 07, 2014 at 02:10:59PM -0400, Gabriel L. Somlo wrote:

Treat monitor and mwait instructions as nop, which is architecturally
correct (but inefficient) behavior. We do this to prevent misbehaving
guests (e.g. OS X = 10.7) from receiving invalid opcode faults after
failing to check for monitor/mwait availability via cpuid.

Since mwait-based idle loops relying on these nop-emulated instructions
would keep the host CPU pegged at 100%, do NOT advertise their presence
via cpuid, preventing compliant guests from ever using them inadvertently.

Signed-off-by: Gabriel L. Somlo so...@cmu.edu

If we really want to be paranoid and worry about guests
that use this strange way to trigger invalid opcode,
we can make it possible for userspace to enable/disable
this hack, and teach qemu to set it.

That would make it even safer than it was.

Not sure it's worth it, just a thought.


Since we don't trap on non-exposed other instructions (new SSE and 
whatdoiknow) I don't think it's really bad to just expose MONITOR/MWAIT 
as nops.



Alex




---

On Wed, May 07, 2014 at 05:30:47PM +0200, Paolo Bonzini wrote:

Il 07/05/2014 17:05, Michael S. Tsirkin ha scritto:

2. Emulate monitor and mwait as nop, but continue to claim they are
not supported via CPUID. That's the patch you cited. Not sure
though whether that sort of undocumented functionality would be
OK with the KVM crowd, though :)

I'd go for this one. It seems unlikely a guest wants to get
an exception intentionally.
Paolo?

That's okay, but please add a printk_once the first time mwait is called.

OK, here's a first pass at an official submission. I have two questions:

   1. I can't test svm.c (on AMD). As such, I'm not sure the


Joerg, mind to take a quick look here?


Alex


  skip_emulated_instruction() call in my own version of nop_interception()
  is necessary. If not, I could probably just call the already existing
  nop_on_interception() (line 1926 or thereabouts in svm.c), which
  just returns returns 1 without skipping anything.

   2. I get defined but not used warnings on invalid_op_interception() (svm.c)
  and handle_invalid_op() (vmx.c). Apparently monitor/mwait are currently
  the only VM exit reasons which lead to an invalid opcode exception.
  Should my patch just nuke those functions (so that if anyone needs them
  in the future they'd have to re-add them), or comment them out, or
  call them after the return 1; statement in the monitor/mwait functions
  to shut up gcc, or ??? :)

Thanks much,
Gabriel

  arch/x86/kvm/cpuid.c |  2 ++
  arch/x86/kvm/svm.c   | 22 --
  arch/x86/kvm/vmx.c   | 22 --
  3 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f47a104..d094fc6 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
/* cpuid 1.ecx */
const u32 kvm_supported_word4_x86_features =
+   /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
+* but *not* advertised to guests via CPUID ! */
F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7f4f9c2..1976488 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3287,6 +3287,24 @@ static int pause_interception(struct vcpu_svm *svm)
return 1;
  }
  
+static int nop_interception(struct vcpu_svm *svm)

+{
+   skip_emulated_instruction((svm-vcpu));
+   return 1;
+}
+
+static int monitor_interception(struct vcpu_svm *svm)
+{
+   printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n);
+   return nop_interception(svm);
+}
+
+static int mwait_interception(struct vcpu_svm *svm)
+{
+   printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n);
+   return nop_interception(svm);
+}
+
  static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
[SVM_EXIT_READ_CR0] = cr_interception,
[SVM_EXIT_READ_CR3] = cr_interception,
@@ -3344,8 +3362,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm 
*svm) = {
[SVM_EXIT_CLGI] = clgi_interception,
[SVM_EXIT_SKINIT]   = skinit_interception,
[SVM_EXIT_WBINVD]   = emulate_on_interception,
-   [SVM_EXIT_MONITOR]  = invalid_op_interception,
-   [SVM_EXIT_MWAIT]= invalid_op_interception,
+   [SVM_EXIT_MONITOR]  = monitor_interception,
+   [SVM_EXIT_MWAIT]

Re: [PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support

2014-05-07 Thread Konrad Rzeszutek Wilk

On Wed, May 07, 2014 at 11:01:28AM -0400, Waiman Long wrote:
 v9-v10:
   - Make some minor changes to qspinlock.c to accommodate review feedback.
   - Change author to PeterZ for 2 of the patches.
   - Include Raghavendra KT's test results in patch 18.

Any chance you can post these on a git tree? Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 18/19] pvqspinlock, x86: Enable PV qspinlock PV for KVM

2014-05-07 Thread Konrad Rzeszutek Wilk

 Raghavendra KT had done some performance testing on this patch with
 the following results:
 
 Overall we are seeing good improvement for pv-unfair version.
 
 System: 32 cpu sandybridge with HT on (4 node with 32 GB each)
 Guest : 8GB with 16 vcpu/VM.
 Average was taken over 8-10 data points.
 
 Base = 3.15-rc2  with PRAVIRT_SPINLOCK = y
 
 A = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y
 PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = y (unfair lock)
 
 B =  3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y
 PRAVIRT_SPINLOCK = n PARAVIRT_UNFAIR_LOCKS = n
 (queue spinlock without paravirt)
 
 C = 3.15-rc2 + qspinlock v9 patch with  QUEUE_SPINLOCK = y
 PRAVIRT_SPINLOCK = y  PARAVIRT_UNFAIR_LOCKS = n
 (queue spinlock with paravirt)

Could you do s/PRAVIRT/PARAVIRT/ please?

 
 Ebizzy %improvements
 
 overcommit  ABC
 0.5x 4.42652.0611   1.5824
 1.0x 0.9015   -7.7828   4.5443
 1.5x46.1162   -2.9845  -3.5046
 2.0x99.8150   -2.7116   4.7461

Considering B sucks
 
 Dbench %improvements
 
 overcommit  ABC
 0.5x 3.26173.54362.5676
 1.0x 0.63022.23425.2201
 1.5x 5.00274.82753.8375
 2.0x23.82424.578212.6067
 
 Absolute values of base results: (overcommit, value, stdev)
 Ebizzy ( records / sec with 120 sec run)
 0.5x 20941.8750 (2%)
 1.0x 17623.8750 (5%)
 1.5x  5874.7778 (15%)
 2.0x  3581.8750 (7%)
 
 Dbench (throughput in MB/sec)
 0.5x 10009.6610 (5%)
 1.0x  6583.0538 (1%)
 1.5x  3991.9622 (4%)
 2.0x  2527.0613 (2.5%)
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  arch/x86/kernel/kvm.c |  135 
 +
  kernel/Kconfig.locks  |2 +-
  2 files changed, 136 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
 index 7ab8ab3..eef427b 100644
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -567,6 +567,7 @@ static void kvm_kick_cpu(int cpu)
   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
  }
  
 +#ifndef CONFIG_QUEUE_SPINLOCK
  enum kvm_contention_stat {
   TAKEN_SLOW,
   TAKEN_SLOW_PICKUP,
 @@ -794,6 +795,134 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, 
 __ticket_t ticket)
   }
   }
  }
 +#else /* !CONFIG_QUEUE_SPINLOCK */
 +
 +#ifdef CONFIG_KVM_DEBUG_FS
 +static struct dentry *d_spin_debug;
 +static struct dentry *d_kvm_debug;
 +static u32 kick_nohlt_stats; /* Kick but not halt count  */
 +static u32 halt_qhead_stats; /* Queue head halting count */
 +static u32 halt_qnode_stats; /* Queue node halting count */
 +static u32 halt_abort_stats; /* Halting abort count  */
 +static u32 wake_kick_stats;  /* Wakeup by kicking count  */
 +static u32 wake_spur_stats;  /* Spurious wakeup count*/
 +static u64 time_blocked; /* Total blocking time  */
 +
 +static int __init kvm_spinlock_debugfs(void)
 +{
 + d_kvm_debug = debugfs_create_dir(kvm-guest, NULL);
 + if (!d_kvm_debug) {
 + printk(KERN_WARNING
 +Could not create 'kvm' debugfs directory\n);
 + return -ENOMEM;
 + }
 + d_spin_debug = debugfs_create_dir(spinlocks, d_kvm_debug);
 +
 + debugfs_create_u32(kick_nohlt_stats,
 +0644, d_spin_debug, kick_nohlt_stats);
 + debugfs_create_u32(halt_qhead_stats,
 +0644, d_spin_debug, halt_qhead_stats);
 + debugfs_create_u32(halt_qnode_stats,
 +0644, d_spin_debug, halt_qnode_stats);
 + debugfs_create_u32(halt_abort_stats,
 +0644, d_spin_debug, halt_abort_stats);
 + debugfs_create_u32(wake_kick_stats,
 +0644, d_spin_debug, wake_kick_stats);
 + debugfs_create_u32(wake_spur_stats,
 +0644, d_spin_debug, wake_spur_stats);
 + debugfs_create_u64(time_blocked,
 +0644, d_spin_debug, time_blocked);
 + return 0;
 +}
 +
 +static inline void kvm_halt_stats(enum pv_lock_stats type)
 +{
 + if (type == PV_HALT_QHEAD)
 + add_smp(halt_qhead_stats, 1);
 + else if (type == PV_HALT_QNODE)
 + add_smp(halt_qnode_stats, 1);
 + else /* type == PV_HALT_ABORT */
 + add_smp(halt_abort_stats, 1);
 +}
 +
 +static inline void kvm_lock_stats(enum pv_lock_stats type)
 +{
 + if (type == PV_WAKE_KICKED)
 + add_smp(wake_kick_stats, 1);
 + else if (type == PV_WAKE_SPURIOUS)
 + add_smp(wake_spur_stats, 1);
 + else /* type == PV_KICK_NOHALT */
 + add_smp(kick_nohlt_stats, 1);
 +}
 +
 +static inline u64 spin_time_start(void)
 +{
 + return sched_clock();
 +}
 +
 +static inline void spin_time_accum_blocked(u64 start)
 +{
 + u64 delta;
 +
 +

Re: [PATCH] kvm: x86: emulate monitor and mwait instructions as nop

2014-05-07 Thread Paolo Bonzini


Il 07/05/2014 20:10, Gabriel L. Somlo ha scritto:

  1. I can't test svm.c (on AMD). As such, I'm not sure the
 skip_emulated_instruction() call in my own version of nop_interception()
 is necessary. If not, I could probably just call the already existing
 nop_on_interception() (line 1926 or thereabouts in svm.c), which
 just returns returns 1 without skipping anything.


Yes, it's necessary.


  2. I get defined but not used warnings on invalid_op_interception() (svm.c)
 and handle_invalid_op() (vmx.c). Apparently monitor/mwait are currently
 the only VM exit reasons which lead to an invalid opcode exception.
 Should my patch just nuke those functions (so that if anyone needs them
 in the future they'd have to re-add them), or comment them out, or
 call them after the return 1; statement in the monitor/mwait functions
 to shut up gcc, or ??? :)


Nuke it. :)

Thanks for working on OS X virtualization, weird guests are always the 
source of interesting quirks!


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] kvm: x86: emulate monitor and mwait instructions as nop

2014-05-07 Thread Gabriel L. Somlo

Treat monitor and mwait instructions as nop, which is architecturally
correct (but inefficient) behavior. We do this to prevent misbehaving
guests (e.g. OS X = 10.7) from crashing after they fail to check for
monitor/mwait availability via cpuid.

Since mwait-based idle loops relying on these nop-emulated instructions
would keep the host CPU pegged at 100%, do NOT advertise their presence
via cpuid, to prevent compliant guests from using them inadvertently.

Signed-off-by: Gabriel L. Somlo so...@cmu.edu
---

New in v2: remove invalid_op handler functions which were only used to
   handle exits caused by monitor and mwait

On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
 On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
 If we really want to be paranoid and worry about guests
 that use this strange way to trigger invalid opcode,
 we can make it possible for userspace to enable/disable
 this hack, and teach qemu to set it.
 
 That would make it even safer than it was.
 
 Not sure it's worth it, just a thought.
 
 Since we don't trap on non-exposed other instructions (new SSE and
 whatdoiknow) I don't think it's really bad to just expose
 MONITOR/MWAIT as nops.

So AFAICT, linux prefers to use mwait for idling if cpuid tells it that
it's available. If we keep telling everyone that we do NOT have monitor
and mwait available, compliant guests will never end up using them, and
this hack would remain completely invisible to them, which is good
(better to use hlt-based idle loops when you're a vm guest, that would
actually allow the host to relax while you're halted :)

So the only time anyone would be able to tell we have this hack would be
when they're about to receive an invalid opcode for using monitor/mwait
in violation of what CPUID (would have) told them. That's what happens
to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs
begain to seriously think about their OS running as a vm guest (on fusion
and parallels)...

Instead of killing the misbehaving guest with an invalid opcode, we'd
allow them to peg the host CPU with their monitor == mwait == nop idle
loop instead, which, at least on OS X, should be tolerable long enough
to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext'
and reboot the guest, after which things would settle down by reverting
the guest to a hlt-based idle loop.

The only reason I can think of to add functionality for enabling/disabling
this hack would be to protect against a malicious guest which would use
mwait *on purpose* to peg the host CPU. But a malicious guest could just
run for(;;); in ring 0 and accomplish the same goal, so we wouldn't
really gain anything in exchange for the added complexity...

Thanks,
  Gabriel

 arch/x86/kvm/cpuid.c |  2 ++
 arch/x86/kvm/svm.c   | 28 
 arch/x86/kvm/vmx.c   | 20 
 3 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f47a104..d094fc6 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
/* cpuid 1.ecx */
const u32 kvm_supported_word4_x86_features =
+   /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
+* but *not* advertised to guests via CPUID ! */
F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7f4f9c2..0b7d58d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2770,12 +2770,6 @@ static int xsetbv_interception(struct vcpu_svm *svm)
return 1;
 }
 
-static int invalid_op_interception(struct vcpu_svm *svm)
-{
-   kvm_queue_exception(svm-vcpu, UD_VECTOR);
-   return 1;
-}
-
 static int task_switch_interception(struct vcpu_svm *svm)
 {
u16 tss_selector;
@@ -3287,6 +3281,24 @@ static int pause_interception(struct vcpu_svm *svm)
return 1;
 }
 
+static int nop_interception(struct vcpu_svm *svm)
+{
+   skip_emulated_instruction((svm-vcpu));
+   return 1;
+}
+
+static int monitor_interception(struct vcpu_svm *svm)
+{
+   printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n);
+   return nop_interception(svm);
+}
+
+static int mwait_interception(struct vcpu_svm *svm)
+{
+   printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n);
+   return nop_interception(svm);
+}
+
 static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
[SVM_EXIT_READ_CR0] = cr_interception,
[SVM_EXIT_READ_CR3] = cr_interception,
@@ -3344,8 +3356,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm 
*svm) = {

Re: [PATCH] kvmclock: Ensure time in migration never goes backward

2014-05-07 Thread Marcelo Tosatti

On Tue, May 06, 2014 at 09:18:27AM +0200, Alexander Graf wrote:
 
 On 06.05.14 01:31, Marcelo Tosatti wrote:
 On Mon, May 05, 2014 at 08:23:43PM -0300, Marcelo Tosatti wrote:
 Hi Alexander,
 
 On Mon, May 05, 2014 at 03:51:22PM +0200, Alexander Graf wrote:
 When we migrate we ask the kernel about its current belief on what the 
 guest
 time would be.
 KVM_GET_CLOCK which returns the time in struct kvm_clock_data.
 
 However, I've seen cases where the kvmclock guest structure
 indicates a time more recent than the kvm returned time.
 This should not happen because the value returned by KVM_GET_CLOCK
 (get_kernel_ns() + kvmclock_offset) should be relatively in sync
 with what is seen in the guest via kvmclock read.
 
 
 Yes, and it isn't. Any ideas why it's not? This patch really just
 uses the guest visible kvmclock time rather than the host view of it
 on migration.

Effective frequency of TSC clock could be higher than NTP frequency. So NTP
correction would slow down the host clock.

 There is definitely something very broken on the host's side since
 it does return a smaller time than the guest exposed interface
 indicates.
 
 
 Alex

Can you please retrieve the values of system_timestamp, tsc_timestamp and
the host clock on the trace you have ?

Nothing forbids backwards time even without migration, in case the problem
is the difference in frequency between TSC and NTP corrected host clock
(point is, should figure out what is happening).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH] kvmclock: Ensure time in migration never goes backward

2014-05-07 Thread Marcelo Tosatti

On Tue, May 06, 2014 at 09:54:35PM +0200, Marcin Gibuła wrote:
 Yes, and it isn't. Any ideas why it's not? This patch really just uses
 the guest visible kvmclock time rather than the host view of it on
 migration.
 
 There is definitely something very broken on the host's side since it
 does return a smaller time than the guest exposed interface indicates.
 
 Don't know if helps but here are example values from
 time_at_migration and s-clock from your patch.
 
 Tested on 5 restores of saved VM that (used to) hang:
 
s-clock  time_at_migration
 157082235125698  157113284546655
 157082235125698  157113298196976
 157082235125698  157113284615117
 157082235125698  157113284486601
 157082235125698  157113284479740
 
 Now, when I compare system time on guest with and without patch:
 
 On unpatched qemu vm restores with date: Apr 18 06:56:36
 On patched qemu it says: Apr 18 06:57:06
 
 -- 
 mg

Marcin,

Can you turn off the TSC clock on the host and attempt to reproduce the 
issue, without Alexander's patch? 

Just 

echo hpet  /sys/devices/system/clocksource/clocksource0/current_clocksource

(or any other clock except tsc if you don't have hpet).
TIA.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 106 matches

Mail list logo