RE: [PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Wu, Feng


 -Original Message-
 From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo
 Bonzini
 Sent: Friday, March 28, 2014 8:03 PM
 To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org
 Subject: Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4
 
 Il 28/03/2014 18:36, Feng Wu ha scritto:
  +   smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
 
 You are overwriting this variable below, but that is not okay because
 the value of CR4 must be considered separately in each iteration.  This
 also hides a uninitialized-variable bug for smap correctly in the EPT
 case.
 
 To avoid that, rename this variable to cr4_smap; it's probably better
 to rename smep to cr4_smep too.
 
  for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
  pfec = byte  1;
  map = 0;
  wf = pfec  PFERR_WRITE_MASK;
  uf = pfec  PFERR_USER_MASK;
  ff = pfec  PFERR_FETCH_MASK;
  +   smapf = pfec  PFERR_RSVD_MASK;
 
 The reader will expect PFERR_RSVD_MASK to be zero here.  So please
 add a comment: /* PFERR_RSVD_MASK is set in pfec if ... */.
 
  for (bit = 0; bit  8; ++bit) {
  x = bit  ACC_EXEC_MASK;
  w = bit  ACC_WRITE_MASK;
  @@ -3627,11 +3629,27 @@ static void update_permission_bitmask(struct
 kvm_vcpu *vcpu,
  w |= !is_write_protection(vcpu)  !uf;
  /* Disallow supervisor fetches of user code if 
  cr4.smep */
  x = !(smep  u  !uf);
  +
  +   /*
  +* SMAP:kernel-mode data accesses from user-mode
  +* mappings should fault. A fault is considered
  +* as a SMAP violation if all of the following
  +* conditions are ture:
  +*   - X86_CR4_SMAP is set in CR4
  +*   - An user page is accessed
  +*   - Page fault in kernel mode
  +*   - !(CPL3  X86_EFLAGS_AC is set)
  +*
  +*   Here, we cover the first three conditions,
  +*   we need to check CPL and X86_EFLAGS_AC in
  +*   permission_fault() dynamiccally
 
 dynamically.  These three lines however are not entirely correct.  We do
 cover the last condition here, it is in smapf.  So perhaps something like
 
  * Here, we cover the first three conditions.
  * The CPL and X86_EFLAGS_AC is in smapf, which
  * permission_fault() computes dynamically.
 
  +*/
  +   smap = smap  smapf  u  !uf;
 
 SMAP does not affect instruction fetches.  Do you need  !ff here?
 Perhaps
 it's clearer to add it even if it is not strictly necessary.
 
 Please write just smap = cr4_smap  u  !uf  !ff here, and add back
 smapf below
 in the assignment to fault.  This makes the code more homogeneous.
 
  } else
  /* Not really needed: no U/S accesses on ept  */
  u = 1;
  -   fault = (ff  !x) || (uf  !u) || (wf  !w);
  +   fault = (ff  !x) || (uf  !u) || (wf  !w) || smap;
 
 ...
 
  +
  +   /*
  +* If CPL  3, SMAP protections are disabled if EFLAGS.AC = 1.
  +*
  +* If CPL = 3, SMAP applies to all supervisor-mode data accesses
  +* (these are implicit supervisor accesses) regardless of the value
  +* of EFLAGS.AC.
  +*
  +* So we need to check CPL and EFLAGS.AC to detect whether there is
  +* a SMAP violation.
  +*/
  +
  +   smapf = ((mmu-permissions[(pfec|PFERR_RSVD_MASK)  1] 
 pte_access) 
  +1)  !((cpl  3)  ((rflags  X86_EFLAGS_AC) == 1));
  +
  +   return ((mmu-permissions[pfec  1]  pte_access)  1) || smapf;
 
 You do not need two separate accesses.  Just add PFERR_RSVD_MASK to pfec
 if
 the conditions for SMAP are satisfied.  There are two possibilities:
 
 1) setting PFERR_RSVD_MASK if SMAP is being enforced, that is if CPL = 3
 || AC = 0.  This is what you are doing now.
 
 2) setting PFERR_RSVD_MASK if SMAP is being overridden, that is if CPL  3
  AC = 1.  You then have to invert the bit in update_permission_bitmask.
 
 Please consider both choices, and pick the one that gives better code.
 
 Also, this must be written in a branchless way.  Branchless tricks are common
 throughout the MMU code because the hit rate of most branches is pretty
 much
 50%-50%.  This is also true in this case, at least if SMAP is in use (if it
 is not in use, we'll have AC=0 most of the time).
 
 I don't want to spoil the fun, but I don't want to waste your time either
 so I rot13'ed my solution and placed it after the signature. ;)
 
 As to nested virtualization, I reread the code and it should already work,
 because it sets 

Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Paolo Bonzini

Il 31/03/2014 08:16, Wu, Feng ha scritto:


/*
 * If CPL  3, SMAP protections are disabled if EFLAGS.AC = 1.
 *
 * If CPL = 3, SMAP applies to all supervisor-mode data accesses
 * (these are implicit supervisor accesses) regardless of the value
 * of EFLAGS.AC.
 *
 * This computes (cpl  3)  (rflags  X86_EFLAGS_AC), leaving
 * the result in X86_EFLAGS_AC. We then insert it in place of
 * the PFERR_RSVD_MASK bit; this bit will always be zero in pfec,
 * but it will be one in index if SMAP checks are being overridden.
 * It is important to keep this branchless.
 */
smap = (cpl - 3)  (rflags  X86_EFLAGS_AC);
index =
(pfec  1) +
(smap  (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1));

return (mmu-permissions[index]  pte_access)  1;

The direction of PFERR_RSVD_MASK is the opposite compared to your code.
---


Correct.


I am a little confused about some points of the above code:
1. smap = (cpl - 3)  (rflags  X86_EFLAGS_AC);
smap equals 1 when it is overridden and it is 0 when being enforced.


Actually, smap equals X86_EFLAGS_AC when it is overridden.  Perhaps this 
is the source of the confusion.  Note that I'm using , not .



So index
will be (pfec  1) when SMAP is enforced, but in my understanding of this 
case, we
should use the index with PFERR_RSVD_MASK bit being 1 in mmu- permissions[]
to check the fault.
2.  smap  (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)
I am not quite understand this line. BTW, I cannot find the definition of 
PFERR_RSVD_BIT,
Do you mean PFERR_RSVD_BIT equals 3?


Yes.  You can similarly add PFERR_PRESENT_BIT (equal to 0) etc.


I think the basic idea is using group 0 to check permission faults when !((cpl - 3) 
 (rflags  X86_EFLAGS_AC)), that is SMAP is overridden
while using group 1 to check faults when (cpl - 3)  (rflags  X86_EFLAGS_AC), 
that is SMAP is enforced.

Here is the code base on your proposal in my understanding:

---
smap = !((cpl - 3)  (rflags  X86_EFLAGS_AC));
index =
(pfec  1) + (smap  (PFERR_RSVD_BIT - 1)); /*assuming PFERR_RSVD_BIT 
== 3*/

return (mmu-permissions[index]  pte_access)  1;
---

Could you please have a look at it? Appreciate your help! :)


It is faster if you avoid the ! and shift right from the AC bit into 
position PFERR_RSVD_BIT - 1.  In update_permission_bitmask you can 
invert the direction of the bit when you extract it from pfec.


Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Wu, Feng


 -Original Message-
 From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo
 Bonzini
 Sent: Monday, March 31, 2014 3:29 PM
 To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org
 Subject: Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4
 
 Il 31/03/2014 08:16, Wu, Feng ha scritto:
 
 
 
  /*
   * If CPL  3, SMAP protections are disabled if EFLAGS.AC = 1.
   *
   * If CPL = 3, SMAP applies to all supervisor-mode data accesses
   * (these are implicit supervisor accesses) regardless of the value
   * of EFLAGS.AC.
   *
   * This computes (cpl  3)  (rflags  X86_EFLAGS_AC), leaving
   * the result in X86_EFLAGS_AC. We then insert it in place of
   * the PFERR_RSVD_MASK bit; this bit will always be zero in pfec,
   * but it will be one in index if SMAP checks are being overridden.
   * It is important to keep this branchless.
   */
  smap = (cpl - 3)  (rflags  X86_EFLAGS_AC);
  index =
  (pfec  1) +
  (smap  (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1));
 
  return (mmu-permissions[index]  pte_access)  1;
 
  The direction of PFERR_RSVD_MASK is the opposite compared to your code.
 
 
 ---
 
 Correct.
 
  I am a little confused about some points of the above code:
  1. smap = (cpl - 3)  (rflags  X86_EFLAGS_AC);
  smap equals 1 when it is overridden and it is 0 when being enforced.
 
 Actually, smap equals X86_EFLAGS_AC when it is overridden.  Perhaps this
 is the source of the confusion.  Note that I'm using , not .
 

Thanks for your explanation, you are right I didn't notice ** just now, I was 
thinking
it is a **, sorry for making the confusion.

  So index
  will be (pfec  1) when SMAP is enforced, but in my understanding of this
 case, we
  should use the index with PFERR_RSVD_MASK bit being 1 in mmu-
 permissions[]
  to check the fault.
  2.  smap  (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)
  I am not quite understand this line. BTW, I cannot find the definition of
 PFERR_RSVD_BIT,
  Do you mean PFERR_RSVD_BIT equals 3?
 
 Yes.  You can similarly add PFERR_PRESENT_BIT (equal to 0) etc.
 
  I think the basic idea is using group 0 to check permission faults when 
  !((cpl -
 3)  (rflags  X86_EFLAGS_AC)), that is SMAP is overridden
  while using group 1 to check faults when (cpl - 3)  (rflags  
  X86_EFLAGS_AC),
 that is SMAP is enforced.
 
  Here is the code base on your proposal in my understanding:
 
 
 
 ---
  smap = !((cpl - 3)  (rflags  X86_EFLAGS_AC));
  index =
  (pfec  1) + (smap  (PFERR_RSVD_BIT - 1)); /*assuming
 PFERR_RSVD_BIT == 3*/
 
  return (mmu-permissions[index]  pte_access)  1;
 
 
 ---
 
  Could you please have a look at it? Appreciate your help! :)
 
 It is faster if you avoid the ! and shift right from the AC bit into
 position PFERR_RSVD_BIT - 1.  In update_permission_bitmask you can
 invert the direction of the bit when you extract it from pfec.

So in that case, we should set smapf in update_permission_bitmask() this way, 
right?

smapf = !(pfec  PFERR_RSVD_MASK);

 
 Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM changes for 3.15 merge window

2014-03-31 Thread Paolo Bonzini
Linus,

The following changes since commit 6d0abeca3242a88cab8232e4acd7e2bf088f3bc2:

  Linux 3.14-rc3 (2014-02-16 13:30:25 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.15-1

for you to fetch changes up to 7227fc006b0df2c0d2966a7f4859b01bdf74:

  Merge branch 'kvm-ppchv-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next 
(2014-03-29 15:44:05 +0100)

No other changes expected for this merge window.

Paolo



PPC and ARM do not have much going on this time.  Most of the cool stuff,
instead, is in s390 and (after a few releases) x86.

ARM has some caching fixes and PPC has transactional memory support
in guests.  MIPS has some fixes, with more probably coming in 3.16 as
QEMU will soon get support for MIPS KVM.

For x86 there are optimizations for debug registers, which trigger on
some Windows games, and other important fixes for Windows guests.  We now
expose to the guest Broadwell instruction set extensions and also Intel
MPX.  There's also a fix/workaround for OS X guests, nested virtualization
features (preemption timer), and a couple kvmclock refinements.

For s390, the main news is asynchronous page faults, together with
improvements to IRQs (floating irqs and adapter irqs) that speed up
virtio devices.


Andrew Honig (1):
  kvm: x86: fix emulator buffer overflow (CVE-2014-0049)

Andrew Jones (2):
  x86: kvm: rate-limit global clock updates
  x86: kvm: introduce periodic global clock updates

Anton Blanchard (1):
  KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n

Christian Borntraeger (4):
  KVM: s390: Provide access to program parameter
  KVM: s390: expose gbea register to userspace
  KVM: s390: Optimize ucontrol path
  KVM: s390: randomize sca address

Christoffer Dall (1):
  KVM: Specify byte order for KVM_EXIT_MMIO

Cornelia Huck (6):
  virtio-ccw: virtio-ccw adapter interrupt support.
  KVM: eventfd: Fix lock order inversion.
  KVM: Add per-vm capability enablement.
  KVM: s390: adapter interrupt sources
  KVM: s390: irq routing for adapter interrupts.
  KVM: Bump KVM_MAX_IRQ_ROUTES for s390

Dominik Dingel (7):
  KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault
  KVM: async_pf: Provide additional direct page notification
  KVM: async_pf: Allow to wait for outstanding work
  KVM: async_pf: Async page fault support on s390
  KVM: async_pf: Exploit one reg interface for pfault
  KVM: async_pf: Add missing call for async page present
  KVM: s390: Removing untriggerable BUG_ONs

Fernando Luis Vázquez Cao (1):
  kvm: remove redundant registration of BSP's hv_clock area

Gabriel L. Somlo (1):
  kvm: x86: ignore ioapic polarity

Greg Kurz (1):
  KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write

Heinz Graalfs (2):
  virtio_ccw: fix vcdev pointer handling issues
  virtio_ccw: fix hang in set offline processing

Igor Mammedov (2):
  KVM: x86 emulator: emulate MOVAPS
  KVM: x86 emulator: emulate MOVAPD

James Hogan (4):
  MIPS: KVM: asm/kvm_host.h: Clean up whitespace
  MIPS: KVM: Pass reserved instruction exceptions to guest
  MIPS: KVM: Consult HWREna before emulating RDHWR
  MIPS: KVM: Remove dead code in CP0 emulation

Jan Kiszka (4):
  KVM: nVMX: Rework interception of IRQs and NMIs
  KVM: nVMX: Fully emulate preemption timer
  KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt
  KVM: x86: Remove return code from enable_irq/nmi_window

Jens Freimann (7):
  KVM: s390: add and extend interrupt information data structs
  KVM: s390: add floating irq controller
  KVM: s390: limit floating irqs
  KVM: s390: Simplify online vcpus counting for stsi
  KVM: s390: get rid of local_int array
  KVM: s390: fix calculation of idle_mask array size
  KVM: s390: clear local interrupts at cpu initial reset

Laurent Dufour (1):
  KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE

Liu, Jinsong (6):
  KVM: x86: expose new instruction RDSEED to guest
  KVM: x86: expose ADX feature to guest
  KVM: x86: Fix xsave cpuid exposing bug
  KVM: x86: Intel MPX vmx and msr handle
  KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save
  KVM: x86: Enable Intel MPX for guest

Marc Zyngier (13):
  arm/arm64: KVM: detect CPU reset on CPU_PM_EXIT
  arm64: KVM: force cache clean on page fault when caches are off
  arm64: KVM: allows discrimination of AArch32 sysreg access
  arm64: KVM: trap VM system registers until MMU and caches are ON
  ARM: KVM: introduce kvm_p*d_addr_end
  arm64: KVM: flush VM pages before letting the guest enable caches
  ARM: KVM: force cache clean on page fault when caches are off
  ARM: KVM: fix handling of 

Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Paolo Bonzini

Il 31/03/2014 10:06, Wu, Feng ha scritto:


 It is faster if you avoid the ! and shift right from the AC bit into
 position PFERR_RSVD_BIT - 1.  In update_permission_bitmask you can
 invert the direction of the bit when you extract it from pfec.

So in that case, we should set smapf in update_permission_bitmask() this way, 
right?

smapf = !(pfec  PFERR_RSVD_MASK);



Yep.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM changes for 3.15 merge window

2014-03-31 Thread Paolo Bonzini

Il 31/03/2014 10:30, Paolo Bonzini ha scritto:

Linus,

The following changes since commit 6d0abeca3242a88cab8232e4acd7e2bf088f3bc2:

  Linux 3.14-rc3 (2014-02-16 13:30:25 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.15-1


Also at tags/for-linus, but request-pull chose this other tag for some 
reason.


Paolo


for you to fetch changes up to 7227fc006b0df2c0d2966a7f4859b01bdf74:

  Merge branch 'kvm-ppchv-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next 
(2014-03-29 15:44:05 +0100)

No other changes expected for this merge window.

Paolo



PPC and ARM do not have much going on this time.  Most of the cool stuff,
instead, is in s390 and (after a few releases) x86.

ARM has some caching fixes and PPC has transactional memory support
in guests.  MIPS has some fixes, with more probably coming in 3.16 as
QEMU will soon get support for MIPS KVM.

For x86 there are optimizations for debug registers, which trigger on
some Windows games, and other important fixes for Windows guests.  We now
expose to the guest Broadwell instruction set extensions and also Intel
MPX.  There's also a fix/workaround for OS X guests, nested virtualization
features (preemption timer), and a couple kvmclock refinements.

For s390, the main news is asynchronous page faults, together with
improvements to IRQs (floating irqs and adapter irqs) that speed up
virtio devices.


Andrew Honig (1):
  kvm: x86: fix emulator buffer overflow (CVE-2014-0049)

Andrew Jones (2):
  x86: kvm: rate-limit global clock updates
  x86: kvm: introduce periodic global clock updates

Anton Blanchard (1):
  KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n

Christian Borntraeger (4):
  KVM: s390: Provide access to program parameter
  KVM: s390: expose gbea register to userspace
  KVM: s390: Optimize ucontrol path
  KVM: s390: randomize sca address

Christoffer Dall (1):
  KVM: Specify byte order for KVM_EXIT_MMIO

Cornelia Huck (6):
  virtio-ccw: virtio-ccw adapter interrupt support.
  KVM: eventfd: Fix lock order inversion.
  KVM: Add per-vm capability enablement.
  KVM: s390: adapter interrupt sources
  KVM: s390: irq routing for adapter interrupts.
  KVM: Bump KVM_MAX_IRQ_ROUTES for s390

Dominik Dingel (7):
  KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault
  KVM: async_pf: Provide additional direct page notification
  KVM: async_pf: Allow to wait for outstanding work
  KVM: async_pf: Async page fault support on s390
  KVM: async_pf: Exploit one reg interface for pfault
  KVM: async_pf: Add missing call for async page present
  KVM: s390: Removing untriggerable BUG_ONs

Fernando Luis Vázquez Cao (1):
  kvm: remove redundant registration of BSP's hv_clock area

Gabriel L. Somlo (1):
  kvm: x86: ignore ioapic polarity

Greg Kurz (1):
  KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write

Heinz Graalfs (2):
  virtio_ccw: fix vcdev pointer handling issues
  virtio_ccw: fix hang in set offline processing

Igor Mammedov (2):
  KVM: x86 emulator: emulate MOVAPS
  KVM: x86 emulator: emulate MOVAPD

James Hogan (4):
  MIPS: KVM: asm/kvm_host.h: Clean up whitespace
  MIPS: KVM: Pass reserved instruction exceptions to guest
  MIPS: KVM: Consult HWREna before emulating RDHWR
  MIPS: KVM: Remove dead code in CP0 emulation

Jan Kiszka (4):
  KVM: nVMX: Rework interception of IRQs and NMIs
  KVM: nVMX: Fully emulate preemption timer
  KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt
  KVM: x86: Remove return code from enable_irq/nmi_window

Jens Freimann (7):
  KVM: s390: add and extend interrupt information data structs
  KVM: s390: add floating irq controller
  KVM: s390: limit floating irqs
  KVM: s390: Simplify online vcpus counting for stsi
  KVM: s390: get rid of local_int array
  KVM: s390: fix calculation of idle_mask array size
  KVM: s390: clear local interrupts at cpu initial reset

Laurent Dufour (1):
  KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE

Liu, Jinsong (6):
  KVM: x86: expose new instruction RDSEED to guest
  KVM: x86: expose ADX feature to guest
  KVM: x86: Fix xsave cpuid exposing bug
  KVM: x86: Intel MPX vmx and msr handle
  KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save
  KVM: x86: Enable Intel MPX for guest

Marc Zyngier (13):
  arm/arm64: KVM: detect CPU reset on CPU_PM_EXIT
  arm64: KVM: force cache clean on page fault when caches are off
  arm64: KVM: allows discrimination of AArch32 sysreg access
  arm64: KVM: trap VM system registers until MMU and caches are ON
  ARM: KVM: introduce kvm_p*d_addr_end
  arm64: KVM: flush VM pages 

[PATCH v3 0/4] KVM: enable Intel SMAP for KVM

2014-03-31 Thread Feng Wu
Supervisor Mode Access Prevention (SMAP) is a new security feature 
disclosed by Intel, please refer to the following document: 

http://software.intel.com/sites/default/files/319433-014.pdf
 
Every access to a linear address is either a supervisor-mode access
or a user-mode access. All accesses performed while the current
privilege level (CPL) is less than 3 are supervisor-mode accesses.
If CPL = 3, accesses are generally user-mode accesses. However, some
operations implicitly access system data structures, and the resulting
accesses to those data structures are supervisor-mode accesses regardless
of CPL. Examples of such implicit supervisor accesses include the following:
accesses to the global descriptor table (GDT) or local descriptor table
(LDT) to load a segment descriptor; accesses to the interrupt descriptor
table (IDT) when delivering an interrupt or exception; and accesses to the
task-state segment (TSS) as part of a task switch or change of CPL.

If CR4.SMAP = 1, supervisor-mode data accesses are not allowed to linear
addresses that are accessible in user mode. If CPL  3, SMAP protections
are disabled if EFLAGS.AC = 1. If CPL = 3, SMAP applies to all supervisor-mode
data accesses (these are implicit supervisor accesses) regardless of the
value of EFLAGS.AC.

This patchset pass-through SMAP feature to guests, and let guests
benefit from it.

Version 1:
  * Remove SMAP bit from CR4_RESERVED_BITS.
  * Add SMAP support when setting CR4
  * Disable SMAP for guests in EPT realmode and EPT unpaging mode
  * Expose SMAP feature to guest

Version 2:
  * Change the logic of updating mmu permission bitmap for SMAP violation
  * Expose SMAP feature to guest in the last patch of this series.

Version 3:
  * Changes in update_permission_bitmask().
  * Use a branchless way suggested by Paolo Bonzini to detect SMAP
violation in permission_fault(). 

Feng Wu (4):
  KVM: Remove SMAP bit from CR4_RESERVED_BITS.
  KVM: Add SMAP support when setting CR4
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
  KVM: expose SMAP feature to guest

 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/cpuid.c|  2 +-
 arch/x86/kvm/cpuid.h|  8 
 arch/x86/kvm/mmu.c  | 35 +---
 arch/x86/kvm/mmu.h  | 44 +
 arch/x86/kvm/paging_tmpl.h  |  2 +-
 arch/x86/kvm/vmx.c  | 11 ++-
 arch/x86/kvm/x86.c  |  9 -
 8 files changed, 93 insertions(+), 20 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/4] KVM: Remove SMAP bit from CR4_RESERVED_BITS.

2014-03-31 Thread Feng Wu
This patch removes SMAP bit from CR4_RESERVED_BITS.

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fdf83af..4eeb049 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -60,7 +60,7 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/4] KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode

2014-03-31 Thread Feng Wu
SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/vmx.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3927528..e58cb5f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3452,13 +3452,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
hw_cr4 = ~X86_CR4_PAE;
hw_cr4 |= X86_CR4_PSE;
/*
-* SMEP is disabled if CPU is in non-paging mode in
-* hardware. However KVM always uses paging mode to
+* SMEP/SMAP is disabled if CPU is in non-paging mode
+* in hardware. However KVM always uses paging mode to
 * emulate guest non-paging mode with TDP.
-* To emulate this behavior, SMEP needs to be manually
-* disabled when guest switches to non-paging mode.
+* To emulate this behavior, SMEP/SMAP needs to be
+* manually disabled when guest switches to non-paging
+* mode.
 */
-   hw_cr4 = ~X86_CR4_SMEP;
+   hw_cr4 = ~(X86_CR4_SMEP | X86_CR4_SMAP);
} else if (!(cr4  X86_CR4_PAE)) {
hw_cr4 = ~X86_CR4_PAE;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/4] KVM: expose SMAP feature to guest

2014-03-31 Thread Feng Wu
This patch exposes SMAP feature to guest

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/cpuid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c697625..deb5f9b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -303,7 +303,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
/* cpuid 7.0.ebx */
const u32 kvm_supported_word9_x86_features =
F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) |
-   F(BMI2) | F(ERMS) | f_invpcid | F(RTM);
+   F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | F(SMAP);
 
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Feng Wu
This patch adds SMAP handling logic when setting CR4 for guests

Thanks a lot to Paolo Bonzini for his suggestion to use the branchless
way to detect SMAP violation.

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/cpuid.h   |  8 
 arch/x86/kvm/mmu.c | 35 ---
 arch/x86/kvm/mmu.h | 44 
 arch/x86/kvm/paging_tmpl.h |  2 +-
 arch/x86/kvm/x86.c |  9 -
 5 files changed, 85 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index a2a1bb7..eeecbed 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -48,6 +48,14 @@ static inline bool guest_cpuid_has_smep(struct kvm_vcpu 
*vcpu)
return best  (best-ebx  bit(X86_FEATURE_SMEP));
 }
 
+static inline bool guest_cpuid_has_smap(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best  (best-ebx  bit(X86_FEATURE_SMAP));
+}
+
 static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9b53135..5a1ed38 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3601,20 +3601,28 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu 
*vcpu,
}
 }
 
-static void update_permission_bitmask(struct kvm_vcpu *vcpu,
+void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smep;
+   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;
 
smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
+   cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
pfec = byte  1;
map = 0;
wf = pfec  PFERR_WRITE_MASK;
uf = pfec  PFERR_USER_MASK;
ff = pfec  PFERR_FETCH_MASK;
+   /*
+* PFERR_RSVD_MASK bit is used to detect SMAP violation.
+* We will check it in permission_fault(), this bit is
+* set in pfec for normal fault, while it is cleared for
+* SMAP violations.
+*/
+   smapf = !(pfec  PFERR_RSVD_MASK);
for (bit = 0; bit  8; ++bit) {
x = bit  ACC_EXEC_MASK;
w = bit  ACC_WRITE_MASK;
@@ -3627,11 +3635,32 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
w |= !is_write_protection(vcpu)  !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
x = !(smep  u  !uf);
+
+   /*
+* SMAP:kernel-mode data accesses from user-mode
+* mappings should fault. A fault is considered
+* as a SMAP violation if all of the following
+* conditions are ture:
+*   - X86_CR4_SMAP is set in CR4
+*   - An user page is accessed
+*   - Page fault in kernel mode
+*   - !(CPL3  X86_EFLAGS_AC is set)
+*
+*   Here, we cover the first three conditions,
+*   The CPL and X86_EFLAGS_AC is in smapf,which
+*   permission_fault() computes dynamically.
+*
+*   Also, SMAP does not affect instruction
+*   fetches, add the !ff check here to make it
+*   clearer.
+*/
+   smap = cr4_smap  u  !uf  !ff;
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;
 
-   fault = (ff  !x) || (uf  !u) || (wf  !w);
+   fault = (ff  !x) || (uf  !u) || (wf  !w) ||
+   (smapf  smap);
map |= fault  bit;
}
mmu-permissions[byte] = map;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..822190f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -44,11 +44,17 @@
 #define PT_DIRECTORY_LEVEL 2
 #define PT_PAGE_TABLE_LEVEL 1
 
-#define PFERR_PRESENT_MASK (1U  0)
-#define PFERR_WRITE_MASK (1U  1)
-#define PFERR_USER_MASK (1U  2)
-#define PFERR_RSVD_MASK (1U  3)
-#define PFERR_FETCH_MASK (1U  4)
+#define PFERR_PRESENT_BIT 0
+#define PFERR_WRITE_BIT 1
+#define PFERR_USER_BIT 2

KVM call agenfda for 2014-04-01

2014-03-31 Thread Juan Quintela

Hi

Please, send any topic that you are interested in covering.

Thanks, Juan.

Call details:

10:00 AM to 11:00 AM EDT
Every two weeks

If you need phone number details,  contact me privately.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Andreas Färber
Hi,

Am 31.03.2014 12:40, schrieb Juan Quintela:
 
 Please, send any topic that you are interested in covering.

I would like to discuss the state of the QEMU release process, please:

* -rc1 has not been tagged.
* Who besides Anthony could upload a tarball if we tag and create it?
* make-release fix for SeaBIOS on the list. Ping, and are more affected?

Thanks,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Stefan Hajnoczi
On Mon, Mar 31, 2014 at 12:51:31PM +0200, Andreas Färber wrote:
 Am 31.03.2014 12:40, schrieb Juan Quintela:
  
  Please, send any topic that you are interested in covering.
 
 I would like to discuss the state of the QEMU release process, please:
 
 * -rc1 has not been tagged.
 * Who besides Anthony could upload a tarball if we tag and create it?
 * make-release fix for SeaBIOS on the list. Ping, and are more affected?

+1

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Peter Maydell
On 31 March 2014 14:21, Christian Borntraeger borntrae...@de.ibm.com wrote:
 Another thing might be the release process in general. Currently it seems
 that everybody tries to push everything just before the hard freeze.  I had
 to debug some problems introduced _after_ soft freeze. Is there some
 interest in having a Linux-like process (merge window + stabilization)? This
 would require shorter release cycles of course.

merge window has been suggested before. I think it would be
a terrible idea for QEMU, personally. We're not the kernel in
many ways, notably dev community size and a greater tendency
to changes that have effects across the whole tree.

Soft + hard freeze is our stabilization period currently.

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Christian Borntraeger
On 31/03/14 12:51, Andreas Färber wrote:
 Hi,
 
 Am 31.03.2014 12:40, schrieb Juan Quintela:

 Please, send any topic that you are interested in covering.
 
 I would like to discuss the state of the QEMU release process, please:
 
 * -rc1 has not been tagged.
 * Who besides Anthony could upload a tarball if we tag and create it?
 * make-release fix for SeaBIOS on the list. Ping, and are more affected?

+1 for this one.

Another thing might be the release process in general. Currently it seems
that everybody tries to push everything just before the hard freeze.  I had
to debug some problems introduced _after_ soft freeze. Is there some
interest in having a Linux-like process (merge window + stabilization)? This
would require shorter release cycles of course.

Christian




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Paolo Bonzini

Just a few comments...


-static void update_permission_bitmask(struct kvm_vcpu *vcpu,
+void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smep;
+   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;

smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);


Can you make an additional patch to rename this to cr4_smep?


+   cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
pfec = byte  1;
map = 0;
wf = pfec  PFERR_WRITE_MASK;
uf = pfec  PFERR_USER_MASK;
ff = pfec  PFERR_FETCH_MASK;
+   /*
+* PFERR_RSVD_MASK bit is used to detect SMAP violation.
+* We will check it in permission_fault(), this bit is
+* set in pfec for normal fault, while it is cleared for
+* SMAP violations.
+*/


This bit is set in PFEC if we the access is _not_ subject to SMAP 
restrictions, and cleared otherwise.  The bit is only meaningful if

the SMAP bit is set in CR4.


+   smapf = !(pfec  PFERR_RSVD_MASK);
for (bit = 0; bit  8; ++bit) {
x = bit  ACC_EXEC_MASK;
w = bit  ACC_WRITE_MASK;
@@ -3627,11 +3635,32 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
w |= !is_write_protection(vcpu)  !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
x = !(smep  u  !uf);
+
+   /*
+* SMAP:kernel-mode data accesses from user-mode
+* mappings should fault. A fault is considered
+* as a SMAP violation if all of the following
+* conditions are ture:
+*   - X86_CR4_SMAP is set in CR4
+*   - An user page is accessed
+*   - Page fault in kernel mode
+*   - !(CPL3  X86_EFLAGS_AC is set)


- if CPL  3, EFLAGS.AC is clear


+*   Here, we cover the first three conditions,
+*   The CPL and X86_EFLAGS_AC is in smapf,which
+*   permission_fault() computes dynamically.


The fourth is computed dynamically in permission_fault() and is in SMAPF.


+*   Also, SMAP does not affect instruction
+*   fetches, add the !ff check here to make it
+*   clearer.
+*/
+   smap = cr4_smap  u  !uf  !ff;
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;

-   fault = (ff  !x) || (uf  !u) || (wf  !w);
+   fault = (ff  !x) || (uf  !u) || (wf  !w) ||
+   (smapf  smap);
map |= fault  bit;
}
mmu-permissions[byte] = map;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..822190f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -44,11 +44,17 @@
 #define PT_DIRECTORY_LEVEL 2
 #define PT_PAGE_TABLE_LEVEL 1

-#define PFERR_PRESENT_MASK (1U  0)
-#define PFERR_WRITE_MASK (1U  1)
-#define PFERR_USER_MASK (1U  2)
-#define PFERR_RSVD_MASK (1U  3)
-#define PFERR_FETCH_MASK (1U  4)
+#define PFERR_PRESENT_BIT 0
+#define PFERR_WRITE_BIT 1
+#define PFERR_USER_BIT 2
+#define PFERR_RSVD_BIT 3
+#define PFERR_FETCH_BIT 4
+
+#define PFERR_PRESENT_MASK (1U  PFERR_PRESENT_BIT)
+#define PFERR_WRITE_MASK (1U  PFERR_WRITE_BIT)
+#define PFERR_USER_MASK (1U  PFERR_USER_BIT)
+#define PFERR_RSVD_MASK (1U  PFERR_RSVD_BIT)
+#define PFERR_FETCH_MASK (1U  PFERR_FETCH_BIT)

 int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
@@ -73,6 +79,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 
addr, bool direct);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
bool execonly);
+void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+   bool ept);

 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
@@ -110,10 +118,30 @@ static inline bool is_write_protection(struct kvm_vcpu 
*vcpu)
  * Will a fault with a given page-fault error code (pfec) cause a permission
  * fault with the given access (in 

Re: [PATCH 3/4] KVM: PPC: Alow kvmppc_get_last_inst() to fail

2014-03-31 Thread Alexander Graf

On 03/26/2014 09:52 PM, Scott Wood wrote:

On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:

diff --git a/arch/powerpc/kvm/book3s_paired_singles.c 
b/arch/powerpc/kvm/book3s_paired_singles.c
index a59a25a..80c533e 100644
--- a/arch/powerpc/kvm/book3s_paired_singles.c
+++ b/arch/powerpc/kvm/book3s_paired_singles.c
@@ -640,19 +640,24 @@ static int kvmppc_ps_one_in(struct kvm_vcpu *vcpu, bool 
rc,
  
  int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu)

  {
-   u32 inst = kvmppc_get_last_inst(vcpu);
+   u32 inst;
enum emulation_result emulated = EMULATE_DONE;
-
-   int ax_rd = inst_get_field(inst, 6, 10);
-   int ax_ra = inst_get_field(inst, 11, 15);
-   int ax_rb = inst_get_field(inst, 16, 20);
-   int ax_rc = inst_get_field(inst, 21, 25);
-   short full_d = inst_get_field(inst, 16, 31);
-
-   u64 *fpr_d = vcpu-arch.fpr[ax_rd];
-   u64 *fpr_a = vcpu-arch.fpr[ax_ra];
-   u64 *fpr_b = vcpu-arch.fpr[ax_rb];
-   u64 *fpr_c = vcpu-arch.fpr[ax_rc];
+   int ax_rd, ax_ra, ax_rb, ax_rc;
+   short full_d;
+   u64 *fpr_d, *fpr_a, *fpr_b, *fpr_c;
+
+   kvmppc_get_last_inst(vcpu, inst);

Should probably check for failure here and elsewhere -- even though it
can't currently fail on book3s, the interface now allows it.


diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 5b9e906..b0d884d 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -624,9 +624,10 @@ void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr)
  static int kvmppc_read_inst(struct kvm_vcpu *vcpu)
  {
ulong srr0 = kvmppc_get_pc(vcpu);
-   u32 last_inst = kvmppc_get_last_inst(vcpu);
+   u32 last_inst;
int ret;
  
+	kvmppc_get_last_inst(vcpu, last_inst);

ret = kvmppc_ld(vcpu, srr0, sizeof(u32), last_inst, false);

This isn't new, but this function looks odd to me -- calling
kvmppc_get_last_inst() but ignoring last_inst, then calling kvmppc_ld()
and ignoring anything but failure.  last_inst itself is never read.  And
no comments to explain the weirdness. :-)

I get that kvmppc_get_last_inst() is probably being used for the side
effect of filling in vcpu-arch.last_inst, but why store the return
value without using it?  Why pass the address of it to kvmppc_ld(),
which seems to be used only as an indirect way of determining whether
kvmppc_get_last_inst() failed?  And that whole mechanism becomes
stranger once it becomes possible for kvmppc_get_last_inst() to directly
return failure.


If you're interested in the history of this, here's the patch :)

https://github.com/mirrors/linux-2.6/commit/c7f38f46f2a98d232147e47284cb4e7363296a3e

The idea is that we need 2 things to be good after this function:

  1) vcpu-arch.last_inst is valid
  2) if the last instruction is not readable, return failure

Hence this weird construct. I don't think it's really necessary though - 
just remove the kvmppc_ld() call and only fail read_inst() when the 
caching didn't work and we can't translate the address.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-31 Thread Alexander Graf

On 03/26/2014 10:17 PM, Scott Wood wrote:

On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:

Load external pid (lwepx) instruction faults (when called from
KVM with guest context) needs to be handled by KVM. This implies
additional code in DO_KVM macro to identify the source of the
exception (which oiginate from KVM host rather than the guest).
The hook requires to check the Exception Syndrome Register
ESR[EPID] and External PID Load Context Register EPLC[EGS] for
some exceptions (DTLB_MISS, DSI and LRAT). Doing this on Data TLB
miss exception is obvious intrusive for the host.

Get rid of lwepx and acquire last instuction in kvmppc_get_last_inst()
by searching for the physical address and kmap it. This fixes an
infinite loop caused by lwepx's data TLB miss handled in the host
and the TODO for TLB eviction and execute-but-not-read entries.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
  arch/powerpc/kvm/bookehv_interrupts.S |   37 +++--
  arch/powerpc/kvm/e500_mmu_host.c  |   93 +
  2 files changed, 102 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 20c7a54..c50490c 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -119,38 +119,14 @@
  1:
  
  	.if	\flags  NEED_EMU

-   /*
-* This assumes you have external PID support.
-* To support a bookehv CPU without external PID, you'll
-* need to look up the TLB entry and create a temporary mapping.
-*
-* FIXME: we don't currently handle if the lwepx faults.  PR-mode
-* booke doesn't handle it either.  Since Linux doesn't use
-* broadcast tlbivax anymore, the only way this should happen is
-* if the guest maps its memory execute-but-not-read, or if we
-* somehow take a TLB miss in the middle of this entry code and
-* evict the relevant entry.  On e500mc, all kernel lowmem is
-* bolted into TLB1 large page mappings, and we don't use
-* broadcast invalidates, so we should not take a TLB miss here.
-*
-* Later we'll need to deal with faults here.  Disallowing guest
-* mappings that are execute-but-not-read could be an option on
-* e500mc, but not on chips with an LRAT if it is used.
-*/
-
-   mfspr   r3, SPRN_EPLC   /* will already have correct ELPID and EGS */
PPC_STL r15, VCPU_GPR(R15)(r4)
PPC_STL r16, VCPU_GPR(R16)(r4)
PPC_STL r17, VCPU_GPR(R17)(r4)
PPC_STL r18, VCPU_GPR(R18)(r4)
PPC_STL r19, VCPU_GPR(R19)(r4)
-   mr  r8, r3
PPC_STL r20, VCPU_GPR(R20)(r4)
-   rlwimi  r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS
PPC_STL r21, VCPU_GPR(R21)(r4)
-   rlwimi  r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR
PPC_STL r22, VCPU_GPR(R22)(r4)
-   rlwimi  r8, r10, EPC_EPID_SHIFT, EPC_EPID
PPC_STL r23, VCPU_GPR(R23)(r4)
PPC_STL r24, VCPU_GPR(R24)(r4)
PPC_STL r25, VCPU_GPR(R25)(r4)
@@ -160,10 +136,15 @@
PPC_STL r29, VCPU_GPR(R29)(r4)
PPC_STL r30, VCPU_GPR(R30)(r4)
PPC_STL r31, VCPU_GPR(R31)(r4)
-   mtspr   SPRN_EPLC, r8
-   isync
-   lwepx   r9, 0, r5
-   mtspr   SPRN_EPLC, r3
+
+   /*
+* We don't use external PID support. lwepx faults would need to be
+* handled by KVM and this implies aditional code in DO_KVM (for
+* DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which
+* is too intrusive for the host. Get last instuction in
+* kvmppc_get_last_inst().
+*/
+   li  r9, KVM_INST_FETCH_FAILED
stw r9, VCPU_LAST_INST(r4)
.endif
  
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c

index 6025cb7..1b4cb41 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -598,9 +598,102 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, 
gpa_t gpaddr,
}
  }
  
+#ifdef CONFIG_KVM_BOOKE_HV

+int kvmppc_ld_inst(struct kvm_vcpu *vcpu, u32 *instr)

It'd be interesting to see what the performance impact of doing this on
non-HV would be -- it would eliminate divergent code, eliminate the
MSR_DS hack, and make exec-only mappings work.


We hit the instruction emulation path a lot more often on non-HV, so 
even a slight performance impact that might not be a major bummer for HV 
would become critical for PR.


But I agree - it'd be interesting to see numbers.




+{
+   gva_t geaddr;
+   hpa_t addr;
+   hfn_t pfn;
+   hva_t eaddr;
+   u32 mas0, mas1, mas2, mas3;
+   u64 mas7_mas3;
+   struct page *page;
+   unsigned int addr_space, psize_shift;
+   bool pr;
+   unsigned long flags;
+
+   /* Search TLB for guest pc to get the real address */
+   geaddr = kvmppc_get_pc(vcpu);
+   addr_space = (vcpu-arch.shared-msr  

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Anthony Liguori
On Mon, Mar 31, 2014 at 6:25 AM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 31 March 2014 14:21, Christian Borntraeger borntrae...@de.ibm.com wrote:
 Another thing might be the release process in general. Currently it seems
 that everybody tries to push everything just before the hard freeze.  I had
 to debug some problems introduced _after_ soft freeze. Is there some
 interest in having a Linux-like process (merge window + stabilization)? This
 would require shorter release cycles of course.

 merge window has been suggested before. I think it would be
 a terrible idea for QEMU, personally. We're not the kernel in
 many ways, notably dev community size and a greater tendency
 to changes that have effects across the whole tree.

 Soft + hard freeze is our stabilization period currently.

Peter, are you willing to do the tagging and announcement for the 2.0
rcs?  I sent instructions privately and between stefanha and I we can
get your permissions sorted out.

Regards,

Anthony Liguori

 thanks
 -- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Paolo Bonzini

Il 31/03/2014 16:01, Anthony Liguori ha scritto:

 merge window has been suggested before. I think it would be
 a terrible idea for QEMU, personally. We're not the kernel in
 many ways, notably dev community size and a greater tendency
 to changes that have effects across the whole tree.

 Soft + hard freeze is our stabilization period currently.

Peter, are you willing to do the tagging and announcement for the 2.0
rcs?  I sent instructions privately and between stefanha and I we can
get your permissions sorted out.


I think it would be a good idea to separate the committer and release 
manager roles.  Peter is providing the community with a wonderful 
service, just like you were; putting too much work on his shoulders 
risks getting us in the same situation if anything were to affect his 
ability to provide it.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Peter Maydell
On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote:
 I think it would be a good idea to separate the committer and release
 manager roles.  Peter is providing the community with a wonderful service,
 just like you were; putting too much work on his shoulders risks getting us
 in the same situation if anything were to affect his ability to provide it.

Yes, I strongly agree with this. I think we'll do much better
if we can manage to share out responsibilities among a wider
group of people.

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Andreas Färber
Am 31.03.2014 16:32, schrieb Peter Maydell:
 On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote:
 I think it would be a good idea to separate the committer and release
 manager roles.  Peter is providing the community with a wonderful service,
 just like you were; putting too much work on his shoulders risks getting us
 in the same situation if anything were to affect his ability to provide it.
 
 Yes, I strongly agree with this. I think we'll do much better
 if we can manage to share out responsibilities among a wider
 group of people.

May I propose Michael Roth, who is already experienced from the N-1
stable releases?

If we can enable him to upload the tarballs created from his tags that
would also streamline the stable workflow while at it.

Regards,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Anthony Liguori
On Mon, Mar 31, 2014 at 7:46 AM, Andreas Färber afaer...@suse.de wrote:
 Am 31.03.2014 16:32, schrieb Peter Maydell:
 On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote:
 I think it would be a good idea to separate the committer and release
 manager roles.  Peter is providing the community with a wonderful service,
 just like you were; putting too much work on his shoulders risks getting us
 in the same situation if anything were to affect his ability to provide it.

 Yes, I strongly agree with this. I think we'll do much better
 if we can manage to share out responsibilities among a wider
 group of people.

 May I propose Michael Roth, who is already experienced from the N-1
 stable releases?

 If we can enable him to upload the tarballs created from his tags that
 would also streamline the stable workflow while at it.

If mdroth is willing to take this on, I am very supportive.

Regards,

Anthony Liguori


 Regards,
 Andreas

 --
 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
 GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Markus Armbruster
Anthony Liguori anth...@codemonkey.ws writes:

 On Mon, Mar 31, 2014 at 7:46 AM, Andreas Färber afaer...@suse.de wrote:
 Am 31.03.2014 16:32, schrieb Peter Maydell:
 On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote:
 I think it would be a good idea to separate the committer and release
 manager roles.  Peter is providing the community with a wonderful service,
 just like you were; putting too much work on his shoulders risks getting us
 in the same situation if anything were to affect his ability to provide it.

 Yes, I strongly agree with this. I think we'll do much better
 if we can manage to share out responsibilities among a wider
 group of people.

 May I propose Michael Roth, who is already experienced from the N-1
 stable releases?

 If we can enable him to upload the tarballs created from his tags that
 would also streamline the stable workflow while at it.

 If mdroth is willing to take this on, I am very supportive.

Another vote of confidence from me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2] ARM VM System Specification

2014-03-31 Thread Christoffer Dall
On Sun, Mar 30, 2014 at 03:10:50PM -0700, Olof Johansson wrote:
 On Fri, Mar 28, 2014 at 11:45 AM, Christoffer Dall
 christoffer.d...@linaro.org wrote:
  ARM VM System Specification
  ===
 
 [not quoting the whole spec here]
 
 This looks very sane to me, and aligns very well with non-virtualized images.
 
 For what it's worth, even though it's not a patch:
 
 Acked-by: Olof Johansson o...@lixom.net

Thanks, appreciate it.

 
 What's the plan on how to lock this in? Where's this document going to
 live? In the kernel sources under Documentation/, or external?
 

There were talks during LCA14 about publishing it as a Linaro
white-paper, but I would like it to be hosted by the open-source
projects that relate to it, for example the Linux kernel under
Documentaiton/ to represent both the guest and KVM side, and the Xen
sources too.

However, that may be a pain to update so the preferred method could be
to host it in either its current form (clear-text) or publish some
versioned PDF somewhere and simply point to it from the compliant
software pieces.

Suggestions or input welcome.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-03-31 Thread Andy Lutomirski
On 03/29/2014 01:47 AM, Zhanghailiang wrote:
 Hi,
 I found when Guest is idle, VDSO pvclock may increase host consumption.
 We can calcutate as follow, Correct me if I am wrong.
   (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
 In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in 
 timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 
 250 Hz, it may consume 225,000 cycles per second, even no VM is created.
 In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the 
 no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per 
 call. The feature decrease 150 cycles consumption per call. 
 When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the 
 host consumption.
 Both Host and Guest is linux-3.13.6.
 So, whether the host cpu consumption is a problem?

Does pvclock serve any real purpose on systems with fully-functional
TSCs?  The x86 guest implementation is awful, so it's about 2x slower
than TSC.  It could be improved a lot, but I'm not sure I understand why
it exists in the first place.

I certainly understand the goal of keeping the guest CLOCK_REALTIME is
sync with the host, but pvclock seems like overkill for that.

--Andy

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Demand paging for VM on KVM

2014-03-31 Thread Andrea Arcangeli
Hi Grigory,

On Thu, Mar 20, 2014 at 10:50:07AM -0700, Grigory Makarevich wrote:
 Andrea, Paolo,
 
 Thanks a lot for the comments.
 
 I like the idea of userfaultfd a lot.  For my prototype I had to solve a
 problem of accessing to the ondemand page from the paths where exiting is
 not safe (emulator is one example). I have solved it using send message
 using netlink socket/blocking calling thread/waking up once page is
 delivered. userfaultfd might be cleaner way to achieve the same goal.

I'm glad you like the idea of userfaultfd, the way the syscall works
tends to mirror the eventfd(2) syscall but the protocol talked on the
fd is clearly different.

So you also made the vcpu to talk from the kernel to the ondemand
thread through some file descriptor and protocol. So the difference is
that it shouldn't be limited to the emulator but it shall work for all
syscalls that the IO thread may invoke and that could hit on missing
guest physical memory.

 My concerns regarding general mm solution are:
 
 - Will it work with any memory mapping schema or only with anonymous memory
 ?

Currently I have the hooks only into anonymous memory but there's no
reason why this shouldn't work for other kind of page faults. The main
issue is not with the technicality in waiting in the page fault but
with the semantics of a page not being mapped.

If there's an hole in a filebacked mapping things are different than
if there's an hole in anonymous memory. As the VM can unmap and free a
filebacked page at any time. But if you're ok to notify the migration
thread even in such a case, it could work the same.

It would be more tricky if we had to differentiate an initial fault
after the vma is created (in order to notify the migration thread only
for initial faults), from faults triggered after the VM unmapped and
freed the page as result of VM pressure. That would require putting a
placeholder in the pagetable instead of keeping the VM code identical
to now (which zeroes a pagetable when a page is unmapped from a
filebacked mapping).

There is also some similarity between the userfault mechanism and the
volatile ranges but volatile ranges are handled in a finegrined way in
the pagetables, but it looked like there was no code to share in the
end. Volatile ranges provides a very different functionality from
userfault, for example they don't need to provide transparent behavior
if the memory is given as parameter to syscalls as far as I know. They
are used only to store data in memory accessed by userland through the
pagetables (not syscalls). The objective of userfault is also
fundamentally different from the objective of volatile ranges. We
cannot ever lose any data, while their whole point is to lose data if
there's VM pressure.

However if you want to extend the userfaultfd functionalty to trap the
first access in the pagetabels for filebacked pages, we would also
need to mangle pagetables with some placeholders to differentiate the
first fault and we may want to revisit if there's some code or
placeholder to share across the two features. Ideally support for
filebacked ranges could be a added at a second stage.

 - Before blocking the calling thread while serving the page-fault in
 host-kernel, one would need carefully release mmu semaphore (otherwise, the
 user-space might be in trouble serving the page-fault), which may be not
 that trivial.

Yes all locks must be dropped before waiting for the migration thread
and that includes the mmap_sem. I don't think we'll need to add much
complexity though, because we can relay on the behavior of page faults
that can be repeated endlessy until they succeed. It's certainly more
trivial for real faults than gup (because gup can work on more than 1
page at time so it may require some rolling back or special lock
retaking during the gup loop). With the real faults the only trick as
optimization will be to repeat it without returning to userland.

 Regarding qemu part of that:
 
 -  Yes, indeed, user-space would need to be careful accessing ondemand
 pages. However,  that should not be a problem, considering that qemu would
 need to know in advance all ondemand regions.  Though, I would expect
 some  refactoring of the qemu internal api might be required.

I don't think the migration thread would risk messing with missing
memory, but the refactoring of some API and wire protocol will be
still needed to make the protocol bidirectional and to handle the
postcopy mechanism. David is working on it.

Paolo also pointed out one case that won't be entirely transparent. If
gdb would debug the migration thread breakpointing into it, and then
the gdb user tries to touch missing memory with ptrace after it
stopped the migration thread, gdb would then soft lockup. It'll
require a sigkill to unblock. There's no real way to fix it in my
view... and overall it looks quite reasonable to end up in a soft
lockup in such a scenario, I mean gdb has other ways to interfere in
bad ways the app by corrupting its 

RE: mechanism to allow a driver to bind to any device

2014-03-31 Thread Stuart Yoder


 -Original Message-
 From: Alex Williamson [mailto:alex.william...@redhat.com]
 Sent: Wednesday, March 26, 2014 5:09 PM
 To: Alexander Graf
 Cc: kvm@vger.kernel.org; jan.kis...@siemens.com; will.dea...@arm.com;
 Yoder Stuart-B08248; a.r...@virtualopensystems.com; Michal Hocko; Wood
 Scott-B07421; Sethi Varun-B16395; kvm...@lists.cs.columbia.edu; Rafael J.
 Wysocki; Guenter Roeck; Dmitry Kasatkin; Tejun Heo; Bjorn Helgaas;
 Antonios Motakis; t...@virtualopensystems.com; Toshi Kani; Greg KH;
 linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org; Joe
 Perches; christoffer.d...@linaro.org
 Subject: Re: mechanism to allow a driver to bind to any device
 
 On Wed, 2014-03-26 at 10:21 -0600, Alex Williamson wrote:
  On Wed, 2014-03-26 at 23:06 +0800, Alexander Graf wrote:
  
Am 26.03.2014 um 22:40 schrieb Konrad Rzeszutek Wilk
 konrad.w...@oracle.com:
   
On Wed, Mar 26, 2014 at 01:40:32AM +, Stuart Yoder wrote:
Hi Greg,
   
We (Linaro, Freescale, Virtual Open Systems) are trying get an
 issue
closed that has been perculating for a while around creating a
 mechanism
that will allow kernel drivers like vfio can bind to devices of
 any type.
   
This thread with you:
http://www.spinics.net/lists/kvm-arm/msg08370.html
...seems to have died out, so am trying to get your response
and will summarize again.  Vfio drivers in the kernel (regardless
 of
bus type) need to bind to devices of any type.  The driver's
 function
is to simply export hardware resources of any type to user space.
   
There are several approaches that have been proposed:
   
You seem to have missed the one I proposed.
   
  1.  new_id -- (current approach) the user explicitly registers
  each new device type with the vfio driver using the new_id
  mechanism.
   
  Problem: multiple drivers will be resident that handle the
  same device type...and there is nothing user space hotplug
  infrastructure can do to help.
   
  2.  any id -- the vfio driver could specify a wildcard match
  of some kind in its ID match table which would allow it to
  match and bind to any possible device id.  However,
  we don't want the vfio driver grabbing _all_ devices...just
 the ones we
  explicitly want to pass to user space.
   
  The proposed patch to support this was to create a new flag
  sysfs_bind_only in struct device_driver.  When this flag
  is set, the driver can only bind to devices via the sysfs
  bind file.  This would allow the wildcard match to work.
   
  Patch is here:
  https://lkml.org/lkml/2013/12/3/253
   
  3.  Driver initiated explicit bind -- with this approach the
  vfio driver would create a private 'bind' sysfs object
  and the user would echo the requested device into it:
   
  echo 0001:03:00.0  /sys/bus/pci/drivers/vfio-pci/vfio_bind
   
  In order to make that work, the driver would need to call
  driver_probe_device() and thus we need this patch:
  https://lkml.org/lkml/2014/2/8/175
   
4). Use the 'unbind' (from the original device) and 'bind' to vfio
 driver.
  
   This is approach 2, no?
  
   
Which I think is what is currently being done. Why is that not
 sufficient?
  
   How would 'bind to vfio driver' look like?
  
The only thing I see in the URL is  That works, but it is ugly.
There is some mention of race but I don't see how - if you do the
 'unbind'
on the original driver and then bind the BDF to the VFIO how would
 you get
a race?
  
   Typically on PCI, you do a
  
 - add wildcard (pci id) match to vfio driver
 - unbind driver
 - reprobe
 - device attaches to vfio driver because it is the least recent
 match
 - remove wildcard match from vfio driver
  
   If in between you hotplug add a card of the same type, it gets
 attached to vfio - even though the logical default driver would be the
 device specific driver.
 
  I've mentioned drivers_autoprobe in the past, but I'm not sure we're
  really factoring it into the discussion.  drivers_autoprobe allows us
 to
  toggle two points:
 
  a) When a new device is added whether we automatically give drivers a
  try at binding to it
 
  b) When a new driver is added whether it gets to try to bind to
 anything
  in the system
 
  So we do have a mechanism to avoid the race, but the problem is that it
  becomes the responsibility of userspace to:
 
  1) turn off drivers_autoprobe
  2) unbind/new_id/bind/remove_id
  3) turn on drivers_autoprobe
  4) call drivers_probe for anything added between 1)  3)
 
  Is the question about the ugliness of the current solution whether it's
  unreasonable to ask userspace to do this?
 
  What we seem to be asking for above is more like an autoprobe flag per
  driver where there's some way for this special driver to opt out of
 auto
  probing.  Option 

[PATCH 1/2] kvm: support any-length wildcard ioeventfd

2014-03-31 Thread Michael S. Tsirkin
It is sometimes benefitial to ignore IO size, and only match on address.
In hindsight this would have been a better default than matching length
when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
of access can be optimized on VMX: there no need to do page lookups.
This can currently be done with many ioeventfds but in a suboptimal way.

However we can't change kernel/userspace ABI without risk of breaking
some applications.
Use len = 0 to mean ignore length for matching in a more optimal way.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 include/uapi/linux/kvm.h |  3 ++-
 arch/x86/kvm/x86.c   |  1 +
 virt/kvm/eventfd.c   | 27 ++-
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 932d7f2..d73007e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -464,7 +464,7 @@ enum {
 struct kvm_ioeventfd {
__u64 datamatch;
__u64 addr;/* legal pio/mmio address */
-   __u32 len; /* 1, 2, 4, or 8 bytes*/
+   __u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
__s32 fd;
__u32 flags;
__u8  pad[36];
@@ -675,6 +675,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_SPAPR_MULTITCE 94
 #define KVM_CAP_EXT_EMUL_CPUID 95
 #define KVM_CAP_HYPERV_TIME 96
+#define KVM_CAP_IOEVENTFD_NO_LENGTH 97
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c6218b..8ae1ff5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2602,6 +2602,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_IRQ_INJECT_STATUS:
case KVM_CAP_IRQFD:
case KVM_CAP_IOEVENTFD:
+   case KVM_CAP_IOEVENTFD_NO_LENGTH:
case KVM_CAP_PIT2:
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index abe4d60..9eadb59 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -600,7 +600,15 @@ ioeventfd_in_range(struct _ioeventfd *p, gpa_t addr, int 
len, const void *val)
 {
u64 _val;
 
-   if (!(addr == p-addr  len == p-length))
+   if (addr != p-addr)
+   /* address must be precise for a hit */
+   return false;
+
+   if (!p-length)
+   /* length = 0 means only look at the address, so always a hit */
+   return true;
+
+   if (len != p-length)
/* address-range must be precise for a hit */
return false;
 
@@ -671,9 +679,11 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
_ioeventfd *p)
 
list_for_each_entry(_p, kvm-ioeventfds, list)
if (_p-bus_idx == p-bus_idx 
-   _p-addr == p-addr  _p-length == p-length 
-   (_p-wildcard || p-wildcard ||
-_p-datamatch == p-datamatch))
+   _p-addr == p-addr 
+   (!_p-length || !p-length ||
+(_p-length == p-length 
+ (_p-wildcard || p-wildcard ||
+  _p-datamatch == p-datamatch
return true;
 
return false;
@@ -697,8 +707,9 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
int   ret;
 
bus_idx = ioeventfd_bus_from_flags(args-flags);
-   /* must be natural-word sized */
+   /* must be natural-word sized, or 0 to ignore length */
switch (args-len) {
+   case 0:
case 1:
case 2:
case 4:
@@ -716,6 +727,12 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
if (args-flags  ~KVM_IOEVENTFD_VALID_FLAG_MASK)
return -EINVAL;
 
+   /* ioeventfd with no length can't be combined with DATAMATCH */
+   if (!args-len 
+   args-flags  (KVM_IOEVENTFD_FLAG_PIO |
+  KVM_IOEVENTFD_FLAG_DATAMATCH))
+   return -EINVAL;
+
eventfd = eventfd_ctx_fdget(args-fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);
-- 
MST

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] vmx: speed up wildcard MMIO EVENTFD

2014-03-31 Thread Michael S. Tsirkin
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Unfortunately, this only works if userspace does not need to match on
access length and data.  The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
- minimize overhead for old userspace that does not use eventfd with 
lengtth = 0
- minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them to handle
  an invalid length)

At the moment, this optimization only has effect for EPT on x86.

It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.

With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.

Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.

The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
pre-review and suggestions.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 include/linux/kvm_host.h |  1 +
 include/uapi/linux/kvm.h |  1 +
 arch/x86/kvm/vmx.c   |  4 
 virt/kvm/eventfd.c   | 16 
 virt/kvm/kvm_main.c  |  1 +
 5 files changed, 23 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b8e9a43..c2eaa9f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -163,6 +163,7 @@ enum kvm_bus {
KVM_MMIO_BUS,
KVM_PIO_BUS,
KVM_VIRTIO_CCW_NOTIFY_BUS,
+   KVM_FAST_MMIO_BUS,
KVM_NR_BUSES
 };
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d73007e..fc75810 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -450,6 +450,7 @@ enum {
kvm_ioeventfd_flag_nr_pio,
kvm_ioeventfd_flag_nr_deassign,
kvm_ioeventfd_flag_nr_virtio_ccw_notify,
+   kvm_ioeventfd_flag_nr_fast_mmio,
kvm_ioeventfd_flag_nr_max,
 };
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3927528..78a1932 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5493,6 +5493,10 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa_t gpa;
 
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+   if (!kvm_io_bus_write(vcpu-kvm, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
 
ret = handle_mmio_page_fault_common(vcpu, gpa, true);
if (likely(ret == RET_MMIO_PF_EMULATE))
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9eadb59..c61f2eb 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -770,6 +770,16 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
if (ret  0)
goto unlock_fail;
 
+   /* When length is ignored, MMIO is also put on a separate bus, for
+* faster lookups.
+*/
+   if (!args-len  !(args-flags  KVM_IOEVENTFD_FLAG_PIO)) {
+   ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
+ p-addr, 0, p-dev);
+   if (ret  0)
+   goto register_fail;
+   }
+
kvm-buses[bus_idx]-ioeventfd_count++;
list_add_tail(p-list, kvm-ioeventfds);
 
@@ -777,6 +787,8 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
 
return 0;
 
+register_fail:
+   kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev);
 unlock_fail:
mutex_unlock(kvm-slots_lock);
 
@@ -816,6 +828,10 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
continue;
 
kvm_io_bus_unregister_dev(kvm, bus_idx, p-dev);
+   if (!p-length) {
+   kvm_io_bus_unregister_dev(kvm, KVM_FAST_MMIO_BUS,
+ p-dev);
+   }
kvm-buses[bus_idx]-ioeventfd_count--;
ioeventfd_release(p);
ret = 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 03a0381..22a4e1b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2920,6 +2920,7 @@ static int __kvm_io_bus_read(struct kvm_io_bus *bus, 
struct kvm_io_range *range,
 
return -EOPNOTSUPP;
 }
+EXPORT_SYMBOL_GPL(kvm_io_bus_write);
 
 /* kvm_io_bus_read - called under kvm-slots_lock */
 int kvm_io_bus_read(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
-- 
MST

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Michael Roth
Quoting Andreas Färber (2014-03-31 09:46:45)
 Am 31.03.2014 16:32, schrieb Peter Maydell:
  On 31 March 2014 15:28, Paolo Bonzini pbonz...@redhat.com wrote:
  I think it would be a good idea to separate the committer and release
  manager roles.  Peter is providing the community with a wonderful service,
  just like you were; putting too much work on his shoulders risks getting us
  in the same situation if anything were to affect his ability to provide it.
  
  Yes, I strongly agree with this. I think we'll do much better
  if we can manage to share out responsibilities among a wider
  group of people.
 
 May I propose Michael Roth, who is already experienced from the N-1
 stable releases?

Sure, I would be willing.

 
 If we can enable him to upload the tarballs created from his tags that
 would also streamline the stable workflow while at it.

Agreed, though I feel a little weird about creating releases for tags that
aren't in the official repo. Would that be acceptable from a community
stand-point? I'm honestly not sure.

Otherwise I think Anthony/Peter would probably still need to process a pull
for stable-y.x branch in advance before we do the tarball/release. Would
still help simplify things a bit though by keeping tasks compartmentalized.

Anthony, Peter: in the past, prior to release, I just sent an email with
a pointer to my github branch with the stable release tagged. Would a proper
pull request (with a for-stable-x.y tag or somesuch) be preferable?

If we opt to align the stable repo updates with the actual release, what kind
of lead time would we need prior to actual release?

 
 Regards,
 Andreas
 
 -- 
 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
 GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3] nVMX: Fixes to run Xen as L1

2014-03-31 Thread Bandan Das
Minor changes to enable Xen as a L1 hypervisor.

Tested with a Haswell host, Xen-4.3 L1 and debian6 L2

v2: 
* Remove advertising single context invalidation for emulated invept
  Patch KVM: nVMX: check for null vmcs12 when L1 does invept from v1 
  is now obsolete and is removed
* Reorder patches KVM: nVMX: Advertise support for interrupt acknowledgement
  and nVMX: Ack and write vector info to intr_info if L1 asks us to
* Add commit description to 2/3 and change comment for nested_exit_intr_ack_set

Jan, I will send a separate unit-test patch

Bandan Das (3):
  KVM: nVMX: Don't advertise single context invalidation for invept
  KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to
  KVM: nVMX: Advertise support for interrupt acknowledgement

 arch/x86/kvm/irq.c |  1 +
 arch/x86/kvm/vmx.c | 35 ---
 2 files changed, 25 insertions(+), 11 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] KVM: nVMX: Don't advertise single context invalidation for invept

2014-03-31 Thread Bandan Das
For single context invalidation, we fall through to global
invalidation in handle_invept() except for one case - when
the operand supplied by L1 is different from what we have in
vmcs12. However, typically hypervisors will only call invept
for the currently loaded eptp, so the condition will
never be true.

Signed-off-by: Bandan Das b...@redhat.com
---
 arch/x86/kvm/vmx.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3927528..3e7f60c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2331,12 +2331,11 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 VMX_EPT_INVEPT_BIT;
nested_vmx_ept_caps = vmx_capability.ept;
/*
-* Since invept is completely emulated we support both global
-* and context invalidation independent of what host cpu
-* supports
+* For nested guests, we don't do anything specific
+* for single context invalidation. Hence, only advertise
+* support for global context invalidation.
 */
-   nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT |
-   VMX_EPT_EXTENT_CONTEXT_BIT;
+   nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT;
} else
nested_vmx_ept_caps = 0;
 
@@ -6383,7 +6382,6 @@ static int handle_invept(struct kvm_vcpu *vcpu)
struct {
u64 eptp, gpa;
} operand;
-   u64 eptp_mask = ((1ull  51) - 1)  PAGE_MASK;
 
if (!(nested_vmx_secondary_ctls_high  SECONDARY_EXEC_ENABLE_EPT) ||
!(nested_vmx_ept_caps  VMX_EPT_INVEPT_BIT)) {
@@ -6423,16 +6421,13 @@ static int handle_invept(struct kvm_vcpu *vcpu)
}
 
switch (type) {
-   case VMX_EPT_EXTENT_CONTEXT:
-   if ((operand.eptp  eptp_mask) !=
-   (nested_ept_get_cr3(vcpu)  eptp_mask))
-   break;
case VMX_EPT_EXTENT_GLOBAL:
kvm_mmu_sync_roots(vcpu);
kvm_mmu_flush_tlb(vcpu);
nested_vmx_succeed(vcpu);
break;
default:
+   /* Trap single context invalidation invept calls */
BUG_ON(1);
break;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to

2014-03-31 Thread Bandan Das
This feature emulates the Acknowledge interrupt on exit behavior.
We can safely emulate it for L1 to run L2 even if L0 itself has it
disabled (to run L1).

Signed-off-by: Bandan Das b...@redhat.com
---
 arch/x86/kvm/irq.c |  1 +
 arch/x86/kvm/vmx.c | 19 +++
 2 files changed, 20 insertions(+)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 484bc87..bd0da43 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -113,6 +113,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
 
return kvm_get_apic_interrupt(v);   /* APIC */
 }
+EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
 
 void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3e7f60c..bdc8f2d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4489,6 +4489,18 @@ static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
PIN_BASED_EXT_INTR_MASK;
 }
 
+/*
+ * In nested virtualization, check if L1 has enabled
+ * interrupt acknowledgement that writes the interrupt vector
+ * info on vmexit
+ * 
+ */
+static bool nested_exit_intr_ack_set(struct kvm_vcpu *vcpu)
+{
+   return get_vmcs12(vcpu)-vm_exit_controls 
+   VM_EXIT_ACK_INTR_ON_EXIT;
+}
+
 static bool nested_exit_on_nmi(struct kvm_vcpu *vcpu)
 {
return get_vmcs12(vcpu)-pin_based_vm_exec_control 
@@ -8442,6 +8454,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 
exit_reason,
prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info,
   exit_qualification);
 
+   if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+nested_exit_intr_ack_set(vcpu)) {
+   int irq = kvm_cpu_get_interrupt(vcpu);
+   vmcs12-vm_exit_intr_info = irq |
+   INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;
+   }
+
trace_kvm_nested_vmexit_inject(vmcs12-vm_exit_reason,
   vmcs12-exit_qualification,
   vmcs12-idt_vectoring_info_field,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] KVM: nVMX: Advertise support for interrupt acknowledgement

2014-03-31 Thread Bandan Das
Some Type 1 hypervisors such as XEN won't enable VMX without it present

Signed-off-by: Bandan Das b...@redhat.com
---
 arch/x86/kvm/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e864b7a..a2a03c5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2273,7 +2273,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
nested_vmx_pinbased_ctls_high = 
~PIN_BASED_VMX_PREEMPTION_TIMER;
}
nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
-   VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER);
+   VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
+ VM_EXIT_ACK_INTR_ON_EXIT);
 
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mechanism to allow a driver to bind to any device

2014-03-31 Thread Kim Phillips
On Mon, 31 Mar 2014 20:23:36 +
Stuart Yoder stuart.yo...@freescale.com wrote:

  From: Greg KH [mailto:gre...@linuxfoundation.org]
  Sent: Monday, March 31, 2014 2:47 PM
  
  On Mon, Mar 31, 2014 at 06:47:51PM +, Stuart Yoder wrote:
   I also, was at the point where I thought we should perhaps just
   go with current mechanisms and implement new_id for the platform
   bus...but Greg's recent response is 'platform devices suck' and it
  sounds
   like he would reject a new_id patch for the platform bus.  So it kind
   of feels like we are stuck.
  
  ids mean nothing in the platform device model, so having a new_id file
  for them makes no sense.
 
 They don't have IDs like PCI, but platform drivers have to match on
 something.  Platform device match tables are based on compatible strings.
 
 Example from Freescale DMA driver:
   static const struct of_device_id fsldma_of_ids[] = {
 { .compatible = fsl,elo3-dma, },
 { .compatible = fsl,eloplus-dma, },
 { .compatible = fsl,elo-dma, },
 {}
   };
 
 The process of unbinding, setting a new_id, and binding to vfio would work
 just like PCI:
 
echo ffe101300.dma  /sys/bus/platform/devices/ffe101300.dma/driver/unbind
echo fsl,eloplus-dma  /sys/bus/platform/drivers/vfio-platform/new_id

In platform device land, we don't want to pursue the
new_id/match-by-compatible methodology: we know exactly which specific
device (not device types) we want bound to which driver, so we just
want to be able to simply:

echo fff51000.ethernet | sudo tee -a 
/sys/bus/platform/devices/fff51000.ethernet/driver/unbind
echo fff51000.ethernet | sudo tee -a 
/sys/bus/platform/drivers/vfio-platform/bind

and not get involved with how PCI doesn't simply do that, independent
of autoprobe/hotplug.

Kim
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mechanism to allow a driver to bind to any device

2014-03-31 Thread Kim Phillips
On Fri, 28 Mar 2014 11:10:23 -0600
Alex Williamson alex.william...@redhat.com wrote:

 On Fri, 2014-03-28 at 12:58 -0400, Konrad Rzeszutek Wilk wrote:
  On Wed, Mar 26, 2014 at 04:09:21PM -0600, Alex Williamson wrote:
   On Wed, 2014-03-26 at 10:21 -0600, Alex Williamson wrote:
On Wed, 2014-03-26 at 23:06 +0800, Alexander Graf wrote:
 
  Am 26.03.2014 um 22:40 schrieb Konrad Rzeszutek Wilk 
  konrad.w...@oracle.com:
  
  On Wed, Mar 26, 2014 at 01:40:32AM +, Stuart Yoder wrote:
  Hi Greg,
  
  We (Linaro, Freescale, Virtual Open Systems) are trying get an 
  issue
  closed that has been perculating for a while around creating a 
  mechanism
  that will allow kernel drivers like vfio can bind to devices of 
  any type.
  
  This thread with you:
  http://www.spinics.net/lists/kvm-arm/msg08370.html
  ...seems to have died out, so am trying to get your response
  and will summarize again.  Vfio drivers in the kernel (regardless 
  of
  bus type) need to bind to devices of any type.  The driver's 
  function
  is to simply export hardware resources of any type to user space.
  
  There are several approaches that have been proposed:
  
  You seem to have missed the one I proposed.
  
1.  new_id -- (current approach) the user explicitly registers
each new device type with the vfio driver using the new_id
mechanism.
  
Problem: multiple drivers will be resident that handle the
same device type...and there is nothing user space hotplug
infrastructure can do to help.
  
2.  any id -- the vfio driver could specify a wildcard match
of some kind in its ID match table which would allow it to
match and bind to any possible device id.  However,
we don't want the vfio driver grabbing _all_ devices...just 
  the ones we
explicitly want to pass to user space.
  
The proposed patch to support this was to create a new flag
sysfs_bind_only in struct device_driver.  When this flag
is set, the driver can only bind to devices via the sysfs
bind file.  This would allow the wildcard match to work.
  
Patch is here:
https://lkml.org/lkml/2013/12/3/253
  
3.  Driver initiated explicit bind -- with this approach the
vfio driver would create a private 'bind' sysfs object
and the user would echo the requested device into it:
  
echo 0001:03:00.0  /sys/bus/pci/drivers/vfio-pci/vfio_bind
  
In order to make that work, the driver would need to call
driver_probe_device() and thus we need this patch:
https://lkml.org/lkml/2014/2/8/175
  
  4). Use the 'unbind' (from the original device) and 'bind' to vfio 
  driver.
 
 This is approach 2, no?
 
  
  Which I think is what is currently being done. Why is that not 
  sufficient?
 
 How would 'bind to vfio driver' look like?
 
  The only thing I see in the URL is  That works, but it is ugly.
  There is some mention of race but I don't see how - if you do the 
  'unbind'
  on the original driver and then bind the BDF to the VFIO how would 
  you get
  a race?
 
 Typically on PCI, you do a
 
   - add wildcard (pci id) match to vfio driver
   - unbind driver
   - reprobe
   - device attaches to vfio driver because it is the least recent 
 match
   - remove wildcard match from vfio driver
 
 If in between you hotplug add a card of the same type, it gets 
 attached to vfio - even though the logical default driver would be 
 the device specific driver.

I've mentioned drivers_autoprobe in the past, but I'm not sure we're
really factoring it into the discussion.  drivers_autoprobe allows us to
toggle two points:

a) When a new device is added whether we automatically give drivers a
try at binding to it

b) When a new driver is added whether it gets to try to bind to anything
in the system

So we do have a mechanism to avoid the race, but the problem is that it
becomes the responsibility of userspace to:

1) turn off drivers_autoprobe
2) unbind/new_id/bind/remove_id
3) turn on drivers_autoprobe
4) call drivers_probe for anything added between 1)  3)

Is the question about the ugliness of the current solution whether it's
unreasonable to ask userspace to do this?
What we seem to be asking for above is more like an autoprobe flag per
driver where there's some way for this special driver to opt out of auto
probing.  Option 2. in Stuart's list does this by short-cutting ID
matching so that a match is only found when using the sysfs bind path,
option 3. enables a way 

Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-31 Thread Scott Wood
On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:
 On 03/26/2014 10:17 PM, Scott Wood wrote:
  On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
  +  /*
  +   * Another thread may rewrite the TLB entry in parallel, don't
  +   * execute from the address if the execute permission is not set
  +   */
 
 What happens when another thread rewrites the TLB entry in parallel? 
 Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? 
 Are the contents of the MAS registers consistent at this point or 
 inconsistent?

If another thread rewrites the TLB entry, then we use the new TLB entry,
just as if it had raced in hardware.  This check ensures that we don't
execute from the new TLB entry if it doesn't have execute permissions
(just as hardware would).

What happens if the new TLB entry is valid and executable, and the
instruction pointed to is valid, but doesn't trap (and thus we don't
have emulation for it)?

 There has to be a good way to detect such a race and deal with it, no?

How would you detect it?  We don't get any information from the trap
about what physical address the instruction was fetched from, or what
the instruction was.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mechanism to allow a driver to bind to any device

2014-03-31 Thread Alex Williamson
On Mon, 2014-03-31 at 17:36 -0500, Kim Phillips wrote:
 On Fri, 28 Mar 2014 11:10:23 -0600
 Alex Williamson alex.william...@redhat.com wrote:
 
  On Fri, 2014-03-28 at 12:58 -0400, Konrad Rzeszutek Wilk wrote:
   On Wed, Mar 26, 2014 at 04:09:21PM -0600, Alex Williamson wrote:
On Wed, 2014-03-26 at 10:21 -0600, Alex Williamson wrote:
 On Wed, 2014-03-26 at 23:06 +0800, Alexander Graf wrote:
  
   Am 26.03.2014 um 22:40 schrieb Konrad Rzeszutek Wilk 
   konrad.w...@oracle.com:
   
   On Wed, Mar 26, 2014 at 01:40:32AM +, Stuart Yoder wrote:
   Hi Greg,
   
   We (Linaro, Freescale, Virtual Open Systems) are trying get an 
   issue
   closed that has been perculating for a while around creating a 
   mechanism
   that will allow kernel drivers like vfio can bind to devices of 
   any type.
   
   This thread with you:
   http://www.spinics.net/lists/kvm-arm/msg08370.html
   ...seems to have died out, so am trying to get your response
   and will summarize again.  Vfio drivers in the kernel 
   (regardless of
   bus type) need to bind to devices of any type.  The driver's 
   function
   is to simply export hardware resources of any type to user space.
   
   There are several approaches that have been proposed:
   
   You seem to have missed the one I proposed.
   
 1.  new_id -- (current approach) the user explicitly registers
 each new device type with the vfio driver using the new_id
 mechanism.
   
 Problem: multiple drivers will be resident that handle the
 same device type...and there is nothing user space hotplug
 infrastructure can do to help.
   
 2.  any id -- the vfio driver could specify a wildcard match
 of some kind in its ID match table which would allow it to
 match and bind to any possible device id.  However,
 we don't want the vfio driver grabbing _all_ 
   devices...just the ones we
 explicitly want to pass to user space.
   
 The proposed patch to support this was to create a new flag
 sysfs_bind_only in struct device_driver.  When this flag
 is set, the driver can only bind to devices via the sysfs
 bind file.  This would allow the wildcard match to work.
   
 Patch is here:
 https://lkml.org/lkml/2013/12/3/253
   
 3.  Driver initiated explicit bind -- with this approach the
 vfio driver would create a private 'bind' sysfs object
 and the user would echo the requested device into it:
   
 echo 0001:03:00.0  /sys/bus/pci/drivers/vfio-pci/vfio_bind
   
 In order to make that work, the driver would need to call
 driver_probe_device() and thus we need this patch:
 https://lkml.org/lkml/2014/2/8/175
   
   4). Use the 'unbind' (from the original device) and 'bind' to 
   vfio driver.
  
  This is approach 2, no?
  
   
   Which I think is what is currently being done. Why is that not 
   sufficient?
  
  How would 'bind to vfio driver' look like?
  
   The only thing I see in the URL is  That works, but it is ugly.
   There is some mention of race but I don't see how - if you do the 
   'unbind'
   on the original driver and then bind the BDF to the VFIO how 
   would you get
   a race?
  
  Typically on PCI, you do a
  
- add wildcard (pci id) match to vfio driver
- unbind driver
- reprobe
- device attaches to vfio driver because it is the least recent 
  match
- remove wildcard match from vfio driver
  
  If in between you hotplug add a card of the same type, it gets 
  attached to vfio - even though the logical default driver would 
  be the device specific driver.
 
 I've mentioned drivers_autoprobe in the past, but I'm not sure we're
 really factoring it into the discussion.  drivers_autoprobe allows us 
 to
 toggle two points:
 
 a) When a new device is added whether we automatically give drivers a
 try at binding to it
 
 b) When a new driver is added whether it gets to try to bind to 
 anything
 in the system
 
 So we do have a mechanism to avoid the race, but the problem is that 
 it
 becomes the responsibility of userspace to:
 
 1) turn off drivers_autoprobe
 2) unbind/new_id/bind/remove_id
 3) turn on drivers_autoprobe
 4) call drivers_probe for anything added between 1)  3)
 
 Is the question about the ugliness of the current solution whether 
 it's
 unreasonable to ask userspace to do this?
 What we seem to be asking for above is more like an autoprobe flag per
 driver where there's some way for this special 

Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-31 Thread Alejandro Comisario
Thanks Stefan and thanks Michael also.

That situation regarding the IRC was very special, since i didnt
wanted to tell Michael hey, everyone in the mailing list got it and
im here chatting with you and you didn't so i assumed the IRC was
9 times more pro than the mailing list so i decided to
keep my head down and assume the communication error was on my side.

Still, IMHO, i really believe that if you are a user willing to give
KVM a chance enought to make a query on the IRC, you might feel you
are not geek enought to be there, and i dont mean be there on IRC, but
trying to use the community to support you while you try KVM.

In my case, while was very important to understant what were my
chances regarding this issue, i knew i would find my answer no matter
what because i was decided to find it, i could get mad with 10.5K
guests running on my back, yes my experience was more from the virsh
stop; virsh start side, but still i felt i needed you guys to try to
find this out.

Again, thanks to everyone.

best.
Alejandro Comisario


On Fri, Mar 28, 2014 at 5:47 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Mar 28, 2014 at 11:01:00AM +0400, Michael Tokarev wrote:
 27.03.2014 20:14, Alejandro Comisario wrote:
  Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side
  (ubuntu 12.04 on host and guest).
  So, how can i adjust the tinmeout on the guest ?

 After a bit more talks on IRC yesterday, it turned out that the situation
 is _much_ more interesting than originally described.  The OP claims to
 have 10500 guests running off an NFS server, and that after NFS server
 downtime, the backing files were disappeared (whatever it means), so
 they had to restore those files.  More, the OP didn't even bother to look
 at the guest's dmesg, being busy rebooting all 10500 guests.

  This solution is the most logical one, but i cannot apply it!
  thanks for all the responses!

 I suggested the OP to actually describe the _real_ situation, instead of
 giving random half-pictures, and actually take a look at the actual problem
 as reported in various places (most importantly the guest kernel log), and
 reoirt _those_ hints to the list.  I also mentioned that, at least for some
 NFS servers, if a client has a file open on the server, and this file is
 deleted, the server will report error to the client when client tries to
 access that file, and this has nothing at all to do with timeouts of any
 kind.

 Thanks for the update and for taking time to help on IRC.  I feel you're
 being harsh on Alejandro though.

 Improving the quality of bug reports is important but it shouldn't be at
 the expense of quality of communication.  We can't assume that everyone
 is an expert in troubleshooting KVM or Linux.  Therefore we can't blame
 them, which will only drive people away and detract from the community.

 TL;DR post logs and error messages +1, berate him -1

 Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Wu, Feng


 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Paolo Bonzini
 Sent: Monday, March 31, 2014 9:31 PM
 To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org
 Subject: Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4
 
 Just a few comments...
 
  -static void update_permission_bitmask(struct kvm_vcpu *vcpu,
  +void update_permission_bitmask(struct kvm_vcpu *vcpu,
  struct kvm_mmu *mmu, bool ept)
   {
  unsigned bit, byte, pfec;
  u8 map;
  -   bool fault, x, w, u, wf, uf, ff, smep;
  +   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;
 
  smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
 
 Can you make an additional patch to rename this to cr4_smep?

Sure! I noticed your comments about this issue in the previous email, I was 
prepare to make a patch for it, will send out it today!
 
  +   cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
  for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
  pfec = byte  1;
  map = 0;
  wf = pfec  PFERR_WRITE_MASK;
  uf = pfec  PFERR_USER_MASK;
  ff = pfec  PFERR_FETCH_MASK;
  +   /*
  +* PFERR_RSVD_MASK bit is used to detect SMAP violation.
  +* We will check it in permission_fault(), this bit is
  +* set in pfec for normal fault, while it is cleared for
  +* SMAP violations.
  +*/
 
 This bit is set in PFEC if we the access is _not_ subject to SMAP
 restrictions, and cleared otherwise.  The bit is only meaningful if
 the SMAP bit is set in CR4.
 
  +   smapf = !(pfec  PFERR_RSVD_MASK);
  for (bit = 0; bit  8; ++bit) {
  x = bit  ACC_EXEC_MASK;
  w = bit  ACC_WRITE_MASK;
  @@ -3627,11 +3635,32 @@ static void update_permission_bitmask(struct
 kvm_vcpu *vcpu,
  w |= !is_write_protection(vcpu)  !uf;
  /* Disallow supervisor fetches of user code if 
  cr4.smep */
  x = !(smep  u  !uf);
  +
  +   /*
  +* SMAP:kernel-mode data accesses from user-mode
  +* mappings should fault. A fault is considered
  +* as a SMAP violation if all of the following
  +* conditions are ture:
  +*   - X86_CR4_SMAP is set in CR4
  +*   - An user page is accessed
  +*   - Page fault in kernel mode
  +*   - !(CPL3  X86_EFLAGS_AC is set)
 
 - if CPL  3, EFLAGS.AC is clear

Should it be if CPL =3 or EFLAGS.AC is clear ?

 
  +*   Here, we cover the first three conditions,
  +*   The CPL and X86_EFLAGS_AC is in smapf,which
  +*   permission_fault() computes dynamically.
 
 The fourth is computed dynamically in permission_fault() and is in SMAPF.
 
  +*   Also, SMAP does not affect instruction
  +*   fetches, add the !ff check here to make it
  +*   clearer.
  +*/
  +   smap = cr4_smap  u  !uf  !ff;
  } else
  /* Not really needed: no U/S accesses on ept  */
  u = 1;
 
  -   fault = (ff  !x) || (uf  !u) || (wf  !w);
  +   fault = (ff  !x) || (uf  !u) || (wf  !w) ||
  +   (smapf  smap);
  map |= fault  bit;
  }
  mmu-permissions[byte] = map;
  diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
  index 2926152..822190f 100644
  --- a/arch/x86/kvm/mmu.h
  +++ b/arch/x86/kvm/mmu.h
  @@ -44,11 +44,17 @@
   #define PT_DIRECTORY_LEVEL 2
   #define PT_PAGE_TABLE_LEVEL 1
 
  -#define PFERR_PRESENT_MASK (1U  0)
  -#define PFERR_WRITE_MASK (1U  1)
  -#define PFERR_USER_MASK (1U  2)
  -#define PFERR_RSVD_MASK (1U  3)
  -#define PFERR_FETCH_MASK (1U  4)
  +#define PFERR_PRESENT_BIT 0
  +#define PFERR_WRITE_BIT 1
  +#define PFERR_USER_BIT 2
  +#define PFERR_RSVD_BIT 3
  +#define PFERR_FETCH_BIT 4
  +
  +#define PFERR_PRESENT_MASK (1U  PFERR_PRESENT_BIT)
  +#define PFERR_WRITE_MASK (1U  PFERR_WRITE_BIT)
  +#define PFERR_USER_MASK (1U  PFERR_USER_BIT)
  +#define PFERR_RSVD_MASK (1U  PFERR_RSVD_BIT)
  +#define PFERR_FETCH_MASK (1U  PFERR_FETCH_BIT)
 
   int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64
 sptes[4]);
   void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
  @@ -73,6 +79,8 @@ int handle_mmio_page_fault_common(struct
 kvm_vcpu *vcpu, u64 addr, bool direct);
   void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu
 *context);
   void 

[PATCH v4 0/4] KVM: enable Intel SMAP for KVM

2014-03-31 Thread Feng Wu
Supervisor Mode Access Prevention (SMAP) is a new security feature 
disclosed by Intel, please refer to the following document: 

http://software.intel.com/sites/default/files/319433-014.pdf
 
Every access to a linear address is either a supervisor-mode access
or a user-mode access. All accesses performed while the current
privilege level (CPL) is less than 3 are supervisor-mode accesses.
If CPL = 3, accesses are generally user-mode accesses. However, some
operations implicitly access system data structures, and the resulting
accesses to those data structures are supervisor-mode accesses regardless
of CPL. Examples of such implicit supervisor accesses include the following:
accesses to the global descriptor table (GDT) or local descriptor table
(LDT) to load a segment descriptor; accesses to the interrupt descriptor
table (IDT) when delivering an interrupt or exception; and accesses to the
task-state segment (TSS) as part of a task switch or change of CPL.

If CR4.SMAP = 1, supervisor-mode data accesses are not allowed to linear
addresses that are accessible in user mode. If CPL  3, SMAP protections
are disabled if EFLAGS.AC = 1. If CPL = 3, SMAP applies to all supervisor-mode
data accesses (these are implicit supervisor accesses) regardless of the
value of EFLAGS.AC.

This patchset pass-through SMAP feature to guests, and let guests
benefit from it.

Version 1:
  * Remove SMAP bit from CR4_RESERVED_BITS.
  * Add SMAP support when setting CR4
  * Disable SMAP for guests in EPT realmode and EPT unpaging mode
  * Expose SMAP feature to guest

Version 2:
  * Change the logic of updating mmu permission bitmap for SMAP violation
  * Expose SMAP feature to guest in the last patch of this series.

Version 3:
  * Changes in update_permission_bitmask().
  * Use a branchless way suggested by Paolo Bonzini to detect SMAP
violation in permission_fault(). 

Version 4:
  * Changes to some comments and code style.

Feng Wu (4):
  KVM: Remove SMAP bit from CR4_RESERVED_BITS.
  KVM: Add SMAP support when setting CR4
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
  KVM: expose SMAP feature to guest

 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/cpuid.c|  2 +-
 arch/x86/kvm/cpuid.h|  8 
 arch/x86/kvm/mmu.c  | 34 ---
 arch/x86/kvm/mmu.h  | 44 +
 arch/x86/kvm/paging_tmpl.h  |  2 +-
 arch/x86/kvm/vmx.c  | 11 ++-
 arch/x86/kvm/x86.c  |  9 -
 8 files changed, 92 insertions(+), 20 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 4/4] KVM: expose SMAP feature to guest

2014-03-31 Thread Feng Wu
This patch exposes SMAP feature to guest

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/cpuid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c697625..deb5f9b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -303,7 +303,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
/* cpuid 7.0.ebx */
const u32 kvm_supported_word9_x86_features =
F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) |
-   F(BMI2) | F(ERMS) | f_invpcid | F(RTM);
+   F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | F(SMAP);
 
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/4] KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode

2014-03-31 Thread Feng Wu
SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/vmx.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3927528..e58cb5f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3452,13 +3452,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
hw_cr4 = ~X86_CR4_PAE;
hw_cr4 |= X86_CR4_PSE;
/*
-* SMEP is disabled if CPU is in non-paging mode in
-* hardware. However KVM always uses paging mode to
+* SMEP/SMAP is disabled if CPU is in non-paging mode
+* in hardware. However KVM always uses paging mode to
 * emulate guest non-paging mode with TDP.
-* To emulate this behavior, SMEP needs to be manually
-* disabled when guest switches to non-paging mode.
+* To emulate this behavior, SMEP/SMAP needs to be
+* manually disabled when guest switches to non-paging
+* mode.
 */
-   hw_cr4 = ~X86_CR4_SMEP;
+   hw_cr4 = ~(X86_CR4_SMEP | X86_CR4_SMAP);
} else if (!(cr4  X86_CR4_PAE)) {
hw_cr4 = ~X86_CR4_PAE;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/4] KVM: Remove SMAP bit from CR4_RESERVED_BITS.

2014-03-31 Thread Feng Wu
This patch removes SMAP bit from CR4_RESERVED_BITS.

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fdf83af..4eeb049 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -60,7 +60,7 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Feng Wu
This patch adds SMAP handling logic when setting CR4 for guests

Thanks a lot to Paolo Bonzini for his suggestion to use the branchless
way to detect SMAP violation.

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/cpuid.h   |  8 
 arch/x86/kvm/mmu.c | 34 +++---
 arch/x86/kvm/mmu.h | 44 
 arch/x86/kvm/paging_tmpl.h |  2 +-
 arch/x86/kvm/x86.c |  9 -
 5 files changed, 84 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index a2a1bb7..eeecbed 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -48,6 +48,14 @@ static inline bool guest_cpuid_has_smep(struct kvm_vcpu 
*vcpu)
return best  (best-ebx  bit(X86_FEATURE_SMEP));
 }
 
+static inline bool guest_cpuid_has_smap(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best  (best-ebx  bit(X86_FEATURE_SMAP));
+}
+
 static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9b53135..a183783 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3601,20 +3601,27 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu 
*vcpu,
}
 }
 
-static void update_permission_bitmask(struct kvm_vcpu *vcpu,
+void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smep;
+   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;
 
smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
+   cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
pfec = byte  1;
map = 0;
wf = pfec  PFERR_WRITE_MASK;
uf = pfec  PFERR_USER_MASK;
ff = pfec  PFERR_FETCH_MASK;
+   /*
+* PFERR_RSVD_MASK bit is set in PFEC if the access is not
+* subject to SMAP restrictions, and cleared otherwise. The
+* bit is only meaningful if the SMAP bit is set in CR4.
+*/
+   smapf = !(pfec  PFERR_RSVD_MASK);
for (bit = 0; bit  8; ++bit) {
x = bit  ACC_EXEC_MASK;
w = bit  ACC_WRITE_MASK;
@@ -3627,11 +3634,32 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
w |= !is_write_protection(vcpu)  !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
x = !(smep  u  !uf);
+
+   /*
+* SMAP:kernel-mode data accesses from user-mode
+* mappings should fault. A fault is considered
+* as a SMAP violation if all of the following
+* conditions are ture:
+*   - X86_CR4_SMAP is set in CR4
+*   - An user page is accessed
+*   - Page fault in kernel mode
+*   - if CPL = 3 or X86_EFLAGS_AC is clear
+*
+*   Here, we cover the first three conditions.
+*   The fourth is computed dynamically in
+*   permission_fault() and is in smapf.
+*
+*   Also, SMAP does not affect instruction
+*   fetches, add the !ff check here to make it
+*   clearer.
+*/
+   smap = cr4_smap  u  !uf  !ff;
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;
 
-   fault = (ff  !x) || (uf  !u) || (wf  !w);
+   fault = (ff  !x) || (uf  !u) || (wf  !w) ||
+   (smapf  smap);
map |= fault  bit;
}
mmu-permissions[byte] = map;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..3842e70 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -44,11 +44,17 @@
 #define PT_DIRECTORY_LEVEL 2
 #define PT_PAGE_TABLE_LEVEL 1
 
-#define PFERR_PRESENT_MASK (1U  0)
-#define PFERR_WRITE_MASK (1U  1)
-#define PFERR_USER_MASK (1U  2)
-#define PFERR_RSVD_MASK (1U  3)
-#define PFERR_FETCH_MASK (1U  4)
+#define PFERR_PRESENT_BIT 0
+#define PFERR_WRITE_BIT 1
+#define PFERR_USER_BIT 2
+#define PFERR_RSVD_BIT 3
+#define 

[PATCH] Rename variable smep to cr4_smep

2014-03-31 Thread Feng Wu
This patch is based on the smap patchset

Feng Wu (1):
  Rename variable smep to cr4_smep

 arch/x86/kvm/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Rename variable smep to cr4_smep

2014-03-31 Thread Feng Wu
Rename variable smep to cr4_smep, which can better reflect the
meaning of the variable.

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a183783..6000557 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3606,9 +3606,9 @@ void update_permission_bitmask(struct kvm_vcpu *vcpu,
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;
+   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
 
-   smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
+   cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
pfec = byte  1;
@@ -3633,7 +3633,7 @@ void update_permission_bitmask(struct kvm_vcpu *vcpu,
/* Allow supervisor writes if !cr0.wp */
w |= !is_write_protection(vcpu)  !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
-   x = !(smep  u  !uf);
+   x = !(cr4_smep  u  !uf);
 
/*
 * SMAP:kernel-mode data accesses from user-mode
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-03-31 Thread Marcelo Tosatti
On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
 On 03/29/2014 01:47 AM, Zhanghailiang wrote:
  Hi,
  I found when Guest is idle, VDSO pvclock may increase host consumption.
  We can calcutate as follow, Correct me if I am wrong.
(Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
  In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in 
  timekeeping.c. It consume nearly 900 cycles per call. So in consideration 
  of 250 Hz, it may consume 225,000 cycles per second, even no VM is created.
  In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If 
  the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles 
  per call. The feature decrease 150 cycles consumption per call. 
  When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the 
  host consumption.
  Both Host and Guest is linux-3.13.6.
  So, whether the host cpu consumption is a problem?
 
 Does pvclock serve any real purpose on systems with fully-functional
 TSCs?  The x86 guest implementation is awful, so it's about 2x slower
 than TSC.  It could be improved a lot, but I'm not sure I understand why
 it exists in the first place.

VM migration.

Can you explain why you consider it so bad ? How you think it could be
improved ?

 I certainly understand the goal of keeping the guest CLOCK_REALTIME is
 sync with the host, but pvclock seems like overkill for that.

VM migration.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-03-31 Thread Andy Lutomirski
On Mar 31, 2014 8:45 PM, Marcelo Tosatti mtosa...@redhat.com wrote:

 On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
  On 03/29/2014 01:47 AM, Zhanghailiang wrote:
   Hi,
   I found when Guest is idle, VDSO pvclock may increase host consumption.
   We can calcutate as follow, Correct me if I am wrong.
 (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
   In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in 
   timekeeping.c. It consume nearly 900 cycles per call. So in consideration 
   of 250 Hz, it may consume 225,000 cycles per second, even no VM is 
   created.
   In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If 
   the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles 
   per call. The feature decrease 150 cycles consumption per call.
   When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the 
   host consumption.
   Both Host and Guest is linux-3.13.6.
   So, whether the host cpu consumption is a problem?
 
  Does pvclock serve any real purpose on systems with fully-functional
  TSCs?  The x86 guest implementation is awful, so it's about 2x slower
  than TSC.  It could be improved a lot, but I'm not sure I understand why
  it exists in the first place.

 VM migration.

Why does that need percpu stuff?  Wouldn't it be sufficient to
interrupt all CPUs (or at least all cpus running in userspace) on
migration and update the normal timing data structures?

Even better: have the VM offer to invalidate the physical page
containing the kernel's clock data on migration and interrupt one CPU.
 If another CPU races, it'll fault and wait for the guest kernel to
update its timing.

Does the current kvmclock stuff track CLOCK_MONOTONIC and
CLOCK_REALTIME separately?


 Can you explain why you consider it so bad ? How you think it could be
 improved ?

The second rdtsc_barrier looks unnecessary.  Even better, if rdtscp is
available, then rdtscp can replace rdtsc_barrier, rdtsc, and the
getcpu call.

It would also be nice to avoid having two sets of rescalings of the timing data.



  I certainly understand the goal of keeping the guest CLOCK_REALTIME is
  sync with the host, but pvclock seems like overkill for that.

 VM migration.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-31 Thread Alexander Graf


 Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com:
 
 On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:
 On 03/26/2014 10:17 PM, Scott Wood wrote:
 On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
 +/*
 + * Another thread may rewrite the TLB entry in parallel, don't
 + * execute from the address if the execute permission is not set
 + */
 
 What happens when another thread rewrites the TLB entry in parallel? 
 Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? 
 Are the contents of the MAS registers consistent at this point or 
 inconsistent?
 
 If another thread rewrites the TLB entry, then we use the new TLB entry,
 just as if it had raced in hardware.  This check ensures that we don't
 execute from the new TLB entry if it doesn't have execute permissions
 (just as hardware would).
 
 What happens if the new TLB entry is valid and executable, and the
 instruction pointed to is valid, but doesn't trap (and thus we don't
 have emulation for it)?
 
 There has to be a good way to detect such a race and deal with it, no?
 
 How would you detect it?  We don't get any information from the trap
 about what physical address the instruction was fetched from, or what
 the instruction was.

Ah, this is a guest race where the guest modifies exactly the TLB in question. 
I see.

Why would this ever happen in practice (in a case where we're not executing 
malicious code)?


Alex


 
 -Scott
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Alow kvmppc_get_last_inst() to fail

2014-03-31 Thread Alexander Graf

On 03/26/2014 09:52 PM, Scott Wood wrote:

On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:

diff --git a/arch/powerpc/kvm/book3s_paired_singles.c 
b/arch/powerpc/kvm/book3s_paired_singles.c
index a59a25a..80c533e 100644
--- a/arch/powerpc/kvm/book3s_paired_singles.c
+++ b/arch/powerpc/kvm/book3s_paired_singles.c
@@ -640,19 +640,24 @@ static int kvmppc_ps_one_in(struct kvm_vcpu *vcpu, bool 
rc,
  
  int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu)

  {
-   u32 inst = kvmppc_get_last_inst(vcpu);
+   u32 inst;
enum emulation_result emulated = EMULATE_DONE;
-
-   int ax_rd = inst_get_field(inst, 6, 10);
-   int ax_ra = inst_get_field(inst, 11, 15);
-   int ax_rb = inst_get_field(inst, 16, 20);
-   int ax_rc = inst_get_field(inst, 21, 25);
-   short full_d = inst_get_field(inst, 16, 31);
-
-   u64 *fpr_d = vcpu-arch.fpr[ax_rd];
-   u64 *fpr_a = vcpu-arch.fpr[ax_ra];
-   u64 *fpr_b = vcpu-arch.fpr[ax_rb];
-   u64 *fpr_c = vcpu-arch.fpr[ax_rc];
+   int ax_rd, ax_ra, ax_rb, ax_rc;
+   short full_d;
+   u64 *fpr_d, *fpr_a, *fpr_b, *fpr_c;
+
+   kvmppc_get_last_inst(vcpu, inst);

Should probably check for failure here and elsewhere -- even though it
can't currently fail on book3s, the interface now allows it.


diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 5b9e906..b0d884d 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -624,9 +624,10 @@ void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr)
  static int kvmppc_read_inst(struct kvm_vcpu *vcpu)
  {
ulong srr0 = kvmppc_get_pc(vcpu);
-   u32 last_inst = kvmppc_get_last_inst(vcpu);
+   u32 last_inst;
int ret;
  
+	kvmppc_get_last_inst(vcpu, last_inst);

ret = kvmppc_ld(vcpu, srr0, sizeof(u32), last_inst, false);

This isn't new, but this function looks odd to me -- calling
kvmppc_get_last_inst() but ignoring last_inst, then calling kvmppc_ld()
and ignoring anything but failure.  last_inst itself is never read.  And
no comments to explain the weirdness. :-)

I get that kvmppc_get_last_inst() is probably being used for the side
effect of filling in vcpu-arch.last_inst, but why store the return
value without using it?  Why pass the address of it to kvmppc_ld(),
which seems to be used only as an indirect way of determining whether
kvmppc_get_last_inst() failed?  And that whole mechanism becomes
stranger once it becomes possible for kvmppc_get_last_inst() to directly
return failure.


If you're interested in the history of this, here's the patch :)

https://github.com/mirrors/linux-2.6/commit/c7f38f46f2a98d232147e47284cb4e7363296a3e

The idea is that we need 2 things to be good after this function:

  1) vcpu-arch.last_inst is valid
  2) if the last instruction is not readable, return failure

Hence this weird construct. I don't think it's really necessary though - 
just remove the kvmppc_ld() call and only fail read_inst() when the 
caching didn't work and we can't translate the address.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-31 Thread Alexander Graf

On 03/26/2014 10:17 PM, Scott Wood wrote:

On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:

Load external pid (lwepx) instruction faults (when called from
KVM with guest context) needs to be handled by KVM. This implies
additional code in DO_KVM macro to identify the source of the
exception (which oiginate from KVM host rather than the guest).
The hook requires to check the Exception Syndrome Register
ESR[EPID] and External PID Load Context Register EPLC[EGS] for
some exceptions (DTLB_MISS, DSI and LRAT). Doing this on Data TLB
miss exception is obvious intrusive for the host.

Get rid of lwepx and acquire last instuction in kvmppc_get_last_inst()
by searching for the physical address and kmap it. This fixes an
infinite loop caused by lwepx's data TLB miss handled in the host
and the TODO for TLB eviction and execute-but-not-read entries.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
  arch/powerpc/kvm/bookehv_interrupts.S |   37 +++--
  arch/powerpc/kvm/e500_mmu_host.c  |   93 +
  2 files changed, 102 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 20c7a54..c50490c 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -119,38 +119,14 @@
  1:
  
  	.if	\flags  NEED_EMU

-   /*
-* This assumes you have external PID support.
-* To support a bookehv CPU without external PID, you'll
-* need to look up the TLB entry and create a temporary mapping.
-*
-* FIXME: we don't currently handle if the lwepx faults.  PR-mode
-* booke doesn't handle it either.  Since Linux doesn't use
-* broadcast tlbivax anymore, the only way this should happen is
-* if the guest maps its memory execute-but-not-read, or if we
-* somehow take a TLB miss in the middle of this entry code and
-* evict the relevant entry.  On e500mc, all kernel lowmem is
-* bolted into TLB1 large page mappings, and we don't use
-* broadcast invalidates, so we should not take a TLB miss here.
-*
-* Later we'll need to deal with faults here.  Disallowing guest
-* mappings that are execute-but-not-read could be an option on
-* e500mc, but not on chips with an LRAT if it is used.
-*/
-
-   mfspr   r3, SPRN_EPLC   /* will already have correct ELPID and EGS */
PPC_STL r15, VCPU_GPR(R15)(r4)
PPC_STL r16, VCPU_GPR(R16)(r4)
PPC_STL r17, VCPU_GPR(R17)(r4)
PPC_STL r18, VCPU_GPR(R18)(r4)
PPC_STL r19, VCPU_GPR(R19)(r4)
-   mr  r8, r3
PPC_STL r20, VCPU_GPR(R20)(r4)
-   rlwimi  r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS
PPC_STL r21, VCPU_GPR(R21)(r4)
-   rlwimi  r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR
PPC_STL r22, VCPU_GPR(R22)(r4)
-   rlwimi  r8, r10, EPC_EPID_SHIFT, EPC_EPID
PPC_STL r23, VCPU_GPR(R23)(r4)
PPC_STL r24, VCPU_GPR(R24)(r4)
PPC_STL r25, VCPU_GPR(R25)(r4)
@@ -160,10 +136,15 @@
PPC_STL r29, VCPU_GPR(R29)(r4)
PPC_STL r30, VCPU_GPR(R30)(r4)
PPC_STL r31, VCPU_GPR(R31)(r4)
-   mtspr   SPRN_EPLC, r8
-   isync
-   lwepx   r9, 0, r5
-   mtspr   SPRN_EPLC, r3
+
+   /*
+* We don't use external PID support. lwepx faults would need to be
+* handled by KVM and this implies aditional code in DO_KVM (for
+* DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which
+* is too intrusive for the host. Get last instuction in
+* kvmppc_get_last_inst().
+*/
+   li  r9, KVM_INST_FETCH_FAILED
stw r9, VCPU_LAST_INST(r4)
.endif
  
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c

index 6025cb7..1b4cb41 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -598,9 +598,102 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, 
gpa_t gpaddr,
}
  }
  
+#ifdef CONFIG_KVM_BOOKE_HV

+int kvmppc_ld_inst(struct kvm_vcpu *vcpu, u32 *instr)

It'd be interesting to see what the performance impact of doing this on
non-HV would be -- it would eliminate divergent code, eliminate the
MSR_DS hack, and make exec-only mappings work.


We hit the instruction emulation path a lot more often on non-HV, so 
even a slight performance impact that might not be a major bummer for HV 
would become critical for PR.


But I agree - it'd be interesting to see numbers.




+{
+   gva_t geaddr;
+   hpa_t addr;
+   hfn_t pfn;
+   hva_t eaddr;
+   u32 mas0, mas1, mas2, mas3;
+   u64 mas7_mas3;
+   struct page *page;
+   unsigned int addr_space, psize_shift;
+   bool pr;
+   unsigned long flags;
+
+   /* Search TLB for guest pc to get the real address */
+   geaddr = kvmppc_get_pc(vcpu);
+   addr_space = (vcpu-arch.shared-msr  

Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-31 Thread Scott Wood
On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:
 On 03/26/2014 10:17 PM, Scott Wood wrote:
  On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
  +  /*
  +   * Another thread may rewrite the TLB entry in parallel, don't
  +   * execute from the address if the execute permission is not set
  +   */
 
 What happens when another thread rewrites the TLB entry in parallel? 
 Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? 
 Are the contents of the MAS registers consistent at this point or 
 inconsistent?

If another thread rewrites the TLB entry, then we use the new TLB entry,
just as if it had raced in hardware.  This check ensures that we don't
execute from the new TLB entry if it doesn't have execute permissions
(just as hardware would).

What happens if the new TLB entry is valid and executable, and the
instruction pointed to is valid, but doesn't trap (and thus we don't
have emulation for it)?

 There has to be a good way to detect such a race and deal with it, no?

How would you detect it?  We don't get any information from the trap
about what physical address the instruction was fetched from, or what
the instruction was.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-31 Thread Alexander Graf


 Am 01.04.2014 um 01:03 schrieb Scott Wood scottw...@freescale.com:
 
 On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:
 On 03/26/2014 10:17 PM, Scott Wood wrote:
 On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
 +/*
 + * Another thread may rewrite the TLB entry in parallel, don't
 + * execute from the address if the execute permission is not set
 + */
 
 What happens when another thread rewrites the TLB entry in parallel? 
 Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? 
 Are the contents of the MAS registers consistent at this point or 
 inconsistent?
 
 If another thread rewrites the TLB entry, then we use the new TLB entry,
 just as if it had raced in hardware.  This check ensures that we don't
 execute from the new TLB entry if it doesn't have execute permissions
 (just as hardware would).
 
 What happens if the new TLB entry is valid and executable, and the
 instruction pointed to is valid, but doesn't trap (and thus we don't
 have emulation for it)?
 
 There has to be a good way to detect such a race and deal with it, no?
 
 How would you detect it?  We don't get any information from the trap
 about what physical address the instruction was fetched from, or what
 the instruction was.

Ah, this is a guest race where the guest modifies exactly the TLB in question. 
I see.

Why would this ever happen in practice (in a case where we're not executing 
malicious code)?


Alex


 
 -Scott
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html