Re: [PATCH 3/4] KVM: Add SMAP support when setting CR4

2014-03-28 Thread Paolo Bonzini

Il 28/03/2014 06:47, Zhang, Yang Z ha scritto:

 +  smap = smap  u  !uf 
 +  !((kvm_x86_ops-get_cpl(vcpu)  3) 
 +  ((kvm_x86_ops-get_rflags(vcpu) 
 +  X86_EFLAGS_AC) == 1));


 Unfortunately this doesn't work.

 The reason is that changing X86_EFLAGS_AC doesn't trigger
 update_permission_bitmask.  So the value of CPL  3  AC = 1 must not
 be checked in update_permission_bitmask; instead, it must be included
 in the index into the permissions array.  You can reuse the
 PFERR_RSVD_MASK bit, like

smapf = pfec  PFERR_RSVD_MASK;
...
smap = smap  smapf u  !uf;

 The VCPU can then be passed to permission_fault in order to get the
 value of the CPL and the AC bit.


Is CPL check needed? Shouldn't it already have been covered by U bit?


It is not needed but actually it is covered by !uf, I think.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 71521] Host call trace when create guest.

2014-03-28 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=71521

Paolo Bonzini bonz...@gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |CODE_FIX

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-28 Thread Michael Tokarev
27.03.2014 20:14, Alejandro Comisario wrote:
 Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side
 (ubuntu 12.04 on host and guest).
 So, how can i adjust the tinmeout on the guest ?

After a bit more talks on IRC yesterday, it turned out that the situation
is _much_ more interesting than originally described.  The OP claims to
have 10500 guests running off an NFS server, and that after NFS server
downtime, the backing files were disappeared (whatever it means), so
they had to restore those files.  More, the OP didn't even bother to look
at the guest's dmesg, being busy rebooting all 10500 guests.

 This solution is the most logical one, but i cannot apply it!
 thanks for all the responses!

I suggested the OP to actually describe the _real_ situation, instead of
giving random half-pictures, and actually take a look at the actual problem
as reported in various places (most importantly the guest kernel log), and
reoirt _those_ hints to the list.  I also mentioned that, at least for some
NFS servers, if a client has a file open on the server, and this file is
deleted, the server will report error to the client when client tries to
access that file, and this has nothing at all to do with timeouts of any
kind.

Thanks,

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/4] KVM: Add SMAP support when setting CR4

2014-03-28 Thread Wu, Feng


 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Paolo Bonzini
 Sent: Friday, March 28, 2014 2:23 PM
 To: Zhang, Yang Z; Wu, Feng; g...@redhat.com; h...@zytor.com;
 kvm@vger.kernel.org
 Subject: Re: [PATCH 3/4] KVM: Add SMAP support when setting CR4
 
 Il 28/03/2014 06:47, Zhang, Yang Z ha scritto:
   +  smap = smap  u  !uf 
   +  !((kvm_x86_ops-get_cpl(vcpu)  
   3) 
   +  ((kvm_x86_ops-get_rflags(vcpu) 
   
   +  X86_EFLAGS_AC) == 1));
  
   Unfortunately this doesn't work.
  
   The reason is that changing X86_EFLAGS_AC doesn't trigger
   update_permission_bitmask.  So the value of CPL  3  AC = 1 must not
   be checked in update_permission_bitmask; instead, it must be included
   in the index into the permissions array.  You can reuse the
   PFERR_RSVD_MASK bit, like
  
smapf = pfec  PFERR_RSVD_MASK;
...
smap = smap  smapf u  !uf;
  
   The VCPU can then be passed to permission_fault in order to get the
   value of the CPL and the AC bit.
 
  Is CPL check needed? Shouldn't it already have been covered by U bit?
 
 It is not needed but actually it is covered by !uf, I think.

In my understanding it is needed, from Intel SDM:

Every access to a linear address is either a supervisor-mode access
or a user-mode access. All accesses performed while the current
privilege level (CPL) is less than 3 are supervisor-mode accesses.
If CPL = 3, accesses are generally user-mode accesses. However, some
operations implicitly access system data structures, and the resulting
accesses to those data structures are supervisor-mode accesses regardless
of CPL. Examples of such implicit supervisor accesses include the following:
accesses to the global descriptor table (GDT) or local descriptor table
(LDT) to load a segment descriptor; accesses to the interrupt descriptor
table (IDT) when delivering an interrupt or exception; and accesses to the
task-state segment (TSS) as part of a task switch or change of CPL.

From the above SDM, we can see supervisor-mode access can also 
happen when CPL equals 3.

If CPL  3, SMAP protections are disabled if EFLAGS.AC = 1. If CPL = 3, 
SMAP applies to all supervisor-mode data accesses (these are implicit
supervisor accesses) regardless of the value of EFLAGS.AC.

So when we check the value of EFLAGS.AC, we also need to check CPL, since AC
bit only takes effect when CPL3.

U==1 means user-mode access are allowed, while !uf means it is a fault
from Supervisor-mode access, I think both *u* and *uf* cannot reflect the
value of CPL.

Correct me if I am wrong. Thanks a lot!

 
 Paolo
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Feng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-28 Thread Stefan Hajnoczi
On Fri, Mar 28, 2014 at 11:01:00AM +0400, Michael Tokarev wrote:
 27.03.2014 20:14, Alejandro Comisario wrote:
  Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side
  (ubuntu 12.04 on host and guest).
  So, how can i adjust the tinmeout on the guest ?
 
 After a bit more talks on IRC yesterday, it turned out that the situation
 is _much_ more interesting than originally described.  The OP claims to
 have 10500 guests running off an NFS server, and that after NFS server
 downtime, the backing files were disappeared (whatever it means), so
 they had to restore those files.  More, the OP didn't even bother to look
 at the guest's dmesg, being busy rebooting all 10500 guests.
 
  This solution is the most logical one, but i cannot apply it!
  thanks for all the responses!
 
 I suggested the OP to actually describe the _real_ situation, instead of
 giving random half-pictures, and actually take a look at the actual problem
 as reported in various places (most importantly the guest kernel log), and
 reoirt _those_ hints to the list.  I also mentioned that, at least for some
 NFS servers, if a client has a file open on the server, and this file is
 deleted, the server will report error to the client when client tries to
 access that file, and this has nothing at all to do with timeouts of any
 kind.

Thanks for the update and for taking time to help on IRC.  I feel you're
being harsh on Alejandro though.

Improving the quality of bug reports is important but it shouldn't be at
the expense of quality of communication.  We can't assume that everyone
is an expert in troubleshooting KVM or Linux.  Therefore we can't blame
them, which will only drive people away and detract from the community.

TL;DR post logs and error messages +1, berate him -1

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/4] KVM: Add SMAP support when setting CR4

2014-03-28 Thread Wu, Feng


 -Original Message-
 From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo
 Bonzini
 Sent: Thursday, March 27, 2014 7:47 PM
 To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org
 Subject: Re: [PATCH 3/4] KVM: Add SMAP support when setting CR4
 
 Il 27/03/2014 13:25, Feng Wu ha scritto:
  +void update_permission_bitmask(struct kvm_vcpu *vcpu,
  struct kvm_mmu *mmu, bool ept)
   {
  unsigned bit, byte, pfec;
  u8 map;
  -   bool fault, x, w, u, wf, uf, ff, smep;
  +   bool fault, x, w, u, wf, uf, ff, smep, smap;
 
  smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
  +   smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
  for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
  pfec = byte  1;
  map = 0;
  @@ -3617,11 +3618,26 @@ static void update_permission_bitmask(struct
 kvm_vcpu *vcpu,
  w |= !is_write_protection(vcpu)  !uf;
  /* Disallow supervisor fetches of user code if 
  cr4.smep */
  x = !(smep  u  !uf);
  +
  +   /*
  +* SMAP:kernel-mode data accesses from user-mode
  +* mappings should fault. A fault is considered
  +* as a SMAP violation if all of the following
  +* conditions are ture:
  +*   - X86_CR4_SMAP is set in CR4
  +*   - An user page is accessed
  +*   - !(CPL3  X86_EFLAGS_AC is set)
  +*   - Page fault in kernel mode
  +*/
  +   smap = smap  u  !uf 
  +   !((kvm_x86_ops-get_cpl(vcpu)  3) 
  +   ((kvm_x86_ops-get_rflags(vcpu) 
  +   X86_EFLAGS_AC) == 1));
 
 Unfortunately this doesn't work.
 
 The reason is that changing X86_EFLAGS_AC doesn't trigger
 update_permission_bitmask.  So the value of CPL  3  AC = 1 must not
 be checked in update_permission_bitmask; instead, it must be included in
 the index into the permissions array.  You can reuse the PFERR_RSVD_MASK
 bit, like
 
   smapf = pfec  PFERR_RSVD_MASK;
   ...
   smap = smap  smapf u  !uf;
 
 The VCPU can then be passed to permission_fault in order to get the
 value of the CPL and the AC bit.
 
 Please test nested virtualization too.  I think PFERR_RSVD_MASK should
 be removed in translate_nested_gpa.

Thanks very much for your comments, I changed the code according to it and
the patch v2 will be sent out for a review. Since setting up the environment for
nested virtualization is a little time consuming and it is in progress now, 
could you please have a look at it to see if it is what you expected first? 
Any comments are appreciated! 

 
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode

2014-03-28 Thread Feng Wu
SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/vmx.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3927528..e58cb5f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3452,13 +3452,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
hw_cr4 = ~X86_CR4_PAE;
hw_cr4 |= X86_CR4_PSE;
/*
-* SMEP is disabled if CPU is in non-paging mode in
-* hardware. However KVM always uses paging mode to
+* SMEP/SMAP is disabled if CPU is in non-paging mode
+* in hardware. However KVM always uses paging mode to
 * emulate guest non-paging mode with TDP.
-* To emulate this behavior, SMEP needs to be manually
-* disabled when guest switches to non-paging mode.
+* To emulate this behavior, SMEP/SMAP needs to be
+* manually disabled when guest switches to non-paging
+* mode.
 */
-   hw_cr4 = ~X86_CR4_SMEP;
+   hw_cr4 = ~(X86_CR4_SMEP | X86_CR4_SMAP);
} else if (!(cr4  X86_CR4_PAE)) {
hw_cr4 = ~X86_CR4_PAE;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] KVM: Remove SMAP bit from CR4_RESERVED_BITS.

2014-03-28 Thread Feng Wu
This patch removes SMAP bit from CR4_RESERVED_BITS.

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fdf83af..4eeb049 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -60,7 +60,7 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] KVM: enable Intel SMAP for KVM

2014-03-28 Thread Feng Wu
Supervisor Mode Access Prevention (SMAP) is a new security feature 
disclosed by Intel, please refer to the following document: 

http://software.intel.com/sites/default/files/319433-014.pdf
 
Every access to a linear address is either a supervisor-mode access
or a user-mode access. All accesses performed while the current
privilege level (CPL) is less than 3 are supervisor-mode accesses.
If CPL = 3, accesses are generally user-mode accesses. However, some
operations implicitly access system data structures, and the resulting
accesses to those data structures are supervisor-mode accesses regardless
of CPL. Examples of such implicit supervisor accesses include the following:
accesses to the global descriptor table (GDT) or local descriptor table
(LDT) to load a segment descriptor; accesses to the interrupt descriptor
table (IDT) when delivering an interrupt or exception; and accesses to the
task-state segment (TSS) as part of a task switch or change of CPL.

If CR4.SMAP = 1, supervisor-mode data accesses are not allowed to linear
addresses that are accessible in user mode. If CPL  3, SMAP protections
are disabled if EFLAGS.AC = 1. If CPL = 3, SMAP applies to all supervisor-mode
data accesses (these are implicit supervisor accesses) regardless of the
value of EFLAGS.AC.

This patchset pass-through SMAP feature to guests, and let guests
benefit from it.

Version 1:
  * Remove SMAP bit from CR4_RESERVED_BITS.
  * Add SMAP support when setting CR4
  * Disable SMAP for guests in EPT realmode and EPT unpaging mode
  * Expose SMAP feature to guest

Version 1:
  * Change the logic of updatinng mmu permission bitmap for SMAP violation
  * Expose SMAP feature to guest in the last patch of this series.

Feng Wu (4):
  KVM: Remove SMAP bit from CR4_RESERVED_BITS.
  KVM: Add SMAP support when setting CR4
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
  KVM: expose SMAP feature to guest

 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/cpuid.c|  2 +-
 arch/x86/kvm/cpuid.h|  8 
 arch/x86/kvm/mmu.c  | 24 +---
 arch/x86/kvm/mmu.h  | 26 +++---
 arch/x86/kvm/paging_tmpl.h  |  2 +-
 arch/x86/kvm/vmx.c  | 11 ++-
 arch/x86/kvm/x86.c  |  9 -
 8 files changed, 69 insertions(+), 15 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] KVM: expose SMAP feature to guest

2014-03-28 Thread Feng Wu
This patch exposes SMAP feature to guest

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/cpuid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c697625..deb5f9b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -303,7 +303,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
/* cpuid 7.0.ebx */
const u32 kvm_supported_word9_x86_features =
F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) |
-   F(BMI2) | F(ERMS) | f_invpcid | F(RTM);
+   F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | F(SMAP);
 
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-28 Thread Feng Wu
This patch adds SMAP handling logic when setting CR4 for guests

Signed-off-by: Feng Wu feng...@intel.com
---
 arch/x86/kvm/cpuid.h   |  8 
 arch/x86/kvm/mmu.c | 24 +---
 arch/x86/kvm/mmu.h | 26 +++---
 arch/x86/kvm/paging_tmpl.h |  2 +-
 arch/x86/kvm/x86.c |  9 -
 5 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index a2a1bb7..eeecbed 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -48,6 +48,14 @@ static inline bool guest_cpuid_has_smep(struct kvm_vcpu 
*vcpu)
return best  (best-ebx  bit(X86_FEATURE_SMEP));
 }
 
+static inline bool guest_cpuid_has_smap(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best  (best-ebx  bit(X86_FEATURE_SMAP));
+}
+
 static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9b53135..83b7f8d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3601,20 +3601,22 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu 
*vcpu,
}
 }
 
-static void update_permission_bitmask(struct kvm_vcpu *vcpu,
+void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smep;
+   bool fault, x, w, u, wf, uf, ff, smapf, smep, smap;
 
smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
+   smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
pfec = byte  1;
map = 0;
wf = pfec  PFERR_WRITE_MASK;
uf = pfec  PFERR_USER_MASK;
ff = pfec  PFERR_FETCH_MASK;
+   smapf = pfec  PFERR_RSVD_MASK;
for (bit = 0; bit  8; ++bit) {
x = bit  ACC_EXEC_MASK;
w = bit  ACC_WRITE_MASK;
@@ -3627,11 +3629,27 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
w |= !is_write_protection(vcpu)  !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
x = !(smep  u  !uf);
+
+   /*
+* SMAP:kernel-mode data accesses from user-mode
+* mappings should fault. A fault is considered
+* as a SMAP violation if all of the following
+* conditions are ture:
+*   - X86_CR4_SMAP is set in CR4
+*   - An user page is accessed
+*   - Page fault in kernel mode
+*   - !(CPL3  X86_EFLAGS_AC is set)
+*
+*   Here, we cover the first three conditions,
+*   we need to check CPL and X86_EFLAGS_AC in
+*   permission_fault() dynamiccally
+*/
+   smap = smap  smapf  u  !uf;
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;
 
-   fault = (ff  !x) || (uf  !u) || (wf  !w);
+   fault = (ff  !x) || (uf  !u) || (wf  !w) || smap;
map |= fault  bit;
}
mmu-permissions[byte] = map;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..9d7a0b3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -73,6 +73,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 
addr, bool direct);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
bool execonly);
+void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+   bool ept);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
@@ -110,10 +112,28 @@ static inline bool is_write_protection(struct kvm_vcpu 
*vcpu)
  * Will a fault with a given page-fault error code (pfec) cause a permission
  * fault with the given access (in ACC_* format)?
  */
-static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
-   unsigned pfec)
+static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+   unsigned pte_access, unsigned pfec)
 {
-   return (mmu-permissions[pfec  1]  pte_access)  1;
+   bool smapf;
+   int cpl = 

Re: [PATCH 1/2] KVM: PPC: Book3S HV: Make TM avoid program check

2014-03-28 Thread Paul Mackerras
On Fri, Mar 28, 2014 at 04:40:36PM +1100, Michael Neuling wrote:
 Currently using kvmppc_set_one_reg() a transaction could be setup without
 TEXASR Failure Summary (FS) not set.  When this is switched back in by the
 host, this will result in a TM Bad Thing (ie 0x700 program check) when the
 trechkpt is run.
 
 This avoids this by always setting the TEXASR FS when there is an active
 transaction being started.
 
 This patch is on top of Paulus' recent KVM TM patch set.

Thanks, Mikey.

Do you mind if I fold these two patches into patch 2/8 of the set I
posted?

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-28 Thread Paolo Bonzini
Il 28/03/2014 18:36, Feng Wu ha scritto:
 + smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);

You are overwriting this variable below, but that is not okay because
the value of CR4 must be considered separately in each iteration.  This
also hides a uninitialized-variable bug for smap correctly in the EPT
case.

To avoid that, rename this variable to cr4_smap; it's probably better 
to rename smep to cr4_smep too.

   for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
   pfec = byte  1;
   map = 0;
   wf = pfec  PFERR_WRITE_MASK;
   uf = pfec  PFERR_USER_MASK;
   ff = pfec  PFERR_FETCH_MASK;
 + smapf = pfec  PFERR_RSVD_MASK;

The reader will expect PFERR_RSVD_MASK to be zero here.  So please
add a comment: /* PFERR_RSVD_MASK is set in pfec if ... */.

   for (bit = 0; bit  8; ++bit) {
   x = bit  ACC_EXEC_MASK;
   w = bit  ACC_WRITE_MASK;
 @@ -3627,11 +3629,27 @@ static void update_permission_bitmask(struct kvm_vcpu 
 *vcpu,
   w |= !is_write_protection(vcpu)  !uf;
   /* Disallow supervisor fetches of user code if 
 cr4.smep */
   x = !(smep  u  !uf);
 +
 + /*
 +  * SMAP:kernel-mode data accesses from user-mode
 +  * mappings should fault. A fault is considered
 +  * as a SMAP violation if all of the following
 +  * conditions are ture:
 +  *   - X86_CR4_SMAP is set in CR4
 +  *   - An user page is accessed
 +  *   - Page fault in kernel mode
 +  *   - !(CPL3  X86_EFLAGS_AC is set)
 +  *
 +  *   Here, we cover the first three conditions,
 +  *   we need to check CPL and X86_EFLAGS_AC in
 +  *   permission_fault() dynamiccally

dynamically.  These three lines however are not entirely correct.  We do
cover the last condition here, it is in smapf.  So perhaps something like

 * Here, we cover the first three conditions.
 * The CPL and X86_EFLAGS_AC is in smapf, which
 * permission_fault() computes dynamically.

 +  */
 + smap = smap  smapf  u  !uf;

SMAP does not affect instruction fetches.  Do you need  !ff here?  Perhaps
it's clearer to add it even if it is not strictly necessary.

Please write just smap = cr4_smap  u  !uf  !ff here, and add back smapf 
below
in the assignment to fault.  This makes the code more homogeneous.

   } else
   /* Not really needed: no U/S accesses on ept  */
   u = 1;
 - fault = (ff  !x) || (uf  !u) || (wf  !w);
 + fault = (ff  !x) || (uf  !u) || (wf  !w) || smap;

...

 +
 + /*
 +  * If CPL  3, SMAP protections are disabled if EFLAGS.AC = 1.
 +  *
 +  * If CPL = 3, SMAP applies to all supervisor-mode data accesses
 +  * (these are implicit supervisor accesses) regardless of the value
 +  * of EFLAGS.AC.
 +  *
 +  * So we need to check CPL and EFLAGS.AC to detect whether there is
 +  * a SMAP violation.
 +  */
 +
 + smapf = ((mmu-permissions[(pfec|PFERR_RSVD_MASK)  1]  pte_access) 
 +  1)  !((cpl  3)  ((rflags  X86_EFLAGS_AC) == 1));
 +
 + return ((mmu-permissions[pfec  1]  pte_access)  1) || smapf;

You do not need two separate accesses.  Just add PFERR_RSVD_MASK to pfec if
the conditions for SMAP are satisfied.  There are two possibilities:

1) setting PFERR_RSVD_MASK if SMAP is being enforced, that is if CPL = 3
|| AC = 0.  This is what you are doing now.

2) setting PFERR_RSVD_MASK if SMAP is being overridden, that is if CPL  3
 AC = 1.  You then have to invert the bit in update_permission_bitmask.

Please consider both choices, and pick the one that gives better code.

Also, this must be written in a branchless way.  Branchless tricks are common
throughout the MMU code because the hit rate of most branches is pretty much
50%-50%.  This is also true in this case, at least if SMAP is in use (if it
is not in use, we'll have AC=0 most of the time).

I don't want to spoil the fun, but I don't want to waste your time either
so I rot13'ed my solution and placed it after the signature. ;)

As to nested virtualization, I reread the code and it should already work,
because it sets PFERR_USER_MASK.  This means uf=1, and a SMAP fault will
never trigger with uf=1.

Thanks for following this!  Please include v3 in the patch subject on
your next post!

Paolo

- 8 --
Nqq qrsvavgvbaf sbe CSREE_*_OVG (0 sbe cerfrag, 1 

Re: [PATCH 1/2] KVM: PPC: Book3S HV: Make TM avoid program check

2014-03-28 Thread Paolo Bonzini

Il 28/03/2014 12:08, Paul Mackerras ha scritto:

 Currently using kvmppc_set_one_reg() a transaction could be setup without
 TEXASR Failure Summary (FS) not set.  When this is switched back in by the
 host, this will result in a TM Bad Thing (ie 0x700 program check) when the
 trechkpt is run.

 This avoids this by always setting the TEXASR FS when there is an active
 transaction being started.

 This patch is on top of Paulus' recent KVM TM patch set.

Thanks, Mikey.

Do you mind if I fold these two patches into patch 2/8 of the set I
posted?


In either case, am I right that this doesn't include the patches in 
kvm-ppc-queue?  I'm waiting for the pull request. :)


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/5] KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation

2014-03-28 Thread Paolo Bonzini
Il 27/03/2014 12:30, Paolo Bonzini ha scritto:
 Despite the provisions to emulate up to 130 consecutive instructions, in
 practice KVM will emulate just one before exiting handle_invalid_guest_state,
 because x86_emulate_instructionn always sets KVM_REQ_EVENT.
 
 However, we only need to do this if an interrupt could be injected,
 which happens a) if an interrupt shadow bit (STI or MOV SS) has gone
 away; b) if the interrupt flag has just been set (because other
 instructions than STI can set it without enabling an interrupt shadow).
 
 This cuts another 250-300 clock cycles from the cost of emulating an
 instruction (530-870 cycles before the patch on kvm-unit-tests,
 290-600 afterwards).
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/x86.c | 28 ++--
  1 file changed, 18 insertions(+), 10 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index fd31aada351b..ce9523345f2e 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -87,6 +87,7 @@ static u64 __read_mostly efer_reserved_bits = 
 ~((u64)EFER_SCE);
  
  static void update_cr8_intercept(struct kvm_vcpu *vcpu);
  static void process_nmi(struct kvm_vcpu *vcpu);
 +static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
  
  struct kvm_x86_ops *kvm_x86_ops;
  EXPORT_SYMBOL_GPL(kvm_x86_ops);
 @@ -4856,8 +4857,10 @@ static void toggle_interruptibility(struct kvm_vcpu 
 *vcpu, u32 mask)
* means that the last instruction is an sti. We should not
* leave the flag on in this case. The same goes for mov ss
*/
 - if (!(int_shadow  mask))
 + if (unlikely(int_shadow)  !(int_shadow  mask)) {
   kvm_x86_ops-set_interrupt_shadow(vcpu, mask);
 + kvm_make_request(KVM_REQ_EVENT, vcpu);
 + }

Better:

 * means that the last instruction is an sti. We should not
 * leave the flag on in this case. The same goes for mov ss
 */
-   if (!(int_shadow  mask))
+   mask = ~int_shadow;
+   if (unlikely(mask != int_shadow))
kvm_x86_ops-set_interrupt_shadow(vcpu, mask);
+
+   /*
+* The interrupt window might have opened if a bit has been cleared.
+*/
+   if (unlikely(int_shadow  ~mask))
+   kvm_make_request(KVM_REQ_EVENT, vcpu);

Paolo

  }
  
  static void inject_emulated_exception(struct kvm_vcpu *vcpu)
 @@ -5083,20 +5086,18 @@ static int kvm_vcpu_check_hw_bp(unsigned long addr, 
 u32 type, u32 dr7,
   return dr6;
  }
  
 -static void kvm_vcpu_check_singlestep(struct kvm_vcpu *vcpu, int *r)
 +static void kvm_vcpu_check_singlestep(struct kvm_vcpu *vcpu, unsigned long 
 rflags, int *r)
  {
   struct kvm_run *kvm_run = vcpu-run;
  
   /*
 -  * Use the raw value to see if TF was passed to the processor.
 -  * Note that the new value of the flags has not been saved yet.
 +  * rflags is the old, raw value of the flags.  The new value has
 +  * not been saved yet.
*
* This is correct even for TF set by the guest, because the
* processor will not generate this exception after the instruction
* that sets the TF flag.
*/
 - unsigned long rflags = kvm_x86_ops-get_rflags(vcpu);
 -
   if (unlikely(rflags  X86_EFLAGS_TF)) {
   if (vcpu-guest_debug  KVM_GUESTDBG_SINGLESTEP) {
   kvm_run-debug.arch.dr6 = DR6_BS | DR6_FIXED_1;
 @@ -5263,13 +5264,15 @@ restart:
   r = EMULATE_DONE;
  
   if (writeback) {
 + unsigned long rflags = kvm_x86_ops-get_rflags(vcpu);
   toggle_interruptibility(vcpu, ctxt-interruptibility);
 - kvm_make_request(KVM_REQ_EVENT, vcpu);
   vcpu-arch.emulate_regs_need_sync_to_vcpu = false;
   kvm_rip_write(vcpu, ctxt-eip);
   if (r == EMULATE_DONE)
 - kvm_vcpu_check_singlestep(vcpu, r);
 - kvm_set_rflags(vcpu, ctxt-eflags);
 + kvm_vcpu_check_singlestep(vcpu, rflags, r);
 + __kvm_set_rflags(vcpu, ctxt-eflags);
 + if (unlikely((ctxt-eflags  ~rflags)  X86_EFLAGS_IF))
 + kvm_make_request(KVM_REQ_EVENT, vcpu);
   } else
   vcpu-arch.emulate_regs_need_sync_to_vcpu = true;
  
 @@ -7385,12 +7388,17 @@ unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu)
  }
  EXPORT_SYMBOL_GPL(kvm_get_rflags);
  
 -void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 +static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
  {
   if (vcpu-guest_debug  KVM_GUESTDBG_SINGLESTEP 
   kvm_is_linear_rip(vcpu, vcpu-arch.singlestep_rip))
   rflags |= X86_EFLAGS_TF;
   kvm_x86_ops-set_rflags(vcpu, rflags);
 +}
 +
 +void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 +{
 + __kvm_set_rflags(vcpu, rflags);
   kvm_make_request(KVM_REQ_EVENT, vcpu);
  }
  EXPORT_SYMBOL_GPL(kvm_set_rflags);
 

--
To 

[PATCH] arm: KVM: fix possible misalignment of PGDs and bounce page

2014-03-28 Thread Mark Salter
The kvm/mmu code shared by arm and arm64 uses kalloc() to allocate
a bounce page (if hypervisor init code crosses page boundary) and
hypervisor PGDs. The problem is that kalloc() does not guarantee
the proper alignment. In the case of the bounce page, the page sized
buffer allocated may also cross a page boundary negating the purpose
and leading to a hang during kvm initialization. Likewise the PGDs
allocated may not meet the minimum alignment requirements of the
underlying MMU. This patch uses __get_free_page() to guarantee the
worst case alignment needs of the bounce page and PGDs on both arm
and arm64.

Signed-off-by: Mark Salter msal...@redhat.com
---
 arch/arm/kvm/mmu.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 7789857..575d790 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -42,6 +42,8 @@ static unsigned long hyp_idmap_start;
 static unsigned long hyp_idmap_end;
 static phys_addr_t hyp_idmap_vector;
 
+#define pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
+
 #define kvm_pmd_huge(_x)   (pmd_huge(_x) || pmd_trans_huge(_x))
 
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
@@ -199,14 +201,14 @@ void free_boot_hyp_pgd(void)
if (boot_hyp_pgd) {
unmap_range(NULL, boot_hyp_pgd, hyp_idmap_start, PAGE_SIZE);
unmap_range(NULL, boot_hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
-   kfree(boot_hyp_pgd);
+   free_pages((unsigned long)boot_hyp_pgd, pgd_order);
boot_hyp_pgd = NULL;
}
 
if (hyp_pgd)
unmap_range(NULL, hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
 
-   kfree(init_bounce_page);
+   free_page((unsigned long)init_bounce_page);
init_bounce_page = NULL;
 
mutex_unlock(kvm_hyp_pgd_mutex);
@@ -236,7 +238,7 @@ void free_hyp_pgds(void)
for (addr = VMALLOC_START; is_vmalloc_addr((void*)addr); addr 
+= PGDIR_SIZE)
unmap_range(NULL, hyp_pgd, KERN_TO_HYP(addr), 
PGDIR_SIZE);
 
-   kfree(hyp_pgd);
+   free_pages((unsigned long)hyp_pgd, pgd_order);
hyp_pgd = NULL;
}
 
@@ -930,7 +932,7 @@ int kvm_mmu_init(void)
size_t len = __hyp_idmap_text_end - __hyp_idmap_text_start;
phys_addr_t phys_base;
 
-   init_bounce_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+   init_bounce_page = (void *)__get_free_page(GFP_KERNEL);
if (!init_bounce_page) {
kvm_err(Couldn't allocate HYP init bounce page\n);
err = -ENOMEM;
@@ -956,8 +958,9 @@ int kvm_mmu_init(void)
 (unsigned long)phys_base);
}
 
-   hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
-   boot_hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
+   hyp_pgd = (pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, pgd_order);
+   boot_hyp_pgd = (pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 
pgd_order);
+
if (!hyp_pgd || !boot_hyp_pgd) {
kvm_err(Hyp mode PGD not allocated\n);
err = -ENOMEM;
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


huge 2nd stage pages and live migration

2014-03-28 Thread Mario Smarduch
Hello

I've been working on live migration for ARM-KVM, and noticed
problem completing migration with huge 2nd stage tables.
 

Aafter write protecting the VM, for write fault 512 page bits
are set in dirty_bitmap[] to take into account future writes to 
huge page.The pmd is write protected again when QEMU  reads the 
dirty log, and the cycle repeats. With this not even a idle 
32MB VM  completes live migration.

If QEMU uses THPs, and 2nd stage tables use pte's, then there
is no problem, live migration is quick. I'm assumung QEMU and Guest 
huge pages with 2nd stage page table pte's should work fine too.

I'm wondering how this has been solved (for any architecture)? 

- Mario
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: Add SMAP support when setting CR4

2014-03-28 Thread Paolo Bonzini

Il 28/03/2014 08:33, Wu, Feng ha scritto:

In my understanding it is needed, from Intel SDM:

Every access to a linear address is either a supervisor-mode access
or a user-mode access. All accesses performed while the current
privilege level (CPL) is less than 3 are supervisor-mode accesses.
If CPL = 3, accesses are generally user-mode accesses. However, some
operations implicitly access system data structures, and the resulting
accesses to those data structures are supervisor-mode accesses regardless
of CPL. Examples of such implicit supervisor accesses include the following:
accesses to the global descriptor table (GDT) or local descriptor table
(LDT) to load a segment descriptor; accesses to the interrupt descriptor
table (IDT) when delivering an interrupt or exception; and accesses to the
task-state segment (TSS) as part of a task switch or change of CPL.

From the above SDM, we can see supervisor-mode access can also
happen when CPL equals 3.

If CPL  3, SMAP protections are disabled if EFLAGS.AC = 1. If CPL = 3,
SMAP applies to all supervisor-mode data accesses (these are implicit
supervisor accesses) regardless of the value of EFLAGS.AC.

So when we check the value of EFLAGS.AC, we also need to check CPL, since AC
bit only takes effect when CPL3.

U==1 means user-mode access are allowed, while !uf means it is a fault
from Supervisor-mode access, I think both *u* and *uf* cannot reflect the
value of CPL.

Correct me if I am wrong. Thanks a lot!


You're right!

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-28 Thread Wu, Feng


 -Original Message-
 From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo
 Bonzini
 Sent: Friday, March 28, 2014 8:03 PM
 To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org
 Subject: Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4
 
 Il 28/03/2014 18:36, Feng Wu ha scritto:
  +   smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
 
 You are overwriting this variable below, but that is not okay because
 the value of CR4 must be considered separately in each iteration.  This
 also hides a uninitialized-variable bug for smap correctly in the EPT
 case.
 
 To avoid that, rename this variable to cr4_smap; it's probably better
 to rename smep to cr4_smep too.
 
  for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
  pfec = byte  1;
  map = 0;
  wf = pfec  PFERR_WRITE_MASK;
  uf = pfec  PFERR_USER_MASK;
  ff = pfec  PFERR_FETCH_MASK;
  +   smapf = pfec  PFERR_RSVD_MASK;
 
 The reader will expect PFERR_RSVD_MASK to be zero here.  So please
 add a comment: /* PFERR_RSVD_MASK is set in pfec if ... */.
 
  for (bit = 0; bit  8; ++bit) {
  x = bit  ACC_EXEC_MASK;
  w = bit  ACC_WRITE_MASK;
  @@ -3627,11 +3629,27 @@ static void update_permission_bitmask(struct
 kvm_vcpu *vcpu,
  w |= !is_write_protection(vcpu)  !uf;
  /* Disallow supervisor fetches of user code if 
  cr4.smep */
  x = !(smep  u  !uf);
  +
  +   /*
  +* SMAP:kernel-mode data accesses from user-mode
  +* mappings should fault. A fault is considered
  +* as a SMAP violation if all of the following
  +* conditions are ture:
  +*   - X86_CR4_SMAP is set in CR4
  +*   - An user page is accessed
  +*   - Page fault in kernel mode
  +*   - !(CPL3  X86_EFLAGS_AC is set)
  +*
  +*   Here, we cover the first three conditions,
  +*   we need to check CPL and X86_EFLAGS_AC in
  +*   permission_fault() dynamiccally
 
 dynamically.  These three lines however are not entirely correct.  We do
 cover the last condition here, it is in smapf.  So perhaps something like
 
  * Here, we cover the first three conditions.
  * The CPL and X86_EFLAGS_AC is in smapf, which
  * permission_fault() computes dynamically.
 
  +*/
  +   smap = smap  smapf  u  !uf;
 
 SMAP does not affect instruction fetches.  Do you need  !ff here?
 Perhaps
 it's clearer to add it even if it is not strictly necessary.
 
 Please write just smap = cr4_smap  u  !uf  !ff here, and add back
 smapf below
 in the assignment to fault.  This makes the code more homogeneous.
 
  } else
  /* Not really needed: no U/S accesses on ept  */
  u = 1;
  -   fault = (ff  !x) || (uf  !u) || (wf  !w);
  +   fault = (ff  !x) || (uf  !u) || (wf  !w) || smap;
 
 ...
 
  +
  +   /*
  +* If CPL  3, SMAP protections are disabled if EFLAGS.AC = 1.
  +*
  +* If CPL = 3, SMAP applies to all supervisor-mode data accesses
  +* (these are implicit supervisor accesses) regardless of the value
  +* of EFLAGS.AC.
  +*
  +* So we need to check CPL and EFLAGS.AC to detect whether there is
  +* a SMAP violation.
  +*/
  +
  +   smapf = ((mmu-permissions[(pfec|PFERR_RSVD_MASK)  1] 
 pte_access) 
  +1)  !((cpl  3)  ((rflags  X86_EFLAGS_AC) == 1));
  +
  +   return ((mmu-permissions[pfec  1]  pte_access)  1) || smapf;
 
 You do not need two separate accesses.  Just add PFERR_RSVD_MASK to pfec
 if
 the conditions for SMAP are satisfied.  There are two possibilities:
 
 1) setting PFERR_RSVD_MASK if SMAP is being enforced, that is if CPL = 3
 || AC = 0.  This is what you are doing now.
 
 2) setting PFERR_RSVD_MASK if SMAP is being overridden, that is if CPL  3
  AC = 1.  You then have to invert the bit in update_permission_bitmask.
 
 Please consider both choices, and pick the one that gives better code.
 
 Also, this must be written in a branchless way.  Branchless tricks are common
 throughout the MMU code because the hit rate of most branches is pretty
 much
 50%-50%.  This is also true in this case, at least if SMAP is in use (if it
 is not in use, we'll have AC=0 most of the time).
 
 I don't want to spoil the fun, but I don't want to waste your time either
 so I rot13'ed my solution and placed it after the signature. ;)
 
 As to nested virtualization, I reread the code and it should already work,
 because it sets 

[RFC v2] ARM VM System Specification

2014-03-28 Thread Christoffer Dall
ARM VM System Specification
===

Goal

The goal of this spec is to allow suitably-built OS images to run on
all ARM virtualization solutions, such as KVM or Xen.

Recommendations in this spec are valid for aarch32 and aarch64 alike, and
they aim to be hypervisor agnostic.

Note that simply adhering to the SBSA [2] is not a valid approach, for
example because the SBSA mandates EL2, which will not be available for
VMs.  Further, this spec also covers the aarch32 execution mode, not
covered in the SBSA.


Image format

The image format, as presented to the VM, needs to be well-defined in
order for prepared disk images to be bootable across various
virtualization implementations.

The raw disk format as presented to the VM must be partitioned with a
GUID Partition Table (GPT).  The bootable software must be placed in the
EFI System Partition (ESP), using the UEFI removable media path, and
must be an EFI application complying to the UEFI Specification 2.4
Revision A [6].

The ESP partition's GPT entry's partition type GUID must be
C12A7328-F81F-11D2-BA4B-00A0C93EC93B and the file system must be
formatted as FAT32/vfat as per Section 12.3.1.1 in [6].

The removable media path is \EFI\BOOT\BOOTARM.EFI for the aarch32
execution state and is \EFI\BOOT\BOOTAA64.EFI for the aarch64 execution
state as specified in Section 3.3 (3.3 (Boot Option Variables Default Boot
Behavior) and 3.4.1.1 (Removable Media Boot Behavior) in [6].

This ensures that tools for both Xen and KVM can load a binary UEFI
firmware which can read and boot the EFI application in the disk image.

A typical scenario will be GRUB2 packaged as an EFI application, which
mounts the system boot partition and boots Linux.


Virtual Firmware

The VM system must be UEFI compliant in order to be able to boot the EFI
application in the ESP.  It is recommended that this is achieved by
loading a UEFI binary as the first software executed by the VM, which
then executes the EFI application.  The UEFI implementation should be
compliant with UEFI Specification 2.4 Revision A [6] or later.

This document strongly recommends that the VM implementation supports
persistent environment storage for virtual firmware implementation in
order to ensure probable use cases such as adding additional disk images
to a VM or running installers to perform upgrades.

This document strongly recommends that VM implementations implement
persistent variable storage for their UEFI implementation.  Persistent
variable storage shall be a property of a VM instance, but shall not be
stored as part of a portable disk image.  Portable disk images shall
conform to the UEFI removable disk requirements from the UEFI spec and
cannot rely on on a pre-configured UEFI environment.

The binary UEFI firmware implementation should not be distributed as
part of the VM image, but is specific to the VM implementation.


Hardware Description

The VM system must be UEFI compliant and therefore the UEFI system table
will provide a means to access hardware description data.

The VM implementation must provide through its UEFI implementation:

  a complete FDT which describes the entire VM system and will boot
  mainline kernels driven by device tree alone

For more information about the arm and arm64 boot conventions, see
Documentation/arm/Booting and Documentation/arm64/booting.txt in the
Linux kernel source tree.

For more information about UEFI booting, see [4] and [5].


VM Platform
---
The specification does not mandate any specific memory map.  The guest
OS must be able to enumerate all processing elements, devices, and
memory through HW description data (FDT) or a bus-specific
mechanism such as PCI.

If aarch64 physical CPUs implement support for the aarch32 execution
state in EL1 and EL0 execution, it is recommended that the VM
implementation supports booting the VM at EL1 in both aarch32 and
aarch64 execution states.

The virtual hardware platform must provide a number of mandatory
peripherals:

  Serial console:  The platform should provide a console,
  based on an emulated pl011, a virtio-console, or a Xen PV console.

  An ARM Generic Interrupt Controller v2 (GICv2) [3] or newer.  GICv2
  limits the the number of virtual CPUs to 8 cores, newer GIC versions
  removes this limitation.

  The ARM virtual timer and counter should be available to the VM as
  per the ARM Generic Timers specification in the ARM ARM [1].

It is strongly recommended that the VM implementation provides a
hotpluggable bus to support hotplug of at least block and network
devices.  Suitable buses include a virtual PCIe bus and the Xen PV bus.

For the VM image to be compliant with this spec, the following applies
for the guest OS in the VM image:

  The guest OS must include support for pl011 UART, virtio-console, and
  the Xen PV console.

  The guest OS must include support for GICv2 and any available newer
  version of the GIC architecture 

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-03-28 Thread Christoffer Dall
On Thu, Mar 13, 2014 at 04:57:26PM +0100, Antonios Motakis wrote:
 On an unhandled IO memory abort, use the kvm_io_bus_* API in order to
 handle the MMIO access through any registered read/write callbacks. This
 is a dependency for eventfd support (ioeventfd and irqfd).
 
 However, accesses to the VGIC are still left implemented independently,
 since the kvm_io_bus_* API doesn't pass the VCPU pointer doing the access.
 
 Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
 Signed-off-by: Nikolay Nikolaev n.nikol...@virtualopensystems.com
 ---
  arch/arm/kvm/mmio.c | 32 
  virt/kvm/arm/vgic.c |  5 -
  2 files changed, 36 insertions(+), 1 deletion(-)
 
 diff --git a/arch/arm/kvm/mmio.c b/arch/arm/kvm/mmio.c
 index 4cb5a93..1d17831 100644
 --- a/arch/arm/kvm/mmio.c
 +++ b/arch/arm/kvm/mmio.c
 @@ -162,6 +162,35 @@ static int decode_hsr(struct kvm_vcpu *vcpu, phys_addr_t 
 fault_ipa,
   return 0;
  }
  
 +/**
 + * kvm_handle_mmio - handle an in-kernel MMIO access
 + * @vcpu:pointer to the vcpu performing the access
 + * @run: pointer to the kvm_run structure
 + * @mmio:pointer to the data describing the access
 + *
 + * returns true if the MMIO access has been performed in kernel space,
 + * and false if it needs to be emulated in user space.
 + */
 +static bool handle_kernel_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run,
 + struct kvm_exit_mmio *mmio)
 +{
 + int ret;
 + if (mmio-is_write) {
 + ret = kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, mmio-phys_addr,
 + mmio-len, mmio-data);
 +
 + } else {
 + ret = kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, mmio-phys_addr,
 + mmio-len, mmio-data);
 + }
 + if (!ret) {
 + kvm_prepare_mmio(run, mmio);
 + kvm_handle_mmio_return(vcpu, run);
 + }
 +
 + return !ret;
 +}
 +
  int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
phys_addr_t fault_ipa)
  {
 @@ -200,6 +229,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run 
 *run,
   if (vgic_handle_mmio(vcpu, run, mmio))
   return 1;
  
 + if (handle_kernel_mmio(vcpu, run, mmio))
 + return 1;
 +

this special-casing of the vgic is now really terrible.  Is there
anything holding you back from doing the necessary restructure of the
kvm_bus_io_*() API instead?  That would allow us to get rid of the ugly
Fix it! in the vgic driver as well.

-Christoffer

   kvm_prepare_mmio(run, mmio);
   return 0;
  }
 diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
 index 8ca405c..afdecc3 100644
 --- a/virt/kvm/arm/vgic.c
 +++ b/virt/kvm/arm/vgic.c
 @@ -849,13 +849,16 @@ struct mmio_range *find_matching_range(const struct 
 mmio_range *ranges,
  }
  
  /**
 - * vgic_handle_mmio - handle an in-kernel MMIO access
 + * vgic_handle_mmio - handle an in-kernel vgic MMIO access
   * @vcpu:pointer to the vcpu performing the access
   * @run: pointer to the kvm_run structure
   * @mmio:pointer to the data describing the access
   *
   * returns true if the MMIO access has been performed in kernel space,
   * and false if it needs to be emulated in user space.
 + *
 + * This is handled outside of kvm_handle_mmio because the kvm_io_bus only
 + * passes the VM pointer, while we need the VCPU performing the access.
   */
  bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run,
 struct kvm_exit_mmio *mmio)
 -- 
 1.8.3.2
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: huge 2nd stage pages and live migration

2014-03-28 Thread Christoffer Dall
On Fri, Mar 28, 2014 at 10:39:25AM -0700, Mario Smarduch wrote:
 Hello
 
 I've been working on live migration for ARM-KVM, and noticed
 problem completing migration with huge 2nd stage tables.
  
 
 Aafter write protecting the VM, for write fault 512 page bits
 are set in dirty_bitmap[] to take into account future writes to 
 huge page.The pmd is write protected again when QEMU  reads the 
 dirty log, and the cycle repeats. With this not even a idle 
 32MB VM  completes live migration.
 
 If QEMU uses THPs, and 2nd stage tables use pte's, then there
 is no problem, live migration is quick. I'm assumung QEMU and Guest 
 huge pages with 2nd stage page table pte's should work fine too.
 
 I'm wondering how this has been solved (for any architecture)? 
 
I don't know if there's a generic solution (have you looked at x86/PPC
what they do?), but I don't see any conceptual problem with putting the
VM into use-pte-for-2nd-stage-mappings-I'm-migrating-mode.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arm: KVM: fix possible misalignment of PGDs and bounce page

2014-03-28 Thread Christoffer Dall
On Fri, Mar 28, 2014 at 10:25:19AM -0400, Mark Salter wrote:
 The kvm/mmu code shared by arm and arm64 uses kalloc() to allocate
 a bounce page (if hypervisor init code crosses page boundary) and
 hypervisor PGDs. The problem is that kalloc() does not guarantee
 the proper alignment. In the case of the bounce page, the page sized
 buffer allocated may also cross a page boundary negating the purpose
 and leading to a hang during kvm initialization. Likewise the PGDs
 allocated may not meet the minimum alignment requirements of the
 underlying MMU. This patch uses __get_free_page() to guarantee the
 worst case alignment needs of the bounce page and PGDs on both arm
 and arm64.
 
 Signed-off-by: Mark Salter msal...@redhat.com
 ---
  arch/arm/kvm/mmu.c | 15 +--
  1 file changed, 9 insertions(+), 6 deletions(-)
 
 diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
 index 7789857..575d790 100644
 --- a/arch/arm/kvm/mmu.c
 +++ b/arch/arm/kvm/mmu.c
 @@ -42,6 +42,8 @@ static unsigned long hyp_idmap_start;
  static unsigned long hyp_idmap_end;
  static phys_addr_t hyp_idmap_vector;
  
 +#define pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
 +
  #define kvm_pmd_huge(_x) (pmd_huge(_x) || pmd_trans_huge(_x))
  
  static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 @@ -199,14 +201,14 @@ void free_boot_hyp_pgd(void)
   if (boot_hyp_pgd) {
   unmap_range(NULL, boot_hyp_pgd, hyp_idmap_start, PAGE_SIZE);
   unmap_range(NULL, boot_hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
 - kfree(boot_hyp_pgd);
 + free_pages((unsigned long)boot_hyp_pgd, pgd_order);
   boot_hyp_pgd = NULL;
   }
  
   if (hyp_pgd)
   unmap_range(NULL, hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
  
 - kfree(init_bounce_page);
 + free_page((unsigned long)init_bounce_page);
   init_bounce_page = NULL;
  
   mutex_unlock(kvm_hyp_pgd_mutex);
 @@ -236,7 +238,7 @@ void free_hyp_pgds(void)
   for (addr = VMALLOC_START; is_vmalloc_addr((void*)addr); addr 
 += PGDIR_SIZE)
   unmap_range(NULL, hyp_pgd, KERN_TO_HYP(addr), 
 PGDIR_SIZE);
  
 - kfree(hyp_pgd);
 + free_pages((unsigned long)hyp_pgd, pgd_order);
   hyp_pgd = NULL;
   }
  
 @@ -930,7 +932,7 @@ int kvm_mmu_init(void)
   size_t len = __hyp_idmap_text_end - __hyp_idmap_text_start;
   phys_addr_t phys_base;
  
 - init_bounce_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
 + init_bounce_page = (void *)__get_free_page(GFP_KERNEL);
   if (!init_bounce_page) {
   kvm_err(Couldn't allocate HYP init bounce page\n);
   err = -ENOMEM;
 @@ -956,8 +958,9 @@ int kvm_mmu_init(void)
(unsigned long)phys_base);
   }
  
 - hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
 - boot_hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
 + hyp_pgd = (pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, pgd_order);
 + boot_hyp_pgd = (pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 
 pgd_order);
 +
   if (!hyp_pgd || !boot_hyp_pgd) {
   kvm_err(Hyp mode PGD not allocated\n);
   err = -ENOMEM;
 -- 
 1.8.5.3
 
This looks right to me.  Funnily enough I seem to remember a discussion
from when we originally merged this code where someone (maybe me) argued
that kmalloc() would align to the size of the allocation, but I don't
see anything backing this up at this point.

So:

Acked-by: Christoffer Dall christoffer.d...@linaro.org

If Marc agrees I can queue this for -rc1.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 net] vhost: fix total length when packets are too short

2014-03-28 Thread David Miller
From: Michael S. Tsirkin m...@redhat.com
Date: Thu, 27 Mar 2014 12:00:26 +0200

 When mergeable buffers are disabled, and the
 incoming packet is too large for the rx buffer,
 get_rx_bufs returns success.
 
 This was intentional in order for make recvmsg
 truncate the packet and then handle_rx would
 detect err != sock_len and drop it.
 
 Unfortunately we pass the original sock_len to
 recvmsg - which means we use parts of iov not fully
 validated.
 
 Fix this up by detecting this overrun and doing packet drop
 immediately.
 
 CVE-2014-0077
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 Changes from v1:
   Fix CVE# in the commit log.
   Patch is unchanged.
 
 Note: this is needed for -stable.

Applied and queued up for -stable.

 I wonder if this can still make the release.

I will try but no promises.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] vhost: validate vhost_get_vq_desc return value

2014-03-28 Thread David Miller
From: Michael S. Tsirkin m...@redhat.com
Date: Thu, 27 Mar 2014 12:53:37 +0200

 vhost fails to validate negative error code
 from vhost_get_vq_desc causing
 a crash: we are using -EFAULT which is 0xfff2
 as vector size, which exceeds the allocated size.
 
 The code in question was introduced in commit
 8dd014adfea6f173c1ef6378f7e5e7924866c923
 vhost-net: mergeable buffers support
 
 CVE-2014-0055
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 This is needed in -stable.

Applied and queued up for -stable, thanks Michael.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Qemu v1.7.1 CentOS 6.4

2014-03-28 Thread Lane Eckley
Hi Everyone,

I am running into performance issues with Windows guest VM's in
conjunction with the rather old version of Qemu-KVM that is currently
being shipped with rhel 6.4 and as such I am looking to upgrade to the
latest stable release of qemu (v1.7.1 if I am not mistaken).

As it stands now I have been unsuccessful in locating a good
guide/tutorial on how to correctly update (or remove  install) v1.7.1
- Does anyone have a good tutorial on how to properly get it installed
under CentOS 6 (or rhel 6 in general)?

The reason for the upgrade is due high context switch rate when the
Windows virtual VM's are idle causing the hypervisor machine to
utilize a lot more CPU than it should be. Based on my Google skills I
have located several threads noting upgrade to at least 0.12.4 should
help resolve the issue, however as it stands right now I really do not
have any confirmation of this.

As to CentOS, unfortunately the control panel system we are using
currently limits us to rhel 6 based distros and as such swapping to a
debian based distro or otherwise is not currently an option.

Any advice  feedback would be very much appreciated.

Thanks!

-Lane
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qemu v1.7.1 CentOS 6.4

2014-03-28 Thread Brian Jackson
On 03/28/2014 03:55 PM, Lane Eckley wrote:
 Hi Everyone,

 I am running into performance issues with Windows guest VM's in
 conjunction with the rather old version of Qemu-KVM that is currently
 being shipped with rhel 6.4 and as such I am looking to upgrade to the
 latest stable release of qemu (v1.7.1 if I am not mistaken).

 As it stands now I have been unsuccessful in locating a good
 guide/tutorial on how to correctly update (or remove  install) v1.7.1
 - Does anyone have a good tutorial on how to properly get it installed
 under CentOS 6 (or rhel 6 in general)?

 The reason for the upgrade is due high context switch rate when the
 Windows virtual VM's are idle causing the hypervisor machine to
 utilize a lot more CPU than it should be. Based on my Google skills I
 have located several threads noting upgrade to at least 0.12.4 should
 help resolve the issue, however as it stands right now I really do not
 have any confirmation of this.


RedHat very likely has backported any fixes and probably most
performance improvements from 0.12-1.7. I'd suggest upgrading to the
latest RHEL release and try to enable the hv-* options (hv-spinlocks,
hv-relaxed, hv-vapic, hv-time or whatever is supported by RHEL's kvm).
IIRC, there is also a document/page in the RHEL docs that talks
specifically about Windows guest performance.



 As to CentOS, unfortunately the control panel system we are using
 currently limits us to rhel 6 based distros and as such swapping to a
 debian based distro or otherwise is not currently an option.

 Any advice  feedback would be very much appreciated.

 Thanks!

 -Lane
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: Book3S HV: Make TM avoid program check

2014-03-28 Thread Paul Mackerras
On Fri, Mar 28, 2014 at 04:40:36PM +1100, Michael Neuling wrote:
 Currently using kvmppc_set_one_reg() a transaction could be setup without
 TEXASR Failure Summary (FS) not set.  When this is switched back in by the
 host, this will result in a TM Bad Thing (ie 0x700 program check) when the
 trechkpt is run.
 
 This avoids this by always setting the TEXASR FS when there is an active
 transaction being started.
 
 This patch is on top of Paulus' recent KVM TM patch set.

Thanks, Mikey.

Do you mind if I fold these two patches into patch 2/8 of the set I
posted?

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: Book3S HV: Make TM avoid program check

2014-03-28 Thread Paolo Bonzini

Il 28/03/2014 12:08, Paul Mackerras ha scritto:

 Currently using kvmppc_set_one_reg() a transaction could be setup without
 TEXASR Failure Summary (FS) not set.  When this is switched back in by the
 host, this will result in a TM Bad Thing (ie 0x700 program check) when the
 trechkpt is run.

 This avoids this by always setting the TEXASR FS when there is an active
 transaction being started.

 This patch is on top of Paulus' recent KVM TM patch set.

Thanks, Mikey.

Do you mind if I fold these two patches into patch 2/8 of the set I
posted?


In either case, am I right that this doesn't include the patches in 
kvm-ppc-queue?  I'm waiting for the pull request. :)


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html