Re: [PATCH v6 11/15] nEPT: Add nEPT violation/misconfigration support

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Inject nEPT fault to L1 guest. This patch is original from Xinhao.

Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/15] nEPT: MMU context for nested EPT

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:
 From: Nadav Har'El n...@il.ibm.com
 
 KVM's existing shadow MMU code already supports nested TDP. To use it, we
 need to set up a new MMU context for nested EPT, and create a few callbacks
 for it (nested_ept_*()). This context should also use the EPT versions of
 the page table access functions (defined in the previous patch).
 Then, we need to switch back and forth between this nested context and the
 regular MMU context when switching between L1 and L2 (when L1 runs this L2
 with EPT).

Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re

2013-08-02 Thread Email Helpdesk Center
Dear Email User,

Your Mailbox Has Exceeded It Storage Limit As Set By Your Email Administrator, 
And You Will Not Be Able To Receive New Mails Until You Re-Validate It.

Click Here http://tututw.phpforms.net/f/firstform and Login to your email 
account to Activate it.

Webmail Help Desk Center.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread “tiejun.chen”

On 08/01/2013 07:12 PM, Bharat Bhushan wrote:

KVM uses same WIM tlb attributes as the corresponding qemu pte.
For this we now search the linux pte for the requested page and
get these cache caching/coherency attributes from pte.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v1-v2
  - Use Linux pte for wimge rather than RAM/no-RAM mechanism

  arch/powerpc/include/asm/kvm_host.h |2 +-
  arch/powerpc/kvm/booke.c|2 +-
  arch/powerpc/kvm/e500.h |8 +---
  arch/powerpc/kvm/e500_mmu_host.c|   31 ++-
  4 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 3328353..583d405 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -535,6 +535,7 @@ struct kvm_vcpu_arch {
  #endif
gpa_t paddr_accessed;
gva_t vaddr_accessed;
+   pgd_t *pgdir;

u8 io_gpr; /* GPR used as IO source/target */
u8 mmio_is_bigendian;
@@ -592,7 +593,6 @@ struct kvm_vcpu_arch {
struct list_head run_list;
struct task_struct *run_task;
struct kvm_run *kvm_run;
-   pgd_t *pgdir;

spinlock_t vpa_update_lock;
struct kvmppc_vpa vpa;
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 17722d8..eb2 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -697,7 +697,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
  #endif

kvmppc_fix_ee_before_entry();
-
+   vcpu-arch.pgdir = current-mm-pgd;
ret = __kvmppc_vcpu_run(kvm_run, vcpu);

/* No need for kvm_guest_exit. It's done in handle_exit.
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 4fd9650..fc4b2f6 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -31,11 +31,13 @@ enum vcpu_ftr {
  #define E500_TLB_NUM   2

  /* entry is mapped somewhere in host TLB */
-#define E500_TLB_VALID (1  0)
+#define E500_TLB_VALID (1  31)
  /* TLB1 entry is mapped by host TLB1, tracked by bitmaps */
-#define E500_TLB_BITMAP(1  1)
+#define E500_TLB_BITMAP(1  30)
  /* TLB1 entry is mapped by host TLB0 */
-#define E500_TLB_TLB0  (1  2)
+#define E500_TLB_TLB0  (1  29)
+/* Lower 5 bits have WIMGE value */
+#define E500_TLB_WIMGE_MASK(0x1f)

  struct tlbe_ref {
pfn_t pfn;  /* valid only for TLB0, except briefly */
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 1c6a9d7..9b10b0b 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -64,15 +64,6 @@ static inline u32 e500_shadow_mas3_attrib(u32 mas3, int 
usermode)
return mas3;
  }

-static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode)
-{
-#ifdef CONFIG_SMP
-   return (mas2  MAS2_ATTRIB_MASK) | MAS2_M;
-#else
-   return mas2  MAS2_ATTRIB_MASK;
-#endif
-}
-
  /*
   * writing shadow tlb entry to host TLB
   */
@@ -248,10 +239,12 @@ static inline int tlbe_is_writable(struct 
kvm_book3e_206_tlb_entry *tlbe)

  static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
 struct kvm_book3e_206_tlb_entry *gtlbe,
-pfn_t pfn)
+pfn_t pfn, int wimg)
  {
ref-pfn = pfn;
ref-flags |= E500_TLB_VALID;
+   /* Use guest supplied MAS2_G and MAS2_E */
+   ref-flags |= (gtlbe-mas2  MAS2_ATTRIB_MASK) | wimg;

if (tlbe_is_writable(gtlbe))
kvm_set_pfn_dirty(pfn);
@@ -312,8 +305,7 @@ static void kvmppc_e500_setup_stlbe(

/* Force IPROT=0 for all guest mappings. */
stlbe-mas1 = MAS1_TSIZE(tsize) | get_tlb_sts(gtlbe) | MAS1_VALID;
-   stlbe-mas2 = (gvaddr  MAS2_EPN) |
- e500_shadow_mas2_attrib(gtlbe-mas2, pr);
+   stlbe-mas2 = (gvaddr  MAS2_EPN) | (ref-flags  E500_TLB_WIMGE_MASK);
stlbe-mas7_3 = ((u64)pfn  PAGE_SHIFT) |
e500_shadow_mas3_attrib(gtlbe-mas7_3, pr);

@@ -332,6 +324,8 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
unsigned long hva;
int pfnmap = 0;
int tsize = BOOK3E_PAGESZ_4K;
+   pte_t pte;
+   int wimg = 0;

/*
 * Translate guest physical to true physical, acquiring
@@ -437,6 +431,8 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,

if (likely(!pfnmap)) {
unsigned long tsize_pages = 1  (tsize + 10 - PAGE_SHIFT);
+   pgd_t *pgdir;
+
pfn = gfn_to_pfn_memslot(slot, gfn);
if (is_error_noslot_pfn(pfn)) {
printk(KERN_ERR Couldn't get real page for gfn %lx!\n,
@@ -447,9 +443,18 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 

Re: kernel 3.10.1 - NMI received for unknown reason

2013-08-02 Thread Stefan Pietsch
On 31.07.2013 11:20, Gleb Natapov wrote:
 On Wed, Jul 31, 2013 at 11:10:01AM +0200, Stefan Pietsch wrote:
 On 30.07.2013 07:31, Gleb Natapov wrote:

 What happen if you run perf on your host (perf record -a)?
 Do you see same NMI messages?

 It seems that perf record -a triggers some delayed NMI messages.
 They appear about 20 or 30 minutes after the command. This seems strange.
 Definitely strange. KVM guest is not running in parallel, correct? 20, 30
 minutes after perf stopped running or it is running all of the time?

No, the KVM guest ist not running in parallel. But I'm not able to
clearly reproduce the NMI messages with perf record.
I start perf record -a and after some minutes I stop the recording.

After that it seems NMI messages appear within a random period of time.
So, I cannot tell what triggers the messages.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 14/15] nEPT: Some additional comments

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:
 From: Nadav Har'El n...@il.ibm.com
 
 Some additional comments to preexisting code:
 Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
 Don't mention shadow on either EPT or shadow as the only two options.

Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60679] New: L2 can't boot up when creating L1 with '-cpu host' qemu option

2013-08-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60679

Bug ID: 60679
   Summary: L2 can't boot up when creating L1 with '-cpu host'
qemu option
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.11.0-RC1
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
  Assignee: virtualization_...@kernel-bugs.osdl.org
  Reporter: yongjie@intel.com
Regression: No

Created attachment 107077
  -- https://bugzilla.kernel.org/attachment.cgi?id=107077action=edit
serial log of L1 guest

Environment:

Host OS (ia32/ia32e/IA64):ia32e
Guest OS (ia32/ia32e/IA64):ia32e
Guest OS Type (Linux/Windows):Linux
kvm.git next Commit:bf640876e21fe603f7f52b0c27d66b7716da0384
qemu-kvm uq/master Commit:0779caeb1a17f4d3ed14e2925b36ba09b084fb7b
Host Kernel Version:3.11.0-rc1
Hardware: SNB-EP


Bug detailed description:
--
create L1 guest with -cpu host, then create a L2 guest, L1 will call trace
and L2 can't boot up.
note:
1. create L1 guest with -cpu qemu64,+vmx, L2 guest works fine.
2. This should be a kvm bug.
kvm  + qemu-kvm   =  result
bf640876 + 0779caeb   =  bad
6d128e1e + 0779caeb   =  good

the first bad commit is:
commit 21feb4eb64e21f8dc91136b91ee886b978ce6421
Author: Arthur Chunqi Li yzt...@gmail.com
Date:   Mon Jul 15 16:04:08 2013 +0800

KVM: nVMX: Set segment infomation of L1 when L2 exits


Reproduce steps:

1. create L1 guest:
qemu-system-x86_64 -enable-kvm -m 4G -smp 4 -net nic,macaddr=00:12:46:09:13:56
-net tap,script=/etc/kvm/qemu-ifup nested-kvm.qcow
2. create L2 guest:
qemu-system-x86_64 -enable-kvm -m 1024 -smp 2 -net none rhel6u4.img

Current result:

L1 call trace; L2 can't boot up.


Expected result:

L1 and L2 work fine



Basic root-causing log:  (in L1 guest)
--
[   94.585378] BUG: unable to handle kernel NULL pointer dereference at
0034

[   94.586002] IP: [a010bb26] write_segment_descriptor+0x66/0xa0
[kvm]

[   94.586002] PGD 0 

[   94.586002] Oops:  [#1] SMP 

[   94.586002] Modules linked in: fuse nfsv3 nfs_acl nfsv4 auth_rpcgss nfs
fscache dns_resolver lockd sunrpc 8021q garp stp llc binfmt_misc uinput ppdev
parport_pc parport kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode
pcspkr e1000 i2c_piix4 floppy(F) cirrus(F) ttm(F) drm_kms_helper(F) drm(F)
i2c_core(F)

[   94.586002] CPU 0 

[   94.586002] Pid: 2132, comm: qemu-system-x86 Tainted: GF3.8.5 #4
Bochs Bochs

[   94.586002] RIP: 0010:[a010bb26]  [a010bb26]
write_segment_descriptor+0x66/0xa0 [kvm]

[   94.586002] RSP: 0018:880118d2bac8  EFLAGS: 00010246

[   94.586002] RAX:  RBX: 880106a79540 RCX:


[   94.586002] RDX: 1000 RSI: 0009 RDI:
00a0

[   94.586002] RBP: 880118d2baf8 R08: 0008 R09:
00a0

[   94.586002] R10:  R11:  R12:
880118d2bb38

[   94.586002] R13: 0001 R14: 0008 R15:
0008

[   94.586002] FS:  7f83d3bb5700() GS:88011fc0()
knlGS:

[   94.586002] CS:  0010 DS:  ES:  CR0: 80050033

[   94.586002] CR2: 0034 CR3: 0001069bf000 CR4:
001427f0

[   94.586002] DR0:  DR1:  DR2:


[   94.586002] DR3:  DR6: 0ff0 DR7:
0400

[   94.586002] Process qemu-system-x86 (pid: 2132, threadinfo 880118d2a000,
task 8801086e2e80)

[   94.586002] Stack:

[   94.586002]  91800027 880106a7 880106a79540
0001

[   94.586002]  0008 880118d2bb38 880118d2bb78
a010be5f

[   94.586002]  880106a79700 00089568 880106a78000
9188

[   94.586002] Call Trace:

[   94.586002]  [a010be5f] load_segment_descriptor+0x2ff/0x330 [kvm]

[   94.586002]  [a010d4c8] em_jmp_far+0x38/0x70 [kvm]

[   94.586002]  [a0108bd0] ? check_cr_read+0x40/0x40 [kvm]

[   94.586002]  [a010c211] x86_emulate_insn+0x261/0x1430 [kvm]

[   94.586002]  [a010d490] ? em_lldt+0x30/0x30 [kvm]

[   94.586002]  [a00f5ff8] x86_emulate_instruction+0x98/0x420 [kvm]

[   94.586002]  [a016a72d] vmx_handle_exit+0x20d/0x780 [kvm_intel]

[   94.586002]  [a0165dfc] ? vmx_vcpu_run+0x38c/0x5b0 [kvm_intel]

[   94.586002]  [a0110a28] ? kvm_apic_has_interrupt+0x28/0xd0 [kvm]

[   94.586002]  [a01621b0] ? vmx_invpcid_supported+0x20/0x20
[kvm_intel]

[   94.586002]  [a00f3982] kvm_arch_vcpu_ioctl_run+0x8c2/0x1140 [kvm]

[   94.586002]  [a00ef607] ? kvm_arch_vcpu_load+0x57/0x1e0 [kvm]

[   94.586002]  [a00dfcde] 

Re: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-08-02 Thread Jan Kiszka
On 2013-08-02 05:04, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-08-01:
 From: Nadav Har'El n...@il.ibm.com

 Recent KVM, since
 http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
 switch the EFER MSR when EPT is used and the host and guest have different
 NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
 and want to be able to run recent KVM as L1, we need to allow L1 to use this
 EFER switching feature.

 To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
 available,
 and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
 support for the former (the latter is still unsupported).

 Nested entry and exit emulation (prepare_vmcs_02 and
 load_vmcs12_host_state,
 respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
 all
 that's left to do in this patch is to properly advertise this feature to L1.

 Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
 using
 vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
 support this feature, regardless of whether the host supports it.

 Reviewed-by: Orit Wasserman owass...@redhat.com
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
 Signed-off-by: Xinhao Xu xinhao...@intel.com
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/vmx.c |   23 ---
  1 file changed, 16 insertions(+), 7 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index e999dc7..27efa6a 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2198,7 +2198,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
  #else
  nested_vmx_exit_ctls_high = 0;
  #endif
 -nested_vmx_exit_ctls_high |=
 VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 +nested_vmx_exit_ctls_high |=
 (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 +  VM_EXIT_LOAD_IA32_EFER);

  /* entry controls */
  rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
 @@ -2207,8 +2208,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
  nested_vmx_entry_ctls_low =
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
  nested_vmx_entry_ctls_high =
  VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
 -nested_vmx_entry_ctls_high |=
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 -
 +nested_vmx_entry_ctls_high |=
 (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
 +   VM_ENTRY_LOAD_IA32_EFER);
 Just saw it, we didn't expose bit22 save VMX-preemption timer in vm-exit 
 control but we already allowed guest to set active VMX-preemption timer in 
 pin based vm-execution conrols. This is wrong.

Does the presence of preemption timer support imply that saving its
value is also supported? Then we could demand this combination (ie. do
not expose preemption timer support at all to L1 if value saving is
missing) and build our preemption timer emulation on top.

There is more broken /wrt VMX preemption timer, patches are welcome.
Arthur will also try to develop test cases for it. But that topic is
unrelated to this series.

Jan

 
  /* cpu-based controls */
  rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
  nested_vmx_procbased_ctls_low,
 nested_vmx_procbased_ctls_high);
 @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu
 *vcpu, struct vmcs12 *vmcs12)
  vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
  vmcs_writel(CR0_GUEST_HOST_MASK,
 ~vcpu-arch.cr0_guest_owned_bits);

 -/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
 below */
 -vmcs_write32(VM_EXIT_CONTROLS,
 -vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl);
 -vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls |
 +/* L2-L1 exit controls are emulated - the hardware exit is to L0 so
 + * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
 + * bits are further modified by vmx_set_efer() below.
 + */
 +vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
 Should we mentioned that save vmx preemption bit must use host|guest, not 
 just host?
 
 +
 +/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
 are
 + * emulated by vmx_set_efer(), below.
 + */
 +vmcs_write32(VM_ENTRY_CONTROLS,
 +(vmcs12-vm_entry_controls  ~VM_ENTRY_LOAD_IA32_EFER 
 +~VM_ENTRY_IA32E_MODE) |
  (vmcs_config.vmentry_ctrl  ~VM_ENTRY_IA32E_MODE));

  if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_PAT)
 --
 1.7.10.4
 
 Best regards,
 Yang




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/6 v2] kvm: powerpc: allow guest control G attribute in mas2

2013-08-02 Thread “tiejun.chen”

On 08/01/2013 07:12 PM, Bharat Bhushan wrote:

G bit in MAS2 indicates whether the page is Guarded.
There is no reason to stop guest setting  E, so allow him.


Could we merge patch 2 and 3 into only one.

Tiejun



Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v1-v2
  - no change

  arch/powerpc/kvm/e500.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 277cb18..4fd9650 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -117,7 +117,7 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct 
kvm_vcpu *vcpu)
  #define E500_TLB_USER_PERM_MASK (MAS3_UX|MAS3_UR|MAS3_UW)
  #define E500_TLB_SUPER_PERM_MASK (MAS3_SX|MAS3_SR|MAS3_SW)
  #define MAS2_ATTRIB_MASK \
- (MAS2_X0 | MAS2_X1 | MAS2_E)
+ (MAS2_X0 | MAS2_X1 | MAS2_E | MAS2_G)
  #define MAS3_ATTRIB_MASK \
  (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \
   | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 15/15] nEPT: Miscelleneous cleanups

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:
 From: Nadav Har'El n...@il.ibm.com
 
 Some trivial code cleanups not really related to nested EPT.

Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM Test report, kernel bf640876... qemu 0779caeb...

2013-08-02 Thread Ren, Yongjie
Hi All,

This is KVM upstream test result against kvm.git next branch and qemu-kvm.git 
uq/master branch.
 kvm.git next branch: bf640876e21fe603f7f52b0c27d66b7716da0384  based 
on kernel 3.11.0-rc1
 qemu-kvm.git uq/master branch: 0779caeb1a17f4d3ed14e2925b36ba09b084fb7b

We found two new bugs and no fixed bug in the past two weeks. 
As all of the two new bugs are introduced by Arthur, I also CCed him. 
Arthur, can you have look at the new bugs?

New issues (2):
1. 64bit RHEL6.4 guest crashes and reboots continuously
  https://bugs.launchpad.net/qemu-kvm/+bug/1207623
2. L2 can't boot up when creating L1 with '-cpu host' qemu option
  https://bugzilla.kernel.org/show_bug.cgi?id=60679

Fixed issue (0):

Old issues (9):
--
1. guest panic with parameter -cpu host in qemu command line (about vPMU 
issue).
  https://bugs.launchpad.net/qemu/+bug/994378
2. Can't install or boot up 32bit win8 guest.
  https://bugs.launchpad.net/qemu/+bug/1007269
3. vCPU hot-add makes the guest abort. 
  https://bugs.launchpad.net/qemu/+bug/1019179
4. Nested Virt: VMX can't be initialized in L1 Xen (Xen on KVM)
  https://bugzilla.kernel.org/show_bug.cgi?id=45931
5. Guest has no xsave feature with parameter -cpu qemu64,+xsave in qemu 
command line.
  https://bugs.launchpad.net/qemu/+bug/1042561
6. Guest hang when doing kernel build and writing data in guest.
  https://bugs.launchpad.net/qemu/+bug/1096814
7. with 'monitor pty', it needs to flush pts device after sending command to it 
  https://bugs.launchpad.net/qemu/+bug/1185228
8. [nested virt] L2 Windows guest can't boot up ('-cpu host' to start L1)
  https://bugzilla.kernel.org/show_bug.cgi?id=58921
9. [nested virt] L2 has NMI error when creating L1 with -cpu host parameter
  https://bugzilla.kernel.org/show_bug.cgi?id=58941


Test environment:
==
  Platform   IvyBridge-EP    Sandybridge-EP
  CPU Cores   32    32
  Memory size 64GB  32GB


Regards
   Yongjie Ren  (Jay)



RE: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-08-02 Thread Zhang, Yang Z
Jan Kiszka wrote on 2013-08-02:
 On 2013-08-02 05:04, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-08-01:
 From: Nadav Har'El n...@il.ibm.com
 
 Recent KVM, since
 http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
 switch the EFER MSR when EPT is used and the host and guest have 
 different NX bits. So if we add support for nested EPT (L1 guest 
 using EPT to run L2) and want to be able to run recent KVM as L1, we 
 need to allow L1 to use this EFER switching feature.
 
 To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if 
 available, and if it isn't, it uses the generic 
 VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the 
 latter is still unsupported).
 
 Nested entry and exit emulation (prepare_vmcs_02 and 
 load_vmcs12_host_state,
 respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly.
 So all that's left to do in this patch is to properly advertise this 
 feature to L1.
 
 Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, 
 by using vmx_set_efer (which itself sets one of several vmcs02 
 fields), so we always support this feature, regardless of whether 
 the host supports it.
 
 Reviewed-by: Orit Wasserman owass...@redhat.com
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
 Signed-off-by: Xinhao Xu xinhao...@intel.com
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/vmx.c |   23 ---
  1 file changed, 16 insertions(+), 7 deletions(-) diff --git 
 a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2198,7 +2198,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
  #else
 nested_vmx_exit_ctls_high = 0;
  #endif
 -   nested_vmx_exit_ctls_high |=
 VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 +   nested_vmx_exit_ctls_high |=
 (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 + VM_EXIT_LOAD_IA32_EFER);
 
 /* entry controls */
 rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
 @@ -2207,8 +2208,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
 nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 nested_vmx_entry_ctls_high =   VM_ENTRY_LOAD_IA32_PAT |
  VM_ENTRY_IA32E_MODE;
 -   nested_vmx_entry_ctls_high |=
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 -
 +   nested_vmx_entry_ctls_high |=
 (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
 +  VM_ENTRY_LOAD_IA32_EFER);
 Just saw it, we didn't expose bit22 save VMX-preemption timer in 
 vm-exit
 control but we already allowed guest to set active VMX-preemption 
 timer in pin based vm-execution conrols. This is wrong.
 
 Does the presence of preemption timer support imply that saving its 
 value is also supported? Then we could demand this combination (ie. do 
 not expose preemption timer support at all to L1 if value saving is
 missing) and build our preemption timer emulation on top.
 
I don't see we saved the preemption timer value to vmcs12 in prepare_vmcs12(). 
Will it be saved automatically?

 There is more broken /wrt VMX preemption timer, patches are welcome.
 Arthur will also try to develop test cases for it. But that topic is 
 unrelated to this series.
 
 Jan
 
 
 /* cpu-based controls */
 rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
 nested_vmx_procbased_ctls_low,
 nested_vmx_procbased_ctls_high);
 @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu 
 *vcpu, struct vmcs12 *vmcs12)
 vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
 vmcs_writel(CR0_GUEST_HOST_MASK,
 ~vcpu-arch.cr0_guest_owned_bits);
 
 -   /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
 below */
 -   vmcs_write32(VM_EXIT_CONTROLS,
 -   vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl);
 -   vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls |
 +   /* L2-L1 exit controls are emulated - the hardware exit is to L0 so
 +* we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
 +* bits are further modified by vmx_set_efer() below.
 +*/
 +   vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
 Should we mentioned that save vmx preemption bit must use host|guest, 
 not just host?
 
 +
 +   /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
 are
 +* emulated by vmx_set_efer(), below.
 +*/
 +   vmcs_write32(VM_ENTRY_CONTROLS,
 +   (vmcs12-vm_entry_controls  ~VM_ENTRY_LOAD_IA32_EFER 
 +   ~VM_ENTRY_IA32E_MODE) |
 (vmcs_config.vmentry_ctrl  ~VM_ENTRY_IA32E_MODE));
  
 if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_PAT)
 --
 1.7.10.4
 
 Best regards,
 Yang



Best regards,
Yang




Re: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-08-02 Thread Jan Kiszka
On 2013-08-02 09:27, Zhang, Yang Z wrote:
 Jan Kiszka wrote on 2013-08-02:
 On 2013-08-02 05:04, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-08-01:
 From: Nadav Har'El n...@il.ibm.com

 Recent KVM, since
 http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
 switch the EFER MSR when EPT is used and the host and guest have 
 different NX bits. So if we add support for nested EPT (L1 guest 
 using EPT to run L2) and want to be able to run recent KVM as L1, we 
 need to allow L1 to use this EFER switching feature.

 To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if 
 available, and if it isn't, it uses the generic 
 VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the 
 latter is still unsupported).

 Nested entry and exit emulation (prepare_vmcs_02 and 
 load_vmcs12_host_state,
 respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly.
 So all that's left to do in this patch is to properly advertise this 
 feature to L1.

 Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, 
 by using vmx_set_efer (which itself sets one of several vmcs02 
 fields), so we always support this feature, regardless of whether 
 the host supports it.

 Reviewed-by: Orit Wasserman owass...@redhat.com
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
 Signed-off-by: Xinhao Xu xinhao...@intel.com
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/vmx.c |   23 ---
  1 file changed, 16 insertions(+), 7 deletions(-) diff --git 
 a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2198,7 +2198,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
  #else
nested_vmx_exit_ctls_high = 0;
  #endif
 -  nested_vmx_exit_ctls_high |=
 VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 +  nested_vmx_exit_ctls_high |=
 (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 +VM_EXIT_LOAD_IA32_EFER);

/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
 @@ -2207,8 +2208,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_entry_ctls_high =   VM_ENTRY_LOAD_IA32_PAT |
  VM_ENTRY_IA32E_MODE;
 -  nested_vmx_entry_ctls_high |=
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 -
 +  nested_vmx_entry_ctls_high |=
 (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
 + VM_ENTRY_LOAD_IA32_EFER);
 Just saw it, we didn't expose bit22 save VMX-preemption timer in 
 vm-exit
 control but we already allowed guest to set active VMX-preemption 
 timer in pin based vm-execution conrols. This is wrong.

 Does the presence of preemption timer support imply that saving its 
 value is also supported? Then we could demand this combination (ie. do 
 not expose preemption timer support at all to L1 if value saving is
 missing) and build our preemption timer emulation on top.

 I don't see we saved the preemption timer value to vmcs12 in 
 prepare_vmcs12(). Will it be saved automatically?

No. As I said, there is more broken with our preemption timer emulation.

Jan

 
 There is more broken /wrt VMX preemption timer, patches are welcome.
 Arthur will also try to develop test cases for it. But that topic is 
 unrelated to this series.

 Jan


/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
nested_vmx_procbased_ctls_low,
 nested_vmx_procbased_ctls_high);
 @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu 
 *vcpu, struct vmcs12 *vmcs12)
vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
vmcs_writel(CR0_GUEST_HOST_MASK,
 ~vcpu-arch.cr0_guest_owned_bits);

 -  /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
 below */
 -  vmcs_write32(VM_EXIT_CONTROLS,
 -  vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl);
 -  vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls |
 +  /* L2-L1 exit controls are emulated - the hardware exit is to L0 so
 +   * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
 +   * bits are further modified by vmx_set_efer() below.
 +   */
 +  vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
 Should we mentioned that save vmx preemption bit must use host|guest, 
 not just host?

 +
 +  /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
 are
 +   * emulated by vmx_set_efer(), below.
 +   */
 +  vmcs_write32(VM_ENTRY_CONTROLS,
 +  (vmcs12-vm_entry_controls  ~VM_ENTRY_LOAD_IA32_EFER 
 +  ~VM_ENTRY_IA32E_MODE) |
(vmcs_config.vmentry_ctrl  ~VM_ENTRY_IA32E_MODE));
  
if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_PAT)
 --
 1.7.10.4

 Best regards,
 Yang

 
 
 Best regards,
 Yang
 
 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v6 10/15] nEPT: Nested INVEPT

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:

 +/* Emulate the INVEPT instruction */
 +static int handle_invept(struct kvm_vcpu *vcpu)
 +{
 + u32 vmx_instruction_info;
 + bool ok;
 + unsigned long type;
 + gva_t gva;
 + struct x86_exception e;
 + struct {
 + u64 eptp, gpa;
 + } operand;
 +
 + if (!(nested_vmx_secondary_ctls_high  SECONDARY_EXEC_ENABLE_EPT) ||
 + !(nested_vmx_ept_caps  VMX_EPT_INVEPT_BIT)) {
 + kvm_queue_exception(vcpu, UD_VECTOR);
 + return 1;
 + }
 +
 + if (!nested_vmx_check_permission(vcpu))
 + return 1;
 +
 + if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
 + kvm_queue_exception(vcpu, UD_VECTOR);
 + return 1;
 + }
 +
 + /* According to the Intel VMX instruction reference, the memory
 +  * operand is read even if it isn't needed (e.g., for type==global)
 +  */
 + vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
 + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
 + vmx_instruction_info, gva))
 + return 1;
 + if (kvm_read_guest_virt(vcpu-arch.emulate_ctxt, gva, operand,
 + sizeof(operand), e)) {
 + kvm_inject_page_fault(vcpu, e);
 + return 1;
 + }
 +
 + type = kvm_register_read(vcpu, (vmx_instruction_info  28)  0xf);
 +
 + switch (type) {
 + case VMX_EPT_EXTENT_GLOBAL:
 + case VMX_EPT_EXTENT_CONTEXT:
 + ok = !!(nested_vmx_ept_caps 
 + (1UL  (type + VMX_EPT_EXTENT_SHIFT)));
 + break;
 + default:
 + ok = false;
 + }
 +
 + if (ok) {
 + kvm_mmu_sync_roots(vcpu);
 + kvm_mmu_flush_tlb(vcpu);
 + nested_vmx_succeed(vcpu);
 + }
 + else
 + nested_vmx_failValid(vcpu,
 + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
 +
 + skip_emulated_instruction(vcpu);
 + return 1;

Can this code be changed to:

switch (type) {
case VMX_EPT_EXTENT_GLOBAL:
case VMX_EPT_EXTENT_CONTEXT:
if (nested_vmx_ept_caps 
(1UL  (type + VMX_EPT_EXTENT_SHIFT) {
kvm_mmu_sync_roots(vcpu);
kvm_mmu_flush_tlb(vcpu);
nested_vmx_succeed(vcpu);
break;
}
default:
nested_vmx_failValid(vcpu,
VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
}
?
That's clearer i think.

Also, we can sync the shadow page table only if the current eptp is the required
eptp, that means:

if (type == GLOBAL || operand.eptp == nested_ept_get_cr3(vcpu)) {
kvm_mmu_sync_roots(vcpu);
..
}

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 13/15] nEPT: Advertise EPT to L1

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:
 From: Nadav Har'El n...@il.ibm.com
 
 Advertise the support of EPT to the L1 guest, through the appropriate MSR.
 
 This is the last patch of the basic Nested EPT feature, so as to allow
 bisection through this patch series: The guest will not see EPT support until
 this last patch, and will not attempt to use the half-applied feature.

Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Ingo Molnar

* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote:

 On 08/01/2013 02:34 PM, Raghavendra K T wrote:
 On 08/01/2013 01:15 PM, Gleb Natapov wrote:
 Shall I consider this as an ack for kvm part?
 
 For everything except 18/18. For that I still want to see numbers. But
 18/18 is pretty independent from the reset of the series so it should
 not stop the reset from going in.
 
 Yes. agreed.
 I am going to evaluate patch 18 separately and come with results for
 that. Now we can consider only 1-17 patches.
 
 
 Gleb,
 
 32 core machine with HT off 32 vcpu guests.
 base = 3.11-rc + patch 1 -17 pvspinlock_v11
 patched = base + patch 18
 
 +---+---+---++---+
   dbench  (Throughput in MB/sec higher is better)
 +---+---+---++---+
   base  stdev   patchedstdev   %improvement
 +---+---+---++---+
 1x 14584.3800   146.9074   14705.1000   163.1060 0.82773
 2x  1713.730032.87501717.320045.5979 0.20948
 3x   967.821242.0257 971.885518.8532 0.41994
 4x   685.276425.7150 694.5881 8.3907 1.35882
 +---+---+---++---+

Please list stddev in percentage as well ...

a blind stab gave me these figures:

   base  stdev   patchedstdev   %improvement
 3x   967.82124.3% 971.8855  1.8% 0.4

That makes the improvement an order of magnitude smaller than the noise of 
the measurement ... i.e. totally inconclusive.

Also please cut the excessive decimal points: with 2-4% noise what point 
is there in 5 decimal point results??

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 02/15] nEPT: Fix cr3 handling in nested exit and entry

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:
 From: Nadav Har'El n...@il.ibm.com
 
 The existing code for handling cr3 and related VMCS fields during nested
 exit and entry wasn't correct in all cases:
 
 If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
 during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
 we forgot to do so. This patch adds this copy.
 
 If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
 whoever does control cr3 (L1 or L2) is using PAE, the processor might have
 saved PDPTEs and we should also save them in vmcs12 (and restore later).

Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Ingo Molnar

* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote:

 On 07/31/2013 11:54 AM, Gleb Natapov wrote:
 On Tue, Jul 30, 2013 at 10:13:12PM +0530, Raghavendra K T wrote:
 On 07/25/2013 03:08 PM, Raghavendra K T wrote:
 On 07/25/2013 02:45 PM, Gleb Natapov wrote:
 On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote:
 On 07/24/2013 06:06 PM, Raghavendra K T wrote:
 On 07/24/2013 05:36 PM, Gleb Natapov wrote:
 On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote:
 On 07/24/2013 04:09 PM, Gleb Natapov wrote:
 On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:
 On 07/23/2013 08:37 PM, Gleb Natapov wrote:
 On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:
 +static void kvm_lock_spinning(struct arch_spinlock *lock,
 __ticket_t want)
 [...]

[ a few hundred lines of unnecessary quotation snipped. ]

 Ingo,
 
 Do you have any concerns reg this series? please let me know if this 
 looks good now to you.

I'm inclined to NAK it for excessive quotation - who knows how many people 
left the discussion in disgust? Was it done to drive away as many 
reviewers as possible?

Anyway, see my other reply, the measurement results seem hard to interpret 
and inconclusive at the moment.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Raghavendra K T

On 08/02/2013 02:53 PM, Ingo Molnar wrote:


* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote:


On 08/01/2013 02:34 PM, Raghavendra K T wrote:

On 08/01/2013 01:15 PM, Gleb Natapov wrote:

Shall I consider this as an ack for kvm part?


For everything except 18/18. For that I still want to see numbers. But
18/18 is pretty independent from the reset of the series so it should
not stop the reset from going in.


Yes. agreed.
I am going to evaluate patch 18 separately and come with results for
that. Now we can consider only 1-17 patches.



Gleb,

32 core machine with HT off 32 vcpu guests.
base = 3.11-rc + patch 1 -17 pvspinlock_v11
patched = base + patch 18

+---+---+---++---+
   dbench  (Throughput in MB/sec higher is better)
+---+---+---++---+
   base  stdev   patchedstdev   %improvement
+---+---+---++---+
1x 14584.3800   146.9074   14705.1000   163.1060 0.82773
2x  1713.730032.87501717.320045.5979 0.20948
3x   967.821242.0257 971.885518.8532 0.41994
4x   685.276425.7150 694.5881 8.3907 1.35882
+---+---+---++---+


Please list stddev in percentage as well ...


Sure. will do this from next time.



a blind stab gave me these figures:


   base  stdev   patchedstdev   %improvement
3x   967.82124.3% 971.8855  1.8% 0.4


That makes the improvement an order of magnitude smaller than the noise of
the measurement ... i.e. totally inconclusive.


Okay. agreed.

I always had seen the positive effect of the patch since it uses ple
handler heuristics, and thus avoiding the directed yield to vcpu's in
halt handler. But the current results clearly does not conclude
anything favoring that. :(

So please drop patch 18 for now.



Also please cut the excessive decimal points: with 2-4% noise what point
is there in 5 decimal point results??


Yes.

Ingo, do you think now the patch series (patch 1 to 17) are in good
shape? or please let me know if you have any concerns to be
addressed.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Gleb Natapov
On Fri, Aug 02, 2013 at 11:25:39AM +0200, Ingo Molnar wrote:
  Ingo,
  
  Do you have any concerns reg this series? please let me know if this 
  looks good now to you.
 
 I'm inclined to NAK it for excessive quotation - who knows how many people 
 left the discussion in disgust? Was it done to drive away as many 
 reviewers as possible?
 
 Anyway, see my other reply, the measurement results seem hard to interpret 
 and inconclusive at the moment.
 
That result was only for patch 18 of the series, not pvspinlock in
general.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 10/15] nEPT: Nested INVEPT

2013-08-02 Thread Gleb Natapov
On Fri, Aug 02, 2013 at 04:06:00PM +0800, Xiao Guangrong wrote:
 On 08/01/2013 10:08 PM, Gleb Natapov wrote:
 
  +/* Emulate the INVEPT instruction */
  +static int handle_invept(struct kvm_vcpu *vcpu)
  +{
  +   u32 vmx_instruction_info;
  +   bool ok;
  +   unsigned long type;
  +   gva_t gva;
  +   struct x86_exception e;
  +   struct {
  +   u64 eptp, gpa;
  +   } operand;
  +
  +   if (!(nested_vmx_secondary_ctls_high  SECONDARY_EXEC_ENABLE_EPT) ||
  +   !(nested_vmx_ept_caps  VMX_EPT_INVEPT_BIT)) {
  +   kvm_queue_exception(vcpu, UD_VECTOR);
  +   return 1;
  +   }
  +
  +   if (!nested_vmx_check_permission(vcpu))
  +   return 1;
  +
  +   if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
  +   kvm_queue_exception(vcpu, UD_VECTOR);
  +   return 1;
  +   }
  +
  +   /* According to the Intel VMX instruction reference, the memory
  +* operand is read even if it isn't needed (e.g., for type==global)
  +*/
  +   vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
  +   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
  +   vmx_instruction_info, gva))
  +   return 1;
  +   if (kvm_read_guest_virt(vcpu-arch.emulate_ctxt, gva, operand,
  +   sizeof(operand), e)) {
  +   kvm_inject_page_fault(vcpu, e);
  +   return 1;
  +   }
  +
  +   type = kvm_register_read(vcpu, (vmx_instruction_info  28)  0xf);
  +
  +   switch (type) {
  +   case VMX_EPT_EXTENT_GLOBAL:
  +   case VMX_EPT_EXTENT_CONTEXT:
  +   ok = !!(nested_vmx_ept_caps 
  +   (1UL  (type + VMX_EPT_EXTENT_SHIFT)));
  +   break;
  +   default:
  +   ok = false;
  +   }
  +
  +   if (ok) {
  +   kvm_mmu_sync_roots(vcpu);
  +   kvm_mmu_flush_tlb(vcpu);
  +   nested_vmx_succeed(vcpu);
  +   }
  +   else
  +   nested_vmx_failValid(vcpu,
  +   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
  +
  +   skip_emulated_instruction(vcpu);
  +   return 1;
 
 Can this code be changed to:
 
   switch (type) {
   case VMX_EPT_EXTENT_GLOBAL:
   case VMX_EPT_EXTENT_CONTEXT:
   if (nested_vmx_ept_caps 
   (1UL  (type + VMX_EPT_EXTENT_SHIFT) {
   kvm_mmu_sync_roots(vcpu);
   kvm_mmu_flush_tlb(vcpu);
   nested_vmx_succeed(vcpu);
   break;
   }
   default:
   nested_vmx_failValid(vcpu,
   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
   }
 ?
 That's clearer i think.
 
 Also, we can sync the shadow page table only if the current eptp is the 
 required
 eptp, that means:
 
 if (type == GLOBAL || operand.eptp == nested_ept_get_cr3(vcpu)) {
   kvm_mmu_sync_roots(vcpu);
   ..
 }
Good point. Will do.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60620] guest loses frequently (multiple times per day!) connectivity to network device

2013-08-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60620

--- Comment #7 from Folkert van Heusden folk...@vanheusden.com ---
Moments ago an other(!) guest on this system lost its connectivity.
I did a tcpdump on its network interface from within the host/hypervisor/dom0
and saw that traffic goes in but nothing comes out - e.g. no traffic from the
client comes out of the 'vnet1' interface. While I ran the tcpdump I also ran a
ping from within that guest which indeed showed only timeouts.
As there's _no_ firewall on this guest, we can rule out the possibility that it
is a firewall issue.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Raghavendra K T

On 08/02/2013 03:24 PM, Gleb Natapov wrote:

On Fri, Aug 02, 2013 at 11:25:39AM +0200, Ingo Molnar wrote:

Ingo,

Do you have any concerns reg this series? please let me know if this
looks good now to you.


I'm inclined to NAK it for excessive quotation - who knows how many people
left the discussion in disgust? Was it done to drive away as many
reviewers as possible?


Ingo, Peter,
Sorry for the confusion caused because of nesting. I should have trimmed 
it as Peter also pointed in other thread.

will ensure that is future mails.



Anyway, see my other reply, the measurement results seem hard to interpret
and inconclusive at the moment.


As Gleb already pointed, patch1-17 as a whole giving good performance 
improvement. It was only the patch 18, Gleb had concerns.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60620] guest loses frequently (multiple times per day!) connectivity to network device

2013-08-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60620

--- Comment #8 from Folkert van Heusden folk...@vanheusden.com ---
And now the firewall lost connectivity again.
- I removed all iptables rules and set the default to ACCEPT
- from within the guest I started tcpdump
- from the host I started ping
- from the guest I see the pings coming in but no ping reply goes out! the same
happened on the other guest that had this problem 10 minutes ago
- after a while the host starts doing arp requests and also does not get any
replies from it

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FAQ on linux-kvm.org has broken link

2013-08-02 Thread folkert
Hi,

 If the result is #2, check firewalls on host and guest.  Also try the
 following inside the guest: disable the network interface, rmmod
 virtio_net, modprobe virtio_net again, and bring the network up.

I pinged, I sniffed, I updated the bug report (it also happens with
other guests now!).

And the bring down interfaces / rmmod / modprobe / ifup works!
So I think something is wrong with virtio_net!

What shall I do now?


Folkert van Heusden

-- 
Nagios user? Check out CoffeeSaint - the versatile Nagios status
viewer! http://www.vanheusden.com/java/CoffeeSaint/
--
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM Test report, kernel bf640876... qemu 0779caeb...

2013-08-02 Thread Gleb Natapov
On Fri, Aug 02, 2013 at 07:19:28AM +, Ren, Yongjie wrote:
 Hi All,
 
 This is KVM upstream test result against kvm.git next branch and qemu-kvm.git 
 uq/master branch.
  kvm.git next branch: bf640876e21fe603f7f52b0c27d66b7716da0384  based 
 on kernel 3.11.0-rc1
  qemu-kvm.git uq/master branch: 
 0779caeb1a17f4d3ed14e2925b36ba09b084fb7b
 
 We found two new bugs and no fixed bug in the past two weeks. 
 As all of the two new bugs are introduced by Arthur, I also CCed him. 
 Arthur, can you have look at the new bugs?
 
 New issues (2):
 1. 64bit RHEL6.4 guest crashes and reboots continuously
   https://bugs.launchpad.net/qemu-kvm/+bug/1207623
Weird. Can you ftrace it?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 60620] guest loses frequently (multiple times per day!) connectivity to network device

2013-08-02 Thread Stefan Hajnoczi
On Fri, Aug 02, 2013 at 11:28:45AM +, bugzilla-dae...@bugzilla.kernel.org 
wrote:
 https://bugzilla.kernel.org/show_bug.cgi?id=60620
 
 --- Comment #9 from Folkert van Heusden folk...@vanheusden.com ---
 Good news!
 If I
 
 - bring down all interfaces in the guest (ifdown eth0...)
 - rmmod virtio_net
 - modprobe virtio_net
 - bring up the interfaces again
 + it all works again!
 
 So hopefully this helps the bug hunt?

Hi Folkert,
Please post the QEMU command-line on the host (ps aux | grep qemu) and
the output of lsmod | grep vhost_net.

Since reinitializing the guest driver fixes the issue, we now need to
find out whether the guest or the host side got stuck.

I think I asked before, but please also post any relevant lines from
dmesg on the host and from the guest.  Examples would include networking
error messages, kernel backtraces, or out-of-memory errors.

Thanks,
Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2013-08-06

2013-08-02 Thread Eduardo Habkost
On Tue, Jul 23, 2013 at 06:31:45PM +0200, Juan Quintela wrote:
 
 Hi
 
 Please, send any topic that you are interested in covering.

* libvirt requirements for:
   * Checking which CPU features are exposed/required by each CPU model
 + machine-type
   * Checking which CPU features are available on a given host
 (considering QEMU + kernel + host CPU capabilities)

I hope Jiri and Daniel can join the call.

-- 
Eduardo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Marcelo Tosatti
On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
 Currently, kvm zaps the large spte if write-protected is needed, the later
 read can fault on that spte. Actually, we can make the large spte readonly
 instead of making them un-present, the page fault caused by read access can
 be avoided
 
 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it reduces
 | jitter.  This removes the need for the return value.
 
 [
   It has fixed the issue reported in 6b73a9606 by stopping fast page fault
   marking the large spte to writable
 ]

Xiao,

Can you please write a comment explaining why are the problems 
with shadow vs large read-only sptes (can't recall anymore),
and then why it is now safe to do it.

Comments below.

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c | 36 +---
  1 file changed, 17 insertions(+), 19 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index cf163ca..35d4b50 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1181,8 +1181,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
 *sptep)
  
  /*
   * Write-protect on the specified @sptep, @pt_protect indicates whether
 - * spte writ-protection is caused by protecting shadow page table.
 - * @flush indicates whether tlb need be flushed.
 + * spte write-protection is caused by protecting shadow page table.
   *
   * Note: write protection is difference between drity logging and spte
   * protection:
 @@ -1191,10 +1190,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
 *sptep)
   * - for spte protection, the spte can be writable only after unsync-ing
   *   shadow page.
   *
 - * Return true if the spte is dropped.
 + * Return true if tlb need be flushed.
   */
 -static bool
 -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
 +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
  {
   u64 spte = *sptep;
  
 @@ -1204,17 +1202,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
 *flush, bool pt_protect)
  
   rmap_printk(rmap_write_protect: spte %p %llx\n, sptep, *sptep);
  
 - if (__drop_large_spte(kvm, sptep)) {
 - *flush |= true;
 - return true;
 - }
 -
   if (pt_protect)
   spte = ~SPTE_MMU_WRITEABLE;
   spte = spte  ~PT_WRITABLE_MASK;
  
 - *flush |= mmu_spte_update(sptep, spte);
 - return false;
 + return mmu_spte_update(sptep, spte);
  }
  
  static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 @@ -1226,11 +1218,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
 unsigned long *rmapp,
  
   for (sptep = rmap_get_first(*rmapp, iter); sptep;) {
   BUG_ON(!(*sptep  PT_PRESENT_MASK));
 - if (spte_write_protect(kvm, sptep, flush, pt_protect)) {
 - sptep = rmap_get_first(*rmapp, iter);
 - continue;
 - }
  
 + flush |= spte_write_protect(kvm, sptep, pt_protect);
   sptep = rmap_get_next(iter);
   }
  
 @@ -2701,6 +2690,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
 int write,
   break;
   }
  
 + drop_large_spte(vcpu, iterator.sptep);
 +
   if (!is_shadow_present_pte(*iterator.sptep)) {
   u64 base_addr = iterator.addr;
  
 @@ -2855,7 +2846,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct 
 kvm_mmu_page *sp,
   * - false: let the real page fault path to fix it.
   */
  static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 - u32 error_code)
 + u32 error_code, bool force_pt_level)
  {
   struct kvm_shadow_walk_iterator iterator;
   struct kvm_mmu_page *sp;
 @@ -2884,6 +2875,13 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, 
 gva_t gva, int level,
   goto exit;
  
   /*
 +  * Can not map the large spte to writable if the page is dirty
 +  * logged.
 +  */
 + if (sp-role.level  PT_PAGE_TABLE_LEVEL  force_pt_level)
 + goto exit;
 +

It is not safe to derive slot-dirty_bitmap like this: 
since dirty log is enabled via RCU update, is dirty bitmap enabled
info could be stale by the time you check it here via the parameter,
so you can instantiate a large spte (because force_pt_level == false),
while you should not.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FAQ on linux-kvm.org has broken link

2013-08-02 Thread Stefan Hajnoczi
On Fri, Aug 2, 2013 at 1:37 PM, folkert folk...@vanheusden.com wrote:
 If the result is #2, check firewalls on host and guest.  Also try the
 following inside the guest: disable the network interface, rmmod
 virtio_net, modprobe virtio_net again, and bring the network up.

 I pinged, I sniffed, I updated the bug report (it also happens with
 other guests now!).

 And the bring down interfaces / rmmod / modprobe / ifup works!
 So I think something is wrong with virtio_net!

 What shall I do now?

Hi Folkert,
I wrote a reply earlier today but it was rejected because I not have a
kernel.org bugzilla account.  If you don't mind let's continue
discussing on this mailing list - we don't know whether this is a
kernel bug yet anyway.

A couple of questions:

Please post the QEMU command-line from the host (ps aux | grep qemu).

Please confirm that vhost_net is being used on the host (lsmod | grep
vhost_net).

Please double-check both guest and host dmesg for any suspicious
messages.  It could be about networking, out-of-memory, or kernel
backtraces.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Xiao Guangrong

On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti mtosa...@redhat.com wrote:

 On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
 Currently, kvm zaps the large spte if write-protected is needed, the later
 read can fault on that spte. Actually, we can make the large spte readonly
 instead of making them un-present, the page fault caused by read access can
 be avoided
 
 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it reduces
 | jitter.  This removes the need for the return value.
 
 [
  It has fixed the issue reported in 6b73a9606 by stopping fast page fault
  marking the large spte to writable
 ]
 
 Xiao,
 
 Can you please write a comment explaining why are the problems 
 with shadow vs large read-only sptes (can't recall anymore),
 and then why it is now safe to do it.

Hi Marcelo,

Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this patch,
we mark the large spte as readonly when the pages are dirt logged and the
readonly spte can be set to writable by fast page fault, but on that path, it 
failed
to check dirty logging, so it will set the large spte to writable but only set 
the first
page to the dirty bitmap.

For example:

1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
 is writable.

2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly

3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.

Then 4K ~ 2M memory is not dirty logged.

In this version, we let fast page fault do not mark large spte to writable if
its page are dirty logged.  But it is still not safe as you pointed out.

 
 
  /*
 + * Can not map the large spte to writable if the page is dirty
 + * logged.
 + */
 +if (sp-role.level  PT_PAGE_TABLE_LEVEL  force_pt_level)
 +goto exit;
 +
 
 It is not safe to derive slot-dirty_bitmap like this: 
 since dirty log is enabled via RCU update, is dirty bitmap enabled
 info could be stale by the time you check it here via the parameter,
 so you can instantiate a large spte (because force_pt_level == false),
 while you should not.

Good catch! This is true even if we enable dirty log under the protection
of mmu lock.

How about let the fault page fault only fix the small spte, that is changing
the code to:
if (sp-role.level  PT_PAGE_TABLE_LEVEL)
goto exit;
?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Alex Williamson
On Fri, 2013-08-02 at 15:10 +1000, Benjamin Herrenschmidt wrote:
 On Thu, 2013-08-01 at 16:18 -0600, Alex Williamson wrote:
  vfio-pci needs to support an interface to do hot resets (PCI parent
  bridge secondary bus reset).   We need this to support reset of
  co-assigned devices where one or more of the devices does not support
  function level reset.  In particular, discrete graphics cards typically
  have no reset options other than doing a link reset.  
 
 Link reset != hot reset but yeah I see your point. There is more too,
 there is fundamental reset which may or may not be something we can
 control for individual slots (ie, switch legs) depending on platform
 specific stuff...
 
  What I have below
  is a bit awkward, so I welcome other ideas to accomplish this goal.
  I've been using a blind interface based on all affected devices
  belonging to the same VFIO container for current VGA testing.  This is
  ok when all you want to do is VGA, but I'd really like to make use of
  this any time a device doesn't support a function level reset.  I've
  posted a series to the PCI list to add bus and slot reset interfaces to
  PCI-core, this API is how we expose that through VFIO to a user.  Please
  comment.  Thanks,
 
 Also in some cases, at least for us, we have a problem where the scope
 of the reset can be larger than the group ... in this case I think we
 need to forbid the reset.

Interesting, I would have ventured to guess that resets are always
contained within a single group given the PE isolation domains rooted at
a PHB.  Why disallow it though?  What I propose below would still allow
it if the user can prove that they own all of the affected groups.  Even
on x86, I think it's too restrictive to only allow reset if the affect
is contained within a group.  It's just too easy to need it for a
multi-function device.  We could hope that a device that implements ACS
also implements some sort of FLR, but that may just be wishful thinking.

 For us, it's also tied into our EEH error handling. IE. The way the
 guest will request some of these things is going to be via firmware APIs
 that we have yet to implement, but that we were also thinking of
 implementing entire in the kernel rather than qemu for various
 reasons... IE. There is more to error handling  recovery in our case at
 least than AER and reset :-)

I've been focused on reset for device re-initialization and
repeatability, but that's a good point, we need to consider error
recovery as a use of this interface as well.

 I need to spend more time reading your proposal and see how it can work
 for us (or not...)... but we might end up just doing our own thing that
 carries the whole EEH API instead.

Obviously it would be great if it could work for you, but regardless of
whether you intend to use it I'd appreciate feedback.  Thanks,

Alex

  ---
  Mechanism to do PCI hot resets through VFIO:
  
  VFIO is fundamentally an IOMMU group and device level interface.
  There's no concept of buses, slots, or hierarchies of devices.  There
  are only IOMMU group and devices.  A bus (or slot) may contain exactly
  one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
  An IOMMU group may contain one or more devices.
  
  The first question is perhaps where should we create the interface to do
  a PCI hot reset.  Assuming an ioctl interface, our choices are the
  group, the container, or the device file descriptors.  Groups and
  containers are not PCI specific, so an extension on either of those
  doesn't make much sense.  They also don't have much granularity if your
  goal is to do a hot reset on the smallest subset of devices you can.
  Therefore the only choice seems to be a VFIO device level interface.
  
  The fact that a hot reset affects multiple devices also raises concerns.
  How do we make sure a user has sufficient access/privilege to perform
  this operation?  If all of the affected devices are within the same
  group, then we know the user already owns all those devices.  Using
  groups as the boundary excludes a number of use cases though.  The user
  would need to prove that they also own the other groups or devices that
  are affected by the reset.  This might be multiple groups, so the ioctl
  quickly grows to requiring a list of file descriptors be passed for
  validation.
  
  We already use the group file descriptor as a unit of ownership for
  enabling the container, so it seems like it would make sense to use it
  here too.  The alternative is a device file descriptor, but groups may
  encompass devices the user doesn't care to use and we don't want to
  require that they open a file descriptor simply to perform a hot reset.
  Groups can also contain devices that the user cannot open, for instance
  those owned by VFIO compatible drivers like pci-stub or pcieport.
  
  The user also needs to know the set of devices affected by a hot reset,
  otherwise they have no idea which group file descriptors to pass to such
 

Re: FAQ on linux-kvm.org has broken link

2013-08-02 Thread folkert
Hi Stefan,

  I pinged, I sniffed, I updated the bug report (it also happens with
  other guests now!).
 
  And the bring down interfaces / rmmod / modprobe / ifup works!
  So I think something is wrong with virtio_net!
 
  What shall I do now?
 
 I wrote a reply earlier today but it was rejected because I not have a
 kernel.org bugzilla account.  If you don't mind let's continue
 discussing on this mailing list - we don't know whether this is a
 kernel bug yet anyway.

no problem

 A couple of questions:
 Please post the QEMU command-line from the host (ps aux | grep qemu).

I'll post them all:
- UMTS-clone: this one works fine since it was created a weak ago
- belle: this one was fine but suddenly also showed the problem
- mauer: the problem one

112   4819 1  4 Jul30 ?03:29:39 /usr/bin/kvm -S -M pc-1.1 
-enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name UMTS-clone -uuid 
e49502f1-0c74-2a60-99dc-7602da5ee640 -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/UMTS-clone.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/dev/VGNEO/LV_V_UMTS-clone,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/home/folkert/ISOs/wheezy.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw
 -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
tap,fd=20,id=hostnet0,vhost=on,vhostfd=21 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:09:3b:b6,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-vnc 127.0.0.1:0,password -vga cirrus -device 
usb-host,hostbus=6,hostaddr=5,id=hostdev0 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
112  10065 1 11 Jul30 ?07:46:16 /usr/bin/kvm -S -M pc-1.1 
-enable-kvm -m 8192 -smp 12,sockets=12,cores=1,threads=1 -name belle -uuid 
16b704d7-5fbd-d67b-71e6-0d6b43f1bc0a -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/belle.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime 
-no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/dev/VGNEO/LV_V_BELLE,if=none,id=drive-virtio-disk0,format=raw -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/dev/VGNEO/LV_V_BELLE_OS,if=none,id=drive-virtio-disk1,format=raw,cache=writeback
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk1,id=virtio-disk1
 -drive 
file=/dev/VGJOURNAL/LV_J_BELLE,if=none,id=drive-ide0-0-0,format=raw,cache=writeback
 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive 
if=none,id=drive-ide0-1-0,readonly=on,format=raw,cache=none -device 
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:75:4a:6f,bus=pci.0,addr=0x3 
-netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:0a:6e:de,bus=pci.0,addr=0x7 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-device usb-tablet,id=input0 -vnc 127.0.0.1:1,password -vga cirrus -device 
intel-hda,id=sound0,bus=pci.0,addr=0x4 -device 
hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
root 13116 12830  0 19:54 pts/800:00:00 grep qemu
112  23453 1 57 13:16 ?03:46:51 /usr/bin/kvm -S -M pc-1.1 
-enable-kvm -m 8192 -smp 8,maxcpus=12,sockets=12,cores=1,threads=1 -name mauer 
-uuid 3a8452e6-81af-b185-63b6-2b32be17ed87 -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/mauer.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/dev/VGNEO/LV_V_MAUER,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/dev/VGJOURNAL/LV_J_MAUER,if=none,id=drive-virtio-disk1,format=raw,cache=writethrough
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0xa,drive=drive-virtio-disk1,id=virtio-disk1
 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device 
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:86:d9:1f,bus=pci.0,addr=0x3 
-netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:a3:12:8a,bus=pci.0,addr=0x4 
-netdev tap,fd=30,id=hostnet2,vhost=on,vhostfd=31 -device 

Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Benjamin Herrenschmidt
On Fri, 2013-08-02 at 10:36 -0600, Alex Williamson wrote:

  Also in some cases, at least for us, we have a problem where the scope
  of the reset can be larger than the group ... in this case I think we
  need to forbid the reset.
 
 Interesting, I would have ventured to guess that resets are always
 contained within a single group given the PE isolation domains rooted at
 a PHB. 

Depends how I configure them. For example, if I assign two functions to
two different PEs (SR-IOV typically), a hot reset will kill both.

For fundamental reset, depending on how the mobo is wired up and other
platform things, I might for example be limited on doing a PERST at the
PHB level, thus accross multiple PEs.

  Why disallow it though?  What I propose below would still allow
 it if the user can prove that they own all of the affected groups.

Right, I mean disallow it if the PEs are not all owned by the same user.
What you propose would probably work.

   Even
 on x86, I think it's too restrictive to only allow reset if the affect
 is contained within a group.  It's just too easy to need it for a
 multi-function device.  We could hope that a device that implements ACS
 also implements some sort of FLR, but that may just be wishful thinking.

Right. Another use case is, I know of devices that need a fundamental
reset (PERST) after applying a FW update.

  For us, it's also tied into our EEH error handling. IE. The way the
  guest will request some of these things is going to be via firmware APIs
  that we have yet to implement, but that we were also thinking of
  implementing entire in the kernel rather than qemu for various
  reasons... IE. There is more to error handling  recovery in our case at
  least than AER and reset :-)
 
 I've been focused on reset for device re-initialization and
 repeatability, but that's a good point, we need to consider error
 recovery as a use of this interface as well.

Unfortunately, with EEH we have very specific FW interface (including
low level diagnostic data blobs etc...) that we need to implement so I
am not sure we can fit it in a generic interface.

We might want to attempt to list our interfaces and see. CC'ing Gavin
who is our EEH expert.

A rough cut is

 - We have the concept of frozen PE (frozen MMIO and frozen DMA, two
separate attributes). So we can query and change the freeze state on a
PE.

 - We have various level of error severity (from informational
(corrected) to full PHB need a reset with some variations in the middle
of PEs being frozen etc...)

 - We have diag blobs coming out of FW for the guest to log (and
possibly partially decode).

 - We have interfaces for various types of resets

  I need to spend more time reading your proposal and see how it can work
  for us (or not...)... but we might end up just doing our own thing that
  carries the whole EEH API instead.
 
 Obviously it would be great if it could work for you, but regardless of
 whether you intend to use it I'd appreciate feedback.  Thanks,

Ben.

 Alex
 
   ---
   Mechanism to do PCI hot resets through VFIO:
   
   VFIO is fundamentally an IOMMU group and device level interface.
   There's no concept of buses, slots, or hierarchies of devices.  There
   are only IOMMU group and devices.  A bus (or slot) may contain exactly
   one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
   An IOMMU group may contain one or more devices.
   
   The first question is perhaps where should we create the interface to do
   a PCI hot reset.  Assuming an ioctl interface, our choices are the
   group, the container, or the device file descriptors.  Groups and
   containers are not PCI specific, so an extension on either of those
   doesn't make much sense.  They also don't have much granularity if your
   goal is to do a hot reset on the smallest subset of devices you can.
   Therefore the only choice seems to be a VFIO device level interface.
   
   The fact that a hot reset affects multiple devices also raises concerns.
   How do we make sure a user has sufficient access/privilege to perform
   this operation?  If all of the affected devices are within the same
   group, then we know the user already owns all those devices.  Using
   groups as the boundary excludes a number of use cases though.  The user
   would need to prove that they also own the other groups or devices that
   are affected by the reset.  This might be multiple groups, so the ioctl
   quickly grows to requiring a list of file descriptors be passed for
   validation.
   
   We already use the group file descriptor as a unit of ownership for
   enabling the container, so it seems like it would make sense to use it
   here too.  The alternative is a device file descriptor, but groups may
   encompass devices the user doesn't care to use and we don't want to
   require that they open a file descriptor simply to perform a hot reset.
   Groups can also contain devices that the user cannot open, for instance
   those owned by VFIO 

Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Bjorn Helgaas
[+cc linux-pci]

On Fri, Aug 2, 2013 at 3:28 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:

 Right. Another use case is, I know of devices that need a fundamental
 reset (PERST) after applying a FW update.

This is a tangent from the real discussion here, but the question of
resetting a device after a firmware update concerns me.  Many if not
all of our current reset interfaces save  restore the architected
parts of config space around the reset.  But a reset after a firmware
update may change things like the number and type of BARs or even the
functionality of the device, so I don't think the restore is safe in
general.

I doubt this is a big problem in general, but I have found reports of
people having to do a system reset or reboot after updating, e.g.,
FPGA images.  I suppose at least some of these could be worked around
with the right hotplug incantations.

http://forums.xilinx.com/t5/PCI-Express/PC-is-hung-if-run-the-second-program-FPGA-by-JTAG/td-p/20871
http://www.alteraforum.com/forum/archive/index.php/t-28477.html
http://www.alteraforum.com/forum/showthread.php?t=35091
https://github.com/NetFPGA/NetFPGA-public/wiki/Restore-PCIE-Configuration
http://lkml.indiana.edu/hypermail/linux/kernel/1104.3/02544.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Marcelo Tosatti
On Fri, Aug 02, 2013 at 11:42:19PM +0800, Xiao Guangrong wrote:
 
 On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
  Currently, kvm zaps the large spte if write-protected is needed, the later
  read can fault on that spte. Actually, we can make the large spte readonly
  instead of making them un-present, the page fault caused by read access can
  be avoided
  
  The idea is from Avi:
  | As I mentioned before, write-protecting a large spte is a good idea,
  | since it moves some work from protect-time to fault-time, so it reduces
  | jitter.  This removes the need for the return value.
  
  [
   It has fixed the issue reported in 6b73a9606 by stopping fast page fault
   marking the large spte to writable
  ]
  
  Xiao,
  
  Can you please write a comment explaining why are the problems 
  with shadow vs large read-only sptes (can't recall anymore),
  and then why it is now safe to do it.
 
 Hi Marcelo,
 
 Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this 
 patch,
 we mark the large spte as readonly when the pages are dirt logged and the
 readonly spte can be set to writable by fast page fault, but on that path, it 
 failed
 to check dirty logging, so it will set the large spte to writable but only 
 set the first
 page to the dirty bitmap.
 
 For example:
 
 1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
  is writable.
 
 2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly
 
 3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.
 
 Then 4K ~ 2M memory is not dirty logged.

Ok can you write a self contained summary of read-only large sptes (when
they are created, when destroyed, from which point they can't be created,
etc), and the interaction with shadow write protection and creation of
writeable sptes?
Its easy to get lost.

 In this version, we let fast page fault do not mark large spte to writable if
 its page are dirty logged.  But it is still not safe as you pointed out.
 
  
  
 /*
  +   * Can not map the large spte to writable if the page is dirty
  +   * logged.
  +   */
  +  if (sp-role.level  PT_PAGE_TABLE_LEVEL  force_pt_level)
  +  goto exit;
  +
  
  It is not safe to derive slot-dirty_bitmap like this: 
  since dirty log is enabled via RCU update, is dirty bitmap enabled
  info could be stale by the time you check it here via the parameter,
  so you can instantiate a large spte (because force_pt_level == false),
  while you should not.
 
 Good catch! This is true even if we enable dirty log under the protection
 of mmu lock.
 
 How about let the fault page fault only fix the small spte, that is changing
 the code to:
   if (sp-role.level  PT_PAGE_TABLE_LEVEL)
   goto exit;
 ?

Sure.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Xiao Guangrong

On Aug 3, 2013, at 4:27 AM, Marcelo Tosatti mtosa...@redhat.com wrote:

 On Fri, Aug 02, 2013 at 11:42:19PM +0800, Xiao Guangrong wrote:
 
 On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
 On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
 Currently, kvm zaps the large spte if write-protected is needed, the later
 read can fault on that spte. Actually, we can make the large spte readonly
 instead of making them un-present, the page fault caused by read access can
 be avoided
 
 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it reduces
 | jitter.  This removes the need for the return value.
 
 [
 It has fixed the issue reported in 6b73a9606 by stopping fast page fault
 marking the large spte to writable
 ]
 
 Xiao,
 
 Can you please write a comment explaining why are the problems 
 with shadow vs large read-only sptes (can't recall anymore),
 and then why it is now safe to do it.
 
 Hi Marcelo,
 
 Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this 
 patch,
 we mark the large spte as readonly when the pages are dirt logged and the
 readonly spte can be set to writable by fast page fault, but on that path, 
 it failed
 to check dirty logging, so it will set the large spte to writable but only 
 set the first
 page to the dirty bitmap.
 
 For example:
 
 1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
 is writable.
 
 2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly
 
 3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.
 
 Then 4K ~ 2M memory is not dirty logged.
 
 Ok can you write a self contained summary of read-only large sptes (when
 they are created, when destroyed, from which point they can't be created,
 etc), and the interaction with shadow write protection and creation of
 writeable sptes?
 Its easy to get lost.

Okay, will do.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Scott Wood
On Thu, 2013-08-01 at 16:42 +0530, Bharat Bhushan wrote:
 KVM need to lookup linux pte for getting TLB attributes (WIMGE).
 This is similar to how book3s does.
 This will be used in follow-up patches.
 
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 ---
 v1-v2
  - This is a new change in this version
 
  arch/powerpc/include/asm/kvm_booke.h |   73 
 ++
  1 files changed, 73 insertions(+), 0 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_booke.h 
 b/arch/powerpc/include/asm/kvm_booke.h
 index d3c1eb3..903624d 100644
 --- a/arch/powerpc/include/asm/kvm_booke.h
 +++ b/arch/powerpc/include/asm/kvm_booke.h
 @@ -102,4 +102,77 @@ static inline ulong kvmppc_get_msr(struct kvm_vcpu *vcpu)
  {
   return vcpu-arch.shared-msr;
  }
 +
 +/*
 + * Lock and read a linux PTE.  If it's present and writable, atomically
 + * set dirty and referenced bits and return the PTE, otherwise return 0.
 + */
 +static inline pte_t kvmppc_read_update_linux_pte(pte_t *p, int writing)
 +{
 + pte_t pte;
 +
 +#ifdef PTE_ATOMIC_UPDATES
 + pte_t tmp;
 +/* wait until _PAGE_BUSY is clear then set it atomically */

_PAGE_BUSY is 0 on book3e.

 +#ifdef CONFIG_PPC64
 + __asm__ __volatile__ (
 + 1: ldarx   %0,0,%3\n
 +andi.   %1,%0,%4\n
 +bne-1b\n
 +ori %1,%0,%4\n
 +stdcx.  %1,0,%3\n
 +bne-1b
 + : =r (pte), =r (tmp), =m (*p)
 + : r (p), i (_PAGE_BUSY)
 + : cc);
 +#else
 +__asm__ __volatile__ (
 +1: lwarx   %0,0,%3\n
 +   andi.   %1,%0,%4\n
 +   bne-1b\n
 +   ori %1,%0,%4\n
 +   stwcx.  %1,0,%3\n
 +   bne-1b
 +: =r (pte), =r (tmp), =m (*p)
 +: r (p), i (_PAGE_BUSY)
 +: cc);
 +#endif

What about 64-bit PTEs on 32-bit kernels?

In any case, this code does not belong in KVM.  It should be in the main
PPC mm code, even if KVM is the only user.

-Scott



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Benjamin Herrenschmidt
On Fri, 2013-08-02 at 16:49 -0600, Bjorn Helgaas wrote:
 [+cc linux-pci]
 
 On Fri, Aug 2, 2013 at 3:28 PM, Benjamin Herrenschmidt
 b...@kernel.crashing.org wrote:
 
  Right. Another use case is, I know of devices that need a fundamental
  reset (PERST) after applying a FW update.
 
 This is a tangent from the real discussion here, but the question of
 resetting a device after a firmware update concerns me.  Many if not
 all of our current reset interfaces save  restore the architected
 parts of config space around the reset.  But a reset after a firmware
 update may change things like the number and type of BARs or even the
 functionality of the device, so I don't think the restore is safe in
 general.

Right.

 I doubt this is a big problem in general, but I have found reports of
 people having to do a system reset or reboot after updating, e.g.,
 FPGA images.  I suppose at least some of these could be worked around
 with the right hotplug incantations.

Yes.

We have that similar issue with error handling, when the driver doesn't
have the right hooks, we simulate an unplug, reset, then replug.

Maybe we could provide generic helpers to do that...

Cheers,
Ben.

 http://forums.xilinx.com/t5/PCI-Express/PC-is-hung-if-run-the-second-program-FPGA-by-JTAG/td-p/20871
 http://www.alteraforum.com/forum/archive/index.php/t-28477.html
 http://www.alteraforum.com/forum/showthread.php?t=35091
 https://github.com/NetFPGA/NetFPGA-public/wiki/Restore-PCIE-Configuration
 http://lkml.indiana.edu/hypermail/linux/kernel/1104.3/02544.html
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Scott Wood
On Thu, Aug 01, 2013 at 04:42:38PM +0530, Bharat Bhushan wrote:
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index 17722d8..eb2 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -697,7 +697,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
  #endif
  
   kvmppc_fix_ee_before_entry();
 -
 + vcpu-arch.pgdir = current-mm-pgd;
   ret = __kvmppc_vcpu_run(kvm_run, vcpu);

kvmppc_fix_ee_before_entry() is supposed to be the last thing that
happens before __kvmppc_vcpu_run().

 @@ -332,6 +324,8 @@ static inline int kvmppc_e500_shadow_map(struct 
 kvmppc_vcpu_e500 *vcpu_e500,
   unsigned long hva;
   int pfnmap = 0;
   int tsize = BOOK3E_PAGESZ_4K;
 + pte_t pte;
 + int wimg = 0;
  
   /*
* Translate guest physical to true physical, acquiring
 @@ -437,6 +431,8 @@ static inline int kvmppc_e500_shadow_map(struct 
 kvmppc_vcpu_e500 *vcpu_e500,
  
   if (likely(!pfnmap)) {
   unsigned long tsize_pages = 1  (tsize + 10 - PAGE_SHIFT);
 + pgd_t *pgdir;
 +
   pfn = gfn_to_pfn_memslot(slot, gfn);
   if (is_error_noslot_pfn(pfn)) {
   printk(KERN_ERR Couldn't get real page for gfn %lx!\n,
 @@ -447,9 +443,18 @@ static inline int kvmppc_e500_shadow_map(struct 
 kvmppc_vcpu_e500 *vcpu_e500,
   /* Align guest and physical address to page map boundaries */
   pfn = ~(tsize_pages - 1);
   gvaddr = ~((tsize_pages  PAGE_SHIFT) - 1);
 + pgdir = vcpu_e500-vcpu.arch.pgdir;
 + pte = lookup_linux_pte(pgdir, hva, 1, tsize_pages);
 + if (pte_present(pte)) {
 + wimg = (pte  PTE_WIMGE_SHIFT)  MAS2_WIMGE_MASK;
 + } else {
 + printk(KERN_ERR pte not present: gfn %lx, pfn %lx\n,
 + (long)gfn, pfn);
 + return -EINVAL;
 + }
   }

How does wimg get set in the pfnmap case?

Could you explain why we need to set dirty/referenced on the PTE, when we
didn't need to do that before?  All we're getting from the PTE is wimg. 
We have MMU notifiers to take care of the page being unmapped, and we've
already marked the page itself as dirty if the TLB entry is writeable.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Alex Williamson
On Sat, 2013-08-03 at 09:15 +1000, Benjamin Herrenschmidt wrote:
 On Fri, 2013-08-02 at 16:49 -0600, Bjorn Helgaas wrote:
  [+cc linux-pci]
  
  On Fri, Aug 2, 2013 at 3:28 PM, Benjamin Herrenschmidt
  b...@kernel.crashing.org wrote:
  
   Right. Another use case is, I know of devices that need a fundamental
   reset (PERST) after applying a FW update.
  
  This is a tangent from the real discussion here, but the question of
  resetting a device after a firmware update concerns me.  Many if not
  all of our current reset interfaces save  restore the architected
  parts of config space around the reset.  But a reset after a firmware
  update may change things like the number and type of BARs or even the
  functionality of the device, so I don't think the restore is safe in
  general.
 
 Right.
 
  I doubt this is a big problem in general, but I have found reports of
  people having to do a system reset or reboot after updating, e.g.,
  FPGA images.  I suppose at least some of these could be worked around
  with the right hotplug incantations.
 
 Yes.
 
 We have that similar issue with error handling, when the driver doesn't
 have the right hooks, we simulate an unplug, reset, then replug.
 
 Maybe we could provide generic helpers to do that...

Devices going away and coming back is pretty difficult for vfio to
handle.  Perhaps helpers to rescan a device in-place would be easier.
On the QEMU side we'd need to rescan the device after each reset, which
would be rather tedious for the typical case where it doesn't change.
Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Benjamin Herrenschmidt
On Fri, 2013-08-02 at 17:58 -0500, Scott Wood wrote:
 
 What about 64-bit PTEs on 32-bit kernels?
 
 In any case, this code does not belong in KVM.  It should be in the
 main
 PPC mm code, even if KVM is the only user.

Also don't we do similar things in BookS KVM ? At the very least that
sutff should become common. And yes, I agree, it should probably also
move to pgtable*

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Benjamin Herrenschmidt
On Fri, 2013-08-02 at 17:37 -0600, Alex Williamson wrote:

  Yes.
  
  We have that similar issue with error handling, when the driver doesn't
  have the right hooks, we simulate an unplug, reset, then replug.
  
  Maybe we could provide generic helpers to do that...
 
 Devices going away and coming back is pretty difficult for vfio to
 handle.  Perhaps helpers to rescan a device in-place would be easier.

Well, in the error handling case (and *maybe* in the FW update case)
we need to unbind and rebind the driver as well. The whole point is that
we have to do that because the driver doesn't have the right hooks.

In the VFIO case, we will have to implement something here so that the
VFIO driver stub doesn't get handled that treatment :-) We'll probably
need some arch specific stuff in vfio_pci unfortunately, so that the
errors get forwarded to the guest via our EEH interfaces, and let the
guest handle it's error handling.

Of course that leaves an interesting problem as to what state the device
is in when it comes back to the host ...

This is made a LOT more complicated in the VFIO model than it is in the
pHyp model (our proprietary hypervisor).

Under pHyp, the PE doesn't have a concept of being used outside of a
guest, and it always reset before being assigned/unassigned. The guest
can mess around all it wants (bus numbers, BARs, etc...) it can only
hurt itself. The hypervisor doesn't keep track of any of that.

 On the QEMU side we'd need to rescan the device after each reset, which
 would be rather tedious for the typical case where it doesn't change.

This is a direct consequence of your model :-) It makes things more
complex for us as well, but we have to bite that bullet now. Maybe we
can consider an alternate/simpler model in the future, more akin to what
pHyp does, where we unplug the device from the host when assigning it
to a guest (and the whole hierarchy below it if it's a bridge) and let
the guest do what it wants with it. 

Doing so means we no longer have to emulate/filter config space
(provided your HW handles MSI virtualization properly), care about bus
numbers, BARs, etc... nor do we need to keep track of a lot of this in
qemu. All we need is reset the whole  lot and re-probe the bus leg when
returning the devices to the host.
 
Cheers,
Ben.

 Thanks,
 
 Alex
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Benjamin Herrenschmidt [mailto:b...@kernel.crashing.org]
 Sent: Saturday, August 03, 2013 4:47 AM
 To: Wood Scott-B07421
 Cc: Bhushan Bharat-R65777; ag...@suse.de; kvm-...@vger.kernel.org;
 kvm@vger.kernel.org; linuxppc-...@lists.ozlabs.org; Bhushan Bharat-R65777
 Subject: Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like
 booke3s
 
 On Fri, 2013-08-02 at 17:58 -0500, Scott Wood wrote:
 
  What about 64-bit PTEs on 32-bit kernels?
 
  In any case, this code does not belong in KVM.  It should be in the
  main PPC mm code, even if KVM is the only user.
 
 Also don't we do similar things in BookS KVM ? At the very least that sutff
 should become common. And yes, I agree, it should probably also move to 
 pgtable*

One of the problem I saw was that if I put this code in asm/pgtable-32.h and 
asm/pgtable-64.h then pte_persent() and other friend function (on which this 
code depends) are defined in pgtable.h. And pgtable.h includes asm/pgtable-32.h 
and asm/pgtable-64.h before it defines pte_present() and friends functions.

Ok I move wove this in asm/pgtable*.h, initially I fought with myself to take 
this code in pgtable* but finally end up doing here (got biased by book3s :)).

Thanks
-Bharat

 
 Cheers,
 Ben.
 
 

N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Wood Scott-B07421
 Sent: Saturday, August 03, 2013 5:05 AM
 To: Bhushan Bharat-R65777
 Cc: b...@kernel.crashing.org; ag...@suse.de; kvm-...@vger.kernel.org;
 kvm@vger.kernel.org; linuxppc-...@lists.ozlabs.org; Bhushan Bharat-R65777
 Subject: Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux
 pte
 
 On Thu, Aug 01, 2013 at 04:42:38PM +0530, Bharat Bhushan wrote:
  diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index
  17722d8..eb2 100644
  --- a/arch/powerpc/kvm/booke.c
  +++ b/arch/powerpc/kvm/booke.c
  @@ -697,7 +697,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run,
  struct kvm_vcpu *vcpu)  #endif
 
  kvmppc_fix_ee_before_entry();
  -
  +   vcpu-arch.pgdir = current-mm-pgd;
  ret = __kvmppc_vcpu_run(kvm_run, vcpu);
 
 kvmppc_fix_ee_before_entry() is supposed to be the last thing that happens
 before __kvmppc_vcpu_run().
 
  @@ -332,6 +324,8 @@ static inline int kvmppc_e500_shadow_map(struct
 kvmppc_vcpu_e500 *vcpu_e500,
  unsigned long hva;
  int pfnmap = 0;
  int tsize = BOOK3E_PAGESZ_4K;
  +   pte_t pte;
  +   int wimg = 0;
 
  /*
   * Translate guest physical to true physical, acquiring @@ -437,6
  +431,8 @@ static inline int kvmppc_e500_shadow_map(struct
  kvmppc_vcpu_e500 *vcpu_e500,
 
  if (likely(!pfnmap)) {
  unsigned long tsize_pages = 1  (tsize + 10 - PAGE_SHIFT);
  +   pgd_t *pgdir;
  +
  pfn = gfn_to_pfn_memslot(slot, gfn);
  if (is_error_noslot_pfn(pfn)) {
  printk(KERN_ERR Couldn't get real page for gfn 
  %lx!\n, @@
 -447,9
  +443,18 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500
 *vcpu_e500,
  /* Align guest and physical address to page map boundaries */
  pfn = ~(tsize_pages - 1);
  gvaddr = ~((tsize_pages  PAGE_SHIFT) - 1);
  +   pgdir = vcpu_e500-vcpu.arch.pgdir;
  +   pte = lookup_linux_pte(pgdir, hva, 1, tsize_pages);
  +   if (pte_present(pte)) {
  +   wimg = (pte  PTE_WIMGE_SHIFT)  MAS2_WIMGE_MASK;
  +   } else {
  +   printk(KERN_ERR pte not present: gfn %lx, pfn %lx\n,
  +   (long)gfn, pfn);
  +   return -EINVAL;
  +   }
  }
 
 How does wimg get set in the pfnmap case?

Pfnmap is not kernel managed pages, right? So should we set I+G there ?

 
 Could you explain why we need to set dirty/referenced on the PTE, when we 
 didn't
 need to do that before? All we're getting from the PTE is wimg.
 We have MMU notifiers to take care of the page being unmapped, and we've 
 already
 marked the page itself as dirty if the TLB entry is writeable.

I pulled this code from book3s.

Ben, can you describe why we need this on book3s ?

Thanks
-Bharat
 
 -Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Benjamin Herrenschmidt
On Sat, 2013-08-03 at 02:58 +, Bhushan Bharat-R65777 wrote:
 One of the problem I saw was that if I put this code in
 asm/pgtable-32.h and asm/pgtable-64.h then pte_persent() and other
 friend function (on which this code depends) are defined in pgtable.h.
 And pgtable.h includes asm/pgtable-32.h and asm/pgtable-64.h before it
 defines pte_present() and friends functions.
 
 Ok I move wove this in asm/pgtable*.h, initially I fought with myself
 to take this code in pgtable* but finally end up doing here (got
 biased by book3s :)).

Is there a reason why these routines can not be completely generic
in pgtable.h ?

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Benjamin Herrenschmidt
On Sat, 2013-08-03 at 03:11 +, Bhushan Bharat-R65777 wrote:
 
  
  Could you explain why we need to set dirty/referenced on the PTE, when we 
  didn't
  need to do that before? All we're getting from the PTE is wimg.
  We have MMU notifiers to take care of the page being unmapped, and we've 
  already
  marked the page itself as dirty if the TLB entry is writeable.
 
 I pulled this code from book3s.
 
 Ben, can you describe why we need this on book3s ?

If you let the guest write to the page you must set the dirty bit on the PTE
(or the struct page, at least one of them), similar with accessed on any access.

If you don't, the VM might swap the page out without writing it back to disk
for example, assuming it contains no modified data.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

2013-08-02 Thread Takuya Yoshikawa
On Tue, 30 Jul 2013 21:01:58 +0800
Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:

 Background
 ==
 Currently, when mark memslot dirty logged or get dirty page, we need to
 write-protect large guest memory, it is the heavy work, especially, we need to
 hold mmu-lock which is also required by vcpu to fix its page table fault and
 mmu-notifier when host page is being changed. In the extreme cpu / memory used
 guest, it becomes a scalability issue.
 
 This patchset introduces a way to locklessly write-protect guest memory.

Nice improvements!

If I read the patch set correctly, this work contains the following changes:

Cleanups:
Patch 1 and patch 12.

Lazy large page dropping for dirty logging:
Patch 2-3.
Patch 2 is preparatory to patch 3.

This does not look like an RFC if you address Marcelo's comment.
Any reason to include this in an RFC patch set?

Making remote TLBs flushable outside of mmu_lock for dirty logging:
Patch 6.

This is nice.  I'm locally using a similar patch for my work, but yours
is much cleaner and better.  I hope this will get merged soon.

New Pte-list handling:
Patch 7-9.

Still reading the details.

RCU-based lockless write protection.
Patch 10-11.

If I understand RCU correctly, the current implementation has a problem:
read-side critical sections can become too long.

See the following LWN's article:
Sleepable RCU
https://lwn.net/Articles/202847/

Especially, kvm_mmu_slot_remove_write_access() can take hundreds of
milliseconds, or even a few seconds for guests using shadow paging.
Is it possible to break the read-side critical section after protecting
some pages? -- I guess so.

Anyway, I want to see the following non-RFC quality patches get merged first:
- Lazy large page dropping for dirty logging:
- Making remote TLBs flushable outside of mmu_lock for dirty logging

As you are doing in patch 11, the latter can eliminate the TLB flushes before
cond_resched_lock().  So this alone is an optimization, and since my work is
based on this TLB flush-less lock breaking, I would appriciate if you make this
change first in your clean way.

The remaining patches, pte-list refactoring and lock-less ones, also look
interesting, but I need to read more to understand them.

Thanks for the nice work!
Takuya


 
 Idea
 ==
 There are the challenges we meet and the ideas to resolve them.
 
 1) How to locklessly walk rmap?
 The first idea we got to prevent desc being freed when we are walking the
 rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
 it updates the rmap really frequently.
 
 So we uses SLAB_DESTROY_BY_RCU to manage desc instead, it allows the object
 to be reused more quickly. We also store a nulls in the last desc
 (desc-more) which can help us to detect whether the desc is moved to anther
 rmap then we can re-walk the rmap if that happened. I learned this idea from
 nulls-list.
 
 Another issue is, when a spte is deleted from the desc, another spte in the
 last desc will be moved to this position to replace the deleted one. If the
 deleted one has been accessed and we do not access the replaced one, the
 replaced one is missed when we do lockless walk.
 To fix this case, we do not backward move the spte, instead, we forward move
 the entry: when a spte is deleted, we move the entry in the first desc to that
 position.
 
 2) How to locklessly access shadow page table?
 It is easy if the handler is in the vcpu context, in that case we can use
 walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
 disable interrupt to stop shadow page be freed. But we are on the ioctl 
 context
 and the paths we are optimizing for have heavy workload, disabling interrupt 
 is
 not good for the system performance.
 
 We add a indicator into kvm struct (kvm-arch.rcu_free_shadow_page), then use
 call_rcu() to free the shadow page if that indicator is set. Set/Clear the
 indicator are protected by slot-lock, so it need not be atomic and does not
 hurt the performance and the scalability.
 
 3) How to locklessly write-protect guest memory?
 Currently, there are two behaviors when we write-protect guest memory, one is
 clearing the Writable bit on spte and the another one is dropping spte when it
 points to large page. The former is easy we only need to atomicly clear a bit
 but the latter is hard since we need to remove the spte from rmap. so we unify
 these two behaviors that only make the spte readonly. Making large spte
 readonly instead of nonpresent is also good for reducing jitter.
 
 And we need to pay more attention on the order of making spte writable, adding
 spte into rmap and setting the corresponding bit on dirty bitmap since
 kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty 
 bitmap,
 we should ensure the writable 

Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread “tiejun.chen”

On 08/01/2013 07:12 PM, Bharat Bhushan wrote:

KVM uses same WIM tlb attributes as the corresponding qemu pte.
For this we now search the linux pte for the requested page and
get these cache caching/coherency attributes from pte.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v1-v2
  - Use Linux pte for wimge rather than RAM/no-RAM mechanism

  arch/powerpc/include/asm/kvm_host.h |2 +-
  arch/powerpc/kvm/booke.c|2 +-
  arch/powerpc/kvm/e500.h |8 +---
  arch/powerpc/kvm/e500_mmu_host.c|   31 ++-
  4 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 3328353..583d405 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -535,6 +535,7 @@ struct kvm_vcpu_arch {
  #endif
gpa_t paddr_accessed;
gva_t vaddr_accessed;
+   pgd_t *pgdir;

u8 io_gpr; /* GPR used as IO source/target */
u8 mmio_is_bigendian;
@@ -592,7 +593,6 @@ struct kvm_vcpu_arch {
struct list_head run_list;
struct task_struct *run_task;
struct kvm_run *kvm_run;
-   pgd_t *pgdir;

spinlock_t vpa_update_lock;
struct kvmppc_vpa vpa;
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 17722d8..eb2 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -697,7 +697,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
  #endif

kvmppc_fix_ee_before_entry();
-
+   vcpu-arch.pgdir = current-mm-pgd;
ret = __kvmppc_vcpu_run(kvm_run, vcpu);

/* No need for kvm_guest_exit. It's done in handle_exit.
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 4fd9650..fc4b2f6 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -31,11 +31,13 @@ enum vcpu_ftr {
  #define E500_TLB_NUM   2

  /* entry is mapped somewhere in host TLB */
-#define E500_TLB_VALID (1  0)
+#define E500_TLB_VALID (1  31)
  /* TLB1 entry is mapped by host TLB1, tracked by bitmaps */
-#define E500_TLB_BITMAP(1  1)
+#define E500_TLB_BITMAP(1  30)
  /* TLB1 entry is mapped by host TLB0 */
-#define E500_TLB_TLB0  (1  2)
+#define E500_TLB_TLB0  (1  29)
+/* Lower 5 bits have WIMGE value */
+#define E500_TLB_WIMGE_MASK(0x1f)

  struct tlbe_ref {
pfn_t pfn;  /* valid only for TLB0, except briefly */
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 1c6a9d7..9b10b0b 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -64,15 +64,6 @@ static inline u32 e500_shadow_mas3_attrib(u32 mas3, int 
usermode)
return mas3;
  }

-static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode)
-{
-#ifdef CONFIG_SMP
-   return (mas2  MAS2_ATTRIB_MASK) | MAS2_M;
-#else
-   return mas2  MAS2_ATTRIB_MASK;
-#endif
-}
-
  /*
   * writing shadow tlb entry to host TLB
   */
@@ -248,10 +239,12 @@ static inline int tlbe_is_writable(struct 
kvm_book3e_206_tlb_entry *tlbe)

  static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
 struct kvm_book3e_206_tlb_entry *gtlbe,
-pfn_t pfn)
+pfn_t pfn, int wimg)
  {
ref-pfn = pfn;
ref-flags |= E500_TLB_VALID;
+   /* Use guest supplied MAS2_G and MAS2_E */
+   ref-flags |= (gtlbe-mas2  MAS2_ATTRIB_MASK) | wimg;

if (tlbe_is_writable(gtlbe))
kvm_set_pfn_dirty(pfn);
@@ -312,8 +305,7 @@ static void kvmppc_e500_setup_stlbe(

/* Force IPROT=0 for all guest mappings. */
stlbe-mas1 = MAS1_TSIZE(tsize) | get_tlb_sts(gtlbe) | MAS1_VALID;
-   stlbe-mas2 = (gvaddr  MAS2_EPN) |
- e500_shadow_mas2_attrib(gtlbe-mas2, pr);
+   stlbe-mas2 = (gvaddr  MAS2_EPN) | (ref-flags  E500_TLB_WIMGE_MASK);
stlbe-mas7_3 = ((u64)pfn  PAGE_SHIFT) |
e500_shadow_mas3_attrib(gtlbe-mas7_3, pr);

@@ -332,6 +324,8 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
unsigned long hva;
int pfnmap = 0;
int tsize = BOOK3E_PAGESZ_4K;
+   pte_t pte;
+   int wimg = 0;

/*
 * Translate guest physical to true physical, acquiring
@@ -437,6 +431,8 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,

if (likely(!pfnmap)) {
unsigned long tsize_pages = 1  (tsize + 10 - PAGE_SHIFT);
+   pgd_t *pgdir;
+
pfn = gfn_to_pfn_memslot(slot, gfn);
if (is_error_noslot_pfn(pfn)) {
printk(KERN_ERR Couldn't get real page for gfn %lx!\n,
@@ -447,9 +443,18 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 

Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread “tiejun.chen”

On 08/01/2013 07:12 PM, Bharat Bhushan wrote:

KVM need to lookup linux pte for getting TLB attributes (WIMGE).
This is similar to how book3s does.
This will be used in follow-up patches.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v1-v2
  - This is a new change in this version

  arch/powerpc/include/asm/kvm_booke.h |   73 ++
  1 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index d3c1eb3..903624d 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -102,4 +102,77 @@ static inline ulong kvmppc_get_msr(struct kvm_vcpu *vcpu)
  {
return vcpu-arch.shared-msr;
  }
+
+/*
+ * Lock and read a linux PTE.  If it's present and writable, atomically
+ * set dirty and referenced bits and return the PTE, otherwise return 0.
+ */
+static inline pte_t kvmppc_read_update_linux_pte(pte_t *p, int writing)
+{
+   pte_t pte;
+
+#ifdef PTE_ATOMIC_UPDATES
+   pte_t tmp;
+/* wait until _PAGE_BUSY is clear then set it atomically */
+#ifdef CONFIG_PPC64
+   __asm__ __volatile__ (
+   1:ldarx   %0,0,%3\n
+ andi.   %1,%0,%4\n
+ bne-1b\n
+ ori %1,%0,%4\n
+ stdcx.  %1,0,%3\n
+ bne-1b
+   : =r (pte), =r (tmp), =m (*p)
+   : r (p), i (_PAGE_BUSY)
+   : cc);
+#else
+__asm__ __volatile__ (
+1: lwarx   %0,0,%3\n
+   andi.   %1,%0,%4\n
+   bne-1b\n
+   ori %1,%0,%4\n
+   stwcx.  %1,0,%3\n
+   bne-1b
+: =r (pte), =r (tmp), =m (*p)
+: r (p), i (_PAGE_BUSY)
+: cc);
+#endif
+#else
+   pte = pte_val(*p);
+#endif
+
+   if (pte_present(pte)) {
+   pte = pte_mkyoung(pte);
+   if (writing  pte_write(pte))
+   pte = pte_mkdirty(pte);
+   }
+
+   *p = pte;   /* clears _PAGE_BUSY */
+
+   return pte;
+}
+
+static inline pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+ int writing, unsigned long *pte_sizep)


Looks this function is as same as book3s, so why not improve that as common :)

Tiejun


+{
+   pte_t *ptep;
+   unsigned long ps = *pte_sizep;
+   unsigned int shift;
+
+   ptep = find_linux_pte_or_hugepte(pgdir, hva, shift);
+   if (!ptep)
+   return __pte(0);
+   if (shift)
+   *pte_sizep = 1ul  shift;
+   else
+   *pte_sizep = PAGE_SIZE;
+
+   if (ps  *pte_sizep)
+   return __pte(0);
+   if (!pte_present(*ptep))
+   return __pte(0);
+
+   return kvmppc_read_update_linux_pte(ptep, writing);
+}
+
  #endif /* __ASM_KVM_BOOKE_H__ */



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6 v2] kvm: powerpc: allow guest control G attribute in mas2

2013-08-02 Thread “tiejun.chen”

On 08/01/2013 07:12 PM, Bharat Bhushan wrote:

G bit in MAS2 indicates whether the page is Guarded.
There is no reason to stop guest setting  E, so allow him.


Could we merge patch 2 and 3 into only one.

Tiejun



Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v1-v2
  - no change

  arch/powerpc/kvm/e500.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 277cb18..4fd9650 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -117,7 +117,7 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct 
kvm_vcpu *vcpu)
  #define E500_TLB_USER_PERM_MASK (MAS3_UX|MAS3_UR|MAS3_UW)
  #define E500_TLB_SUPER_PERM_MASK (MAS3_SX|MAS3_SR|MAS3_SW)
  #define MAS2_ATTRIB_MASK \
- (MAS2_X0 | MAS2_X1 | MAS2_E)
+ (MAS2_X0 | MAS2_X1 | MAS2_E | MAS2_G)
  #define MAS3_ATTRIB_MASK \
  (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \
   | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK)



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Scott Wood
On Thu, 2013-08-01 at 16:42 +0530, Bharat Bhushan wrote:
 KVM need to lookup linux pte for getting TLB attributes (WIMGE).
 This is similar to how book3s does.
 This will be used in follow-up patches.
 
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 ---
 v1-v2
  - This is a new change in this version
 
  arch/powerpc/include/asm/kvm_booke.h |   73 
 ++
  1 files changed, 73 insertions(+), 0 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_booke.h 
 b/arch/powerpc/include/asm/kvm_booke.h
 index d3c1eb3..903624d 100644
 --- a/arch/powerpc/include/asm/kvm_booke.h
 +++ b/arch/powerpc/include/asm/kvm_booke.h
 @@ -102,4 +102,77 @@ static inline ulong kvmppc_get_msr(struct kvm_vcpu *vcpu)
  {
   return vcpu-arch.shared-msr;
  }
 +
 +/*
 + * Lock and read a linux PTE.  If it's present and writable, atomically
 + * set dirty and referenced bits and return the PTE, otherwise return 0.
 + */
 +static inline pte_t kvmppc_read_update_linux_pte(pte_t *p, int writing)
 +{
 + pte_t pte;
 +
 +#ifdef PTE_ATOMIC_UPDATES
 + pte_t tmp;
 +/* wait until _PAGE_BUSY is clear then set it atomically */

_PAGE_BUSY is 0 on book3e.

 +#ifdef CONFIG_PPC64
 + __asm__ __volatile__ (
 + 1: ldarx   %0,0,%3\n
 +andi.   %1,%0,%4\n
 +bne-1b\n
 +ori %1,%0,%4\n
 +stdcx.  %1,0,%3\n
 +bne-1b
 + : =r (pte), =r (tmp), =m (*p)
 + : r (p), i (_PAGE_BUSY)
 + : cc);
 +#else
 +__asm__ __volatile__ (
 +1: lwarx   %0,0,%3\n
 +   andi.   %1,%0,%4\n
 +   bne-1b\n
 +   ori %1,%0,%4\n
 +   stwcx.  %1,0,%3\n
 +   bne-1b
 +: =r (pte), =r (tmp), =m (*p)
 +: r (p), i (_PAGE_BUSY)
 +: cc);
 +#endif

What about 64-bit PTEs on 32-bit kernels?

In any case, this code does not belong in KVM.  It should be in the main
PPC mm code, even if KVM is the only user.

-Scott



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Scott Wood
On Thu, Aug 01, 2013 at 04:42:38PM +0530, Bharat Bhushan wrote:
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index 17722d8..eb2 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -697,7 +697,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
  #endif
  
   kvmppc_fix_ee_before_entry();
 -
 + vcpu-arch.pgdir = current-mm-pgd;
   ret = __kvmppc_vcpu_run(kvm_run, vcpu);

kvmppc_fix_ee_before_entry() is supposed to be the last thing that
happens before __kvmppc_vcpu_run().

 @@ -332,6 +324,8 @@ static inline int kvmppc_e500_shadow_map(struct 
 kvmppc_vcpu_e500 *vcpu_e500,
   unsigned long hva;
   int pfnmap = 0;
   int tsize = BOOK3E_PAGESZ_4K;
 + pte_t pte;
 + int wimg = 0;
  
   /*
* Translate guest physical to true physical, acquiring
 @@ -437,6 +431,8 @@ static inline int kvmppc_e500_shadow_map(struct 
 kvmppc_vcpu_e500 *vcpu_e500,
  
   if (likely(!pfnmap)) {
   unsigned long tsize_pages = 1  (tsize + 10 - PAGE_SHIFT);
 + pgd_t *pgdir;
 +
   pfn = gfn_to_pfn_memslot(slot, gfn);
   if (is_error_noslot_pfn(pfn)) {
   printk(KERN_ERR Couldn't get real page for gfn %lx!\n,
 @@ -447,9 +443,18 @@ static inline int kvmppc_e500_shadow_map(struct 
 kvmppc_vcpu_e500 *vcpu_e500,
   /* Align guest and physical address to page map boundaries */
   pfn = ~(tsize_pages - 1);
   gvaddr = ~((tsize_pages  PAGE_SHIFT) - 1);
 + pgdir = vcpu_e500-vcpu.arch.pgdir;
 + pte = lookup_linux_pte(pgdir, hva, 1, tsize_pages);
 + if (pte_present(pte)) {
 + wimg = (pte  PTE_WIMGE_SHIFT)  MAS2_WIMGE_MASK;
 + } else {
 + printk(KERN_ERR pte not present: gfn %lx, pfn %lx\n,
 + (long)gfn, pfn);
 + return -EINVAL;
 + }
   }

How does wimg get set in the pfnmap case?

Could you explain why we need to set dirty/referenced on the PTE, when we
didn't need to do that before?  All we're getting from the PTE is wimg. 
We have MMU notifiers to take care of the page being unmapped, and we've
already marked the page itself as dirty if the TLB entry is writeable.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Benjamin Herrenschmidt
On Fri, 2013-08-02 at 17:58 -0500, Scott Wood wrote:
 
 What about 64-bit PTEs on 32-bit kernels?
 
 In any case, this code does not belong in KVM.  It should be in the
 main
 PPC mm code, even if KVM is the only user.

Also don't we do similar things in BookS KVM ? At the very least that
sutff should become common. And yes, I agree, it should probably also
move to pgtable*

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Benjamin Herrenschmidt [mailto:b...@kernel.crashing.org]
 Sent: Saturday, August 03, 2013 4:47 AM
 To: Wood Scott-B07421
 Cc: Bhushan Bharat-R65777; ag...@suse.de; kvm-ppc@vger.kernel.org;
 k...@vger.kernel.org; linuxppc-...@lists.ozlabs.org; Bhushan Bharat-R65777
 Subject: Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like
 booke3s
 
 On Fri, 2013-08-02 at 17:58 -0500, Scott Wood wrote:
 
  What about 64-bit PTEs on 32-bit kernels?
 
  In any case, this code does not belong in KVM.  It should be in the
  main PPC mm code, even if KVM is the only user.
 
 Also don't we do similar things in BookS KVM ? At the very least that sutff
 should become common. And yes, I agree, it should probably also move to 
 pgtable*

One of the problem I saw was that if I put this code in asm/pgtable-32.h and 
asm/pgtable-64.h then pte_persent() and other friend function (on which this 
code depends) are defined in pgtable.h. And pgtable.h includes asm/pgtable-32.h 
and asm/pgtable-64.h before it defines pte_present() and friends functions.

Ok I move wove this in asm/pgtable*.h, initially I fought with myself to take 
this code in pgtable* but finally end up doing here (got biased by book3s :)).

Thanks
-Bharat

 
 Cheers,
 Ben.
 
 

N�r��yb�X��ǧv�^�)޺{.n�+jir)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥

RE: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Wood Scott-B07421
 Sent: Saturday, August 03, 2013 5:05 AM
 To: Bhushan Bharat-R65777
 Cc: b...@kernel.crashing.org; ag...@suse.de; kvm-ppc@vger.kernel.org;
 k...@vger.kernel.org; linuxppc-...@lists.ozlabs.org; Bhushan Bharat-R65777
 Subject: Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux
 pte
 
 On Thu, Aug 01, 2013 at 04:42:38PM +0530, Bharat Bhushan wrote:
  diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index
  17722d8..eb2 100644
  --- a/arch/powerpc/kvm/booke.c
  +++ b/arch/powerpc/kvm/booke.c
  @@ -697,7 +697,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run,
  struct kvm_vcpu *vcpu)  #endif
 
  kvmppc_fix_ee_before_entry();
  -
  +   vcpu-arch.pgdir = current-mm-pgd;
  ret = __kvmppc_vcpu_run(kvm_run, vcpu);
 
 kvmppc_fix_ee_before_entry() is supposed to be the last thing that happens
 before __kvmppc_vcpu_run().
 
  @@ -332,6 +324,8 @@ static inline int kvmppc_e500_shadow_map(struct
 kvmppc_vcpu_e500 *vcpu_e500,
  unsigned long hva;
  int pfnmap = 0;
  int tsize = BOOK3E_PAGESZ_4K;
  +   pte_t pte;
  +   int wimg = 0;
 
  /*
   * Translate guest physical to true physical, acquiring @@ -437,6
  +431,8 @@ static inline int kvmppc_e500_shadow_map(struct
  kvmppc_vcpu_e500 *vcpu_e500,
 
  if (likely(!pfnmap)) {
  unsigned long tsize_pages = 1  (tsize + 10 - PAGE_SHIFT);
  +   pgd_t *pgdir;
  +
  pfn = gfn_to_pfn_memslot(slot, gfn);
  if (is_error_noslot_pfn(pfn)) {
  printk(KERN_ERR Couldn't get real page for gfn 
  %lx!\n, @@
 -447,9
  +443,18 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500
 *vcpu_e500,
  /* Align guest and physical address to page map boundaries */
  pfn = ~(tsize_pages - 1);
  gvaddr = ~((tsize_pages  PAGE_SHIFT) - 1);
  +   pgdir = vcpu_e500-vcpu.arch.pgdir;
  +   pte = lookup_linux_pte(pgdir, hva, 1, tsize_pages);
  +   if (pte_present(pte)) {
  +   wimg = (pte  PTE_WIMGE_SHIFT)  MAS2_WIMGE_MASK;
  +   } else {
  +   printk(KERN_ERR pte not present: gfn %lx, pfn %lx\n,
  +   (long)gfn, pfn);
  +   return -EINVAL;
  +   }
  }
 
 How does wimg get set in the pfnmap case?

Pfnmap is not kernel managed pages, right? So should we set I+G there ?

 
 Could you explain why we need to set dirty/referenced on the PTE, when we 
 didn't
 need to do that before? All we're getting from the PTE is wimg.
 We have MMU notifiers to take care of the page being unmapped, and we've 
 already
 marked the page itself as dirty if the TLB entry is writeable.

I pulled this code from book3s.

Ben, can you describe why we need this on book3s ?

Thanks
-Bharat
 
 -Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Benjamin Herrenschmidt
On Sat, 2013-08-03 at 02:58 +, Bhushan Bharat-R65777 wrote:
 One of the problem I saw was that if I put this code in
 asm/pgtable-32.h and asm/pgtable-64.h then pte_persent() and other
 friend function (on which this code depends) are defined in pgtable.h.
 And pgtable.h includes asm/pgtable-32.h and asm/pgtable-64.h before it
 defines pte_present() and friends functions.
 
 Ok I move wove this in asm/pgtable*.h, initially I fought with myself
 to take this code in pgtable* but finally end up doing here (got
 biased by book3s :)).

Is there a reason why these routines can not be completely generic
in pgtable.h ?

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Benjamin Herrenschmidt
On Sat, 2013-08-03 at 03:11 +, Bhushan Bharat-R65777 wrote:
 
  
  Could you explain why we need to set dirty/referenced on the PTE, when we 
  didn't
  need to do that before? All we're getting from the PTE is wimg.
  We have MMU notifiers to take care of the page being unmapped, and we've 
  already
  marked the page itself as dirty if the TLB entry is writeable.
 
 I pulled this code from book3s.
 
 Ben, can you describe why we need this on book3s ?

If you let the guest write to the page you must set the dirty bit on the PTE
(or the struct page, at least one of them), similar with accessed on any access.

If you don't, the VM might swap the page out without writing it back to disk
for example, assuming it contains no modified data.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html