Re: [PATCH v2] KVM: Synthesize G bit for all segments.

2014-07-10 Thread Jan Kiszka
On 2014-07-08 06:17, Alok Kataria wrote:
 Thanks Jan and Paolo for looking at the change, I have added a comment
 in svm_get_segment. Joerg, please consider this for the next merge.
 
 --
 
 From: Jim Mattson jmatt...@vmware.com
 
 We have noticed that qemu-kvm hangs early in the BIOS when runnning nested
 under some versions of VMware ESXi.
 
 The problem we believe is because KVM assumes that the platform preserves
 the 'G' but for any segment register. The SVM specification itemizes the
 segment attribute bits that are observed by the CPU, but the (G)ranularity bit
 is not one of the bits itemized, for any segment. Though current AMD CPUs keep
 track of the (G)ranularity bit for all segment registers other than CS, the
 specification does not require it. VMware's virtual CPU may not track the
 (G)ranularity bit for any segment register.
 
 Since kvm already synthesizes the (G)ranularity bit for the CS segment. It
 should do so for all segments. The patch below does that, and helps get rid of
 the hangs. Patch applies on top of Linus' tree.
 
 Signed-off-by: Jim Mattson jmatt...@vmware.com
 Signed-off-by: Alok N Kataria akata...@vmware.com
 
 Index: linux-2.6/arch/x86/kvm/svm.c
 ===
 --- linux-2.6.orig/arch/x86/kvm/svm.c 2014-07-07 15:32:52.724368183 +0530
 +++ linux-2.6/arch/x86/kvm/svm.c  2014-07-08 09:30:29.124431069 +0530
 @@ -1415,7 +1415,13 @@
   var-avl = (s-attrib  SVM_SELECTOR_AVL_SHIFT)  1;
   var-l = (s-attrib  SVM_SELECTOR_L_SHIFT)  1;
   var-db = (s-attrib  SVM_SELECTOR_DB_SHIFT)  1;
 - var-g = (s-attrib  SVM_SELECTOR_G_SHIFT)  1;
 +
 + /*
 +  * SVM spec doesn't require the platform to track the G bit for all
 +  * segments, so similar to CS, let's synthesize this bit for all
 +  * segments.

Either I misunderstand the reference to CS or it does no longer apply
once the patch is in. I would suggest to remove that part of the sentence.

Jan

 +  */
 + var-g = s-limit  0xf;
  
   /*
* AMD's VMCB does not have an explicit unusable field, so emulate it
 @@ -1424,14 +1430,6 @@
   var-unusable = !var-present || (var-type == 0);
  
   switch (seg) {
 - case VCPU_SREG_CS:
 - /*
 -  * SVM always stores 0 for the 'G' bit in the CS selector in
 -  * the VMCB on a VMEXIT. This hurts cross-vendor migration:
 -  * Intel's VMENTRY has a check on the 'G' bit.
 -  */
 - var-g = s-limit  0xf;
 - break;
   case VCPU_SREG_TR:
   /*
* Work around a bug where the busy flag in the tr selector
 
 
 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.

2014-07-10 Thread Peter Zijlstra
/me reminds you of 78 char text wrap.

On Wed, Jul 09, 2014 at 07:32:09PM +, Liang, Kan wrote:
  Sure; but what I meant was, check_msr() is broken when ran on such a
  kernel. You need to fix check_msr() to return failure on these 'ignored'
  MSRs, after all they don't function as expected, they're effectively broken.
 
 The function is designed to check if the MSRs can be safely accessed
 (no #GP). It cannot guarantee the correctness of the MSRs.  If KVM
 applied patch 2 and guest applied patch 1, from the guest's
 perspective, the MSRs can be accessed (no #GP triggered). So return
 true is expected. It should not be a broken. 

You're not understanding. I know you wrote that function to do that. I'm
saying that's wrong.

Look at check_hw_exists() it explicitly checks for fake MSRs and reports
them broken.

These fake MSRs _ARE_ broken, they do not behave as expected. Not
crashing is not the right consideration here, we're interested in higher
order correct behaviour.

 The only unexpected
 thing for guest is that the counting/sampling result for LBR/extra reg
 is always 0. But the patch is a short term fix to stop things from
 crashing. I think it should be acceptable.

Patch 2 is fine, patch 1, in particular your check_msr() routine is not.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: svm: virtualize MSR reads from MC4_MISC1

2014-07-10 Thread Matthias Lange
On Wed, Jul 09, 2014 at 06:26:23PM +0200, Paolo Bonzini wrote:
 Il 26/06/2014 14:22, Matthias Lange ha scritto:
 Linux' AMD MCE code tries to read from the MC4_MISC1 (0xc408) MSR. 
 Because
 this read is not virtualized within KVM, a GPE is injected into the guest.
 This patch handles guest reads from MC4_MISC and returns 0 to the guest.
 
 Signed-off-by: Matthias Lange matthias.la...@kernkonzept.com
 ---
  arch/x86/kvm/x86.c | 3 +++
  1 file changed, 3 insertions(+)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 24d70d4..d5a305b 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2530,6 +2530,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
 u64 *pdata)
  return 1;
  data = vcpu-arch.osvw.status;
  break;
 +case 0xc408: /* MC4_MISC1 */
 +data = 0;
 +break;
  default:
  if (kvm_pmu_msr(vcpu, msr))
  return kvm_pmu_get_msr(vcpu, msr, pdata);
 
 
 Why don't you have to do the same with MC4_MISC0?

MC4_MISC0 (0x413) is handled in kvm_get_msr_common by this statement

  case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
  return get_msr_mce(vcpu, msr, pdata);

and get_msr_mce returns a sane value from the vcpu mce_banks.

 Is your kernel compiled with or without CONFIG_PARAVIRT?  Paravirt does
 
 .read_msr = native_read_msr_safe,
 .write_msr = native_write_msr_safe,
 
 so you shouldn't get these #GP exceptions.

I used the default i386_defconfig to build the guest kernel. CONFIG_PARAVIRT
is disabled in this setup.

Matthias.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] powerpc/kvm: Create proper names for the kvm_host_state PMU fields

2014-07-10 Thread Michael Ellerman
We have two arrays in kvm_host_state that contain register values for
the PMU. Currently we only create an asm-offsets symbol for the base of
the arrays, and do the array offset in the assembly code.

Creating an asm-offsets symbol for each field individually makes the
code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and
might have helped us notice the recent double restore bug we had in this
code.

Signed-off-by: Michael Ellerman m...@ellerman.id.au
---

This is on top of powerpc/kvm: Remove redundant save of SIER AND MMCR2.

 arch/powerpc/kernel/asm-offsets.c   | 17 +++--
 arch/powerpc/kvm/book3s_hv_interrupts.S | 30 +++---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 32 
 3 files changed, 46 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index f5995a912213..9adadb1db9b8 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -645,8 +645,21 @@ int main(void)
HSTATE_FIELD(HSTATE_SAVED_XIRR, saved_xirr);
HSTATE_FIELD(HSTATE_HOST_IPI, host_ipi);
HSTATE_FIELD(HSTATE_PTID, ptid);
-   HSTATE_FIELD(HSTATE_MMCR, host_mmcr);
-   HSTATE_FIELD(HSTATE_PMC, host_pmc);
+   HSTATE_FIELD(HSTATE_MMCR0, host_mmcr[0]);
+   HSTATE_FIELD(HSTATE_MMCR1, host_mmcr[1]);
+   HSTATE_FIELD(HSTATE_MMCRA, host_mmcr[2]);
+   HSTATE_FIELD(HSTATE_SIAR, host_mmcr[3]);
+   HSTATE_FIELD(HSTATE_SDAR, host_mmcr[4]);
+   HSTATE_FIELD(HSTATE_MMCR2, host_mmcr[5]);
+   HSTATE_FIELD(HSTATE_SIER, host_mmcr[6]);
+   HSTATE_FIELD(HSTATE_PMC1, host_pmc[0]);
+   HSTATE_FIELD(HSTATE_PMC2, host_pmc[1]);
+   HSTATE_FIELD(HSTATE_PMC3, host_pmc[2]);
+   HSTATE_FIELD(HSTATE_PMC4, host_pmc[3]);
+   HSTATE_FIELD(HSTATE_PMC5, host_pmc[4]);
+   HSTATE_FIELD(HSTATE_PMC6, host_pmc[5]);
+   HSTATE_FIELD(HSTATE_PMC7, host_pmc[6]);
+   HSTATE_FIELD(HSTATE_PMC8, host_pmc[7]);
HSTATE_FIELD(HSTATE_PURR, host_purr);
HSTATE_FIELD(HSTATE_SPURR, host_spurr);
HSTATE_FIELD(HSTATE_DSCR, host_dscr);
diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S 
b/arch/powerpc/kvm/book3s_hv_interrupts.S
index 731be7478b27..16ef3163a8b6 100644
--- a/arch/powerpc/kvm/book3s_hv_interrupts.S
+++ b/arch/powerpc/kvm/book3s_hv_interrupts.S
@@ -97,15 +97,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
mfspr   r5, SPRN_MMCR1
mfspr   r9, SPRN_SIAR
mfspr   r10, SPRN_SDAR
-   std r7, HSTATE_MMCR(r13)
-   std r5, HSTATE_MMCR + 8(r13)
-   std r6, HSTATE_MMCR + 16(r13)
-   std r9, HSTATE_MMCR + 24(r13)
-   std r10, HSTATE_MMCR + 32(r13)
+   std r7, HSTATE_MMCR0(r13)
+   std r5, HSTATE_MMCR1(r13)
+   std r6, HSTATE_MMCRA(r13)
+   std r9, HSTATE_SIAR(r13)
+   std r10, HSTATE_SDAR(r13)
 BEGIN_FTR_SECTION
mfspr   r9, SPRN_SIER
-   std r8, HSTATE_MMCR + 40(r13)
-   std r9, HSTATE_MMCR + 48(r13)
+   std r8, HSTATE_MMCR2(r13)
+   std r9, HSTATE_SIER(r13)
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
mfspr   r3, SPRN_PMC1
mfspr   r5, SPRN_PMC2
@@ -117,15 +117,15 @@ BEGIN_FTR_SECTION
mfspr   r10, SPRN_PMC7
mfspr   r11, SPRN_PMC8
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
-   stw r3, HSTATE_PMC(r13)
-   stw r5, HSTATE_PMC + 4(r13)
-   stw r6, HSTATE_PMC + 8(r13)
-   stw r7, HSTATE_PMC + 12(r13)
-   stw r8, HSTATE_PMC + 16(r13)
-   stw r9, HSTATE_PMC + 20(r13)
+   stw r3, HSTATE_PMC1(r13)
+   stw r5, HSTATE_PMC2(r13)
+   stw r6, HSTATE_PMC3(r13)
+   stw r7, HSTATE_PMC4(r13)
+   stw r8, HSTATE_PMC5(r13)
+   stw r9, HSTATE_PMC6(r13)
 BEGIN_FTR_SECTION
-   stw r10, HSTATE_PMC + 24(r13)
-   stw r11, HSTATE_PMC + 28(r13)
+   stw r10, HSTATE_PMC7(r13)
+   stw r11, HSTATE_PMC8(r13)
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
 31:
 
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 868347ef09fd..d68ecb33b52a 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -87,20 +87,20 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
cmpwi   r4, 0
beq 23f /* skip if not */
 BEGIN_FTR_SECTION
-   ld  r3, HSTATE_MMCR(r13)
+   ld  r3, HSTATE_MMCR0(r13)
andi.   r4, r3, MMCR0_PMAO_SYNC | MMCR0_PMAO
cmpwi   r4, MMCR0_PMAO
beqlkvmppc_fix_pmao
 END_FTR_SECTION_IFSET(CPU_FTR_PMAO_BUG)
-   lwz r3, HSTATE_PMC(r13)
-   lwz r4, HSTATE_PMC + 4(r13)
-   lwz r5, HSTATE_PMC + 8(r13)
-   lwz r6, HSTATE_PMC + 12(r13)
-   lwz r8, HSTATE_PMC + 16(r13)
-   lwz r9, HSTATE_PMC + 20(r13)
+   lwz r3, HSTATE_PMC1(r13)
+   lwz 

Re: [PATCH] powerpc/kvm: Create proper names for the kvm_host_state PMU fields

2014-07-10 Thread Alexander Graf


On 10.07.14 11:34, Michael Ellerman wrote:

We have two arrays in kvm_host_state that contain register values for
the PMU. Currently we only create an asm-offsets symbol for the base of
the arrays, and do the array offset in the assembly code.

Creating an asm-offsets symbol for each field individually makes the
code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and
might have helped us notice the recent double restore bug we had in this
code.

Signed-off-by: Michael Ellerman m...@ellerman.id.au


Acked-by: Alexander Graf ag...@suse.de

I still think this whole code path should just be C though.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[3.11.y.z extended stable] Patch MIPS: KVM: Remove redundant NULL checks before kfree() has been added to staging queue

2014-07-10 Thread Luis Henriques
This is a note to let you know that I have just added a patch titled

MIPS: KVM: Remove redundant NULL checks before kfree()

to the linux-3.11.y-queue branch of the 3.11.y.z extended stable tree 
which can be found at:

 
http://kernel.ubuntu.com/git?p=ubuntu/linux.git;a=shortlog;h=refs/heads/linux-3.11.y-queue

If you, or anyone else, feels it should not be added to this tree, please 
reply to this email.

For more information about the 3.11.y.z tree, see
https://wiki.ubuntu.com/Kernel/Dev/ExtendedStable

Thanks.
-Luis

--

From 58d267632e7edae4b73fe577261713f68a0038d8 Mon Sep 17 00:00:00 2001
From: James Hogan james.ho...@imgtec.com
Date: Thu, 29 May 2014 10:16:44 +0100
Subject: MIPS: KVM: Remove redundant NULL checks before kfree()

commit c6c0a6637f9da54f9472144d44f71cf847f92e20 upstream.

The kfree() function already NULL checks the parameter so remove the
redundant NULL checks before kfree() calls in arch/mips/kvm/.

Signed-off-by: James Hogan james.ho...@imgtec.com
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Gleb Natapov g...@kernel.org
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle r...@linux-mips.org
Cc: linux-m...@linux-mips.org
Cc: Sanjay Lal sanj...@kymasys.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: Luis Henriques luis.henriq...@canonical.com
---
 arch/mips/kvm/kvm_mips.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index 426345ac6f6e..7e78af0e57de 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -149,9 +149,7 @@ void kvm_mips_free_vcpus(struct kvm *kvm)
if (kvm-arch.guest_pmap[i] != KVM_INVALID_PAGE)
kvm_mips_release_pfn_clean(kvm-arch.guest_pmap[i]);
}
-
-   if (kvm-arch.guest_pmap)
-   kfree(kvm-arch.guest_pmap);
+   kfree(kvm-arch.guest_pmap);

kvm_for_each_vcpu(i, vcpu, kvm) {
kvm_arch_vcpu_free(vcpu);
@@ -384,12 +382,8 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)

kvm_mips_dump_stats(vcpu);

-   if (vcpu-arch.guest_ebase)
-   kfree(vcpu-arch.guest_ebase);
-
-   if (vcpu-arch.kseg0_commpage)
-   kfree(vcpu-arch.kseg0_commpage);
-
+   kfree(vcpu-arch.guest_ebase);
+   kfree(vcpu-arch.kseg0_commpage);
 }

 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
--
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors

2014-07-10 Thread Peter Maydell
On 17 June 2014 23:10, James Hogan james.ho...@imgtec.com wrote:
 The patchset depends on v4 of target-mips: implement UserLocal
 Register. I'm aiming for QEMU 2.1, hopefully it isn't too late to get
 some final review.

 Thanks to everybody who has already taken part in review.

 This patchset implements KVM support for MIPS32 processors, using Trap 
 Emulation.

I was looking at what happens for MMIO accesses to nonexistent
memory when we're running under KVM, and interestingly this
patchset means MIPS is now the only CPU which both (a) supports
KVM and (b) has an implementation of the do_unassigned_access()
CPU hook. Does the current code support a guest attempt to
access unassigned physical addresses? I had a look at the code
and it seems like mips_cpu_unassigned_access() will end up
calling cpu_restore_state() and cpu_loop_exit(), which I think
will probably crash if you call them when using KVM rather than
TCG...

More generally, there doesn't really seem to be provision in the
KVM KVM_EXIT_MMIO API for returning this access failed.
I guess in theory userspace could do all the figure out how
to adjust CPU state to do exception entry and then run VCPU,
but that seems like quite a lot of work which the kernel already
knows how to do; is there some way to provide a simpler API
that lets userspace just inform the kernel that it needs to
fault the access?

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Synthesize G bit for all segments.

2014-07-10 Thread Paolo Bonzini
Il 10/07/2014 10:55, Jan Kiszka ha scritto:
  +  /*
  +   * SVM spec doesn't require the platform to track the G bit for all
  +   * segments, so similar to CS, let's synthesize this bit for all
  +   * segments.
 Either I misunderstand the reference to CS or it does no longer apply
 once the patch is in. I would suggest to remove that part of the sentence.

Something like this:

/*
 * The SVM spec doesn't require the platform to track the 'G' bit for
 * all segments.  Current processors track it for all segments except
 * CS, but other hypervisors may not do so.  So let's synthesize this
 * bit always to help running KVM nested.  It also helps cross-vendor
 * migration, because Intel's vmentry has a check on the 'G' bit.
 */


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors

2014-07-10 Thread Paolo Bonzini

Il 10/07/2014 14:17, Peter Maydell ha scritto:


More generally, there doesn't really seem to be provision in the
KVM KVM_EXIT_MMIO API for returning this access failed.
I guess in theory userspace could do all the figure out how
to adjust CPU state to do exception entry and then run VCPU,
but that seems like quite a lot of work which the kernel already
knows how to do; is there some way to provide a simpler API
that lets userspace just inform the kernel that it needs to
fault the access?


There are 3 free padding bytes in struct kvm_run's mmio field.  It's 
possible to add a per-VM capability and have the kernel check one of 
these bytes when the capability is set.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 2/5] KVM: s390: move finalization of SIGP STOP orders to kvm_s390_vcpu_stop

2014-07-10 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

Let's move the finalization of SIGP STOP and SIGP STOP AND STORE STATUS orders 
to
the point where the VCPU is actually stopped.

This change is needed to prepare for a user space driven VCPU state change. The
action_bits may only be cleared when setting the cpu state to STOPPED while
holding the local irq lock.

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 arch/s390/kvm/intercept.c | 31 ---
 arch/s390/kvm/kvm-s390.c  |  8 
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c
index a0b586c..ac6b325 100644
--- a/arch/s390/kvm/intercept.c
+++ b/arch/s390/kvm/intercept.c
@@ -56,32 +56,25 @@ static int handle_noop(struct kvm_vcpu *vcpu)
 static int handle_stop(struct kvm_vcpu *vcpu)
 {
int rc = 0;
+   unsigned int action_bits;
 
vcpu-stat.exit_stop_request++;
-   spin_lock_bh(vcpu-arch.local_int.lock);
-
trace_kvm_s390_stop_request(vcpu-arch.local_int.action_bits);
 
-   if (vcpu-arch.local_int.action_bits  ACTION_STOP_ON_STOP) {
-   kvm_s390_vcpu_stop(vcpu);
-   vcpu-arch.local_int.action_bits = ~ACTION_STOP_ON_STOP;
-   VCPU_EVENT(vcpu, 3, %s, cpu stopped);
-   rc = -EOPNOTSUPP;
-   }
+   action_bits = vcpu-arch.local_int.action_bits;
 
-   if (vcpu-arch.local_int.action_bits  ACTION_STORE_ON_STOP) {
-   vcpu-arch.local_int.action_bits = ~ACTION_STORE_ON_STOP;
-   /* store status must be called unlocked. Since local_int.lock
-* only protects local_int.* and not guest memory we can give
-* up the lock here */
-   spin_unlock_bh(vcpu-arch.local_int.lock);
+   if (!(action_bits  ACTION_STOP_ON_STOP))
+   return 0;
+
+   if (action_bits  ACTION_STORE_ON_STOP) {
rc = kvm_s390_vcpu_store_status(vcpu,
KVM_S390_STORE_STATUS_NOADDR);
-   if (rc = 0)
-   rc = -EOPNOTSUPP;
-   } else
-   spin_unlock_bh(vcpu-arch.local_int.lock);
-   return rc;
+   if (rc)
+   return rc;
+   }
+
+   kvm_s390_vcpu_stop(vcpu);
+   return -EOPNOTSUPP;
 }
 
 static int handle_validity(struct kvm_vcpu *vcpu)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 2f3e14f..c507789 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1494,7 +1494,15 @@ void kvm_s390_vcpu_stop(struct kvm_vcpu *vcpu)
spin_lock_bh(vcpu-kvm-arch.start_stop_lock);
online_vcpus = atomic_read(vcpu-kvm-online_vcpus);
 
+   /* Need to lock access to action_bits to avoid a SIGP race condition */
+   spin_lock_bh(vcpu-arch.local_int.lock);
atomic_set_mask(CPUSTAT_STOPPED, vcpu-arch.sie_block-cpuflags);
+
+   /* SIGP STOP and SIGP STOP AND STORE STATUS has been fully processed */
+   vcpu-arch.local_int.action_bits =
+~(ACTION_STOP_ON_STOP | ACTION_STORE_ON_STOP);
+   spin_unlock_bh(vcpu-arch.local_int.lock);
+
__disable_ibs_on_vcpu(vcpu);
 
for (i = 0; i  online_vcpus; i++) {
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 0/5] KVM: s390: Let user space control the cpu states

2014-07-10 Thread Christian Borntraeger
Paolo,

here is a preview of a rework of CPU state on s390x. Since we extend 
MP_STATE I wanted to send an RFC upfront. I will wait for some feedback
and send a proper pull request in the next week:

-

This series enables the KVM_SET_MP_STATE ioctl on s390 to make the cpu
state settable by user space.

This helps us to avoid races in SIGP/Reset handling that happen because
some SIGPs are handled in QEMU, others in the KERNEL. 

As soon as the new interface is touched, user space takes complete control
of the cpu states. Otherwise the old way is used.
Therefore, the new kernel should continue to work fine with old QEMUs.


David Hildenbrand (5):
  KVM: s390: allow only one SIGP STOP (AND STORE STATUS) at a time
  KVM: s390: move finalization of SIGP STOP orders to kvm_s390_vcpu_stop
  KVM: s390: remove __cpu_is_stopped and expose is_vcpu_stopped
  KVM: prepare for KVM_(S|G)ET_MP_STATE on other architectures
  KVM: s390: implement KVM_(S|G)ET_MP_STATE for user space state control

 Documentation/virtual/kvm/api.txt | 31 ++-
 arch/s390/include/asm/kvm_host.h  |  1 +
 arch/s390/kvm/diag.c  |  3 ++-
 arch/s390/kvm/intercept.c | 32 ++--
 arch/s390/kvm/kvm-s390.c  | 52 +++
 arch/s390/kvm/kvm-s390.h  | 10 ++--
 arch/s390/kvm/sigp.c  |  7 +-
 include/uapi/linux/kvm.h  |  7 +-
 8 files changed, 98 insertions(+), 45 deletions(-)

-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 3/5] KVM: s390: remove __cpu_is_stopped and expose is_vcpu_stopped

2014-07-10 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

The function __cpu_is_stopped is not used any more. Let's remove it and
expose the function is_vcpu_stopped instead, which is actually what we want.

This patch also converts an open coded check for CPUSTAT_STOPPED to
is_vcpu_stopped().

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Acked-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 arch/s390/kvm/kvm-s390.c | 7 +--
 arch/s390/kvm/kvm-s390.h | 4 ++--
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index c507789..3428953 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -926,7 +926,7 @@ static int kvm_arch_vcpu_ioctl_set_initial_psw(struct 
kvm_vcpu *vcpu, psw_t psw)
 {
int rc = 0;
 
-   if (!(atomic_read(vcpu-arch.sie_block-cpuflags)  CPUSTAT_STOPPED))
+   if (!is_vcpu_stopped(vcpu))
rc = -EBUSY;
else {
vcpu-run-psw_mask = psw.mask;
@@ -1413,11 +1413,6 @@ int kvm_s390_vcpu_store_status(struct kvm_vcpu *vcpu, 
unsigned long addr)
return kvm_s390_store_status_unloaded(vcpu, addr);
 }
 
-static inline int is_vcpu_stopped(struct kvm_vcpu *vcpu)
-{
-   return atomic_read((vcpu)-arch.sie_block-cpuflags)  CPUSTAT_STOPPED;
-}
-
 static void __disable_ibs_on_vcpu(struct kvm_vcpu *vcpu)
 {
kvm_check_request(KVM_REQ_ENABLE_IBS, vcpu);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index a8655ed..77ed846 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -45,9 +45,9 @@ do { \
  d_args); \
 } while (0)
 
-static inline int __cpu_is_stopped(struct kvm_vcpu *vcpu)
+static inline int is_vcpu_stopped(struct kvm_vcpu *vcpu)
 {
-   return atomic_read(vcpu-arch.sie_block-cpuflags)  CPUSTAT_STOP_INT;
+   return atomic_read(vcpu-arch.sie_block-cpuflags)  CPUSTAT_STOPPED;
 }
 
 static inline int kvm_is_ucontrol(struct kvm *kvm)
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 0/5] s390x/kvm: track the logical cpu state in QEMU and propagate it to kvm

2014-07-10 Thread Christian Borntraeger
This is the qemu part of kernel series Let user space control the
cpu states

Christian Borntraeger (1):
  update linux headers with with cpustate changes

David Hildenbrand (4):
  s390x/kvm: introduce proper states for s390 cpus
  s390x/kvm: proper use of the cpu states OPERATING and STOPPED
  s390x/kvm: test whether a cpu is STOPPED when checking has_work
  s390x/kvm: propagate s390 cpu state to kvm

 hw/s390x/ipl.c|   2 +-
 hw/s390x/s390-virtio.c|  32 --
 linux-headers/linux/kvm.h |   7 ++-
 target-s390x/cpu.c| 106 +++---
 target-s390x/cpu.h|  33 +--
 target-s390x/helper.c |  11 ++---
 target-s390x/kvm.c|  49 ++---
 trace-events  |   6 +++
 8 files changed, 179 insertions(+), 67 deletions(-)

-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 1/5] KVM: s390: allow only one SIGP STOP (AND STORE STATUS) at a time

2014-07-10 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

A SIGP STOP (AND STORE STATUS) order is complete as soon as the VCPU has been
stopped. This patch makes sure that only one SIGP STOP (AND STORE STATUS) may
be pending at a time (as defined by the architecture). If the action_bits are
still set, a SIGP STOP has been issued but not completed yet. The VCPU is busy
for further SIGP STOP orders.

Also set the CPUSTAT_STOP_INT after the action_bits variable has been modified
(the same order that is used when injecting a KVM_S390_SIGP_STOP from
userspace).

Both changes are needed in preparation for a user space driven VCPU state change
(to avoid race conditions).

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 arch/s390/kvm/sigp.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
index 43079a4..fd7fb5c 100644
--- a/arch/s390/kvm/sigp.c
+++ b/arch/s390/kvm/sigp.c
@@ -136,6 +136,11 @@ static int __inject_sigp_stop(struct 
kvm_s390_local_interrupt *li, int action)
inti-type = KVM_S390_SIGP_STOP;
 
spin_lock_bh(li-lock);
+   if (li-action_bits  ACTION_STOP_ON_STOP) {
+   /* another SIGP STOP is pending */
+   rc = SIGP_CC_BUSY;
+   goto out;
+   }
if ((atomic_read(li-cpuflags)  CPUSTAT_STOPPED)) {
kfree(inti);
if ((action  ACTION_STORE_ON_STOP) != 0)
@@ -144,8 +149,8 @@ static int __inject_sigp_stop(struct 
kvm_s390_local_interrupt *li, int action)
}
list_add_tail(inti-list, li-list);
atomic_set(li-active, 1);
-   atomic_set_mask(CPUSTAT_STOP_INT, li-cpuflags);
li-action_bits |= action;
+   atomic_set_mask(CPUSTAT_STOP_INT, li-cpuflags);
if (waitqueue_active(li-wq))
wake_up_interruptible(li-wq);
 out:
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 4/5] KVM: prepare for KVM_(S|G)ET_MP_STATE on other architectures

2014-07-10 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

Highlight the aspects of the ioctls that are actually specific to x86
and ia64. As defined restrictions (irqchip) and mp states may not apply
to other architectures, these parts are flagged to belong to x86 and ia64.

In preparation for the use of KVM_(S|G)ET_MP_STATE by s390.
Fix a spelling error (KVM_SET_MP_STATE vs. KVM_SET_MPSTATE) on the way.

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Acked-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 Documentation/virtual/kvm/api.txt | 21 -
 include/uapi/linux/kvm.h  |  3 ++-
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 0fe3649..904c61c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -988,18 +988,20 @@ uniprocessor guests).
 
 Possible values are:
 
- - KVM_MP_STATE_RUNNABLE:the vcpu is currently running
+ - KVM_MP_STATE_RUNNABLE:the vcpu is currently running [x86, ia64]
  - KVM_MP_STATE_UNINITIALIZED:   the vcpu is an application processor (AP)
- which has not yet received an INIT signal
+ which has not yet received an INIT signal 
[x86,
+ ia64]
  - KVM_MP_STATE_INIT_RECEIVED:   the vcpu has received an INIT signal, and is
- now ready for a SIPI
+ now ready for a SIPI [x86, ia64]
  - KVM_MP_STATE_HALTED:  the vcpu has executed a HLT instruction and
- is waiting for an interrupt
+ is waiting for an interrupt [x86, ia64]
  - KVM_MP_STATE_SIPI_RECEIVED:   the vcpu has just received a SIPI (vector
- accessible via KVM_GET_VCPU_EVENTS)
+ accessible via KVM_GET_VCPU_EVENTS) [x86, 
ia64]
 
-This ioctl is only useful after KVM_CREATE_IRQCHIP.  Without an in-kernel
-irqchip, the multiprocessing state must be maintained by userspace.
+On x86 and ia64, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
+in-kernel irqchip, the multiprocessing state must be maintained by userspace on
+these architectures.
 
 
 4.39 KVM_SET_MP_STATE
@@ -1013,8 +1015,9 @@ Returns: 0 on success; -1 on error
 Sets the vcpu's current multiprocessing state; see KVM_GET_MP_STATE for
 arguments.
 
-This ioctl is only useful after KVM_CREATE_IRQCHIP.  Without an in-kernel
-irqchip, the multiprocessing state must be maintained by userspace.
+On x86 and ia64, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
+in-kernel irqchip, the multiprocessing state must be maintained by userspace on
+these architectures.
 
 
 4.40 KVM_SET_IDENTITY_MAP_ADDR
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e11d8f1..37d4ec6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -399,8 +399,9 @@ struct kvm_vapic_addr {
__u64 vapic_addr;
 };
 
-/* for KVM_SET_MPSTATE */
+/* for KVM_SET_MP_STATE */
 
+/* not all states are valid on all architectures */
 #define KVM_MP_STATE_RUNNABLE  0
 #define KVM_MP_STATE_UNINITIALIZED 1
 #define KVM_MP_STATE_INIT_RECEIVED 2
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 2/5] s390x/kvm: introduce proper states for s390 cpus

2014-07-10 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

Until now, when a s390 cpu was stopped or halted, the number of running
CPUs was tracked in a global variable. This was problematic for migration,
so Jason came up with a per-cpu running state.
As it turns out, we want to track the full logical state of a target vcpu,
so we need real s390 cpu states.

This patch is based on an initial patch by Jason Herne, but was heavily
rewritten when adding the cpu states STOPPED and OPERATING. On the way we
move add_del_running to cpu.c (the declaration is already in cpu.h) and
modify the users where appropriate.

Please note that the cpu is still set to be stopped when it is
halted, which is wrong. This will be fixed in the next patch. The LOAD and
CHECK-STOP state will not be used in the first step.

Signed-off-by: Jason J. Herne jjhe...@us.ibm.com
Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
[folded Jason's patch into David's patch to avoid add/remove same lines]
---
 hw/s390x/s390-virtio.c | 32 
 target-s390x/cpu.c | 43 +++
 target-s390x/cpu.h | 14 ++
 3 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/hw/s390x/s390-virtio.c b/hw/s390x/s390-virtio.c
index 93c7ace..91baa18 100644
--- a/hw/s390x/s390-virtio.c
+++ b/hw/s390x/s390-virtio.c
@@ -124,38 +124,6 @@ static void s390_virtio_register_hcalls(void)
s390_virtio_hcall_set_status);
 }
 
-/*
- * The number of running CPUs. On s390 a shutdown is the state of all CPUs
- * being either stopped or disabled (for interrupts) waiting. We have to
- * track this number to call the shutdown sequence accordingly. This
- * number is modified either on startup or while holding the big qemu lock.
- */
-static unsigned s390_running_cpus;
-
-void s390_add_running_cpu(S390CPU *cpu)
-{
-CPUState *cs = CPU(cpu);
-
-if (cs-halted) {
-s390_running_cpus++;
-cs-halted = 0;
-cs-exception_index = -1;
-}
-}
-
-unsigned s390_del_running_cpu(S390CPU *cpu)
-{
-CPUState *cs = CPU(cpu);
-
-if (cs-halted == 0) {
-assert(s390_running_cpus = 1);
-s390_running_cpus--;
-cs-halted = 1;
-cs-exception_index = EXCP_HLT;
-}
-return s390_running_cpus;
-}
-
 void s390_init_ipl_dev(const char *kernel_filename,
const char *kernel_cmdline,
const char *initrd_filename,
diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c
index c3082b7..a6edd07 100644
--- a/target-s390x/cpu.c
+++ b/target-s390x/cpu.c
@@ -224,6 +224,49 @@ static void s390_cpu_finalize(Object *obj)
 #endif
 }
 
+#if !defined(CONFIG_USER_ONLY)
+static unsigned s390_count_running_cpus(void)
+{
+CPUState *cpu;
+int nr_running = 0;
+
+CPU_FOREACH(cpu) {
+uint8_t state = S390_CPU(cpu)-env.cpu_state;
+if (state == CPU_STATE_OPERATING ||
+state == CPU_STATE_LOAD) {
+nr_running++;
+}
+}
+
+return nr_running;
+}
+
+void s390_add_running_cpu(S390CPU *cpu)
+{
+CPUState *cs = CPU(cpu);
+
+if (cs-halted) {
+cpu-env.cpu_state = CPU_STATE_OPERATING;
+cs-halted = 0;
+cs-exception_index = -1;
+}
+}
+
+unsigned s390_del_running_cpu(S390CPU *cpu)
+{
+CPUState *cs = CPU(cpu);
+
+if (cs-halted == 0) {
+assert(s390_count_running_cpus() = 1);
+cpu-env.cpu_state = CPU_STATE_STOPPED;
+cs-halted = 1;
+cs-exception_index = EXCP_HLT;
+}
+
+return s390_count_running_cpus();
+}
+#endif
+
 static const VMStateDescription vmstate_s390_cpu = {
 .name = cpu,
 .unmigratable = 1,
diff --git a/target-s390x/cpu.h b/target-s390x/cpu.h
index b13761d..af22ba6 100644
--- a/target-s390x/cpu.h
+++ b/target-s390x/cpu.h
@@ -141,6 +141,20 @@ typedef struct CPUS390XState {
 QEMUTimer *tod_timer;
 
 QEMUTimer *cpu_timer;
+
+/*
+ * The cpu state represents the logical state of a cpu. In contrast to 
other
+ * architectures, there is a difference between a halt and a stop on s390.
+ * If all cpus are either stopped (including check stop) or in the disabled
+ * wait state, the vm can be shut down.
+ */
+#define CPU_STATE_UNINITIALIZED0x00
+#define CPU_STATE_STOPPED  0x01
+#define CPU_STATE_CHECK_STOP   0x02
+#define CPU_STATE_OPERATING0x03
+#define CPU_STATE_LOAD 0x04
+uint8_t cpu_state;
+
 } CPUS390XState;
 
 #include cpu-qom.h
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 4/5] s390x/kvm: test whether a cpu is STOPPED when checking has_work

2014-07-10 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

If a cpu is stopped, it must never be allowed to run and no interrupt may wake 
it
up. A cpu also has to be unhalted if it is halted and has work to do - this
scenario wasn't hit in kvm case yet, as only disabled wait is processed within
QEMU.

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 target-s390x/cpu.c | 6 ++
 target-s390x/kvm.c | 5 +
 2 files changed, 11 insertions(+)

diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c
index c5ab98f..1d32f5a 100644
--- a/target-s390x/cpu.c
+++ b/target-s390x/cpu.c
@@ -72,6 +72,12 @@ static bool s390_cpu_has_work(CPUState *cs)
 S390CPU *cpu = S390_CPU(cs);
 CPUS390XState *env = cpu-env;
 
+/* stopped cpus can never run */
+if (env-cpu_state == CPU_STATE_STOPPED ||
+env-cpu_state == CPU_STATE_CHECK_STOP) {
+return false;
+}
+
 return (cs-interrupt_request  CPU_INTERRUPT_HARD) 
(env-psw.mask  PSW_MASK_EXT);
 }
diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
index db2e42c..00125f1 100644
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -553,6 +553,11 @@ void kvm_arch_post_run(CPUState *cpu, struct kvm_run *run)
 
 int kvm_arch_process_async_events(CPUState *cs)
 {
+if (cs-halted  CPU_GET_CLASS(cs)-has_work(cs)) {
+/* has_work will take care of stopped cpus */
+s390_cpu_unhalt(S390_CPU(cs));
+}
+
 return cs-halted;
 }
 
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 3/5] s390x/kvm: proper use of the cpu states OPERATING and STOPPED

2014-07-10 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

This patch makes sure that halting a cpu and stopping a cpu are two different
things. Stopping a cpu will also set the cpu halted - this is needed for common
infrastructure to work (note that the stop and stopped flag cannot be used for
our purpose because they are already used by other mechanisms).

A cpu can be halted (waiting) when it is operating. If interrupts are
disabled, this is called a disabled wait, as it can't be woken up anymore. A
stopped cpu is treated like a disabled wait cpu, but in order to prepare for a
proper cpu state synchronization with the kvm part, we need to track the real
logical state of a cpu.

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 hw/s390x/ipl.c|  2 +-
 target-s390x/cpu.c| 78 +--
 target-s390x/cpu.h| 14 ++---
 target-s390x/helper.c | 11 ++--
 target-s390x/kvm.c| 11 
 trace-events  |  5 
 6 files changed, 75 insertions(+), 46 deletions(-)

diff --git a/hw/s390x/ipl.c b/hw/s390x/ipl.c
index 4fa9cff..3b77c9a 100644
--- a/hw/s390x/ipl.c
+++ b/hw/s390x/ipl.c
@@ -176,7 +176,7 @@ static void s390_ipl_reset(DeviceState *dev)
 }
 }
 
-s390_add_running_cpu(cpu);
+s390_cpu_set_state(CPU_STATE_OPERATING, cpu);
 }
 
 static void s390_ipl_class_init(ObjectClass *klass, void *data)
diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c
index a6edd07..c5ab98f 100644
--- a/target-s390x/cpu.c
+++ b/target-s390x/cpu.c
@@ -26,7 +26,9 @@
 #include cpu.h
 #include qemu-common.h
 #include qemu/timer.h
+#include qemu/error-report.h
 #include hw/hw.h
+#include trace.h
 #ifndef CONFIG_USER_ONLY
 #include sysemu/arch_init.h
 #endif
@@ -81,7 +83,7 @@ static void s390_cpu_load_normal(CPUState *s)
 S390CPU *cpu = S390_CPU(s);
 cpu-env.psw.addr = ldl_phys(s-as, 4)  PSW_MASK_ESA_ADDR;
 cpu-env.psw.mask = PSW_MASK_32 | PSW_MASK_64;
-s390_add_running_cpu(cpu);
+s390_cpu_set_state(CPU_STATE_OPERATING, cpu);
 }
 #endif
 
@@ -93,11 +95,8 @@ static void s390_cpu_reset(CPUState *s)
 CPUS390XState *env = cpu-env;
 
 env-pfault_token = -1UL;
-s390_del_running_cpu(cpu);
 scc-parent_reset(s);
-#if !defined(CONFIG_USER_ONLY)
-s-halted = 1;
-#endif
+s390_cpu_set_state(CPU_STATE_STOPPED, cpu);
 tlb_flush(s, 1);
 }
 
@@ -135,9 +134,8 @@ static void s390_cpu_full_reset(CPUState *s)
 S390CPUClass *scc = S390_CPU_GET_CLASS(cpu);
 CPUS390XState *env = cpu-env;
 
-s390_del_running_cpu(cpu);
-
 scc-parent_reset(s);
+s390_cpu_set_state(CPU_STATE_STOPPED, cpu);
 
 memset(env, 0, offsetof(CPUS390XState, cpu_num));
 
@@ -147,12 +145,7 @@ static void s390_cpu_full_reset(CPUState *s)
 
 env-pfault_token = -1UL;
 
-/* set halted to 1 to make sure we can add the cpu in
- * s390_ipl_cpu code, where CPUState::halted is set back to 0
- * after incrementing the cpu counter */
 #if !defined(CONFIG_USER_ONLY)
-s-halted = 1;
-
 if (kvm_enabled()) {
 kvm_s390_reset_vcpu(cpu);
 }
@@ -201,10 +194,7 @@ static void s390_cpu_initfn(Object *obj)
 env-tod_basetime = 0;
 env-tod_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, s390x_tod_timer, cpu);
 env-cpu_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, s390x_cpu_timer, cpu);
-/* set CPUState::halted state to 1 to avoid decrementing the running
- * cpu counter in s390_cpu_reset to a negative number at
- * initial ipl */
-cs-halted = 1;
+s390_cpu_set_state(CPU_STATE_STOPPED, cpu);
 #endif
 env-cpu_num = cpu_num++;
 env-ext_index = -1;
@@ -225,6 +215,12 @@ static void s390_cpu_finalize(Object *obj)
 }
 
 #if !defined(CONFIG_USER_ONLY)
+static bool disabled_wait(CPUState *cpu)
+{
+return cpu-halted  !(S390_CPU(cpu)-env.psw.mask 
+(PSW_MASK_IO | PSW_MASK_EXT | PSW_MASK_MCHECK));
+}
+
 static unsigned s390_count_running_cpus(void)
 {
 CPUState *cpu;
@@ -234,34 +230,60 @@ static unsigned s390_count_running_cpus(void)
 uint8_t state = S390_CPU(cpu)-env.cpu_state;
 if (state == CPU_STATE_OPERATING ||
 state == CPU_STATE_LOAD) {
-nr_running++;
+if (!disabled_wait(cpu)) {
+nr_running++;
+}
 }
 }
 
 return nr_running;
 }
 
-void s390_add_running_cpu(S390CPU *cpu)
+unsigned int s390_cpu_halt(S390CPU *cpu)
 {
 CPUState *cs = CPU(cpu);
+trace_cpu_halt(cs-cpu_index);
 
-if (cs-halted) {
-cpu-env.cpu_state = CPU_STATE_OPERATING;
-cs-halted = 0;
-cs-exception_index = -1;
+if (!cs-halted) {
+cs-halted = 1;
+cs-exception_index = EXCP_HLT;
 }
+
+return s390_count_running_cpus();
 }
 
-unsigned s390_del_running_cpu(S390CPU *cpu)
+void 

[PATCH/RFC 5/5] s390x/kvm: propagate s390 cpu state to kvm

2014-07-10 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

Let QEMU propagate the cpu state to kvm. If kvm doesn't yet support it, it is
silently ignored as kvm will still handle the cpu state itself in that case.

The state is not synced back, thus kvm won't have a chance to actively modify
the cpu state. To do so, control has to be given back to QEMU (which is already
done so in all relevant cases).

Setting of the cpu state can fail either because kvm doesn't support the
interface yet, or because the state is invalid/not supported. Failed attempts
will be traced

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Thomas Huth th...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 target-s390x/cpu.c |  3 +++
 target-s390x/cpu.h |  5 +
 target-s390x/kvm.c | 33 +
 trace-events   |  1 +
 4 files changed, 42 insertions(+)

diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c
index 1d32f5a..cf62ebd 100644
--- a/target-s390x/cpu.c
+++ b/target-s390x/cpu.c
@@ -289,6 +289,9 @@ unsigned int s390_cpu_set_state(uint8_t cpu_state, S390CPU 
*cpu)
  cpu_state);
 exit(1);
 }
+if (kvm_enabled()  cpu-env.cpu_state != cpu_state) {
+kvm_s390_set_cpu_state(cpu, cpu_state);
+}
 cpu-env.cpu_state = cpu_state;
 
 return s390_count_running_cpus();
diff --git a/target-s390x/cpu.h b/target-s390x/cpu.h
index 892726c..ce4a6ca 100644
--- a/target-s390x/cpu.h
+++ b/target-s390x/cpu.h
@@ -1088,6 +1088,7 @@ int kvm_s390_assign_subch_ioeventfd(EventNotifier 
*notifier, uint32_t sch,
 int vq, bool assign);
 int kvm_s390_cpu_restart(S390CPU *cpu);
 void kvm_s390_clear_cmma_callback(void *opaque);
+int kvm_s390_set_cpu_state(S390CPU *cpu, uint8_t cpu_state);
 #else
 static inline void kvm_s390_io_interrupt(uint16_t subchannel_id,
 uint16_t subchannel_nr,
@@ -1114,6 +1115,10 @@ static inline int kvm_s390_cpu_restart(S390CPU *cpu)
 static inline void kvm_s390_clear_cmma_callback(void *opaque)
 {
 }
+static inline int kvm_s390_set_cpu_state(S390CPU *cpu, uint8_t cpu_state)
+{
+return -ENOSYS;
+}
 #endif
 
 static inline void cmma_reset(S390CPU *cpu)
diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
index 00125f1..d8ecd0b 100644
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -1289,3 +1289,36 @@ int kvm_s390_assign_subch_ioeventfd(EventNotifier 
*notifier, uint32_t sch,
 }
 return kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, kick);
 }
+
+int kvm_s390_set_cpu_state(S390CPU *cpu, uint8_t cpu_state)
+{
+struct kvm_mp_state mp_state = {};
+int ret;
+
+switch (cpu_state) {
+case CPU_STATE_STOPPED:
+mp_state.mp_state = KVM_MP_STATE_STOPPED;
+break;
+case CPU_STATE_CHECK_STOP:
+mp_state.mp_state = KVM_MP_STATE_CHECK_STOP;
+break;
+case CPU_STATE_OPERATING:
+mp_state.mp_state = KVM_MP_STATE_OPERATING;
+break;
+case CPU_STATE_LOAD:
+mp_state.mp_state = KVM_MP_STATE_LOAD;
+break;
+default:
+error_report(Requested CPU state is not a valid S390 CPU state: %u,
+ cpu_state);
+exit(1);
+}
+
+ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MP_STATE, mp_state);
+if (ret) {
+trace_kvm_failed_cpu_state_set(CPU(cpu)-cpu_index, cpu_state,
+   strerror(-ret));
+}
+
+return ret;
+}
diff --git a/trace-events b/trace-events
index c652606..56a7dab 100644
--- a/trace-events
+++ b/trace-events
@@ -1301,6 +1301,7 @@ mhp_pc_dimm_assigned_address(uint64_t addr) 0x%PRIx64
 # target-s390x/kvm.c
 kvm_enable_cmma(int rc) CMMA: enabling with result code %d
 kvm_clear_cmma(int rc) CMMA: clearing with result code %d
+kvm_failed_cpu_state_set(int cpu_index, uint8_t state, const char *msg) 
Warning: Unable to set cpu %d state % PRIu8  to KVM: %s
 
 # target-s390x/cpu.c
 cpu_set_state(int cpu_index, uint8_t state) setting cpu %d state to % PRIu8
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 5/5] KVM: s390: implement KVM_(S|G)ET_MP_STATE for user space state control

2014-07-10 Thread Christian Borntraeger
From: David Hildenbrand d...@linux.vnet.ibm.com

This patch
- adds s390 specific MP states to linux headers and documents them
- implements the KVM_{SET,GET}_MP_STATE ioctls
- enables KVM_CAP_MP_STATE
- allows user space to control the VCPU state on s390.

If user space sets the VCPU state using the ioctl KVM_SET_MP_STATE, we can 
disable
manual changing of the VCPU state and trust user space to do the right thing.

Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com
Acked-by: Christian Borntraeger borntrae...@de.ibm.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 Documentation/virtual/kvm/api.txt | 10 --
 arch/s390/include/asm/kvm_host.h  |  1 +
 arch/s390/kvm/diag.c  |  3 ++-
 arch/s390/kvm/intercept.c |  3 ++-
 arch/s390/kvm/kvm-s390.c  | 37 +
 arch/s390/kvm/kvm-s390.h  |  6 ++
 include/uapi/linux/kvm.h  |  4 
 7 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 904c61c..a41465b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -974,7 +974,7 @@ for vm-wide capabilities.
 4.38 KVM_GET_MP_STATE
 
 Capability: KVM_CAP_MP_STATE
-Architectures: x86, ia64
+Architectures: x86, ia64, s390
 Type: vcpu ioctl
 Parameters: struct kvm_mp_state (out)
 Returns: 0 on success; -1 on error
@@ -998,6 +998,12 @@ Possible values are:
  is waiting for an interrupt [x86, ia64]
  - KVM_MP_STATE_SIPI_RECEIVED:   the vcpu has just received a SIPI (vector
  accessible via KVM_GET_VCPU_EVENTS) [x86, 
ia64]
+ - KVM_MP_STATE_STOPPED: the vcpu is stopped [s390]
+ - KVM_MP_STATE_CHECK_STOP:  the vcpu is in a special error state [s390]
+ - KVM_MP_STATE_OPERATING:   the vcpu is operating (running or halted)
+ [s390]
+ - KVM_MP_STATE_LOAD:the vcpu is in a special load/startup state
+ [s390]
 
 On x86 and ia64, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
 in-kernel irqchip, the multiprocessing state must be maintained by userspace on
@@ -1007,7 +1013,7 @@ these architectures.
 4.39 KVM_SET_MP_STATE
 
 Capability: KVM_CAP_MP_STATE
-Architectures: x86, ia64
+Architectures: x86, ia64, s390
 Type: vcpu ioctl
 Parameters: struct kvm_mp_state (in)
 Returns: 0 on success; -1 on error
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 4181d7b..c2ba020 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -418,6 +418,7 @@ struct kvm_arch{
int css_support;
int use_irqchip;
int use_cmma;
+   int user_cpu_state_ctrl;
struct s390_io_adapter *adapters[MAX_S390_IO_ADAPTERS];
wait_queue_head_t ipte_wq;
spinlock_t start_stop_lock;
diff --git a/arch/s390/kvm/diag.c b/arch/s390/kvm/diag.c
index 0161675..59bd8f9 100644
--- a/arch/s390/kvm/diag.c
+++ b/arch/s390/kvm/diag.c
@@ -176,7 +176,8 @@ static int __diag_ipl_functions(struct kvm_vcpu *vcpu)
return -EOPNOTSUPP;
}
 
-   kvm_s390_vcpu_stop(vcpu);
+   if (!kvm_s390_user_cpu_state_ctrl(vcpu-kvm))
+   kvm_s390_vcpu_stop(vcpu);
vcpu-run-s390_reset_flags |= KVM_S390_RESET_SUBSYSTEM;
vcpu-run-s390_reset_flags |= KVM_S390_RESET_IPL;
vcpu-run-s390_reset_flags |= KVM_S390_RESET_CPU_INIT;
diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c
index ac6b325..eaf4629 100644
--- a/arch/s390/kvm/intercept.c
+++ b/arch/s390/kvm/intercept.c
@@ -73,7 +73,8 @@ static int handle_stop(struct kvm_vcpu *vcpu)
return rc;
}
 
-   kvm_s390_vcpu_stop(vcpu);
+   if (!kvm_s390_user_cpu_state_ctrl(vcpu-kvm))
+   kvm_s390_vcpu_stop(vcpu);
return -EOPNOTSUPP;
 }
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 3428953..fdf88f7 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -167,6 +167,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_DEVICE_CTRL:
case KVM_CAP_ENABLE_CAP_VM:
case KVM_CAP_VM_ATTRIBUTES:
+   case KVM_CAP_MP_STATE:
r = 1;
break;
case KVM_CAP_NR_VCPUS:
@@ -595,7 +596,8 @@ static void kvm_s390_vcpu_initial_reset(struct kvm_vcpu 
*vcpu)
vcpu-arch.sie_block-pp = 0;
vcpu-arch.pfault_token = KVM_S390_PFAULT_TOKEN_INVALID;
kvm_clear_async_pf_completion_queue(vcpu);
-   kvm_s390_vcpu_stop(vcpu);
+   if (!kvm_s390_user_cpu_state_ctrl(vcpu-kvm))
+   kvm_s390_vcpu_stop(vcpu);
kvm_s390_clear_local_irqs(vcpu);
 }
 
@@ -980,13 +982,34 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
*vcpu,
 int 

[PATCH/RFC 1/5] update linux headers with with cpustate changes

2014-07-10 Thread Christian Borntraeger
Will do a proper headers sync when ready

Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 linux-headers/linux/kvm.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index f5d2c38..9b584da 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -399,13 +399,18 @@ struct kvm_vapic_addr {
__u64 vapic_addr;
 };
 
-/* for KVM_SET_MPSTATE */
+/* for KVM_SET_MP_STATE */
 
+/* not all states are valid on all architectures */
 #define KVM_MP_STATE_RUNNABLE  0
 #define KVM_MP_STATE_UNINITIALIZED 1
 #define KVM_MP_STATE_INIT_RECEIVED 2
 #define KVM_MP_STATE_HALTED3
 #define KVM_MP_STATE_SIPI_RECEIVED 4
+#define KVM_MP_STATE_STOPPED   5
+#define KVM_MP_STATE_CHECK_STOP6
+#define KVM_MP_STATE_OPERATING 7
+#define KVM_MP_STATE_LOAD  8
 
 struct kvm_mp_state {
__u32 mp_state;
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/5] s390x/kvm: track the logical cpu state in QEMU and propagate it to kvm

2014-07-10 Thread David Hildenbrand
 This is the qemu part of kernel series Let user space control the
 cpu states
 
 Christian Borntraeger (1):
   update linux headers with with cpustate changes
 
 David Hildenbrand (4):
   s390x/kvm: introduce proper states for s390 cpus
   s390x/kvm: proper use of the cpu states OPERATING and STOPPED
   s390x/kvm: test whether a cpu is STOPPED when checking has_work
   s390x/kvm: propagate s390 cpu state to kvm
 
  hw/s390x/ipl.c|   2 +-
  hw/s390x/s390-virtio.c|  32 --
  linux-headers/linux/kvm.h |   7 ++-
  target-s390x/cpu.c| 106 
 +++---
  target-s390x/cpu.h|  33 +--
  target-s390x/helper.c |  11 ++---
  target-s390x/kvm.c|  49 ++---
  trace-events  |   6 +++
  8 files changed, 179 insertions(+), 67 deletions(-)
 

Looks good to me.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/5] s390x/kvm: track the logical cpu state in QEMU and propagate it to kvm

2014-07-10 Thread David Hildenbrand
 This is the qemu part of kernel series Let user space control the
 cpu states
 
 Christian Borntraeger (1):
   update linux headers with with cpustate changes
 
 David Hildenbrand (4):
   s390x/kvm: introduce proper states for s390 cpus
   s390x/kvm: proper use of the cpu states OPERATING and STOPPED
   s390x/kvm: test whether a cpu is STOPPED when checking has_work
   s390x/kvm: propagate s390 cpu state to kvm
 
  hw/s390x/ipl.c|   2 +-
  hw/s390x/s390-virtio.c|  32 --
  linux-headers/linux/kvm.h |   7 ++-
  target-s390x/cpu.c| 106 
 +++---
  target-s390x/cpu.h|  33 +--
  target-s390x/helper.c |  11 ++---
  target-s390x/kvm.c|  49 ++---
  trace-events  |   6 +++
  8 files changed, 179 insertions(+), 67 deletions(-)
 

Looks good to me

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/5] s390x/kvm: track the logical cpu state in QEMU and propagate it to kvm

2014-07-10 Thread David Hildenbrand
  This is the qemu part of kernel series Let user space control the
  cpu states
  
  Christian Borntraeger (1):
update linux headers with with cpustate changes
  
  David Hildenbrand (4):
s390x/kvm: introduce proper states for s390 cpus
s390x/kvm: proper use of the cpu states OPERATING and STOPPED
s390x/kvm: test whether a cpu is STOPPED when checking has_work
s390x/kvm: propagate s390 cpu state to kvm
  
   hw/s390x/ipl.c|   2 +-
   hw/s390x/s390-virtio.c|  32 --
   linux-headers/linux/kvm.h |   7 ++-
   target-s390x/cpu.c| 106 
  +++---
   target-s390x/cpu.h|  33 +--
   target-s390x/helper.c |  11 ++---
   target-s390x/kvm.c|  49 ++---
   trace-events  |   6 +++
   8 files changed, 179 insertions(+), 67 deletions(-)
  
 
 Looks good to me
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

@all thought it was the final internal review :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] arm/arm64: KVM: Support KVM_CAP_READONLY_MEM

2014-07-10 Thread Christoffer Dall
When userspace loads code and data in a read-only memory regions, KVM
needs to be able to handle this on arm and arm64.  Specifically this is
used when running code directly from a read-only flash device; the
common scenario is a UEFI blob loaded with the -bios option in QEMU.

To avoid looking through the memslots twice and to reuse the hva error
checking of gfn_to_hva_prot(), add a new gfn_to_hva_memslot_prot()
function and refactor gfn_to_hva_prot() to use this function.

Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
---
Note that if you want to test this with QEMU, you need to update the
uapi headers.  You can also grab the branch below from my qemu git tree
with the temporary update headers patch applied on top of Peter
Maydell's -bios in -M virt support patches:

git://git.linaro.org/people/christoffer.dall/qemu-arm.git virt-for-uefi

 arch/arm/include/uapi/asm/kvm.h   |  1 +
 arch/arm/kvm/arm.c|  1 +
 arch/arm/kvm/mmu.c| 15 ---
 arch/arm64/include/uapi/asm/kvm.h |  1 +
 include/linux/kvm_host.h  |  2 ++
 virt/kvm/kvm_main.c   | 11 +--
 6 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
index e6ebdd3..51257fd 100644
--- a/arch/arm/include/uapi/asm/kvm.h
+++ b/arch/arm/include/uapi/asm/kvm.h
@@ -25,6 +25,7 @@
 
 #define __KVM_HAVE_GUEST_DEBUG
 #define __KVM_HAVE_IRQ_LINE
+#define __KVM_HAVE_READONLY_MEM
 
 #define KVM_REG_SIZE(id)   \
(1U  (((id)  KVM_REG_SIZE_MASK)  KVM_REG_SIZE_SHIFT))
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index d7424ef..037adda 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -188,6 +188,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_ONE_REG:
case KVM_CAP_ARM_PSCI:
case KVM_CAP_ARM_PSCI_0_2:
+   case KVM_CAP_READONLY_MEM:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 0f6f642..d606d86 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -745,14 +745,13 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, 
phys_addr_t *ipap)
 }
 
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
- struct kvm_memory_slot *memslot,
+ struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
 {
int ret;
bool write_fault, writable, hugetlb = false, force_pte = false;
unsigned long mmu_seq;
gfn_t gfn = fault_ipa  PAGE_SHIFT;
-   unsigned long hva = gfn_to_hva(vcpu-kvm, gfn);
struct kvm *kvm = vcpu-kvm;
struct kvm_mmu_memory_cache *memcache = vcpu-arch.mmu_page_cache;
struct vm_area_struct *vma;
@@ -861,7 +860,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
unsigned long fault_status;
phys_addr_t fault_ipa;
struct kvm_memory_slot *memslot;
-   bool is_iabt;
+   unsigned long hva;
+   bool is_iabt, write_fault, writable;
gfn_t gfn;
int ret, idx;
 
@@ -882,7 +882,10 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
idx = srcu_read_lock(vcpu-kvm-srcu);
 
gfn = fault_ipa  PAGE_SHIFT;
-   if (!kvm_is_visible_gfn(vcpu-kvm, gfn)) {
+   memslot = gfn_to_memslot(vcpu-kvm, gfn);
+   hva = gfn_to_hva_memslot_prot(memslot, gfn, writable);
+   write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
+   if (kvm_is_error_hva(hva) || (write_fault  !writable)) {
if (is_iabt) {
/* Prefetch Abort on I/O address */
kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
@@ -908,9 +911,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
goto out_unlock;
}
 
-   memslot = gfn_to_memslot(vcpu-kvm, gfn);
-
-   ret = user_mem_abort(vcpu, fault_ipa, memslot, fault_status);
+   ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
if (ret == 0)
ret = 1;
 out_unlock:
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index e633ff8..f4ec5a6 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -37,6 +37,7 @@
 
 #define __KVM_HAVE_GUEST_DEBUG
 #define __KVM_HAVE_IRQ_LINE
+#define __KVM_HAVE_READONLY_MEM
 
 #define KVM_REG_SIZE(id)   \
(1U  (((id)  KVM_REG_SIZE_MASK)  KVM_REG_SIZE_SHIFT))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 970c681..8636b7f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -544,6 +544,8 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t 

[PATCH v2 09/10] target-arm/kvm.c: better error reporting

2014-07-10 Thread Alex Bennée
From: Alex Bennée a...@bennee.com

When we have a problem syncing CP registers between kvm-qemu it's a
lot more useful to have the names of the registers in the log than just
a random abort() and core dump.

Signed-off-by: Alex Bennée alex.ben...@linaro.org

---
v2
  - less verbose log message
  - fix checkpatch warnings

diff --git a/target-arm/kvm.c b/target-arm/kvm.c
index 319784d..72e242d 100644
--- a/target-arm/kvm.c
+++ b/target-arm/kvm.c
@@ -279,6 +279,16 @@ void kvm_arm_register_device(MemoryRegion *mr, uint64_t 
devid, uint64_t group,
 memory_region_ref(kd-mr);
 }
 
+static void failed_cpreg_operation(ARMCPU *cpu, uint64_t regidx, int ret,
+   const char *func)
+{
+uint32_t cpreg_id = kvm_to_cpreg_id(regidx);
+ARMCPRegInfo *cpreg = g_hash_table_lookup(cpu-cp_regs, cpreg_id);
+qemu_log_mask(LOG_UNIMP,
+  %s: failed (%d) KVM reg op %PRIx64 (%s)\n,
+  func, ret, regidx, cpreg ? cpreg-name : unknown);
+}
+
 bool write_kvmstate_to_list(ARMCPU *cpu)
 {
 CPUState *cs = CPU(cpu);
@@ -309,6 +319,7 @@ bool write_kvmstate_to_list(ARMCPU *cpu)
 abort();
 }
 if (ret) {
+failed_cpreg_operation(cpu, regidx, ret, __func__);
 ok = false;
 }
 }
@@ -345,6 +356,7 @@ bool write_list_to_kvmstate(ARMCPU *cpu)
  * you tried to set a register which is constant with
  * a different value from what it actually contains.
  */
+failed_cpreg_operation(cpu, regidx, ret, __func__);
 ok = false;
 }
 }
-- 
2.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/10] target-arm/kvm: make reg sync code common between kvm32/64

2014-07-10 Thread Alex Bennée
Before we launch a guest we query KVM for the list of co-processor
registers it knows about which is used later for save/restore of machine
state. The logic is identical for both 32-bit and 64-bit so I've moved
it all into the common code and simplified the exit paths (as failure =
exit).

This list may well have more registers than are known by the TCG
emulation which is not necessarily a problem but it does stop us from
migrating between KVM and TCG hosted guests. I've added some additional
checking to report those registers under -d unimp.

Signed-off-by: Alex Bennée alex.ben...@linaro.org

diff --git a/target-arm/kvm.c b/target-arm/kvm.c
index 72e242d..a2895dc 100644
--- a/target-arm/kvm.c
+++ b/target-arm/kvm.c
@@ -21,6 +21,7 @@
 #include sysemu/kvm.h
 #include kvm_arm.h
 #include cpu.h
+#include internals.h
 #include hw/arm/arm.h
 
 const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
@@ -289,6 +290,130 @@ static void failed_cpreg_operation(ARMCPU *cpu, uint64_t 
regidx, int ret,
   func, ret, regidx, cpreg ? cpreg-name : unknown);
 }
 
+static int compare_u64(const void *a, const void *b)
+{
+if (*(uint64_t *)a  *(uint64_t *)b) {
+return 1;
+}
+if (*(uint64_t *)a  *(uint64_t *)b) {
+return -1;
+}
+return 0;
+}
+
+static bool reg_syncs_via_tuple_list(uint64_t regidx)
+{
+/* Return true if the regidx is a register we should synchronize
+ * via the cpreg_tuples array (ie is not a core reg we sync by
+ * hand in kvm_arch_get/put_registers())
+ */
+switch (regidx  KVM_REG_ARM_COPROC_MASK) {
+case KVM_REG_ARM_CORE:
+#ifdef KVM_REG_ARM_VFP
+case KVM_REG_ARM_VFP:
+#endif
+return false;
+default:
+return true;
+}
+}
+
+/*
+ * Fetch a list of registers from KVM that we will need to be able to
+ * migrate the state. These registers may or may not map onto real
+ * hardware registers but either way QEMU uses the KVM_GET/SET_ONE_REG
+ * api to copy their state back and forth when required.
+ *
+ * For migration between KVM and TCG both models need to understand
+ * the same set of registers.
+ *
+ * If we exit due to failure we would leak memory but we'll be exiting
+ * anyway so the return path is kept simple.
+ */
+bool kvm_arm_sync_register_list(CPUState *cs)
+{
+struct kvm_reg_list rl;
+struct kvm_reg_list *rlp;
+int i, j, ret, arraylen;
+ARMCPU *cpu = ARM_CPU(cs);
+
+/* Populate the cpreg list based on the kernel's idea
+ * of what registers exist (and throw away the TCG-created list).
+ */
+rl.n = 0;
+ret = kvm_vcpu_ioctl(cs, KVM_GET_REG_LIST, rl);
+if (ret != -E2BIG) {
+return FALSE;
+}
+
+rlp = g_malloc(sizeof(struct kvm_reg_list) + (rl.n * sizeof(uint64_t)));
+rlp-n = rl.n;
+ret = kvm_vcpu_ioctl(cs, KVM_GET_REG_LIST, rlp);
+if (ret) {
+fprintf(stderr, %s: failed to get register list\n, __func__);
+return FALSE;
+}
+/* Sort the list we get back from the kernel, since cpreg_tuples
+ * must be in strictly ascending order.
+ */
+qsort(rlp-reg, rlp-n, sizeof(rlp-reg[0]), compare_u64);
+
+/* Count how many of these registers we'll actually sync through
+ * the cpreg_indexes mechanism and overwrite the existing TCG
+ * built array of registers.
+ */
+for (i = 0, arraylen = 0; i  rlp-n; i++) {
+uint64_t regidx = rlp-reg[i];
+if (reg_syncs_via_tuple_list(regidx)) {
+gboolean found = FALSE;
+arraylen++;
+for (j = 0; j  cpu-cpreg_array_len; j++) {
+if (regidx == cpu-cpreg_indexes[j]) {
+found = TRUE;
+break;
+}
+}
+if (!found) {
+qemu_log_mask(LOG_UNIMP,
+  %s: TCG missing definition of %PRIx64\n,
+  __func__, regidx);
+}
+}
+}
+
+cpu-cpreg_indexes = g_renew(uint64_t, cpu-cpreg_indexes, arraylen);
+cpu-cpreg_values = g_renew(uint64_t, cpu-cpreg_values, arraylen);
+cpu-cpreg_vmstate_indexes = g_renew(uint64_t, cpu-cpreg_vmstate_indexes,
+ arraylen);
+cpu-cpreg_vmstate_values = g_renew(uint64_t, cpu-cpreg_vmstate_values,
+arraylen);
+cpu-cpreg_array_len = arraylen;
+cpu-cpreg_vmstate_array_len = arraylen;
+
+for (i = 0, arraylen = 0; i  rlp-n; i++) {
+uint64_t regidx = rlp-reg[i];
+if (!reg_syncs_via_tuple_list(regidx)) {
+continue;
+}
+switch (regidx  KVM_REG_SIZE_MASK) {
+case KVM_REG_SIZE_U32:
+case KVM_REG_SIZE_U64:
+break;
+default:
+fprintf(stderr,
+%s: un-handled register size (%PRIx64) in kernel 
list\n,
+__func__, regidx);
+return FALSE;
+}
+cpu-cpreg_indexes[arraylen] = 

[PATCH 0/2] KVM: PPC: Fix Mac OS X guests on non-Apple hosts

2014-07-10 Thread Alexander Graf
While trying to get Mac-on-Linux to work on a POWER7 box I stumbled over two
issues in our book3s_32 MMU emulation. With these issues fixed and a hack to
disable magic page mapping (very hard to get right with 64k pages in this case)
I can successfully run Mac OS X guests on a POWER7 host.

Alex

Alexander Graf (2):
  KVM: PPC: Deflect page write faults properly in kvmppc_st
  KVM: PPC: Book3S: Stop PTE lookup on write errors

 arch/powerpc/kvm/book3s.c| 6 --
 arch/powerpc/kvm/book3s_32_mmu.c | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: PPC: Deflect page write faults properly in kvmppc_st

2014-07-10 Thread Alexander Graf
When we have a page that we're not allowed to write to, xlate() will already
tell us -EPERM on lookup of that page. With the code as is we change it into
a page missing error which a guest may get confused about. Instead, just
tell the caller about the -EPERM directly.

This fixes Mac OS X guests when run with DCBZ32 emulation.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index bd75902..9624c56 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -418,11 +418,13 @@ int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
  bool data)
 {
struct kvmppc_pte pte;
+   int r;
 
vcpu-stat.st++;
 
-   if (kvmppc_xlate(vcpu, *eaddr, data, true, pte))
-   return -ENOENT;
+   r = kvmppc_xlate(vcpu, *eaddr, data, true, pte);
+   if (r  0)
+   return r;
 
*eaddr = pte.raddr;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: PPC: Book3S: Stop PTE lookup on write errors

2014-07-10 Thread Alexander Graf
When a page lookup failed because we're not allowed to write to the page, we
should not overwrite that value with another lookup on the second PTEG which
will return page not found. Instead, we should just tell the caller that we
had a permission problem.

This fixes Mac OS X guests looping endlessly in page lookup code for me.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_32_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c
index 93503bb..cd0b073 100644
--- a/arch/powerpc/kvm/book3s_32_mmu.c
+++ b/arch/powerpc/kvm/book3s_32_mmu.c
@@ -335,7 +335,7 @@ static int kvmppc_mmu_book3s_32_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
if (r  0)
r = kvmppc_mmu_book3s_32_xlate_pte(vcpu, eaddr, pte,
   data, iswrite, true);
-   if (r  0)
+   if (r == -ENOENT)
r = kvmppc_mmu_book3s_32_xlate_pte(vcpu, eaddr, pte,
   data, iswrite, false);
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-10 Thread kan . liang
From: Kan Liang kan.li...@intel.com

x86, perf: Protect LBR and extra_regs against KVM lying

With -cpu host, KVM reports LBR and extra_regs support, if the host has support.
When the guest perf driver tries to access LBR or extra_regs MSR,
it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
So check the related MSRs access right once at initialization time to avoid the 
error access at runtime.

For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y 
(for host kernel).
And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
Start the guest with -cpu host.
Run perf record with --branch-any or --branch-filter in guest to trigger LBR 
#GP.
Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to 
trigger offcore_rsp #GP

Signed-off-by: Kan Liang kan.li...@intel.com

V2: Move the check code to initialization time.
V3: Add flag for each extra register.
Check all LBR MSRs at initialization time.
V4: Remove lbr_msr_access. For LBR msr, simply set lbr_nr to 0 if check_msr 
failed.
Disable all extra msrs in creation places if check_msr failed.
V5: Fix check_msr broken
Don't check any more MSRs after the first fail
Return error when checking fail to stop creating the event
Remove the checking code path which never get
---
 arch/x86/kernel/cpu/perf_event.c   |  3 +++
 arch/x86/kernel/cpu/perf_event.h   | 45 ++
 arch/x86/kernel/cpu/perf_event_intel.c | 38 +++-
 3 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 2bdfbff..a7c5e4b 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event 
*event)
continue;
if (event-attr.config1  ~er-valid_mask)
return -EINVAL;
+   /* Check if the extra msrs can be safely accessed*/
+   if (!x86_pmu.extra_msr_access[er-idx])
+   return -EFAULT;
 
reg-idx = er-idx;
reg-config = event-attr.config1;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3b2f9bd..992c678 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -464,6 +464,12 @@ struct x86_pmu {
 */
struct extra_reg *extra_regs;
unsigned int er_flags;
+   /*
+* EXTRA REG MSR can be accessed
+* The extra registers are completely unrelated to each other.
+* So it needs a flag for each extra register.
+*/
+   boolextra_msr_access[EXTRA_REG_MAX];
 
/*
 * Intel host/guest support (KVM)
@@ -525,6 +531,45 @@ extern u64 __read_mostly hw_cache_extra_regs
[PERF_COUNT_HW_CACHE_OP_MAX]
[PERF_COUNT_HW_CACHE_RESULT_MAX];
 
+/*
+ * Under certain circumstances, access certain MSR may cause #GP.
+ * The function tests if the input MSR can be safely accessed.
+ */
+static inline bool check_msr(unsigned long msr)
+{
+   u64 val_old, val_new, val_tmp;
+
+   /*
+* Read the current value, change it and read it back to see if it
+* matches, this is needed to detect certain hardware emulators
+* (qemu/kvm) that don't trap on the MSR access and always return 0s.
+*/
+   if (rdmsrl_safe(msr, val_old))
+   goto msr_fail;
+   /*
+* Only chagne it slightly,
+* since the higher bits of some MSRs cannot be updated by wrmsrl.
+* E.g. MSR_LBR_TOS
+*/
+   val_tmp = val_old ^ 0x3UL;
+   if (wrmsrl_safe(msr, val_tmp) ||
+   rdmsrl_safe(msr, val_new))
+   goto msr_fail;
+
+   if (val_new != val_tmp)
+   goto msr_fail;
+
+   /* Here it's sure that the MSR can be safely accessed.
+* Restore the old value and return.
+*/
+   wrmsrl(msr, val_old);
+
+   return true;
+
+msr_fail:
+   return false;
+}
+
 u64 x86_perf_event_update(struct perf_event *event);
 
 static inline unsigned int x86_pmu_config_addr(int index)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..953b2b2 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2262,7 +2262,8 @@ __init int intel_pmu_init(void)
union cpuid10_ebx ebx;
struct event_constraint *c;
unsigned int unused;
-   int version;
+   int version, i;
+   struct extra_reg *er;
 
if (!cpu_has(boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
switch (boot_cpu_data.x86) {
@@ -2565,6 +2566,41 @@ __init int intel_pmu_init(void)
}
}
 
+   /*
+* Access LBR MSR may cause #GP 

[PATCH V5 2/2] kvm: ignore LBR and extra_reg

2014-07-10 Thread kan . liang
From: Kan Liang kan.li...@intel.com

With -cpu host KVM reports LBR and extra_regs support, so the perf driver may 
accesses the LBR and extra_regs MSRs.
However, there is no LBR and extra_regs virtualization support yet. This could 
causes guest to crash.
As a workaround, KVM just simply ignore the LBR and extra_regs MSRs to lie the 
guest.

For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y 
(for host kernel).
And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
Start the guest with -cpu host.
Run perf record with --branch-any or --branch-filter in guest to trigger LBR 
#GP.
Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to 
trigger offcore_rsp #GP

Signed-off-by: Andi Kleen a...@linux.intel.com
Signed-off-by: Kan Liang kan.li...@intel.com

V3: add MSR_LBR_TOS
V4: add MSR_LBR_SELECT and MSR_PEBS_LD_LAT_THRESHOLD
V5: set_msr should return 0 to lie the guest
---
 arch/x86/kvm/pmu.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index cbecaa9..5fd5b44 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -331,6 +331,18 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
ret = pmu-version  1;
break;
+   case MSR_OFFCORE_RSP_0:
+   case MSR_OFFCORE_RSP_1:
+   case MSR_LBR_SELECT:
+   case MSR_PEBS_LD_LAT_THRESHOLD:
+   case MSR_LBR_TOS:
+   /* At most 8-deep LBR for core and atom */
+   case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7:
+   case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7:
+   /* 16-deep LBR for core i3/i5/i7 series processors */
+   case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15:
+   case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15:
+   return 1; /* to avoid crashes */
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)
|| get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0)
@@ -358,6 +370,19 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 
*data)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
*data = pmu-global_ovf_ctrl;
return 0;
+   case MSR_OFFCORE_RSP_0:
+   case MSR_OFFCORE_RSP_1:
+   case MSR_LBR_SELECT:
+   case MSR_PEBS_LD_LAT_THRESHOLD:
+   case MSR_LBR_TOS:
+   /* At most 8-deep LBR for core and atom */
+   case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7:
+   case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7:
+   /* 16-deep LBR for core i3/i5/i7 series processors */
+   case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15:
+   case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15:
+   *data = 0;
+   return 0;
default:
if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, index))) {
@@ -409,6 +434,19 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data 
*msr_info)
return 0;
}
break;
+   case MSR_OFFCORE_RSP_0:
+   case MSR_OFFCORE_RSP_1:
+   case MSR_LBR_SELECT:
+   case MSR_PEBS_LD_LAT_THRESHOLD:
+   case MSR_LBR_TOS:
+   /* At most 8-deep LBR for core and atom */
+   case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7:
+   case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7:
+   /* 16-deep LBR for core i3/i5/i7 series processors */
+   case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15:
+   case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15:
+   /* dummy for now */
+   return 0;
default:
if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, index))) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arm64: fix VTTBR_BADDR_MASK

2014-07-10 Thread Christoffer Dall
On Wed, Jul 09, 2014 at 11:17:04AM -0500, Joel Schopp wrote:
 The current calculation for VTTBR_BADDR_MASK masks only 39 bits and not
 all 40 bits.  That last bit is important as some systems allocate
 from near the top of the available address space.
 
 This patch is necessary to run KVM on an aarch64 SOC I have been testing.
 
 Signed-off-by: Joel Schopp joel.sch...@amd.com
 ---
  arch/arm64/include/asm/kvm_arm.h |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/arch/arm64/include/asm/kvm_arm.h 
 b/arch/arm64/include/asm/kvm_arm.h
 index 3d69030..b39e93f 100644
 --- a/arch/arm64/include/asm/kvm_arm.h
 +++ b/arch/arm64/include/asm/kvm_arm.h
 @@ -148,7 +148,7 @@
  #endif
  
  #define VTTBR_BADDR_SHIFT (VTTBR_X - 1)
 -#define VTTBR_BADDR_MASK  (((1LLU  (40 - VTTBR_X)) - 1)  
 VTTBR_BADDR_SHIFT)
 +#define VTTBR_BADDR_MASK  (0xffLLU)  /* bits 0-39 */
  #define VTTBR_VMID_SHIFT  (48LLU)
  #define VTTBR_VMID_MASK(0xffLLU  VTTBR_VMID_SHIFT)
  
 

While this is obviously fixing a bug, it doesn't feel like the right
short-term fix.  I'll have to go back and read the definitions of x in
BADDR[47:x] for VTTBR_EL2 exactly again, but the intended use of
VTTBR_BADDR_MASK (and the only occurence of it in C-code) is to deal
with alignment of the allocated pgd.

So while all the MSB in the physical address range of your system should
be used, we still need to check that the pgd we allocate is actually
aligned properly, because writing any bits in [x-1:0} as non-zero can
potentially cause unpredicted behavior and the translation walk results
can be corrupted.

So, you do need to use a proper shift, I'm fine with fixing it for the
assumed 40-bit physical address space for now, but you need to figure
out the right value of 'x', for the BADDR[47:x], and you should change
the code in update_vttbr() as part of this patch, something like
BUG_ON(pgd_phys  ~VTTBR_BADDR_MASK) and make sure the pgd is allocated
at the right alignment, or ensure that the PGD is large enough that we
can just mask off the lower bits and still contain the required number
of entries.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arm64: fix VTTBR_BADDR_MASK

2014-07-10 Thread Joel Schopp

On 07/10/2014 03:25 PM, Christoffer Dall wrote:
 On Wed, Jul 09, 2014 at 11:17:04AM -0500, Joel Schopp wrote:
 The current calculation for VTTBR_BADDR_MASK masks only 39 bits and not
 all 40 bits.  That last bit is important as some systems allocate
 from near the top of the available address space.

 This patch is necessary to run KVM on an aarch64 SOC I have been testing.

 Signed-off-by: Joel Schopp joel.sch...@amd.com
 ---
  arch/arm64/include/asm/kvm_arm.h |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/arch/arm64/include/asm/kvm_arm.h 
 b/arch/arm64/include/asm/kvm_arm.h
 index 3d69030..b39e93f 100644
 --- a/arch/arm64/include/asm/kvm_arm.h
 +++ b/arch/arm64/include/asm/kvm_arm.h
 @@ -148,7 +148,7 @@
  #endif
  
  #define VTTBR_BADDR_SHIFT (VTTBR_X - 1)
 -#define VTTBR_BADDR_MASK  (((1LLU  (40 - VTTBR_X)) - 1)  
 VTTBR_BADDR_SHIFT)
 +#define VTTBR_BADDR_MASK  (0xffLLU)  /* bits 0-39 */
  #define VTTBR_VMID_SHIFT  (48LLU)
  #define VTTBR_VMID_MASK   (0xffLLU  VTTBR_VMID_SHIFT)
  

 While this is obviously fixing a bug, it doesn't feel like the right
 short-term fix.  I'll have to go back and read the definitions of x in
 BADDR[47:x] for VTTBR_EL2 exactly again, but the intended use of
 VTTBR_BADDR_MASK (and the only occurence of it in C-code) is to deal
 with alignment of the allocated pgd.
I think there is some confusion.  Before VTTBR_BADDR_MASK always
evaluated to 0x7fLLU, after the change it always evaluates to
0xffLLU

Neither before nor after the patch is it dealing with alignment.  Any
bits it throws away (bits 40-47) are most significant not least significant.

I could have rewritten the macro like:

#define VTTBR_BADDR_MASK  (((1LLU  (40 - VTTBR_X + 1)) - 1)  
VTTBR_BADDR_SHIFT)

to correct the bug but it's my opinion that the existing code is quite
obfuscated which is how the bug happened in the first place.  It seemed
easier to just actually mask the bits in a straightforward and easy to
understand manner.  I even added a comment so nobody has to count the fs ;)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arm64: fix VTTBR_BADDR_MASK

2014-07-10 Thread Joel Schopp

On 07/10/2014 04:02 PM, Joel Schopp wrote:
 On 07/10/2014 03:25 PM, Christoffer Dall wrote:
 On Wed, Jul 09, 2014 at 11:17:04AM -0500, Joel Schopp wrote:
 The current calculation for VTTBR_BADDR_MASK masks only 39 bits and not
 all 40 bits.  That last bit is important as some systems allocate
 from near the top of the available address space.

 This patch is necessary to run KVM on an aarch64 SOC I have been testing.

 Signed-off-by: Joel Schopp joel.sch...@amd.com
 ---
  arch/arm64/include/asm/kvm_arm.h |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/arch/arm64/include/asm/kvm_arm.h 
 b/arch/arm64/include/asm/kvm_arm.h
 index 3d69030..b39e93f 100644
 --- a/arch/arm64/include/asm/kvm_arm.h
 +++ b/arch/arm64/include/asm/kvm_arm.h
 @@ -148,7 +148,7 @@
  #endif
  
  #define VTTBR_BADDR_SHIFT (VTTBR_X - 1)
 -#define VTTBR_BADDR_MASK  (((1LLU  (40 - VTTBR_X)) - 1)  
 VTTBR_BADDR_SHIFT)
 +#define VTTBR_BADDR_MASK  (0xffLLU)  /* bits 0-39 */
  #define VTTBR_VMID_SHIFT  (48LLU)
  #define VTTBR_VMID_MASK  (0xffLLU  VTTBR_VMID_SHIFT)
  

 While this is obviously fixing a bug, it doesn't feel like the right
 short-term fix.  I'll have to go back and read the definitions of x in
 BADDR[47:x] for VTTBR_EL2 exactly again, but the intended use of
 VTTBR_BADDR_MASK (and the only occurence of it in C-code) is to deal
 with alignment of the allocated pgd.
 I think there is some confusion.  Before VTTBR_BADDR_MASK always
 evaluated to 0x7fLLU, after the change it always evaluates to
 0xffLLU

 Neither before nor after the patch is it dealing with alignment.  Any
 bits it throws away (bits 40-47) are most significant not least significant.

 I could have rewritten the macro like:

 #define VTTBR_BADDR_MASK  (((1LLU  (40 - VTTBR_X + 1)) - 1)  
 VTTBR_BADDR_SHIFT)

 to correct the bug but it's my opinion that the existing code is quite
 obfuscated which is how the bug happened in the first place.  It seemed
 easier to just actually mask the bits in a straightforward and easy to
 understand manner.  I even added a comment so nobody has to count the fs ;)

I hate to reply to my own email correcting myself.  But you were
correct.  I will fix and resend a v2.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation

2014-07-10 Thread Ethan Zhao
Current implementation of helper function pci_vfs_assigned() is a little 
complex, to
get sum of VFs that assigned to VM, access low level configuration space 
register and
then loop in traversing device tree.

This patch introduce an atomic reference counter for VFs that assigned to VM in 
struct
pci_sriov, and compose two more helper functions

pci_sriov_assign_device(),
pci_sriov_deassign_device()

to replace manipulation to device flag and the meanwhile increase and decease 
the counter.

Passed building on 3.15.5

Signed-off-by: Ethan Zhao ethan.z...@oracle.com
---
 drivers/pci/iov.c   |   65 --
 drivers/pci/pci.h   |1 +
 include/linux/pci.h |4 +++
 3 files changed, 41 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index de7a747..72e267f 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -382,6 +382,7 @@ found:
iov-nres = nres;
iov-ctrl = ctrl;
iov-total_VFs = total;
+   atomic_set(iov-VFs_assigned_cnt, 0);
iov-offset = offset;
iov-stride = stride;
iov-pgsz = pgsz;
@@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev)
 EXPORT_SYMBOL_GPL(pci_num_vf);
 
 /**
- * pci_vfs_assigned - returns number of VFs are assigned to a guest
- * @dev: the PCI device
+ * pci_vfs_assigned - returns number of VFs are assigned to VM
+ * @dev: the physical PCI device that contains the VFs.
  *
- * Returns number of VFs belonging to this device that are assigned to a guest.
+ * Returns number of VFs belonging to this device that are assigned to VM.
  * If device is not a physical function returns 0.
  */
 int pci_vfs_assigned(struct pci_dev *dev)
 {
-   struct pci_dev *vfdev;
-   unsigned int vfs_assigned = 0;
-   unsigned short dev_id;
-
-   /* only search if we are a PF */
if (!dev-is_physfn)
return 0;
+   if (dev-sriov)
+   return atomic_read(dev-sriov-VFs_assigned_cnt);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(pci_vfs_assigned);
 
-   /*
-* determine the device ID for the VFs, the vendor ID will be the
-* same as the PF so there is no need to check for that one
-*/
-   pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id);
-
-   /* loop through all the VFs to see if we own any that are assigned */
-   vfdev = pci_get_device(dev-vendor, dev_id, NULL);
-   while (vfdev) {
-   /*
-* It is considered assigned if it is a virtual function with
-* our dev as the physical function and the assigned bit is set
-*/
-   if (vfdev-is_virtfn  (vfdev-physfn == dev) 
-   (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED))
-   vfs_assigned++;
-
-   vfdev = pci_get_device(dev-vendor, dev_id, vfdev);
-   }
+/**
+ * pci_sriov_assign_device - assign device to VM
+ * @pdev: the device to be assigned.
+ */
+void pci_sriov_assign_device(struct pci_dev *pdev)
+{
+   pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
+   if (pdev-is_virtfn  !pdev-is_physfn)
+   if (pdev-physfn)
+   if (pdev-physfn-sriov)
+   atomic_inc(pdev-physfn-sriov-
+   VFs_assigned_cnt);
+}
+EXPORT_SYMBOL_GPL(pci_sriov_assign_device);
 
-   return vfs_assigned;
+/**
+ * pci_sriov_deassign_device - deasign device from VM
+ * @pdev: the device to be deassigned.
+ */
+void pci_sriov_deassign_device(struct pci_dev *pdev)
+{
+   pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
+   if (pdev-is_virtfn  !pdev-is_physfn)
+   if (pdev-physfn)
+   if (pdev-physfn-sriov)
+   atomic_dec(pdev-physfn-sriov-
+   VFs_assigned_cnt);
 }
-EXPORT_SYMBOL_GPL(pci_vfs_assigned);
+EXPORT_SYMBOL_GPL(pci_sriov_deassign_device);
 
 /**
  * pci_sriov_set_totalvfs -- reduce the TotalVFs available
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6bd0822..d17bda2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -235,6 +235,7 @@ struct pci_sriov {
u32 pgsz;   /* page size for BAR alignment */
u8 link;/* Function Dependency Link */
u16 driver_max_VFs; /* max num VFs driver supports */
+   atomic_t VFs_assigned_cnt; /* counter of VFs assigned to VM */
struct pci_dev *dev;/* lowest numbered PF */
struct pci_dev *self;   /* this PF */
struct mutex lock;  /* lock for VF bus */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index aab57b4..5cf6833 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1601,6 +1601,8 @@ int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);

[PATCH 2/4] xen-pciback: use PCI VFs assignment helper functions

2014-07-10 Thread Ethan Zhao
New VFs reference counter mechanism and VFs assignment helper functions are 
introduced to
PCI SRIOV, use them instead of manipulating device flag directly.

Signed-off-by: Ethan Zhao ethan.z...@oracle.com
---
 drivers/xen/xen-pciback/pci_stub.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/xen-pciback/pci_stub.c 
b/drivers/xen/xen-pciback/pci_stub.c
index 62fcd48..1d7b747 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -133,7 +133,7 @@ static void pcistub_device_release(struct kref *kref)
xen_pcibk_config_free_dyn_fields(dev);
xen_pcibk_config_free_dev(dev);
 
-   dev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
+   pci_sriov_deassign_device(dev);
pci_dev_put(dev);
 
kfree(psdev);
@@ -404,7 +404,7 @@ static int pcistub_init_device(struct pci_dev *dev)
dev_dbg(dev-dev, reset device\n);
xen_pcibk_reset_device(dev);
 
-   dev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
+   pci_sriov_assign_device(dev);
return 0;
 
 config_release:
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] i40e: use PCI VFs assignment helper function simplify i40e_vfs_are_assigned()

2014-07-10 Thread Ethan Zhao
New VFs reference counter mechanism and VFs assignment helper functions are 
introduced to
PCI SRIOV, use them instead of manipulating device flag directly.

Signed-off-by: Ethan Zhao ethan.z...@oracle.com
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |   17 ++---
 1 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 02c11a7..781040e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -693,22 +693,9 @@ complete_reset:
 static bool i40e_vfs_are_assigned(struct i40e_pf *pf)
 {
struct pci_dev *pdev = pf-pdev;
-   struct pci_dev *vfdev;
-
-   /* loop through all the VFs to see if we own any that are assigned */
-   vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, I40E_DEV_ID_VF , NULL);
-   while (vfdev) {
-   /* if we don't own it we don't care */
-   if (vfdev-is_virtfn  pci_physfn(vfdev) == pdev) {
-   /* if it is assigned we cannot release it */
-   if (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED)
-   return true;
-   }
 
-   vfdev = pci_get_device(PCI_VENDOR_ID_INTEL,
-  I40E_DEV_ID_VF,
-  vfdev);
-   }
+   if (pci_vfs_assigned(pdev))
+   return true;
 
return false;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] KVM: use PCI VFs assignment helper functions

2014-07-10 Thread Ethan Zhao
New VFs reference counter mechanism and VFs assignment helper functions are 
introduced to
PCI SRIOV, use them instead of manipulating device flag directly.

Signed-off-by: Ethan Zhao ethan.z...@oracle.com
---
 virt/kvm/assigned-dev.c |2 +-
 virt/kvm/iommu.c|4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index bf06577..80b477f 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -302,7 +302,7 @@ static void kvm_free_assigned_device(struct kvm *kvm,
else
pci_restore_state(assigned_dev-dev);
 
-   assigned_dev-dev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
+   pci_sriov_deassign_device(assigned_dev-dev);
 
pci_release_regions(assigned_dev-dev);
pci_disable_device(assigned_dev-dev);
diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
index 0df7d4b..6d3f5ef 100644
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -194,7 +194,7 @@ int kvm_assign_device(struct kvm *kvm,
goto out_unmap;
}
 
-   pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
+   pci_sriov_assign_device(pdev);
 
dev_info(pdev-dev, kvm assign device\n);
 
@@ -220,7 +220,7 @@ int kvm_deassign_device(struct kvm *kvm,
 
iommu_detach_device(domain, pdev-dev);
 
-   pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
+   pci_sriov_deassign_device(pdev);
 
dev_info(pdev-dev, kvm deassign device\n);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Book3S: Add hack for split real mode

2014-07-10 Thread Alexander Graf
Today we handle split real mode by mapping both instruction and data faults
into a special virtual address space that only exists during the split mode
phase.

This is good enough to catch 32bit Linux guests that use split real mode for
copy_from/to_user. In this case we're always prefixed with 0xc000 for our
instruction pointer and can map the user space process freely below there.

However, that approach fails when we're running KVM inside of KVM. Here the 1st
level last_inst reader may well be in the same virtual page as a 2nd level
interrupt handler.

It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a
kernel copy_from/to_user implementation can easily overlap with user space
addresses.

The architecturally correct way to fix this would be to implement an instruction
interpreter in KVM that kicks in whenever we go into split real mode. This
interpreter however would not receive a great amount of testing and be a lot of
bloat for a reasonably isolated corner case.

So I went back to the drawing board and tried to come up with a way to make
split real mode work with a single flat address space. And then I realized that
we could get away with the same trick that makes it work for Linux:

Whenever we see an instruction address during split real mode that may collide,
we just move it higher up the virtual address space to a place that hopefully
does not collide (keep your fingers crossed!).

That approach does work surprisingly well. I am able to successfully run
Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I
apply a tiny timing probe hack to QEMU. I'd say this is a win over even more
broken split real mode :).

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/kvm_asm.h|  1 +
 arch/powerpc/include/asm/kvm_book3s.h |  3 +++
 arch/powerpc/kvm/book3s.c | 20 +++
 arch/powerpc/kvm/book3s_pr.c  | 48 +++
 4 files changed, 72 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 9601741..3f3e530 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -131,6 +131,7 @@
 #define BOOK3S_HFLAG_NATIVE_PS 0x8
 #define BOOK3S_HFLAG_MULTI_PGSIZE  0x10
 #define BOOK3S_HFLAG_NEW_TLBIE 0x20
+#define BOOK3S_HFLAG_SPLIT_HACK0x40
 
 #define RESUME_FLAG_NV  (10)  /* Reload guest nonvolatile state? */
 #define RESUME_FLAG_HOST(11)  /* Resume host? */
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 8ac5392..cb7e661 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -324,4 +324,7 @@ static inline bool is_kvmppc_resume_guest(int r)
 /* LPIDs we support with this build -- runtime limit may be lower */
 #define KVMPPC_NR_LPIDS(LPID_RSVD + 1)
 
+#define SPLIT_HACK_MASK0xfff0
+#define SPLIT_HACK_OFFS0xfe20
+
 #endif /* __ASM_KVM_BOOK3S_H__ */
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 9624c56..98fc18d 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -72,6 +72,17 @@ void kvmppc_core_load_guest_debugstate(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvmppc_unfixup_split_real(struct kvm_vcpu *vcpu)
+{
+   if (vcpu-arch.hflags  BOOK3S_HFLAG_SPLIT_HACK) {
+   ulong pc = kvmppc_get_pc(vcpu);
+   if ((pc  SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
+   kvmppc_set_pc(vcpu, pc  ~SPLIT_HACK_MASK);
+   vcpu-arch.hflags = ~BOOK3S_HFLAG_SPLIT_HACK;
+   }
+}
+EXPORT_SYMBOL_GPL(kvmppc_unfixup_split_real);
+
 static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu)
 {
if (!is_kvmppc_hv_enabled(vcpu-kvm))
@@ -118,6 +129,7 @@ static inline bool kvmppc_critical_section(struct kvm_vcpu 
*vcpu)
 
 void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int vec, u64 flags)
 {
+   kvmppc_unfixup_split_real(vcpu);
kvmppc_set_srr0(vcpu, kvmppc_get_pc(vcpu));
kvmppc_set_srr1(vcpu, kvmppc_get_msr(vcpu) | flags);
kvmppc_set_pc(vcpu, kvmppc_interrupt_offset(vcpu) + vec);
@@ -384,6 +396,14 @@ static int kvmppc_xlate(struct kvm_vcpu *vcpu, ulong 
eaddr, bool data,
pte-may_write = true;
pte-may_execute = true;
r = 0;
+
+   if ((kvmppc_get_msr(vcpu)  (MSR_IR | MSR_DR)) == MSR_DR 
+   !data) {
+   ulong pc = kvmppc_get_pc(vcpu);
+   if ((vcpu-arch.hflags  BOOK3S_HFLAG_SPLIT_HACK) 
+   ((eaddr  SPLIT_HACK_MASK) == SPLIT_HACK_OFFS))
+   pte-raddr = ~SPLIT_HACK_MASK;
+   }
}
 
return r;
diff --git a/arch/powerpc/kvm/book3s_pr.c 

Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation

2014-07-10 Thread Alex Williamson

Since there's no 0th patch, I guess I'll comment here.  This series is
not bisectable, patch 1 breaks the existing implementation.  I'd
suggest:

patch 1 - fix i40e
patch 2 - create assign/deassign that uses dev_flags
patch 3 - convert users to new interface
patch 4 - convert interface to use atomic_t

IMHO, the assigned flag is a horrible hack and I don't know why drivers
like xenback need to use it.  KVM needs to use it because it doesn't
actually have a driver to bind to when a device is assigned, it's happy
to assign devices without any driver.  Thanks,

Alex


On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote:
 Current implementation of helper function pci_vfs_assigned() is a little 
 complex, to
 get sum of VFs that assigned to VM, access low level configuration space 
 register and
 then loop in traversing device tree.
 
 This patch introduce an atomic reference counter for VFs that assigned to VM 
 in struct
 pci_sriov, and compose two more helper functions
 
 pci_sriov_assign_device(),
 pci_sriov_deassign_device()
 
 to replace manipulation to device flag and the meanwhile increase and decease 
 the counter.
 
 Passed building on 3.15.5
 
 Signed-off-by: Ethan Zhao ethan.z...@oracle.com
 ---
  drivers/pci/iov.c   |   65 --
  drivers/pci/pci.h   |1 +
  include/linux/pci.h |4 +++
  3 files changed, 41 insertions(+), 29 deletions(-)
 
 diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
 index de7a747..72e267f 100644
 --- a/drivers/pci/iov.c
 +++ b/drivers/pci/iov.c
 @@ -382,6 +382,7 @@ found:
   iov-nres = nres;
   iov-ctrl = ctrl;
   iov-total_VFs = total;
 + atomic_set(iov-VFs_assigned_cnt, 0);
   iov-offset = offset;
   iov-stride = stride;
   iov-pgsz = pgsz;
 @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev)
  EXPORT_SYMBOL_GPL(pci_num_vf);
  
  /**
 - * pci_vfs_assigned - returns number of VFs are assigned to a guest
 - * @dev: the PCI device
 + * pci_vfs_assigned - returns number of VFs are assigned to VM
 + * @dev: the physical PCI device that contains the VFs.
   *
 - * Returns number of VFs belonging to this device that are assigned to a 
 guest.
 + * Returns number of VFs belonging to this device that are assigned to VM.
   * If device is not a physical function returns 0.
   */
  int pci_vfs_assigned(struct pci_dev *dev)
  {
 - struct pci_dev *vfdev;
 - unsigned int vfs_assigned = 0;
 - unsigned short dev_id;
 -
 - /* only search if we are a PF */
   if (!dev-is_physfn)
   return 0;
 + if (dev-sriov)
 + return atomic_read(dev-sriov-VFs_assigned_cnt);
 + return 0;
 +}
 +EXPORT_SYMBOL_GPL(pci_vfs_assigned);
  
 - /*
 -  * determine the device ID for the VFs, the vendor ID will be the
 -  * same as the PF so there is no need to check for that one
 -  */
 - pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id);
 -
 - /* loop through all the VFs to see if we own any that are assigned */
 - vfdev = pci_get_device(dev-vendor, dev_id, NULL);
 - while (vfdev) {
 - /*
 -  * It is considered assigned if it is a virtual function with
 -  * our dev as the physical function and the assigned bit is set
 -  */
 - if (vfdev-is_virtfn  (vfdev-physfn == dev) 
 - (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED))
 - vfs_assigned++;
 -
 - vfdev = pci_get_device(dev-vendor, dev_id, vfdev);
 - }
 +/**
 + * pci_sriov_assign_device - assign device to VM
 + * @pdev: the device to be assigned.
 + */
 +void pci_sriov_assign_device(struct pci_dev *pdev)
 +{
 + pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
 + if (pdev-is_virtfn  !pdev-is_physfn)
 + if (pdev-physfn)
 + if (pdev-physfn-sriov)
 + atomic_inc(pdev-physfn-sriov-
 + VFs_assigned_cnt);
 +}
 +EXPORT_SYMBOL_GPL(pci_sriov_assign_device);
  
 - return vfs_assigned;
 +/**
 + * pci_sriov_deassign_device - deasign device from VM
 + * @pdev: the device to be deassigned.
 + */
 +void pci_sriov_deassign_device(struct pci_dev *pdev)
 +{
 + pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
 + if (pdev-is_virtfn  !pdev-is_physfn)
 + if (pdev-physfn)
 + if (pdev-physfn-sriov)
 + atomic_dec(pdev-physfn-sriov-
 + VFs_assigned_cnt);
  }
 -EXPORT_SYMBOL_GPL(pci_vfs_assigned);
 +EXPORT_SYMBOL_GPL(pci_sriov_deassign_device);
  
  /**
   * pci_sriov_set_totalvfs -- reduce the TotalVFs available
 diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
 index 6bd0822..d17bda2 100644
 --- a/drivers/pci/pci.h
 +++ b/drivers/pci/pci.h
 @@ -235,6 +235,7 @@ struct pci_sriov {
   u32 pgsz;   /* page size for BAR alignment */
   u8 link;/* Function 

Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation

2014-07-10 Thread ethan zhao

Alex,
   Thanks for your reviewing, when I create the patch order, I thought 
about the question you concerned for
quit a while, make every patch be independent to each other as possible 
as I could, so we can do bisect when hit

problem.

   I manage to take more time to figure out better patch order.

   Thanks,
   Ethan

On 2014/7/11 9:48, Alex Williamson wrote:

Since there's no 0th patch, I guess I'll comment here.  This series is
not bisectable, patch 1 breaks the existing implementation.  I'd
suggest:

patch 1 - fix i40e
i40e only could be fixed with new interface, so it couldn't 
be the first one.

patch 2 - create assign/deassign that uses dev_flags

This will be the first ?

patch 3 - convert users to new interface

Have to be the later step.

patch 4 - convert interface to use atomic_t

Could it be standalone step ?

 Let me think about it.


IMHO, the assigned flag is a horrible hack and I don't know why drivers
like xenback need to use it.  KVM needs to use it because it doesn't
actually have a driver to bind to when a device is assigned, it's happy
to assign devices without any driver.  Thanks,

Alex


On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote:

Current implementation of helper function pci_vfs_assigned() is a little 
complex, to
get sum of VFs that assigned to VM, access low level configuration space 
register and
then loop in traversing device tree.

This patch introduce an atomic reference counter for VFs that assigned to VM in 
struct
pci_sriov, and compose two more helper functions

pci_sriov_assign_device(),
pci_sriov_deassign_device()

to replace manipulation to device flag and the meanwhile increase and decease 
the counter.

Passed building on 3.15.5

Signed-off-by: Ethan Zhao ethan.z...@oracle.com
---
  drivers/pci/iov.c   |   65 --
  drivers/pci/pci.h   |1 +
  include/linux/pci.h |4 +++
  3 files changed, 41 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index de7a747..72e267f 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -382,6 +382,7 @@ found:
iov-nres = nres;
iov-ctrl = ctrl;
iov-total_VFs = total;
+   atomic_set(iov-VFs_assigned_cnt, 0);
iov-offset = offset;
iov-stride = stride;
iov-pgsz = pgsz;
@@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev)
  EXPORT_SYMBOL_GPL(pci_num_vf);
  
  /**

- * pci_vfs_assigned - returns number of VFs are assigned to a guest
- * @dev: the PCI device
+ * pci_vfs_assigned - returns number of VFs are assigned to VM
+ * @dev: the physical PCI device that contains the VFs.
   *
- * Returns number of VFs belonging to this device that are assigned to a guest.
+ * Returns number of VFs belonging to this device that are assigned to VM.
   * If device is not a physical function returns 0.
   */
  int pci_vfs_assigned(struct pci_dev *dev)
  {
-   struct pci_dev *vfdev;
-   unsigned int vfs_assigned = 0;
-   unsigned short dev_id;
-
-   /* only search if we are a PF */
if (!dev-is_physfn)
return 0;
+   if (dev-sriov)
+   return atomic_read(dev-sriov-VFs_assigned_cnt);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(pci_vfs_assigned);
  
-	/*

-* determine the device ID for the VFs, the vendor ID will be the
-* same as the PF so there is no need to check for that one
-*/
-   pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id);
-
-   /* loop through all the VFs to see if we own any that are assigned */
-   vfdev = pci_get_device(dev-vendor, dev_id, NULL);
-   while (vfdev) {
-   /*
-* It is considered assigned if it is a virtual function with
-* our dev as the physical function and the assigned bit is set
-*/
-   if (vfdev-is_virtfn  (vfdev-physfn == dev) 
-   (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED))
-   vfs_assigned++;
-
-   vfdev = pci_get_device(dev-vendor, dev_id, vfdev);
-   }
+/**
+ * pci_sriov_assign_device - assign device to VM
+ * @pdev: the device to be assigned.
+ */
+void pci_sriov_assign_device(struct pci_dev *pdev)
+{
+   pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
+   if (pdev-is_virtfn  !pdev-is_physfn)
+   if (pdev-physfn)
+   if (pdev-physfn-sriov)
+   atomic_inc(pdev-physfn-sriov-
+   VFs_assigned_cnt);
+}
+EXPORT_SYMBOL_GPL(pci_sriov_assign_device);
  
-	return vfs_assigned;

+/**
+ * pci_sriov_deassign_device - deasign device from VM
+ * @pdev: the device to be deassigned.
+ */
+void pci_sriov_deassign_device(struct pci_dev *pdev)
+{
+   pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
+   if (pdev-is_virtfn  !pdev-is_physfn)
+   if (pdev-physfn)
+   if 

Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation

2014-07-10 Thread Alex Williamson
On Fri, 2014-07-11 at 10:10 +0800, ethan zhao wrote:
 Alex,
 Thanks for your reviewing, when I create the patch order, I thought 
 about the question you concerned for
 quit a while, make every patch be independent to each other as possible 
 as I could, so we can do bisect when hit
 problem.
 
 I manage to take more time to figure out better patch order.
 
 Thanks,
 Ethan
 
 On 2014/7/11 9:48, Alex Williamson wrote:
  Since there's no 0th patch, I guess I'll comment here.  This series is
  not bisectable, patch 1 breaks the existing implementation.  I'd
  suggest:
 
  patch 1 - fix i40e
  i40e only could be fixed with new interface, so it couldn't 
 be the first one.

It looks like i40e just has it's own copy of pci_vfs_assigned(), why
can't your current patch 4/4 be applied now?

  patch 2 - create assign/deassign that uses dev_flags
  This will be the first ?
  patch 3 - convert users to new interface
  Have to be the later step.
  patch 4 - convert interface to use atomic_t
  Could it be standalone step ?
 
   Let me think about it.
 
  IMHO, the assigned flag is a horrible hack and I don't know why drivers
  like xenback need to use it.  KVM needs to use it because it doesn't
  actually have a driver to bind to when a device is assigned, it's happy
  to assign devices without any driver.  Thanks,
 
  Alex
 
 
  On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote:
  Current implementation of helper function pci_vfs_assigned() is a little 
  complex, to
  get sum of VFs that assigned to VM, access low level configuration space 
  register and
  then loop in traversing device tree.
 
  This patch introduce an atomic reference counter for VFs that assigned to 
  VM in struct
  pci_sriov, and compose two more helper functions
 
  pci_sriov_assign_device(),
  pci_sriov_deassign_device()
 
  to replace manipulation to device flag and the meanwhile increase and 
  decease the counter.
 
  Passed building on 3.15.5
 
  Signed-off-by: Ethan Zhao ethan.z...@oracle.com
  ---
drivers/pci/iov.c   |   65 
  --
drivers/pci/pci.h   |1 +
include/linux/pci.h |4 +++
3 files changed, 41 insertions(+), 29 deletions(-)
 
  diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
  index de7a747..72e267f 100644
  --- a/drivers/pci/iov.c
  +++ b/drivers/pci/iov.c
  @@ -382,6 +382,7 @@ found:
 iov-nres = nres;
 iov-ctrl = ctrl;
 iov-total_VFs = total;
  +  atomic_set(iov-VFs_assigned_cnt, 0);
 iov-offset = offset;
 iov-stride = stride;
 iov-pgsz = pgsz;
  @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev)
EXPORT_SYMBOL_GPL(pci_num_vf);

/**
  - * pci_vfs_assigned - returns number of VFs are assigned to a guest
  - * @dev: the PCI device
  + * pci_vfs_assigned - returns number of VFs are assigned to VM
  + * @dev: the physical PCI device that contains the VFs.
 *
  - * Returns number of VFs belonging to this device that are assigned to a 
  guest.
  + * Returns number of VFs belonging to this device that are assigned to VM.
 * If device is not a physical function returns 0.
 */
int pci_vfs_assigned(struct pci_dev *dev)
{
  -  struct pci_dev *vfdev;
  -  unsigned int vfs_assigned = 0;
  -  unsigned short dev_id;
  -
  -  /* only search if we are a PF */
 if (!dev-is_physfn)
 return 0;
  +  if (dev-sriov)
  +  return atomic_read(dev-sriov-VFs_assigned_cnt);
  +  return 0;
  +}
  +EXPORT_SYMBOL_GPL(pci_vfs_assigned);

  -  /*
  -   * determine the device ID for the VFs, the vendor ID will be the
  -   * same as the PF so there is no need to check for that one
  -   */
  -  pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id);
  -
  -  /* loop through all the VFs to see if we own any that are assigned */
  -  vfdev = pci_get_device(dev-vendor, dev_id, NULL);
  -  while (vfdev) {
  -  /*
  -   * It is considered assigned if it is a virtual function with
  -   * our dev as the physical function and the assigned bit is set
  -   */
  -  if (vfdev-is_virtfn  (vfdev-physfn == dev) 
  -  (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED))
  -  vfs_assigned++;
  -
  -  vfdev = pci_get_device(dev-vendor, dev_id, vfdev);
  -  }
  +/**
  + * pci_sriov_assign_device - assign device to VM
  + * @pdev: the device to be assigned.
  + */
  +void pci_sriov_assign_device(struct pci_dev *pdev)
  +{
  +  pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
  +  if (pdev-is_virtfn  !pdev-is_physfn)
  +  if (pdev-physfn)
  +  if (pdev-physfn-sriov)
  +  atomic_inc(pdev-physfn-sriov-
  +  VFs_assigned_cnt);
  +}
  +EXPORT_SYMBOL_GPL(pci_sriov_assign_device);

  -  return vfs_assigned;
  +/**
  + * pci_sriov_deassign_device - deasign device from VM
  + * @pdev: the device to 

Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation

2014-07-10 Thread ethan zhao


On 2014/7/11 10:22, Alex Williamson wrote:

On Fri, 2014-07-11 at 10:10 +0800, ethan zhao wrote:

Alex,
 Thanks for your reviewing, when I create the patch order, I thought
about the question you concerned for
quit a while, make every patch be independent to each other as possible
as I could, so we can do bisect when hit
problem.

 I manage to take more time to figure out better patch order.

 Thanks,
 Ethan

On 2014/7/11 9:48, Alex Williamson wrote:

Since there's no 0th patch, I guess I'll comment here.  This series is
not bisectable, patch 1 breaks the existing implementation.  I'd
suggest:

patch 1 - fix i40e

  i40e only could be fixed with new interface, so it couldn't
be the first one.

It looks like i40e just has it's own copy of pci_vfs_assigned(), why
can't your current patch 4/4 be applied now?
 Yes, i40e has its local copy of pci_vfs_assigned(),it could be 
simplified .

with new interface,in another word, its a user of new interface.

 Thanks,
 Ethan



patch 2 - create assign/deassign that uses dev_flags

  This will be the first ?

patch 3 - convert users to new interface

  Have to be the later step.

patch 4 - convert interface to use atomic_t

  Could it be standalone step ?

   Let me think about it.


IMHO, the assigned flag is a horrible hack and I don't know why drivers
like xenback need to use it.  KVM needs to use it because it doesn't
actually have a driver to bind to when a device is assigned, it's happy
to assign devices without any driver.  Thanks,

Alex


On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote:

Current implementation of helper function pci_vfs_assigned() is a little 
complex, to
get sum of VFs that assigned to VM, access low level configuration space 
register and
then loop in traversing device tree.

This patch introduce an atomic reference counter for VFs that assigned to VM in 
struct
pci_sriov, and compose two more helper functions

pci_sriov_assign_device(),
pci_sriov_deassign_device()

to replace manipulation to device flag and the meanwhile increase and decease 
the counter.

Passed building on 3.15.5

Signed-off-by: Ethan Zhao ethan.z...@oracle.com
---
   drivers/pci/iov.c   |   65 --
   drivers/pci/pci.h   |1 +
   include/linux/pci.h |4 +++
   3 files changed, 41 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index de7a747..72e267f 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -382,6 +382,7 @@ found:
iov-nres = nres;
iov-ctrl = ctrl;
iov-total_VFs = total;
+   atomic_set(iov-VFs_assigned_cnt, 0);
iov-offset = offset;
iov-stride = stride;
iov-pgsz = pgsz;
@@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev)
   EXPORT_SYMBOL_GPL(pci_num_vf);
   
   /**

- * pci_vfs_assigned - returns number of VFs are assigned to a guest
- * @dev: the PCI device
+ * pci_vfs_assigned - returns number of VFs are assigned to VM
+ * @dev: the physical PCI device that contains the VFs.
*
- * Returns number of VFs belonging to this device that are assigned to a guest.
+ * Returns number of VFs belonging to this device that are assigned to VM.
* If device is not a physical function returns 0.
*/
   int pci_vfs_assigned(struct pci_dev *dev)
   {
-   struct pci_dev *vfdev;
-   unsigned int vfs_assigned = 0;
-   unsigned short dev_id;
-
-   /* only search if we are a PF */
if (!dev-is_physfn)
return 0;
+   if (dev-sriov)
+   return atomic_read(dev-sriov-VFs_assigned_cnt);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(pci_vfs_assigned);
   
-	/*

-* determine the device ID for the VFs, the vendor ID will be the
-* same as the PF so there is no need to check for that one
-*/
-   pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id);
-
-   /* loop through all the VFs to see if we own any that are assigned */
-   vfdev = pci_get_device(dev-vendor, dev_id, NULL);
-   while (vfdev) {
-   /*
-* It is considered assigned if it is a virtual function with
-* our dev as the physical function and the assigned bit is set
-*/
-   if (vfdev-is_virtfn  (vfdev-physfn == dev) 
-   (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED))
-   vfs_assigned++;
-
-   vfdev = pci_get_device(dev-vendor, dev_id, vfdev);
-   }
+/**
+ * pci_sriov_assign_device - assign device to VM
+ * @pdev: the device to be assigned.
+ */
+void pci_sriov_assign_device(struct pci_dev *pdev)
+{
+   pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
+   if (pdev-is_virtfn  !pdev-is_physfn)
+   if (pdev-physfn)
+   if (pdev-physfn-sriov)
+   atomic_inc(pdev-physfn-sriov-
+   

Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation

2014-07-10 Thread Alex Williamson
On Fri, 2014-07-11 at 10:29 +0800, ethan zhao wrote:
 On 2014/7/11 10:22, Alex Williamson wrote:
  On Fri, 2014-07-11 at 10:10 +0800, ethan zhao wrote:
  Alex,
   Thanks for your reviewing, when I create the patch order, I thought
  about the question you concerned for
  quit a while, make every patch be independent to each other as possible
  as I could, so we can do bisect when hit
  problem.
 
   I manage to take more time to figure out better patch order.
 
   Thanks,
   Ethan
 
  On 2014/7/11 9:48, Alex Williamson wrote:
  Since there's no 0th patch, I guess I'll comment here.  This series is
  not bisectable, patch 1 breaks the existing implementation.  I'd
  suggest:
 
  patch 1 - fix i40e
i40e only could be fixed with new interface, so it couldn't
  be the first one.
  It looks like i40e just has it's own copy of pci_vfs_assigned(), why
  can't your current patch 4/4 be applied now?
   Yes, i40e has its local copy of pci_vfs_assigned(),it could be 
 simplified .
 with new interface,in another word, its a user of new interface.

It's a user of the existing interface.  You're missing the point,
abstract all the users of the assigned dev_flags first, then you're free
to fix the implementation of the interface in one shot without breaking
anything.

  patch 2 - create assign/deassign that uses dev_flags
This will be the first ?
  patch 3 - convert users to new interface
Have to be the later step.
  patch 4 - convert interface to use atomic_t
Could it be standalone step ?
 
 Let me think about it.
 
  IMHO, the assigned flag is a horrible hack and I don't know why drivers
  like xenback need to use it.  KVM needs to use it because it doesn't
  actually have a driver to bind to when a device is assigned, it's happy
  to assign devices without any driver.  Thanks,
 
  Alex
 
 
  On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote:
  Current implementation of helper function pci_vfs_assigned() is a little 
  complex, to
  get sum of VFs that assigned to VM, access low level configuration space 
  register and
  then loop in traversing device tree.
 
  This patch introduce an atomic reference counter for VFs that assigned 
  to VM in struct
  pci_sriov, and compose two more helper functions
 
  pci_sriov_assign_device(),
  pci_sriov_deassign_device()
 
  to replace manipulation to device flag and the meanwhile increase and 
  decease the counter.
 
  Passed building on 3.15.5
 
  Signed-off-by: Ethan Zhao ethan.z...@oracle.com
  ---
 drivers/pci/iov.c   |   65 
  --
 drivers/pci/pci.h   |1 +
 include/linux/pci.h |4 +++
 3 files changed, 41 insertions(+), 29 deletions(-)
 
  diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
  index de7a747..72e267f 100644
  --- a/drivers/pci/iov.c
  +++ b/drivers/pci/iov.c
  @@ -382,6 +382,7 @@ found:
   iov-nres = nres;
   iov-ctrl = ctrl;
   iov-total_VFs = total;
  +atomic_set(iov-VFs_assigned_cnt, 0);
   iov-offset = offset;
   iov-stride = stride;
   iov-pgsz = pgsz;
  @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev)
 EXPORT_SYMBOL_GPL(pci_num_vf);
 
 /**
  - * pci_vfs_assigned - returns number of VFs are assigned to a guest
  - * @dev: the PCI device
  + * pci_vfs_assigned - returns number of VFs are assigned to VM
  + * @dev: the physical PCI device that contains the VFs.
  *
  - * Returns number of VFs belonging to this device that are assigned to 
  a guest.
  + * Returns number of VFs belonging to this device that are assigned to 
  VM.
  * If device is not a physical function returns 0.
  */
 int pci_vfs_assigned(struct pci_dev *dev)
 {
  -struct pci_dev *vfdev;
  -unsigned int vfs_assigned = 0;
  -unsigned short dev_id;
  -
  -/* only search if we are a PF */
   if (!dev-is_physfn)
   return 0;
  +if (dev-sriov)
  +return atomic_read(dev-sriov-VFs_assigned_cnt);
  +return 0;
  +}
  +EXPORT_SYMBOL_GPL(pci_vfs_assigned);
 
  -/*
  - * determine the device ID for the VFs, the vendor ID will be 
  the
  - * same as the PF so there is no need to check for that one
  - */
  -pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, 
  dev_id);
  -
  -/* loop through all the VFs to see if we own any that are 
  assigned */
  -vfdev = pci_get_device(dev-vendor, dev_id, NULL);
  -while (vfdev) {
  -/*
  - * It is considered assigned if it is a virtual 
  function with
  - * our dev as the physical function and the assigned 
  bit is set
  - */
  -if (vfdev-is_virtfn  (vfdev-physfn == dev) 
  -(vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED))
  -

Re:letter..

2014-07-10 Thread Bussgg
reply me on the email id below for explanation of better opportunity for us.

email: chnsnn...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation

2014-07-10 Thread ethan zhao


On 2014/7/11 10:33, Alex Williamson wrote:

On Fri, 2014-07-11 at 10:29 +0800, ethan zhao wrote:

On 2014/7/11 10:22, Alex Williamson wrote:

On Fri, 2014-07-11 at 10:10 +0800, ethan zhao wrote:

Alex,
  Thanks for your reviewing, when I create the patch order, I thought
about the question you concerned for
quit a while, make every patch be independent to each other as possible
as I could, so we can do bisect when hit
problem.

  I manage to take more time to figure out better patch order.

  Thanks,
  Ethan

On 2014/7/11 9:48, Alex Williamson wrote:

Since there's no 0th patch, I guess I'll comment here.  This series is
not bisectable, patch 1 breaks the existing implementation.  I'd
suggest:

patch 1 - fix i40e

   i40e only could be fixed with new interface, so it couldn't
be the first one.

It looks like i40e just has it's own copy of pci_vfs_assigned(), why
can't your current patch 4/4 be applied now?

   Yes, i40e has its local copy of pci_vfs_assigned(),it could be
simplified .
with new interface,in another word, its a user of new interface.

It's a user of the existing interface.  You're missing the point,
abstract all the users of the assigned dev_flags first, then you're free
to fix the implementation of the interface in one shot without breaking
anything.

 Great,you hit the center this time, agree !

patch 2 - create assign/deassign that uses dev_flags

   This will be the first ?

patch 3 - convert users to new interface

   Have to be the later step.

patch 4 - convert interface to use atomic_t

   Could it be standalone step ?

Let me think about it.


IMHO, the assigned flag is a horrible hack and I don't know why drivers
like xenback need to use it.  KVM needs to use it because it doesn't
actually have a driver to bind to when a device is assigned, it's happy
to assign devices without any driver.  Thanks,

Alex


On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote:

Current implementation of helper function pci_vfs_assigned() is a little 
complex, to
get sum of VFs that assigned to VM, access low level configuration space 
register and
then loop in traversing device tree.

This patch introduce an atomic reference counter for VFs that assigned to VM in 
struct
pci_sriov, and compose two more helper functions

pci_sriov_assign_device(),
pci_sriov_deassign_device()

to replace manipulation to device flag and the meanwhile increase and decease 
the counter.

Passed building on 3.15.5

Signed-off-by: Ethan Zhao ethan.z...@oracle.com
---
drivers/pci/iov.c   |   65 
--
drivers/pci/pci.h   |1 +
include/linux/pci.h |4 +++
3 files changed, 41 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index de7a747..72e267f 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -382,6 +382,7 @@ found:
iov-nres = nres;
iov-ctrl = ctrl;
iov-total_VFs = total;
+   atomic_set(iov-VFs_assigned_cnt, 0);
iov-offset = offset;
iov-stride = stride;
iov-pgsz = pgsz;
@@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev)
EXPORT_SYMBOL_GPL(pci_num_vf);

/**

- * pci_vfs_assigned - returns number of VFs are assigned to a guest
- * @dev: the PCI device
+ * pci_vfs_assigned - returns number of VFs are assigned to VM
+ * @dev: the physical PCI device that contains the VFs.
 *
- * Returns number of VFs belonging to this device that are assigned to a guest.
+ * Returns number of VFs belonging to this device that are assigned to VM.
 * If device is not a physical function returns 0.
 */
int pci_vfs_assigned(struct pci_dev *dev)
{
-   struct pci_dev *vfdev;
-   unsigned int vfs_assigned = 0;
-   unsigned short dev_id;
-
-   /* only search if we are a PF */
if (!dev-is_physfn)
return 0;
+   if (dev-sriov)
+   return atomic_read(dev-sriov-VFs_assigned_cnt);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(pci_vfs_assigned);

-	/*

-* determine the device ID for the VFs, the vendor ID will be the
-* same as the PF so there is no need to check for that one
-*/
-   pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id);
-
-   /* loop through all the VFs to see if we own any that are assigned */
-   vfdev = pci_get_device(dev-vendor, dev_id, NULL);
-   while (vfdev) {
-   /*
-* It is considered assigned if it is a virtual function with
-* our dev as the physical function and the assigned bit is set
-*/
-   if (vfdev-is_virtfn  (vfdev-physfn == dev) 
-   (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED))
-   vfs_assigned++;
-
-   vfdev = pci_get_device(dev-vendor, dev_id, vfdev);
-   }
+/**
+ * pci_sriov_assign_device - assign device to VM
+ * @pdev: the 

[PATCH][RESEND] KVM: nVMX: Fix vmptrld fail and vmwrite error when L1 goes down

2014-07-10 Thread Wanpeng Li
This bug can be trigger by L1 goes down directly w/ enable_shadow_vmcs.

[ 6413.158950] kvm: vmptrld   (null)/7800 failed
[ 6413.158954] vmwrite error: reg 401e value 4 (err 1)
[ 6413.158957] CPU: 0 PID: 4840 Comm: qemu-system-x86 Tainted: G   OE 
3.16.0kvm+ #2
[ 6413.158958] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 
12/05/2013
[ 6413.158959]  0003 880210c9fb58 81741de9 
8800d7433f80
[ 6413.158960]  880210c9fb68 a059fa08 880210c9fb78 
a05938bf
[ 6413.158962]  880210c9fba8 a059a97f 8800d7433f80 
0003
[ 6413.158963] Call Trace:
[ 6413.158968]  [81741de9] dump_stack+0x45/0x56
[ 6413.158972]  [a059fa08] vmwrite_error+0x2c/0x2e [kvm_intel]
[ 6413.158974]  [a05938bf] vmcs_writel+0x1f/0x30 [kvm_intel]
[ 6413.158976]  [a059a97f] free_nested.part.73+0x5f/0x170 [kvm_intel]
[ 6413.158978]  [a059ab13] vmx_free_vcpu+0x33/0x70 [kvm_intel]
[ 6413.158991]  [a0360324] kvm_arch_vcpu_free+0x44/0x50 [kvm]
[ 6413.158998]  [a0360f92] kvm_arch_destroy_vm+0xf2/0x1f0 [kvm]

Commit 26a865 (KVM: VMX: fix use after free of vmx-loaded_vmcs) fix the use 
after free bug by move free_loaded_vmcs() before free_nested(), however, this 
lead to free loaded_vmcs-vmcs premature and vmptrld load a NULL pointer during 
sync shadow vmcs to vmcs12. In addition, vmwrite which used to disable shadow 
vmcs and reset VMCS_LINK_POINTER failed since there is no valid current-VMCS.
This patch fix it by skipping kfree vmcs02_list item for current-vmcs in 
nested_free_all_saved_vmcss() and kfree it after free_loaded_vmcs(). This can 
also avoid use after free bug.

Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
---
 arch/x86/kvm/vmx.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 021d84a..6e427ac 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5762,10 +5762,11 @@ static void nested_free_all_saved_vmcss(struct vcpu_vmx 
*vmx)
 {
struct vmcs02_list *item, *n;
list_for_each_entry_safe(item, n, vmx-nested.vmcs02_pool, list) {
-   if (vmx-loaded_vmcs != item-vmcs02)
+   if (vmx-loaded_vmcs != item-vmcs02) {
free_loaded_vmcs(item-vmcs02);
-   list_del(item-list);
-   kfree(item);
+   list_del(item-list);
+   kfree(item);
+   }
}
vmx-nested.vmcs02_num = 0;
 
@@ -7526,10 +7527,16 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
*vcpu)
 static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs02_list *item;
 
free_vpid(vmx);
-   free_loaded_vmcs(vmx-loaded_vmcs);
free_nested(vmx);
+   free_loaded_vmcs(vmx-loaded_vmcs);
+   if (vmx-nested.vmxon)
+   list_for_each_entry(item, vmx-nested.vmcs02_pool, list) {
+   list_del(item-list);
+   kfree(item);
+   }
kfree(vmx-guest_msrs);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, vmx);
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc/kvm: Create proper names for the kvm_host_state PMU fields

2014-07-10 Thread Michael Ellerman
On Thu, 2014-07-10 at 12:16 +0200, Alexander Graf wrote:
 On 10.07.14 11:34, Michael Ellerman wrote:
  We have two arrays in kvm_host_state that contain register values for
  the PMU. Currently we only create an asm-offsets symbol for the base of
  the arrays, and do the array offset in the assembly code.
 
  Creating an asm-offsets symbol for each field individually makes the
  code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and
  might have helped us notice the recent double restore bug we had in this
  code.
 
  Signed-off-by: Michael Ellerman m...@ellerman.id.au
 
 Acked-by: Alexander Graf ag...@suse.de

Thanks.

 I still think this whole code path should just be C though.

Yeah it probably should.

cheers


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2 V2] PCI: implement VFs assignment reference counter

2014-07-10 Thread Ethan Zhao
Current implementation of helper function pci_vfs_assigned() is a
little complex, to get sum of VFs that assigned to VM, access low
level configuration space register and then loop in traversing
device tree.

This patch introduces an atomic reference counter for VFs those
were assigned to VM in struct pci_sriov. and simplify the code in
pci_vfs_assigned().

v2: reorder the patchset according to the suggestion from
alex.william...@redhat.com

Signed-off-by: Ethan Zhao ethan.z...@oracle.com
---
 drivers/pci/iov.c |   45 +
 drivers/pci/pci.h |1 +
 2 files changed, 18 insertions(+), 28 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index c082523..5478a0c 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -382,6 +382,7 @@ found:
iov-nres = nres;
iov-ctrl = ctrl;
iov-total_VFs = total;
+   atomic_set(iov-VFs_assigned_cnt, 0);
iov-offset = offset;
iov-stride = stride;
iov-pgsz = pgsz;
@@ -603,43 +604,21 @@ int pci_num_vf(struct pci_dev *dev)
 EXPORT_SYMBOL_GPL(pci_num_vf);
 
 /**
- * pci_vfs_assigned - returns number of VFs are assigned to a guest
- * @dev: the PCI device
+ * pci_vfs_assigned - returns number of VFs are assigned to VM
+ * @dev: the physical PCI device that contains the VFs.
  *
- * Returns number of VFs belonging to this device that are assigned to a guest.
+ * Returns number of VFs belonging to this device that are assigned to VM.
  * If device is not a physical function returns 0.
  */
 int pci_vfs_assigned(struct pci_dev *dev)
 {
-   struct pci_dev *vfdev;
-   unsigned int vfs_assigned = 0;
-   unsigned short dev_id;
-
/* only search if we are a PF */
if (!dev-is_physfn)
return 0;
 
-   /*
-* determine the device ID for the VFs, the vendor ID will be the
-* same as the PF so there is no need to check for that one
-*/
-   pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id);
-
-   /* loop through all the VFs to see if we own any that are assigned */
-   vfdev = pci_get_device(dev-vendor, dev_id, NULL);
-   while (vfdev) {
-   /*
-* It is considered assigned if it is a virtual function with
-* our dev as the physical function and the assigned bit is set
-*/
-   if (vfdev-is_virtfn  (vfdev-physfn == dev) 
-   (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED))
-   vfs_assigned++;
-
-   vfdev = pci_get_device(dev-vendor, dev_id, vfdev);
-   }
-
-   return vfs_assigned;
+   if (dev-sriov)
+   return atomic_read(dev-sriov-VFs_assigned_cnt);
+   return 0;
 }
 EXPORT_SYMBOL_GPL(pci_vfs_assigned);
 
@@ -650,6 +629,11 @@ EXPORT_SYMBOL_GPL(pci_vfs_assigned);
 void pci_sriov_assign_device(struct pci_dev *pdev)
 {
pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
+   if (pdev-is_virtfn  !pdev-is_physfn)
+   if (pdev-physfn)
+   if (pdev-physfn-sriov)
+   atomic_inc(pdev-physfn-sriov-
+   VFs_assigned_cnt);
 }
 EXPORT_SYMBOL_GPL(pci_sriov_assign_device);
 
@@ -660,6 +644,11 @@ EXPORT_SYMBOL_GPL(pci_sriov_assign_device);
 void pci_sriov_deassign_device(struct pci_dev *pdev)
 {
pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
+   if (pdev-is_virtfn  !pdev-is_physfn)
+   if (pdev-physfn)
+   if (pdev-physfn-sriov)
+   atomic_dec(pdev-physfn-sriov-
+   VFs_assigned_cnt);
 }
 EXPORT_SYMBOL_GPL(pci_sriov_deassign_device);  
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6bd0822..d17bda2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -235,6 +235,7 @@ struct pci_sriov {
u32 pgsz;   /* page size for BAR alignment */
u8 link;/* Function Dependency Link */
u16 driver_max_VFs; /* max num VFs driver supports */
+   atomic_t VFs_assigned_cnt; /* counter of VFs assigned to VM */
struct pci_dev *dev;/* lowest numbered PF */
struct pci_dev *self;   /* this PF */
struct mutex lock;  /* lock for VF bus */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2 V2] PCI: introduce device assignment interface and refactory related code

2014-07-10 Thread Ethan Zhao
This patch introduces two new device assignment functions

pci_sriov_assign_device(),
pci_sriov_deassign_device()

along with the existed one

pci_vfs_assigned()

They construct the VFs assignment management interface, used to assign/
deassign device to VM and query the VFs reference counter. instead of
direct manipulation of device flag.

This patch refashioned the related code and make them atomic.

v2: reorder the patchset and make it bisectable and atomic, steps clear
between interface defination and implemenation according to the
suggestion from alex.william...@redhat.com

Signed-off-by: Ethan Zhao ethan.z...@oracle.com
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |   17 ++---
 drivers/pci/iov.c  |   20 
 drivers/xen/xen-pciback/pci_stub.c |4 ++--
 include/linux/pci.h|4 
 virt/kvm/assigned-dev.c|2 +-
 virt/kvm/iommu.c   |4 ++--
 6 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 02c11a7..781040e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -693,22 +693,9 @@ complete_reset:
 static bool i40e_vfs_are_assigned(struct i40e_pf *pf)
 {
struct pci_dev *pdev = pf-pdev;
-   struct pci_dev *vfdev;
-
-   /* loop through all the VFs to see if we own any that are assigned */
-   vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, I40E_DEV_ID_VF , NULL);
-   while (vfdev) {
-   /* if we don't own it we don't care */
-   if (vfdev-is_virtfn  pci_physfn(vfdev) == pdev) {
-   /* if it is assigned we cannot release it */
-   if (vfdev-dev_flags  PCI_DEV_FLAGS_ASSIGNED)
-   return true;
-   }
 
-   vfdev = pci_get_device(PCI_VENDOR_ID_INTEL,
-  I40E_DEV_ID_VF,
-  vfdev);
-   }
+   if (pci_vfs_assigned(pdev))
+   return true;
 
return false;
 }
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index de7a747..c082523 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -644,6 +644,26 @@ int pci_vfs_assigned(struct pci_dev *dev)
 EXPORT_SYMBOL_GPL(pci_vfs_assigned);
 
 /**
+ * pci_sriov_assign_device - assign device to VM
+ * @pdev: the device to be assigned.
+ */
+void pci_sriov_assign_device(struct pci_dev *pdev)
+{
+   pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_assign_device);
+
+/**
+ * pci_sriov_deassign_device - deasign device from VM
+ * @pdev: the device to be deassigned.
+ */
+void pci_sriov_deassign_device(struct pci_dev *pdev)
+{
+   pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_deassign_device);
+
+/**
  * pci_sriov_set_totalvfs -- reduce the TotalVFs available
  * @dev: the PCI PF device
  * @numvfs: number that should be used for TotalVFs supported
diff --git a/drivers/xen/xen-pciback/pci_stub.c 
b/drivers/xen/xen-pciback/pci_stub.c
index 62fcd48..1d7b747 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -133,7 +133,7 @@ static void pcistub_device_release(struct kref *kref)
xen_pcibk_config_free_dyn_fields(dev);
xen_pcibk_config_free_dev(dev);
 
-   dev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED;
+   pci_sriov_deassign_device(dev);
pci_dev_put(dev);
 
kfree(psdev);
@@ -404,7 +404,7 @@ static int pcistub_init_device(struct pci_dev *dev)
dev_dbg(dev-dev, reset device\n);
xen_pcibk_reset_device(dev);
 
-   dev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED;
+   pci_sriov_assign_device(dev);
return 0;
 
 config_release:
diff --git a/include/linux/pci.h b/include/linux/pci.h
index aab57b4..ddfcceb 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1603,6 +1603,8 @@ int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
+void pci_sriov_assign_device(struct pci_dev *dev);
+void pci_sriov_deassign_device(struct pci_dev *dev);
 #else
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
@@ -1614,6 +1616,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev 
*dev, u16 numvfs)
 { return 0; }
 static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 { return 0; }
+static inline void pci_sriov_assign_device(struct pci_dev *dev) { }
+static inline void pci_sriov_deassign_device(struct pci_dev *dev) { }
 #endif
 
 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
diff --git a/virt/kvm/assigned-dev.c 

Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

2014-07-10 Thread Alexander Graf


On 09.07.14 00:59, Stewart Smith wrote:

Hi!

Thanks for review, much appreciated!

Alexander Graf ag...@suse.de writes:

On 08.07.14 07:06, Stewart Smith wrote:

@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
int i, need_vpa_update;
int srcu_idx;
struct kvm_vcpu *vcpus_to_update[threads_per_core];
+   phys_addr_t phy_addr, tmp;

Please put the variable declarations into the if () branch so that the
compiler can catch potential leaks :)

ack. will fix.


@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
   
   	srcu_idx = srcu_read_lock(vc-kvm-srcu);
   
+	/* If we have a saved list of L2/L3, restore it */

+   if (cpu_has_feature(CPU_FTR_ARCH_207S)  vc-mpp_buffer) {
+   phy_addr = virt_to_phys((void *)vc-mpp_buffer);
+#if defined(CONFIG_PPC_4K_PAGES)
+   phy_addr = (phy_addr + 8*4096)  ~(8*4096);

get_free_pages() is automatically aligned to the order, no?

That's what Paul reckoned too, and then we've attempted to find anywhere
that documents that behaviour. Happen to be able to point to docs/source
that say this is part of API?


Phew - it's probably buried somewhere. I could only find this document 
saying that we always get order-aligned allocations:


http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html

Mel, do you happen to have any pointer to something that explicitly (or 
even properly implicitly) says that get_free_pages() returns 
order-aligned memory?





+#endif
+   tmp = phy_addr  PPC_MPPE_ADDRESS_MASK;
+   tmp = tmp | PPC_MPPE_WHOLE_TABLE;
+
+   /* For sanity, abort any 'save' requests in progress */
+   asm volatile(PPC_LOGMPP(R1) : : r (tmp));
+
+   /* Inititate a cache-load request */
+   mtspr(SPRN_MPPR, tmp);
+   }

In fact, this whole block up here could be a function, no?

It could, perfectly happy for it to be one. Will fix.


+
+   /* Allocate memory before switching out of guest so we don't
+  trash L2/L3 with memory allocation stuff */
+   if (cpu_has_feature(CPU_FTR_ARCH_207S)  !vc-mpp_buffer) {
+   vc-mpp_buffer = __get_free_pages(GFP_KERNEL|__GFP_ZERO,
+ MPP_BUFFER_ORDER);

get_order(64 * 1024)?

Also, why allocate it here and not on vcore creation?

There's also the possibility of saving/restorting part of the L3 cache
as well, and I was envisioning a future patch to this which checks a
flag in vcore (maybe exposed via sysfs or whatever mechanism is
applicable) if it should save/restore L2 or L2/L3, so thus it makes a
bit more sense allocating it there rather than elsewhere.

There's also no real reason to fail to create a vcore if we can't
allocate a buffer for L2/L3 cache contents - retrying later is perfectly
harmless.


If we failed during core creation just don't save/restore L2 cache 
contents at all. I really prefer to have allocation and dealloction all 
at init time - and such low order allocations will most likely succeed.


Let's leave the L3 cache bits for later when we know whether it actually 
has an impact. I personally doubt it :).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

2014-07-10 Thread Mel Gorman
On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:
 
 On 09.07.14 00:59, Stewart Smith wrote:
 Hi!
 
 Thanks for review, much appreciated!
 
 Alexander Graf ag...@suse.de writes:
 On 08.07.14 07:06, Stewart Smith wrote:
 @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
int i, need_vpa_update;
int srcu_idx;
struct kvm_vcpu *vcpus_to_update[threads_per_core];
 +  phys_addr_t phy_addr, tmp;
 Please put the variable declarations into the if () branch so that the
 compiler can catch potential leaks :)
 ack. will fix.
 
 @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
srcu_idx = srcu_read_lock(vc-kvm-srcu);
 +  /* If we have a saved list of L2/L3, restore it */
 +  if (cpu_has_feature(CPU_FTR_ARCH_207S)  vc-mpp_buffer) {
 +  phy_addr = virt_to_phys((void *)vc-mpp_buffer);
 +#if defined(CONFIG_PPC_4K_PAGES)
 +  phy_addr = (phy_addr + 8*4096)  ~(8*4096);
 get_free_pages() is automatically aligned to the order, no?
 That's what Paul reckoned too, and then we've attempted to find anywhere
 that documents that behaviour. Happen to be able to point to docs/source
 that say this is part of API?
 
 Phew - it's probably buried somewhere. I could only find this
 document saying that we always get order-aligned allocations:
 
 http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html
 
 Mel, do you happen to have any pointer to something that explicitly
 (or even properly implicitly) says that get_free_pages() returns
 order-aligned memory?
 

I did not read the whole thread so I lack context and will just answer
this part.

There is no guarantee that pages are returned in PFN order for multiple
requests to the page allocator. This is the relevant comment in
rmqueue_bulk

/*
 * Split buddy pages returned by expand() are received here
 * in physical page order. The page is added to the callers and
 * list and the list head then moves forward. From the callers
 * perspective, the linked list is ordered by page number in
 * some conditions. This is useful for IO devices that can
 * merge IO requests if the physical pages are ordered
 * properly.
 */

It will probably be true early in the lifetime of the system but the milage
will vary on systems with a lot of uptime. If you depend on this behaviour
for correctness then you will have a bad day.

High-order page requests to the page allocator are guaranteed to be in physical
order. However, this does not apply to vmalloc() where allocations are
only guaranteed to be virtually contiguous.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

2014-07-10 Thread Alexander Graf


On 10.07.14 15:07, Mel Gorman wrote:

On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:

On 09.07.14 00:59, Stewart Smith wrote:

Hi!

Thanks for review, much appreciated!

Alexander Graf ag...@suse.de writes:

On 08.07.14 07:06, Stewart Smith wrote:

@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
int i, need_vpa_update;
int srcu_idx;
struct kvm_vcpu *vcpus_to_update[threads_per_core];
+   phys_addr_t phy_addr, tmp;

Please put the variable declarations into the if () branch so that the
compiler can catch potential leaks :)

ack. will fix.


@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
srcu_idx = srcu_read_lock(vc-kvm-srcu);
+   /* If we have a saved list of L2/L3, restore it */
+   if (cpu_has_feature(CPU_FTR_ARCH_207S)  vc-mpp_buffer) {
+   phy_addr = virt_to_phys((void *)vc-mpp_buffer);
+#if defined(CONFIG_PPC_4K_PAGES)
+   phy_addr = (phy_addr + 8*4096)  ~(8*4096);

get_free_pages() is automatically aligned to the order, no?

That's what Paul reckoned too, and then we've attempted to find anywhere
that documents that behaviour. Happen to be able to point to docs/source
that say this is part of API?

Phew - it's probably buried somewhere. I could only find this
document saying that we always get order-aligned allocations:

http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html

Mel, do you happen to have any pointer to something that explicitly
(or even properly implicitly) says that get_free_pages() returns
order-aligned memory?


I did not read the whole thread so I lack context and will just answer
this part.

There is no guarantee that pages are returned in PFN order for multiple
requests to the page allocator. This is the relevant comment in
rmqueue_bulk

 /*
  * Split buddy pages returned by expand() are received here
  * in physical page order. The page is added to the callers and
  * list and the list head then moves forward. From the callers
  * perspective, the linked list is ordered by page number in
  * some conditions. This is useful for IO devices that can
  * merge IO requests if the physical pages are ordered
  * properly.
  */

It will probably be true early in the lifetime of the system but the milage
will vary on systems with a lot of uptime. If you depend on this behaviour
for correctness then you will have a bad day.

High-order page requests to the page allocator are guaranteed to be in physical
order. However, this does not apply to vmalloc() where allocations are
only guaranteed to be virtually contiguous.


Hrm, ok to be very concrete:

  Does __get_free_pages(..., 4); on a 4k page size system give me a 64k 
aligned pointer? :)



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

2014-07-10 Thread Mel Gorman
On Thu, Jul 10, 2014 at 03:17:16PM +0200, Alexander Graf wrote:
 
 On 10.07.14 15:07, Mel Gorman wrote:
 On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:
 On 09.07.14 00:59, Stewart Smith wrote:
 Hi!
 
 Thanks for review, much appreciated!
 
 Alexander Graf ag...@suse.de writes:
 On 08.07.14 07:06, Stewart Smith wrote:
 @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore 
 *vc)
  int i, need_vpa_update;
  int srcu_idx;
  struct kvm_vcpu *vcpus_to_update[threads_per_core];
 +phys_addr_t phy_addr, tmp;
 Please put the variable declarations into the if () branch so that the
 compiler can catch potential leaks :)
 ack. will fix.
 
 @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore 
 *vc)
  srcu_idx = srcu_read_lock(vc-kvm-srcu);
 +/* If we have a saved list of L2/L3, restore it */
 +if (cpu_has_feature(CPU_FTR_ARCH_207S)  vc-mpp_buffer) {
 +phy_addr = virt_to_phys((void *)vc-mpp_buffer);
 +#if defined(CONFIG_PPC_4K_PAGES)
 +phy_addr = (phy_addr + 8*4096)  ~(8*4096);
 get_free_pages() is automatically aligned to the order, no?
 That's what Paul reckoned too, and then we've attempted to find anywhere
 that documents that behaviour. Happen to be able to point to docs/source
 that say this is part of API?
 Phew - it's probably buried somewhere. I could only find this
 document saying that we always get order-aligned allocations:
 
 http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html
 
 Mel, do you happen to have any pointer to something that explicitly
 (or even properly implicitly) says that get_free_pages() returns
 order-aligned memory?
 
 I did not read the whole thread so I lack context and will just answer
 this part.
 
 There is no guarantee that pages are returned in PFN order for multiple
 requests to the page allocator. This is the relevant comment in
 rmqueue_bulk
 
  /*
   * Split buddy pages returned by expand() are received here
   * in physical page order. The page is added to the callers 
  and
   * list and the list head then moves forward. From the 
  callers
   * perspective, the linked list is ordered by page number in
   * some conditions. This is useful for IO devices that can
   * merge IO requests if the physical pages are ordered
   * properly.
   */
 
 It will probably be true early in the lifetime of the system but the milage
 will vary on systems with a lot of uptime. If you depend on this behaviour
 for correctness then you will have a bad day.
 
 High-order page requests to the page allocator are guaranteed to be in 
 physical
 order. However, this does not apply to vmalloc() where allocations are
 only guaranteed to be virtually contiguous.
 
 Hrm, ok to be very concrete:
 
   Does __get_free_pages(..., 4); on a 4k page size system give me a
 64k aligned pointer? :)
 

Yes.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] KVM: PPC: Fix Mac OS X guests on non-Apple hosts

2014-07-10 Thread Alexander Graf
While trying to get Mac-on-Linux to work on a POWER7 box I stumbled over two
issues in our book3s_32 MMU emulation. With these issues fixed and a hack to
disable magic page mapping (very hard to get right with 64k pages in this case)
I can successfully run Mac OS X guests on a POWER7 host.

Alex

Alexander Graf (2):
  KVM: PPC: Deflect page write faults properly in kvmppc_st
  KVM: PPC: Book3S: Stop PTE lookup on write errors

 arch/powerpc/kvm/book3s.c| 6 --
 arch/powerpc/kvm/book3s_32_mmu.c | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: PPC: Deflect page write faults properly in kvmppc_st

2014-07-10 Thread Alexander Graf
When we have a page that we're not allowed to write to, xlate() will already
tell us -EPERM on lookup of that page. With the code as is we change it into
a page missing error which a guest may get confused about. Instead, just
tell the caller about the -EPERM directly.

This fixes Mac OS X guests when run with DCBZ32 emulation.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index bd75902..9624c56 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -418,11 +418,13 @@ int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
  bool data)
 {
struct kvmppc_pte pte;
+   int r;
 
vcpu-stat.st++;
 
-   if (kvmppc_xlate(vcpu, *eaddr, data, true, pte))
-   return -ENOENT;
+   r = kvmppc_xlate(vcpu, *eaddr, data, true, pte);
+   if (r  0)
+   return r;
 
*eaddr = pte.raddr;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: PPC: Book3S: Stop PTE lookup on write errors

2014-07-10 Thread Alexander Graf
When a page lookup failed because we're not allowed to write to the page, we
should not overwrite that value with another lookup on the second PTEG which
will return page not found. Instead, we should just tell the caller that we
had a permission problem.

This fixes Mac OS X guests looping endlessly in page lookup code for me.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/book3s_32_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c
index 93503bb..cd0b073 100644
--- a/arch/powerpc/kvm/book3s_32_mmu.c
+++ b/arch/powerpc/kvm/book3s_32_mmu.c
@@ -335,7 +335,7 @@ static int kvmppc_mmu_book3s_32_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
if (r  0)
r = kvmppc_mmu_book3s_32_xlate_pte(vcpu, eaddr, pte,
   data, iswrite, true);
-   if (r  0)
+   if (r == -ENOENT)
r = kvmppc_mmu_book3s_32_xlate_pte(vcpu, eaddr, pte,
   data, iswrite, false);
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Book3S: Add hack for split real mode

2014-07-10 Thread Alexander Graf
Today we handle split real mode by mapping both instruction and data faults
into a special virtual address space that only exists during the split mode
phase.

This is good enough to catch 32bit Linux guests that use split real mode for
copy_from/to_user. In this case we're always prefixed with 0xc000 for our
instruction pointer and can map the user space process freely below there.

However, that approach fails when we're running KVM inside of KVM. Here the 1st
level last_inst reader may well be in the same virtual page as a 2nd level
interrupt handler.

It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a
kernel copy_from/to_user implementation can easily overlap with user space
addresses.

The architecturally correct way to fix this would be to implement an instruction
interpreter in KVM that kicks in whenever we go into split real mode. This
interpreter however would not receive a great amount of testing and be a lot of
bloat for a reasonably isolated corner case.

So I went back to the drawing board and tried to come up with a way to make
split real mode work with a single flat address space. And then I realized that
we could get away with the same trick that makes it work for Linux:

Whenever we see an instruction address during split real mode that may collide,
we just move it higher up the virtual address space to a place that hopefully
does not collide (keep your fingers crossed!).

That approach does work surprisingly well. I am able to successfully run
Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I
apply a tiny timing probe hack to QEMU. I'd say this is a win over even more
broken split real mode :).

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/include/asm/kvm_asm.h|  1 +
 arch/powerpc/include/asm/kvm_book3s.h |  3 +++
 arch/powerpc/kvm/book3s.c | 20 +++
 arch/powerpc/kvm/book3s_pr.c  | 48 +++
 4 files changed, 72 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 9601741..3f3e530 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -131,6 +131,7 @@
 #define BOOK3S_HFLAG_NATIVE_PS 0x8
 #define BOOK3S_HFLAG_MULTI_PGSIZE  0x10
 #define BOOK3S_HFLAG_NEW_TLBIE 0x20
+#define BOOK3S_HFLAG_SPLIT_HACK0x40
 
 #define RESUME_FLAG_NV  (10)  /* Reload guest nonvolatile state? */
 #define RESUME_FLAG_HOST(11)  /* Resume host? */
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 8ac5392..cb7e661 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -324,4 +324,7 @@ static inline bool is_kvmppc_resume_guest(int r)
 /* LPIDs we support with this build -- runtime limit may be lower */
 #define KVMPPC_NR_LPIDS(LPID_RSVD + 1)
 
+#define SPLIT_HACK_MASK0xfff0
+#define SPLIT_HACK_OFFS0xfe20
+
 #endif /* __ASM_KVM_BOOK3S_H__ */
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 9624c56..98fc18d 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -72,6 +72,17 @@ void kvmppc_core_load_guest_debugstate(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvmppc_unfixup_split_real(struct kvm_vcpu *vcpu)
+{
+   if (vcpu-arch.hflags  BOOK3S_HFLAG_SPLIT_HACK) {
+   ulong pc = kvmppc_get_pc(vcpu);
+   if ((pc  SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
+   kvmppc_set_pc(vcpu, pc  ~SPLIT_HACK_MASK);
+   vcpu-arch.hflags = ~BOOK3S_HFLAG_SPLIT_HACK;
+   }
+}
+EXPORT_SYMBOL_GPL(kvmppc_unfixup_split_real);
+
 static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu)
 {
if (!is_kvmppc_hv_enabled(vcpu-kvm))
@@ -118,6 +129,7 @@ static inline bool kvmppc_critical_section(struct kvm_vcpu 
*vcpu)
 
 void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int vec, u64 flags)
 {
+   kvmppc_unfixup_split_real(vcpu);
kvmppc_set_srr0(vcpu, kvmppc_get_pc(vcpu));
kvmppc_set_srr1(vcpu, kvmppc_get_msr(vcpu) | flags);
kvmppc_set_pc(vcpu, kvmppc_interrupt_offset(vcpu) + vec);
@@ -384,6 +396,14 @@ static int kvmppc_xlate(struct kvm_vcpu *vcpu, ulong 
eaddr, bool data,
pte-may_write = true;
pte-may_execute = true;
r = 0;
+
+   if ((kvmppc_get_msr(vcpu)  (MSR_IR | MSR_DR)) == MSR_DR 
+   !data) {
+   ulong pc = kvmppc_get_pc(vcpu);
+   if ((vcpu-arch.hflags  BOOK3S_HFLAG_SPLIT_HACK) 
+   ((eaddr  SPLIT_HACK_MASK) == SPLIT_HACK_OFFS))
+   pte-raddr = ~SPLIT_HACK_MASK;
+   }
}
 
return r;
diff --git a/arch/powerpc/kvm/book3s_pr.c