Re: [PATCH 4/4 v3] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

Scott Wood Fri, 13 Jun 2014 14:57:28 -0700

On Thu, 2014-06-12 at 18:04 +0200, Alexander Graf wrote:
> On 06/02/2014 05:50 PM, Mihai Caraman wrote:
> > On book3e, KVM uses load external pid (lwepx) dedicated instruction to read
> > guest last instruction on the exit path. lwepx exceptions (DTLB_MISS, DSI
> > and LRAT), generated by loading a guest address, needs to be handled by KVM.
> > These exceptions are generated in a substituted guest translation context
> > (EPLC[EGS] = 1) from host context (MSR[GS] = 0).
> >
> > Currently, KVM hooks only interrupts generated from guest context (MSR[GS] 
> > = 1),
> > doing minimal checks on the fast path to avoid host performance degradation.
> > lwepx exceptions originate from host state (MSR[GS] = 0) which implies
> > additional checks in DO_KVM macro (beside the current MSR[GS] = 1) by 
> > looking
> > at the Exception Syndrome Register (ESR[EPID]) and the External PID Load 
> > Context
> > Register (EPLC[EGS]). Doing this on each Data TLB miss exception is obvious
> > too intrusive for the host.
> >
> > Read guest last instruction from kvmppc_load_last_inst() by searching for 
> > the
> > physical address and kmap it. This address the TODO for TLB eviction and
> > execute-but-not-read entries, and allow us to get rid of lwepx until we are
> > able to handle failures.
> >
> > A simple stress benchmark shows a 1% sys performance degradation compared 
> > with
> > previous approach (lwepx without failure handling):
> >
> > time for i in `seq 1 10000`; do /bin/echo > /dev/null; done
> >
> > real        0m 8.85s
> > user        0m 4.34s
> > sys 0m 4.48s
> >
> > vs
> >
> > real        0m 8.84s
> > user        0m 4.36s
> > sys 0m 4.44s
> >
> > An alternative solution, to handle lwepx exceptions in KVM, is to temporary
> > highjack the interrupt vector from host. Some cores share host IVOR 
> > registers
> > between hardware threads, which is the case of FSL e6500, which impose 
> > additional
> > synchronization logic for this solution to work. This optimized solution can
> > be developed later on top of this patch.
> >
> > Signed-off-by: Mihai Caraman <mihai.cara...@freescale.com>
> > ---
> > v3:
> >   - reworked patch description
> >   - use unaltered kmap addr for kunmap
> >   - get last instruction before beeing preempted
> >
> > v2:
> >   - reworked patch description
> >   - used pr_* functions
> >   - addressed cosmetic feedback
> >
> >   arch/powerpc/kvm/booke.c              | 32 ++++++++++++
> >   arch/powerpc/kvm/bookehv_interrupts.S | 37 ++++----------
> >   arch/powerpc/kvm/e500_mmu_host.c      | 93 
> > +++++++++++++++++++++++++++++++++++
> >   3 files changed, 134 insertions(+), 28 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> > index 34a42b9..4ef52a8 100644
> > --- a/arch/powerpc/kvm/booke.c
> > +++ b/arch/powerpc/kvm/booke.c
> > @@ -880,6 +880,8 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
> > kvm_vcpu *vcpu,
> >     int r = RESUME_HOST;
> >     int s;
> >     int idx;
> > +   u32 last_inst = KVM_INST_FETCH_FAILED;
> > +   enum emulation_result emulated = EMULATE_DONE;
> >   
> >     /* update before a new last_exit_type is rewritten */
> >     kvmppc_update_timing_stats(vcpu);
> > @@ -887,6 +889,15 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
> > kvm_vcpu *vcpu,
> >     /* restart interrupts if they were meant for the host */
> >     kvmppc_restart_interrupt(vcpu, exit_nr);
> >   
> > +   /*
> > +    * get last instruction before beeing preempted
> > +    * TODO: for e6500 check also BOOKE_INTERRUPT_LRAT_ERROR & ESR_DATA
> > +    */
> > +   if (exit_nr == BOOKE_INTERRUPT_DATA_STORAGE ||
> > +       exit_nr == BOOKE_INTERRUPT_DTLB_MISS ||
> > +       exit_nr == BOOKE_INTERRUPT_HV_PRIV)
> 
> Please make this a switch() - that's easier to read.
> 
> > +           emulated = kvmppc_get_last_inst(vcpu, false, &last_inst);
> > +
> >     local_irq_enable();
> >   
> >     trace_kvm_exit(exit_nr, vcpu);
> > @@ -895,6 +906,26 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
> > kvm_vcpu *vcpu,
> >     run->exit_reason = KVM_EXIT_UNKNOWN;
> >     run->ready_for_interrupt_injection = 1;
> >   
> > +   switch (emulated) {
> > +   case EMULATE_AGAIN:
> > +           r = RESUME_GUEST;
> > +           goto out;
> > +
> > +   case EMULATE_FAIL:
> > +           pr_debug("%s: emulation at %lx failed (%08x)\n",
> > +                  __func__, vcpu->arch.pc, last_inst);
> > +           /* For debugging, encode the failing instruction and
> > +            * report it to userspace. */
> > +           run->hw.hardware_exit_reason = ~0ULL << 32;
> > +           run->hw.hardware_exit_reason |= last_inst;
> > +           kvmppc_core_queue_program(vcpu, ESR_PIL);
> > +           r = RESUME_HOST;
> > +           goto out;
> > +
> > +   default:
> > +           break;
> > +   }
> 
> I think you can just put this into a function.
> 
> Scott, I think the patch overall looks quite good. Can you please check 
> as well and if you agree give it your reviewed-by? Mike, when Scott 
> gives you a reviewed-by, please include it for the next version.
> 
> 
> Alex
> 
> > +
> >     switch (exit_nr) {
> >     case BOOKE_INTERRUPT_MACHINE_CHECK:
> >             printk("MACHINE CHECK: %lx\n", mfspr(SPRN_MCSR));
> > @@ -1184,6 +1215,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
> > kvm_vcpu *vcpu,
> >             BUG();
> >     }
> >   
> > +out:
> >     /*
> >      * To avoid clobbering exit_reason, only check for signals if we
> >      * aren't already exiting to userspace for some other reason.
> > diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
> > b/arch/powerpc/kvm/bookehv_interrupts.S
> > index 6ff4480..e000b39 100644
> > --- a/arch/powerpc/kvm/bookehv_interrupts.S
> > +++ b/arch/powerpc/kvm/bookehv_interrupts.S
> > @@ -121,38 +121,14 @@
> >   1:
> >   
> >     .if     \flags & NEED_EMU
> > -   /*
> > -    * This assumes you have external PID support.
> > -    * To support a bookehv CPU without external PID, you'll
> > -    * need to look up the TLB entry and create a temporary mapping.
> > -    *
> > -    * FIXME: we don't currently handle if the lwepx faults.  PR-mode
> > -    * booke doesn't handle it either.  Since Linux doesn't use
> > -    * broadcast tlbivax anymore, the only way this should happen is
> > -    * if the guest maps its memory execute-but-not-read, or if we
> > -    * somehow take a TLB miss in the middle of this entry code and
> > -    * evict the relevant entry.  On e500mc, all kernel lowmem is
> > -    * bolted into TLB1 large page mappings, and we don't use
> > -    * broadcast invalidates, so we should not take a TLB miss here.
> > -    *
> > -    * Later we'll need to deal with faults here.  Disallowing guest
> > -    * mappings that are execute-but-not-read could be an option on
> > -    * e500mc, but not on chips with an LRAT if it is used.
> > -    */
> > -
> > -   mfspr   r3, SPRN_EPLC   /* will already have correct ELPID and EGS */
> >     PPC_STL r15, VCPU_GPR(R15)(r4)
> >     PPC_STL r16, VCPU_GPR(R16)(r4)
> >     PPC_STL r17, VCPU_GPR(R17)(r4)
> >     PPC_STL r18, VCPU_GPR(R18)(r4)
> >     PPC_STL r19, VCPU_GPR(R19)(r4)
> > -   mr      r8, r3
> >     PPC_STL r20, VCPU_GPR(R20)(r4)
> > -   rlwimi  r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS
> >     PPC_STL r21, VCPU_GPR(R21)(r4)
> > -   rlwimi  r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR
> >     PPC_STL r22, VCPU_GPR(R22)(r4)
> > -   rlwimi  r8, r10, EPC_EPID_SHIFT, EPC_EPID
> >     PPC_STL r23, VCPU_GPR(R23)(r4)
> >     PPC_STL r24, VCPU_GPR(R24)(r4)
> >     PPC_STL r25, VCPU_GPR(R25)(r4)
> > @@ -162,10 +138,15 @@
> >     PPC_STL r29, VCPU_GPR(R29)(r4)
> >     PPC_STL r30, VCPU_GPR(R30)(r4)
> >     PPC_STL r31, VCPU_GPR(R31)(r4)
> > -   mtspr   SPRN_EPLC, r8
> > -   isync
> > -   lwepx   r9, 0, r5
> > -   mtspr   SPRN_EPLC, r3
> > +
> > +   /*
> > +    * We don't use external PID support. lwepx faults would need to be
> > +    * handled by KVM and this implies aditional code in DO_KVM (for
> > +    * DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which
> > +    * is too intrusive for the host. Get last instuction in
> > +    * kvmppc_get_last_inst().
> > +    */
> > +   li      r9, KVM_INST_FETCH_FAILED
> >     stw     r9, VCPU_LAST_INST(r4)
> >     .endif
> >   
> > diff --git a/arch/powerpc/kvm/e500_mmu_host.c 
> > b/arch/powerpc/kvm/e500_mmu_host.c
> > index f692c12..0528fe5 100644
> > --- a/arch/powerpc/kvm/e500_mmu_host.c
> > +++ b/arch/powerpc/kvm/e500_mmu_host.c
> > @@ -606,10 +606,103 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 
> > eaddr, gpa_t gpaddr,
> >     }
> >   }
> >   
> > +#ifdef CONFIG_KVM_BOOKE_HV
> >   int kvmppc_load_last_inst(struct kvm_vcpu *vcpu, bool prev, u32 *instr)
> >   {
> > +   gva_t geaddr;
> > +   hpa_t addr;
> > +   hfn_t pfn;
> > +   hva_t eaddr;
> > +   u32 mas0, mas1, mas2, mas3;
> > +   u64 mas7_mas3;
> > +   struct page *page;
> > +   unsigned int addr_space, psize_shift;
> > +   bool pr;
> > +   unsigned long flags;
> > +
> > +   WARN_ON_ONCE(prev);


What are the semantics of "prev"?

> > +   /* Search TLB for guest pc to get the real address */
> > +   geaddr = kvmppc_get_pc(vcpu);
> > +
> > +   addr_space = (vcpu->arch.shared->msr & MSR_IS) >> MSR_IR_LG;
> > +
> > +   local_irq_save(flags);
> > +   mtspr(SPRN_MAS6, (vcpu->arch.pid << MAS6_SPID_SHIFT) | addr_space);
> > +   mtspr(SPRN_MAS5, MAS5_SGS | vcpu->kvm->arch.lpid);
> > +   asm volatile("tlbsx 0, %[geaddr]\n" : :
> > +                [geaddr] "r" (geaddr));
> > +   mtspr(SPRN_MAS5, 0);
> > +   mtspr(SPRN_MAS8, 0);
> > +   mas0 = mfspr(SPRN_MAS0);
> > +   mas1 = mfspr(SPRN_MAS1);
> > +   mas2 = mfspr(SPRN_MAS2);
> > +   mas3 = mfspr(SPRN_MAS3);
> > +   mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) << 32) | mas3;
> > +   local_irq_restore(flags);

Where is mas0 used?

On 64-bit you could read directly from SPRN_MAS7_MAS3.

> > +   /*
> > +    * If the TLB entry for guest pc was evicted, return to the guest.
> > +    * There are high chances to find a valid TLB entry next time.
> > +    */
> > +   if (!(mas1 & MAS1_VALID))
> > +           return EMULATE_AGAIN;
> > +
> > +   /*
> > +    * Another thread may rewrite the TLB entry in parallel, don't
> > +    * execute from the address if the execute permission is not set
> > +    */
> > +   pr = vcpu->arch.shared->msr & MSR_PR;
> > +   if ((pr && !(mas3 & MAS3_UX)) || (!pr && !(mas3 & MAS3_SX))) {
> > +           pr_debug("Instuction emulation from a guest page\n"
> > +                           "withot execute permission\n");

Don't split user-visible strings.

s/withot/without/

Give more context in the message, such as __func__ and the address.

> > +           return EMULATE_FAIL;
> > +   }
> > +
> > +   /*
> > +    * We will map the real address through a cacheable page, so we will
> > +    * not support cache-inhibited guest pages. Fortunately emulated
> > +    * instructions should not live there.
> > +    */
> > +   if (mas2 & MAS2_I) {
> > +           pr_debug("Instuction emulation from cache-inhibited\n"
> > +                           "guest pages is not supported\n");
> > +           return EMULATE_FAIL;
> > +   }

Mismatches in M or W are bad as well, but shouldn't the page_is_ram()
protect us?  Except when LRAT is used, but there we can't prevent
mismatches anyway.

> > +   /* Get page size */
> > +   psize_shift = MAS1_GET_TSIZE(mas1) + 10;
> > +
> > +   /* Map a page and get guest's instruction */
> > +   addr = (mas7_mas3 & (~0ULL << psize_shift)) |
> > +          (geaddr & ((1ULL << psize_shift) - 1ULL));
> > +   pfn = addr >> PAGE_SHIFT;
> > +
> > +   /* Guard us against emulation from devices area */
> > +   if (unlikely(!page_is_ram(pfn))) {
> > +           pr_debug("Instruction emulation from non-RAM host\n"
> > +                           "pages is not supported\n");
> > +           return EMULATE_FAIL;
> > +   }
> > +
> > +   if (unlikely(!pfn_valid(pfn))) {
> > +           pr_debug("Invalid frame number\n");
> > +           return EMULATE_FAIL;
> > +   }

Can we ever have page_is_ram(pfn) && !pfn_valid(pfn)?

-Scott


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4 v3] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

Reply via email to