Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 20:40, Scott Wood wrote:

> On 02/15/2012 01:36 PM, Alexander Graf wrote:
>> 
>> On 10.01.2012, at 01:51, Scott Wood wrote:
>>> I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
>>> treat it as a kernel fault (search exception table) -- but this works
>>> too and is a bit cleaner (could be other uses of external pid), at the
>>> expense of a couple extra instructions in the emulation path (but
>>> probably a slightly faster host TLB handler).
>>> 
>>> The check wouldn't go in DO_KVM, though, since on bookehv that only
>>> deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
>>> case here.
>> 
>> Thinking about it a bit more, how is this different from a failed 
>> get_user()? We can just use the same fixup mechanism as there, right?
> 
> The fixup mechanism can be the same (we'd like to know whether it failed
> due to TLB miss or DSI, so we know which to reflect

No, we only want to know "fast path failed". The reason is a different pair of 
shoes and should be evaluated in the slow path. We shouldn't ever fault here 
during normal operation btw. We already executed a guest instruction, so 
there's almost no reason it can't be read.

> -- but if necessary
> I think we can figure that out with a tlbsx).  What's different is that
> the page fault handler needs to know that any external pid (or AS1)
> fault is bad, same as if the address were in the kernel area, and it
> should go directly to searching the exception tables instead of trying
> to page something in.

Yes and no. We need to force it to search the exception tables. We don't care 
if the page fault handlers knows anything about external pids.

Either way, we discussed the further stuff on IRC and came to a working 
solution :). Stay tuned.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-02-15 Thread Scott Wood
On 02/15/2012 01:36 PM, Alexander Graf wrote:
> 
> On 10.01.2012, at 01:51, Scott Wood wrote:
>> I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
>> treat it as a kernel fault (search exception table) -- but this works
>> too and is a bit cleaner (could be other uses of external pid), at the
>> expense of a couple extra instructions in the emulation path (but
>> probably a slightly faster host TLB handler).
>>
>> The check wouldn't go in DO_KVM, though, since on bookehv that only
>> deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
>> case here.
> 
> Thinking about it a bit more, how is this different from a failed get_user()? 
> We can just use the same fixup mechanism as there, right?

The fixup mechanism can be the same (we'd like to know whether it failed
due to TLB miss or DSI, so we know which to reflect -- but if necessary
I think we can figure that out with a tlbsx).  What's different is that
the page fault handler needs to know that any external pid (or AS1)
fault is bad, same as if the address were in the kernel area, and it
should go directly to searching the exception tables instead of trying
to page something in.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-02-15 Thread Alexander Graf

On 10.01.2012, at 01:51, Scott Wood wrote:

> On 01/09/2012 11:46 AM, Alexander Graf wrote:
>> 
>> On 21.12.2011, at 02:34, Scott Wood wrote:
> 

[...]

>>> Current issues include:
>>> - Machine checks from guest state are not routed to the host handler.
>>> - The guest can cause a host oops by executing an emulated instruction
>>>  in a page that lacks read permission.  Existing e500/4xx support has
>>>  the same problem.
>> 
>> We solve that in book3s pr by doing
>> 
>>  LAST_INST = ;
>>  PACA->kvm_mode = ;
>>  lwz(guest pc);
>>  do_more_stuff();
>> 
>> That way when an exception occurs at lwz() the DO_KVM handler checks that 
>> we're in kvm mode "recover" which does basically srr0+=4; rfi;.
> 
> I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
> treat it as a kernel fault (search exception table) -- but this works
> too and is a bit cleaner (could be other uses of external pid), at the
> expense of a couple extra instructions in the emulation path (but
> probably a slightly faster host TLB handler).
> 
> The check wouldn't go in DO_KVM, though, since on bookehv that only
> deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
> case here.

Thinking about it a bit more, how is this different from a failed get_user()? 
We can just use the same fixup mechanism as there, right?

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-01-12 Thread Scott Wood
On Thu, Jan 12, 2012 at 05:44:26PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2012-01-10 at 04:11 +0100, Alexander Graf wrote:
> > This is what book3s does:
> > 
> > case EMULATE_FAIL:
> > printk(KERN_CRIT "%s: emulation at %lx failed
> > (%08x)\n",
> >__func__, kvmppc_get_pc(vcpu),
> > kvmppc_get_last_inst(vcpu));
> > kvmppc_core_queue_program(vcpu, flags);
> > r = RESUME_GUEST;
> > 
> > which also doesn't throttle the printk, but I think injecting a
> > program fault into the guest is the most sensible thing to do if we
> > don't know what the instruction is supposed to do. Best case we get an
> > oops inside the guest telling us what broke :).
> 
> You can also fallback to a slow path that reads the guest TLB,
> translates then reads the instruction. Of course you have to be careful
> as such a manual translate + read + execute needs to be somewhat
> synchronized with a possible TLB invalidation :-)

That's how we should deal with a failure to read the instruction due to
it being execute-only (once we add the ability to fix up a fault on a
booke KVM instruction fetch) -- but the above code is dealing with the
case where we read the instruction successfully, but don't have an
emulation handler for it.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-01-11 Thread Alexander Graf


On 12.01.2012, at 07:44, Benjamin Herrenschmidt  
wrote:

> On Tue, 2012-01-10 at 04:11 +0100, Alexander Graf wrote:
>> This is what book3s does:
>> 
>>case EMULATE_FAIL:
>>printk(KERN_CRIT "%s: emulation at %lx failed
>> (%08x)\n",
>>   __func__, kvmppc_get_pc(vcpu),
>> kvmppc_get_last_inst(vcpu));
>>kvmppc_core_queue_program(vcpu, flags);
>>r = RESUME_GUEST;
>> 
>> which also doesn't throttle the printk, but I think injecting a
>> program fault into the guest is the most sensible thing to do if we
>> don't know what the instruction is supposed to do. Best case we get an
>> oops inside the guest telling us what broke :).
> 
> You can also fallback to a slow path that reads the guest TLB,
> translates then reads the instruction. Of course you have to be careful
> as such a manual translate + read + execute needs to be somewhat
> synchronized with a possible TLB invalidation :-)

Well we do want to be fast on the default path though. So yes, what you're 
saying is what book3s does, but as a fallback in case the fast path didn't work.

The problem here however is that we don't know if the fast path failed; we oops.


> 
> (MMIO emulation is broken in this regard too btw)

Huh?

Alex

> 
> Cheers,
> Ben.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-01-11 Thread Benjamin Herrenschmidt
On Tue, 2012-01-10 at 04:11 +0100, Alexander Graf wrote:
> This is what book3s does:
> 
> case EMULATE_FAIL:
> printk(KERN_CRIT "%s: emulation at %lx failed
> (%08x)\n",
>__func__, kvmppc_get_pc(vcpu),
> kvmppc_get_last_inst(vcpu));
> kvmppc_core_queue_program(vcpu, flags);
> r = RESUME_GUEST;
> 
> which also doesn't throttle the printk, but I think injecting a
> program fault into the guest is the most sensible thing to do if we
> don't know what the instruction is supposed to do. Best case we get an
> oops inside the guest telling us what broke :).

You can also fallback to a slow path that reads the guest TLB,
translates then reads the instruction. Of course you have to be careful
as such a manual translate + read + execute needs to be somewhat
synchronized with a possible TLB invalidation :-)

(MMIO emulation is broken in this regard too btw)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-01-10 Thread Alexander Graf

On 10.01.2012, at 23:03, Scott Wood wrote:

> On 01/09/2012 09:11 PM, Alexander Graf wrote:
>> On 10.01.2012, at 01:51, Scott Wood wrote:
>>> On 01/09/2012 11:46 AM, Alexander Graf wrote:
 On 21.12.2011, at 02:34, Scott Wood wrote:
> + /* For debugging, encode the failing instruction and
> +  * report it to userspace. */
> + run->hw.hardware_exit_reason = ~0ULL << 32;
> + run->hw.hardware_exit_reason |= vcpu->arch.last_inst;
 
 
 I'm fairly sure you want to fix this :)
>>> 
>>> Likewise, that's what booke.c already does.  What should it do instead?
>> 
>> This is what book3s does:
>> 
>>case EMULATE_FAIL:
>>printk(KERN_CRIT "%s: emulation at %lx failed 
>> (%08x)\n",
>>   __func__, kvmppc_get_pc(vcpu), 
>> kvmppc_get_last_inst(vcpu));
>>kvmppc_core_queue_program(vcpu, flags);
>>r = RESUME_GUEST;
>> 
>> which also doesn't throttle the printk, but I think injecting a
>> program fault into the guest is the most sensible thing to do if we
>> don't know what the instruction is supposed to do. Best case we get
>> an oops inside the guest telling us what broke :).
> 
> Ah, yes, it should send a program check.
> 
 Ah, so that's what you want to use regs for. So is having a pt_regs
 struct that only contains useful register values in half its fields
 any useful here? Or could we keep control of the registers ourselves,
 enabling us to maybe one day optimize things more.
>>> 
>>> I think it contains enough to be useful for debugging code such as sysrq
>>> and tracers, and as noted in the comment we could copy the rest if we
>>> care enough.  MSR might be worth copying.
>>> 
>>> It will eventually be used for machine checks as well, which I'd like to
>>> hand reasonable register state to, at least for GPRs, LR, and PC.
>>> 
>>> If there's a good enough performance reason, we could just copy
>>> everything over for machine checks and pass NULL to do_IRQ (I think it
>>> can take this -- a dummy regs struct if not), but it seems premature at
>>> the moment unless the switch already causes measured performance loss
>>> (cache utilization?).
>> 
>> I'm definitely not concerned about performance, but complexity and 
>> uniqueness.
>> 
>> With the pt_regs struct, we have a bunch of fields in the vcpu that are 
>> there, but unused. I find that situation pretty confusing.
> 
> I removed the registers from the vcpu, that are to be used in regs instead.
> 
> There are a few fields in regs that are not valid, though it is
> explicitly pointed out via a comment.

Yes, and if there was real technical reason to do it this way I'd agree. But 
there isn't.

> 
>> So yes, I would definitely prefer to copy registers during MC and keep the 
>> registers where they are today - unless there are SPRs for them of course.
>> 
>> Imagine we'd one day want to share GPRs with user space through the
>> kvm_run structure (see the s390 patches on the ML for this). I really
>> wouldn't want to make pt_regs part of our userspace ABI.
> 
> Neither would I.  If that's something that's reasonably likely to
> happen, I guess that's a good enough reason to avoid this.  We could
> always add later a debug option to copy regs even on normal interrupts,
> if needed.

Yup. I don't want to walk in the wrong direction basically. The overhead of 
copying
a couple fields to the stack on machine checks doesn't sound too bad compared
to the flexibility we maintain by keeping fields under our control.

Another imaginary case. I experimented with putting the GPRs into the PACA
back in the day. I don't remember why anymore, but it was for some speedup
of something.

That wouldn't be possible if we mandate everyone to use pt_regs.

> 
>>> We probably should defer the check until after we've disabled
>>> interrupts, similar to signals -- even if we didn't exit for an
>>> interrupt, we could have received one after enabling them.
>> 
>> Yup. I just don't think you can call resched() with interrupts disabled, so 
>> a bit cleverness is probably required here.
> 
> I think it is actually allowed, but interrupts will be enabled on
> return.  We'll need to repeat prepare_to_enter if we do schedule.  Since
> we already need special handling for that, we might as well add a
> local_irq_enable() once we know we are going to schedule, just in case.

Yup :). And then check again.

> 
> diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
> index 05d1d99..d53bcf2 100644
> --- a/arch/powerpc/kvm/booke.h
> +++ b/arch/powerpc/kvm/booke.h
> @@ -48,7 +48,20 @@
> #define BOOKE_IRQPRIO_PERFORMANCE_MONITOR 19
> /* Internal pseudo-irqprio for level triggered externals */
> #define BOOKE_IRQPRIO_EXTERNAL_LEVEL 20
> -#define BOOKE_IRQPRIO_MAX 20
> +#define BOOKE_IRQPRIO_DBELL 21
> +#define BOOKE_IRQPRIO_DBELL_CRIT 22
> +#define BOOKE_IRQ

Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-01-10 Thread Scott Wood
On 01/09/2012 09:11 PM, Alexander Graf wrote:
> On 10.01.2012, at 01:51, Scott Wood wrote:
>> On 01/09/2012 11:46 AM, Alexander Graf wrote:
>>> On 21.12.2011, at 02:34, Scott Wood wrote:
 +  /* For debugging, encode the failing instruction and
 +   * report it to userspace. */
 +  run->hw.hardware_exit_reason = ~0ULL << 32;
 +  run->hw.hardware_exit_reason |= vcpu->arch.last_inst;
>>>
>>>
>>> I'm fairly sure you want to fix this :)
>>
>> Likewise, that's what booke.c already does.  What should it do instead?
> 
> This is what book3s does:
> 
> case EMULATE_FAIL:
> printk(KERN_CRIT "%s: emulation at %lx failed 
> (%08x)\n",
>__func__, kvmppc_get_pc(vcpu), 
> kvmppc_get_last_inst(vcpu));
> kvmppc_core_queue_program(vcpu, flags);
> r = RESUME_GUEST;
> 
> which also doesn't throttle the printk, but I think injecting a
> program fault into the guest is the most sensible thing to do if we
> don't know what the instruction is supposed to do. Best case we get
> an oops inside the guest telling us what broke :).

Ah, yes, it should send a program check.

>>> Ah, so that's what you want to use regs for. So is having a pt_regs
>>> struct that only contains useful register values in half its fields
>>> any useful here? Or could we keep control of the registers ourselves,
>>> enabling us to maybe one day optimize things more.
>>
>> I think it contains enough to be useful for debugging code such as sysrq
>> and tracers, and as noted in the comment we could copy the rest if we
>> care enough.  MSR might be worth copying.
>>
>> It will eventually be used for machine checks as well, which I'd like to
>> hand reasonable register state to, at least for GPRs, LR, and PC.
>>
>> If there's a good enough performance reason, we could just copy
>> everything over for machine checks and pass NULL to do_IRQ (I think it
>> can take this -- a dummy regs struct if not), but it seems premature at
>> the moment unless the switch already causes measured performance loss
>> (cache utilization?).
> 
> I'm definitely not concerned about performance, but complexity and uniqueness.
>
> With the pt_regs struct, we have a bunch of fields in the vcpu that are 
> there, but unused. I find that situation pretty confusing.

I removed the registers from the vcpu, that are to be used in regs instead.

There are a few fields in regs that are not valid, though it is
explicitly pointed out via a comment.

> So yes, I would definitely prefer to copy registers during MC and keep the 
> registers where they are today - unless there are SPRs for them of course.
>
> Imagine we'd one day want to share GPRs with user space through the
> kvm_run structure (see the s390 patches on the ML for this). I really
> wouldn't want to make pt_regs part of our userspace ABI.

Neither would I.  If that's something that's reasonably likely to
happen, I guess that's a good enough reason to avoid this.  We could
always add later a debug option to copy regs even on normal interrupts,
if needed.

>> We probably should defer the check until after we've disabled
>> interrupts, similar to signals -- even if we didn't exit for an
>> interrupt, we could have received one after enabling them.
> 
> Yup. I just don't think you can call resched() with interrupts disabled, so a 
> bit cleverness is probably required here.

I think it is actually allowed, but interrupts will be enabled on
return.  We'll need to repeat prepare_to_enter if we do schedule.  Since
we already need special handling for that, we might as well add a
local_irq_enable() once we know we are going to schedule, just in case.

 diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
 index 05d1d99..d53bcf2 100644
 --- a/arch/powerpc/kvm/booke.h
 +++ b/arch/powerpc/kvm/booke.h
 @@ -48,7 +48,20 @@
 #define BOOKE_IRQPRIO_PERFORMANCE_MONITOR 19
 /* Internal pseudo-irqprio for level triggered externals */
 #define BOOKE_IRQPRIO_EXTERNAL_LEVEL 20
 -#define BOOKE_IRQPRIO_MAX 20
 +#define BOOKE_IRQPRIO_DBELL 21
 +#define BOOKE_IRQPRIO_DBELL_CRIT 22
 +#define BOOKE_IRQPRIO_MAX 23
>>>
>>> So was MAX wrong before or is it too big now?
>>
>> MAX is just a marker for how many IRQPRIOs we have, not any sort of
>> external limit.  This patch adds new IRQPRIOs, so MAX goes up.
>>
>> The actual limit is the number of bits in a long.
> 

> Yes, and before the highest value was 20 with MAX being 20, now the
> highest value is 22 with MAX being 23. Either MAX == highest number
> or MAX == highest number + 1, but you're changing the semantics of
> MAX here. Maybe it was wrong before, I don't know, hence I'm asking
> :).

Oh, didn't notice that.

Actually, it looks like the two places that reference BOOKE_IRQPRIO_MAX
don't agree on what they're expecting.  book3s uses "one greater than
the highest irqprio", so I guess we

Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-01-09 Thread Alexander Graf

On 10.01.2012, at 01:51, Scott Wood wrote:

> On 01/09/2012 11:46 AM, Alexander Graf wrote:
>> 
>> On 21.12.2011, at 02:34, Scott Wood wrote:
>> 
>>> Chips such as e500mc that implement category E.HV in Power ISA 2.06
>>> provide hardware virtualization features, including a new MSR mode for
>>> guest state.  The guest OS can perform many operations without trapping
>>> into the hypervisor, including transitions to and from guest userspace.
>>> 
>>> Since we can use SRR1[GS] to reliably tell whether an exception came from
>>> guest state, instead of messing around with IVPR, we use DO_KVM similarly
>>> to book3s.
>> 
>> Is there any benefit of using DO_KVM? I would assume that messing with IVPR 
>> is faster.
> 
> Using the GS bit to decide which handler to run means we won't get
> confused if a machine check or critical interrupt happens between
> entering/exiting the guest and updating IVPR (we could use the IS bit
> similarly in PR-mode).
> 
> This could be supplemented with IVPR (though that will add a few cycles
> to guest entry/exit) or some sort of runtime patching (would be more
> coarse-grained, active when any KVM guest exists) to avoid adding
> overhead to traps when KVM is not used, but I'd like to quantify that
> overhead first.  It should be much lower than what happens on book3s.

Hrm. Yeah, given that your DO_KVM handler is so much simpler, it might make 
sense to stick with that method. Benchmarks would be useful in the long run 
though.

> 
>>> Current issues include:
>>> - Machine checks from guest state are not routed to the host handler.
>>> - The guest can cause a host oops by executing an emulated instruction
>>>  in a page that lacks read permission.  Existing e500/4xx support has
>>>  the same problem.
>> 
>> We solve that in book3s pr by doing
>> 
>>  LAST_INST = ;
>>  PACA->kvm_mode = ;
>>  lwz(guest pc);
>>  do_more_stuff();
>> 
>> That way when an exception occurs at lwz() the DO_KVM handler checks that 
>> we're in kvm mode "recover" which does basically srr0+=4; rfi;.
> 
> I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
> treat it as a kernel fault (search exception table) -- but this works
> too and is a bit cleaner (could be other uses of external pid), at the
> expense of a couple extra instructions in the emulation path (but
> probably a slightly faster host TLB handler).
> 
> The check wouldn't go in DO_KVM, though, since on bookehv that only
> deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
> case here.

Yup, not sure where you'd put the check, as it'd slow down normal operation 
too. Hrm.

> 
>>> @@ -243,16 +324,20 @@ static int kvmppc_booke_irqprio_deliver(struct 
>>> kvm_vcpu *vcpu,
>>> case BOOKE_IRQPRIO_AP_UNAVAIL:
>>> case BOOKE_IRQPRIO_ALIGNMENT:
>>> allowed = 1;
>>> -   msr_mask = MSR_CE|MSR_ME|MSR_DE;
>>> +   msr_mask = MSR_GS | MSR_CE | MSR_ME | MSR_DE;
>> 
>> No need to do this. You already force MSR_GS in set_msr();
> 
> OK.  This was here since before set_msr() started doing that. :-)
> 
>>> +   if (!current->thread.kvm_vcpu) {
>>> +   WARN(1, "no vcpu\n");
>>> +   return -EPERM;
>>> +   }
>> 
>> Huh?
> 
> Oops, leftover debugging.
> 
>>> +static int emulation_exit(struct kvm_run *run, struct kvm_vcpu *vcpu)
>>> +{
>>> +   enum emulation_result er;
>>> +
>>> +   er = kvmppc_emulate_instruction(run, vcpu);
>>> +   switch (er) {
>>> +   case EMULATE_DONE:
>>> +   /* don't overwrite subtypes, just account kvm_stats */
>>> +   kvmppc_account_exit_stat(vcpu, EMULATED_INST_EXITS);
>>> +   /* Future optimization: only reload non-volatiles if
>>> +* they were actually modified by emulation. */
>>> +   return RESUME_GUEST_NV;
>>> +
>>> +   case EMULATE_DO_DCR:
>>> +   run->exit_reason = KVM_EXIT_DCR;
>>> +   return RESUME_HOST;
>>> +
>>> +   case EMULATE_FAIL:
>>> +   /* XXX Deliver Program interrupt to guest. */
>>> +   printk(KERN_CRIT "%s: emulation at %lx failed (%08x)\n",
>>> +  __func__, vcpu->arch.regs.nip, vcpu->arch.last_inst);
>> 
>> This should be throttled, otherwise the guest can spam our logs.
> 
> Yes it should, but I'm just moving the code here.

Yeah, only realized this later. Maybe next time (not for this patch set, next 
time you're sending something) just extract these mechanical parts, so it's 
easier to review the pieces where code actually changes :).

> 
>>> +   /* For debugging, encode the failing instruction and
>>> +* report it to userspace. */
>>> +   run->hw.hardware_exit_reason = ~0ULL << 32;
>>> +   run->hw.hardware_exit_reason |= vcpu->arch.last_inst;
>> 
>> 
>> I'm fairly sure you want to fix this :)
> 
> Likewise, that's what booke.c already does.  What should it do instead?

This is what book3s does:

case EMULATE_FAIL:
printk(KERN_CRIT "%s: emulation at %lx failed (%08x)\

Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-01-09 Thread Scott Wood
On 01/09/2012 11:46 AM, Alexander Graf wrote:
> 
> On 21.12.2011, at 02:34, Scott Wood wrote:
> 
>> Chips such as e500mc that implement category E.HV in Power ISA 2.06
>> provide hardware virtualization features, including a new MSR mode for
>> guest state.  The guest OS can perform many operations without trapping
>> into the hypervisor, including transitions to and from guest userspace.
>>
>> Since we can use SRR1[GS] to reliably tell whether an exception came from
>> guest state, instead of messing around with IVPR, we use DO_KVM similarly
>> to book3s.
> 
> Is there any benefit of using DO_KVM? I would assume that messing with IVPR 
> is faster.

Using the GS bit to decide which handler to run means we won't get
confused if a machine check or critical interrupt happens between
entering/exiting the guest and updating IVPR (we could use the IS bit
similarly in PR-mode).

This could be supplemented with IVPR (though that will add a few cycles
to guest entry/exit) or some sort of runtime patching (would be more
coarse-grained, active when any KVM guest exists) to avoid adding
overhead to traps when KVM is not used, but I'd like to quantify that
overhead first.  It should be much lower than what happens on book3s.

>> Current issues include:
>> - Machine checks from guest state are not routed to the host handler.
>> - The guest can cause a host oops by executing an emulated instruction
>>   in a page that lacks read permission.  Existing e500/4xx support has
>>   the same problem.
> 
> We solve that in book3s pr by doing
> 
>   LAST_INST = ;
>   PACA->kvm_mode = ;
>   lwz(guest pc);
>   do_more_stuff();
> 
> That way when an exception occurs at lwz() the DO_KVM handler checks that 
> we're in kvm mode "recover" which does basically srr0+=4; rfi;.

I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
treat it as a kernel fault (search exception table) -- but this works
too and is a bit cleaner (could be other uses of external pid), at the
expense of a couple extra instructions in the emulation path (but
probably a slightly faster host TLB handler).

The check wouldn't go in DO_KVM, though, since on bookehv that only
deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
case here.

>> @@ -243,16 +324,20 @@ static int kvmppc_booke_irqprio_deliver(struct 
>> kvm_vcpu *vcpu,
>>  case BOOKE_IRQPRIO_AP_UNAVAIL:
>>  case BOOKE_IRQPRIO_ALIGNMENT:
>>  allowed = 1;
>> -msr_mask = MSR_CE|MSR_ME|MSR_DE;
>> +msr_mask = MSR_GS | MSR_CE | MSR_ME | MSR_DE;
> 
> No need to do this. You already force MSR_GS in set_msr();

OK.  This was here since before set_msr() started doing that. :-)

>> +if (!current->thread.kvm_vcpu) {
>> +WARN(1, "no vcpu\n");
>> +return -EPERM;
>> +}
> 
> Huh?

Oops, leftover debugging.

>> +static int emulation_exit(struct kvm_run *run, struct kvm_vcpu *vcpu)
>> +{
>> +enum emulation_result er;
>> +
>> +er = kvmppc_emulate_instruction(run, vcpu);
>> +switch (er) {
>> +case EMULATE_DONE:
>> +/* don't overwrite subtypes, just account kvm_stats */
>> +kvmppc_account_exit_stat(vcpu, EMULATED_INST_EXITS);
>> +/* Future optimization: only reload non-volatiles if
>> + * they were actually modified by emulation. */
>> +return RESUME_GUEST_NV;
>> +
>> +case EMULATE_DO_DCR:
>> +run->exit_reason = KVM_EXIT_DCR;
>> +return RESUME_HOST;
>> +
>> +case EMULATE_FAIL:
>> +/* XXX Deliver Program interrupt to guest. */
>> +printk(KERN_CRIT "%s: emulation at %lx failed (%08x)\n",
>> +   __func__, vcpu->arch.regs.nip, vcpu->arch.last_inst);
> 
> This should be throttled, otherwise the guest can spam our logs.

Yes it should, but I'm just moving the code here.

>> +/* For debugging, encode the failing instruction and
>> + * report it to userspace. */
>> +run->hw.hardware_exit_reason = ~0ULL << 32;
>> +run->hw.hardware_exit_reason |= vcpu->arch.last_inst;
> 
> 
> I'm fairly sure you want to fix this :)

Likewise, that's what booke.c already does.  What should it do instead?

> /**
>>  * kvmppc_handle_exit
>>  *
>> @@ -374,12 +530,39 @@ out:
>> int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
>>unsigned int exit_nr)
>> {
>> -enum emulation_result er;
>>  int r = RESUME_HOST;
>>
>>  /* update before a new last_exit_type is rewritten */
>>  kvmppc_update_timing_stats(vcpu);
>>
>> +/*
>> + * If we actually care, we could copy MSR, DEAR, and ESR to regs,
>> + * insert an appropriate trap number, etc.
>> + *
>> + * Seems like a waste of cycles for something that should only matter
>> + * to someone using sysrq-t/p or similar host kernel debug facility.
>> + * We have other debug facilities to get that information from a
>> + * guest through us

[RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2011-12-20 Thread Scott Wood
Chips such as e500mc that implement category E.HV in Power ISA 2.06
provide hardware virtualization features, including a new MSR mode for
guest state.  The guest OS can perform many operations without trapping
into the hypervisor, including transitions to and from guest userspace.

Since we can use SRR1[GS] to reliably tell whether an exception came from
guest state, instead of messing around with IVPR, we use DO_KVM similarly
to book3s.

Current issues include:
 - Machine checks from guest state are not routed to the host handler.
 - The guest can cause a host oops by executing an emulated instruction
   in a page that lacks read permission.  Existing e500/4xx support has
   the same problem.

Includes work by Ashish Kalra ,
Varun Sethi , and
Liu Yu .

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/dbell.h|1 +
 arch/powerpc/include/asm/kvm_asm.h  |8 +
 arch/powerpc/include/asm/kvm_booke_hv_asm.h |   49 +++
 arch/powerpc/include/asm/kvm_host.h |   19 +-
 arch/powerpc/include/asm/kvm_ppc.h  |3 +
 arch/powerpc/include/asm/mmu-book3e.h   |6 +
 arch/powerpc/include/asm/processor.h|3 +
 arch/powerpc/include/asm/reg.h  |2 +
 arch/powerpc/include/asm/reg_booke.h|   34 ++
 arch/powerpc/kernel/asm-offsets.c   |   15 +-
 arch/powerpc/kernel/head_booke.h|   28 ++-
 arch/powerpc/kvm/Kconfig|3 +
 arch/powerpc/kvm/booke.c|  398 ++-
 arch/powerpc/kvm/booke.h|   24 +-
 arch/powerpc/kvm/booke_emulate.c|   23 +-
 arch/powerpc/kvm/bookehv_interrupts.S   |  587 +++
 arch/powerpc/kvm/powerpc.c  |5 +
 arch/powerpc/kvm/timing.h   |6 +
 18 files changed, 1107 insertions(+), 107 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_booke_hv_asm.h
 create mode 100644 arch/powerpc/kvm/bookehv_interrupts.S

diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
index efa74ac..d7365b0 100644
--- a/arch/powerpc/include/asm/dbell.h
+++ b/arch/powerpc/include/asm/dbell.h
@@ -19,6 +19,7 @@
 
 #define PPC_DBELL_MSG_BRDCAST  (0x0400)
 #define PPC_DBELL_TYPE(x)  (((x) & 0xf) << (63-36))
+#define PPC_DBELL_LPID(x)  ((x) << (63 - 49))
 enum ppc_dbell {
PPC_DBELL = 0,  /* doorbell */
PPC_DBELL_CRIT = 1, /* critical doorbell */
diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 7b1f0e0..0978152 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -48,6 +48,14 @@
 #define BOOKE_INTERRUPT_SPE_FP_DATA 33
 #define BOOKE_INTERRUPT_SPE_FP_ROUND 34
 #define BOOKE_INTERRUPT_PERFORMANCE_MONITOR 35
+#define BOOKE_INTERRUPT_DOORBELL 36
+#define BOOKE_INTERRUPT_DOORBELL_CRITICAL 37
+
+/* booke_hv */
+#define BOOKE_INTERRUPT_GUEST_DBELL 38
+#define BOOKE_INTERRUPT_GUEST_DBELL_CRIT 39
+#define BOOKE_INTERRUPT_HV_SYSCALL 40
+#define BOOKE_INTERRUPT_HV_PRIV 41
 
 /* book3s */
 
diff --git a/arch/powerpc/include/asm/kvm_booke_hv_asm.h 
b/arch/powerpc/include/asm/kvm_booke_hv_asm.h
new file mode 100644
index 000..30a600f
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_booke_hv_asm.h
@@ -0,0 +1,49 @@
+/*
+ * Copyright 2010-2011 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef ASM_KVM_BOOKE_HV_ASM_H
+#define ASM_KVM_BOOKE_HV_ASM_H
+
+#ifdef __ASSEMBLY__
+
+/*
+ * All exceptions from guest state must go through KVM
+ * (except for those which are delivered directly to the guest) --
+ * there are no exceptions for which we fall through directly to
+ * the normal host handler.
+ *
+ * Expected inputs (normal exceptions):
+ *   SCRATCH0 = saved r10
+ *   r10 = thread struct
+ *   r11 = appropriate SRR1 variant (currently used as scratch)
+ *   r13 = saved CR
+ *   *(r10 + THREAD_NORMSAVE(0)) = saved r11
+ *   *(r10 + THREAD_NORMSAVE(2)) = saved r13
+ *
+ * Expected inputs (crit/mcheck/debug exceptions):
+ *   appropriate SCRATCH = saved r8
+ *   r8 = exception level stack frame
+ *   r9 = *(r8 + _CCR) = saved CR
+ *   r11 = appropriate SRR1 variant (currently used as scratch)
+ *   *(r8 + GPR9) = saved r9
+ *   *(r8 + GPR10) = saved r10 (r10 not yet clobbered)
+ *   *(r8 + GPR11) = saved r11
+ */
+.macro DO_KVM intno srr1
+#ifdef CONFIG_KVM_BOOKE_HV
+BEGIN_FTR_SECTION
+   mtocrf  0x80, r11   /* check MSR[GS] without clobbering reg */
+   bf  3, kvmppc_resume_\intno\()_\srr1
+   b   kvmppc_handler_\intno\()_\srr1
+kvmppc_resume_\intno\()_\srr1:
+END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
+#endif
+.endm
+
+#endif /*__ASSEMBLY__ */
+#endif /* ASM_KVM_BOOKE_HV_ASM_H */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerp