Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-10 Thread Alexey Kardashevskiy
On 07/10/2013 08:05 PM, Alexander Graf wrote:
> 
> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
> 
>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
 This adds real mode handlers for the H_PUT_TCE_INDIRECT and
 H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
 devices or emulated PCI.  These calls allow adding multiple entries
 (up to 512) into the TCE table in one call which saves time on
 transition to/from real mode.
>>>
>>> We don't mention QEMU explicitly in KVM code usually.
>>>
 This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
 (copied from user and verified) before writing the whole list into
 the TCE table. This cache will be utilized more in the upcoming
 VFIO/IOMMU support to continue TCE list processing in the virtual
 mode in the case if the real mode handler failed for some reason.

 This adds a guest physical to host real address converter
 and calls the existing H_PUT_TCE handler. The converting function
 is going to be fully utilized by upcoming VFIO supporting patches.

 This also implements the KVM_CAP_PPC_MULTITCE capability,
 so in order to support the functionality of this patch, QEMU
 needs to query for this capability and set the "hcall-multi-tce"
 hypertas property only if the capability is present, otherwise
 there will be serious performance degradation.
>>>
>>> Same as above. But really you're only giving recommendations here. What's
>>> the point? Please describe what the benefit of this patch is, not what some
>>> other random subsystem might do with the benefits it brings.
>>>

 Signed-off-by: Paul Mackerras
 Signed-off-by: Alexey Kardashevskiy

 ---
 Changelog:
 2013/07/06:
 * fixed number of wrong get_page()/put_page() calls

 2013/06/27:
 * fixed clear of BUSY bit in kvmppc_lookup_pte()
 * H_PUT_TCE_INDIRECT does realmode_get_page() now
 * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
 * updated doc

 2013/06/05:
 * fixed mistype about IBMVIO in the commit message
 * updated doc and moved it to another section
 * changed capability number

 2013/05/21:
 * added kvm_vcpu_arch::tce_tmp
 * removed cleanup if put_indirect failed, instead we do not even start
 writing to TCE table if we cannot get TCEs from the user and they are
 invalid
 * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
 and kvmppc_emulated_validate_tce (for the previous item)
 * fixed bug with failthrough for H_IPI
 * removed all get_user() from real mode handlers
 * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)

 Signed-off-by: Alexey Kardashevskiy
 ---
  Documentation/virtual/kvm/api.txt   |  25 +++
  arch/powerpc/include/asm/kvm_host.h |   9 ++
  arch/powerpc/include/asm/kvm_ppc.h  |  16 +-
  arch/powerpc/kvm/book3s_64_vio.c| 154 ++-
  arch/powerpc/kvm/book3s_64_vio_hv.c | 260
 
  arch/powerpc/kvm/book3s_hv.c|  41 -
  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
  arch/powerpc/kvm/book3s_pr_papr.c   |  37 -
  arch/powerpc/kvm/powerpc.c  |   3 +
  9 files changed, 517 insertions(+), 34 deletions(-)

 diff --git a/Documentation/virtual/kvm/api.txt
 b/Documentation/virtual/kvm/api.txt
 index 6365fef..762c703 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
 to userspace to be
  handled.


 +4.86 KVM_CAP_PPC_MULTITCE
 +
 +Capability: KVM_CAP_PPC_MULTITCE
 +Architectures: ppc
 +Type: vm
 +
 +This capability means the kernel is capable of handling hypercalls
 +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
 +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>
>>> significanly? Please run this through a spell checker.
>>>
 +The user space should expect that its handlers for these hypercalls
>>>
>>> s/The//
>>>
 +are not going to be called.
>>>
>>> Is user space guaranteed they will not be called? Or can it still happen?
>>
>> ... if user space previously registered LIOBN in KVM (via
>> KVM_CREATE_SPAPR_TCE or similar calls).
>>
>> ok?
> 
> How about this?
> 
> The hypercalls mentioned above may or may not be processed successfully in 
> the kernel based fast path. If they can not be handled by the kernel, they 
> will get passed on to user space. So user space still has to have an 
> implementation for these despite the in kernel acceleration.
> 
> ---
> 
> The target audience for this documentation is user space KVM API users. 
> Someone developing kvm tool for examp

Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-10 Thread tiejun.chen

On 07/11/2013 03:15 AM, Scott Wood wrote:

On 07/10/2013 01:02:19 AM, Tiejun Chen wrote:

We should ensure the preemption cannot occur while calling get_paca()
insdide hard_irq_disable(), otherwise the paca_struct may be the
wrong one just after. And btw, we may update timing stats in this case.


The soft-ee mechanism depends on accessing the PACA directly via r13 to avoid
this.  We probably should be using inline asm to read was_enabled rather than


Yes.


hoping the compiler doesn't do anything silly.


Do you recommend I should directly replace get_paca() with local_paca inside 
hard_irq_disable()?


#define hard_irq_disable()  do {\
u8 _was_enabled = get_paca()->soft_enabled; \

->   u8 _was_enabled = local_paca->soft_enabled;

But is this safe for all scenarios?

Tiejun

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-10 Thread tiejun.chen

On 07/10/2013 05:49 PM, Alexander Graf wrote:


On 10.07.2013, at 08:02, Tiejun Chen wrote:


We should ensure the preemption cannot occur while calling get_paca()
insdide hard_irq_disable(), otherwise the paca_struct may be the
wrong one just after. And btw, we may update timing stats in this case.

Signed-off-by: Tiejun Chen 
---
arch/powerpc/kvm/booke.c |2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index dcc94f0..9dae25d 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -839,6 +839,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu 
*vcpu,
WARN_ON(local_paca->irq_happened != 0);
#endif

+   preempt_disable();
/*
 * We enter with interrupts disabled in hardware, but
 * we need to call hard_irq_disable anyway to ensure that
@@ -848,6 +849,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu 
*vcpu,

/* update before a new last_exit_type is rewritten */
kvmppc_update_timing_stats(vcpu);
+   preempt_enable();


All of the code here is already called with interrupts disabled. I don't see 
how we could preempt then?


But the kernel still check this in preempt case,

#define get_paca()  ((void) debug_smp_processor_id(), local_paca)

then trigger that known call trace as I observed :)

BUG: using smp_processor_id() in preemptible [] code: 
qemu-system-ppc/2065
caller is .kvmppc_handle_exit+0x48/0x810
CPU: 0 PID: 2065 Comm: qemu-system-ppc Not tainted 3.10.0-172036-ge2daa28-dirty 
#116
Call Trace:
[c001fc637570] [c000835c] .show_stack+0x7c/0x1f0 (unreliable)
[c001fc637640] [c0673a0c] .dump_stack+0x28/0x3c
[c001fc6376b0] [c02f04d8] .debug_smp_processor_id+0x108/0x120
[c001fc637740] [c00444e8] .kvmppc_handle_exit+0x48/0x810
[c001fc6377f0] [c0047c80] .kvmppc_resume_host+0xa4/0xf8

Tiejun

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation

2013-07-10 Thread Alexander Graf

On 11.07.2013, at 02:15, Scott Wood wrote:

> On 07/10/2013 05:50:01 PM, Alexander Graf wrote:
>> On 10.07.2013, at 20:42, Scott Wood wrote:
>> > On 07/10/2013 05:15:09 AM, Alexander Graf wrote:
>> >> On 10.07.2013, at 02:06, Scott Wood wrote:
>> >> > On 07/09/2013 04:44:24 PM, Alexander Graf wrote:
>> >> >> On 09.07.2013, at 20:46, Scott Wood wrote:
>> >> >> > I suspect that tlbsx is faster, or at worst similar.  And unlike 
>> >> >> > comparing tlbsx to lwepx (not counting a fix for the threading 
>> >> >> > problem), we don't already have code to search the guest TLB, so 
>> >> >> > testing would be more work.
>> >> >> We have code to walk the guest TLB for TLB misses. This really is just 
>> >> >> the TLB miss search without host TLB injection.
>> >> >> So let's say we're using the shadow TLB. The guest always has its say 
>> >> >> 64 TLB entries that it can count on - we never evict anything by 
>> >> >> accident, because we store all of the 64 entries in our guest TLB 
>> >> >> cache. When the guest faults at an address, the first thing we do is 
>> >> >> we check the cache whether we have that page already mapped.
>> >> >> However, with this method we now have 2 enumeration methods for guest 
>> >> >> TLB searches. We have the tlbsx one which searches the host TLB and we 
>> >> >> have our guest TLB cache. The guest TLB cache might still contain an 
>> >> >> entry for an address that we already invalidated on the host. Would 
>> >> >> that impose a problem?
>> >> >> I guess not because we're swizzling the exit code around to instead be 
>> >> >> an instruction miss which means we restore the TLB entry into our 
>> >> >> host's TLB so that when we resume, we land here and the tlbsx hits. 
>> >> >> But it feels backwards.
>> >> >
>> >> > Any better way?  Searching the guest TLB won't work for the LRAT case, 
>> >> > so we'd need to have this logic around anyway.  We shouldn't add a 
>> >> > second codepath unless it's a clear performance gain -- and again, I 
>> >> > suspect it would be the opposite, especially if the entry is not in 
>> >> > TLB0 or in one of the first few entries searched in TLB1.  The tlbsx 
>> >> > miss case is not what we should optimize for.
>> >> Hrm.
>> >> So let's redesign this thing theoretically. We would have an exit that 
>> >> requires an instruction fetch. We would override kvmppc_get_last_inst() 
>> >> to always do kvmppc_ld_inst(). That one can fail because it can't find 
>> >> the TLB entry in the host TLB. When it fails, we have to abort the 
>> >> emulation and resume the guest at the same IP.
>> >> Now the guest gets the TLB miss, we populate, go back into the guest. The 
>> >> guest hits the emulation failure again. We go back to kvmppc_ld_inst() 
>> >> which succeeds this time and we can emulate the instruction.
>> >
>> > That's pretty much what this patch does, except that it goes immediately 
>> > to the TLB miss code rather than having the extra round-trip back to the 
>> > guest.  Is there any benefit from adding that extra round-trip?  Rewriting 
>> > the exit type instead doesn't seem that bad...
>> It's pretty bad. I want to have code that is easy to follow - and I don't 
>> care whether the very rare case of a TLB entry getting evicted by a random 
>> other thread right when we execute the exit path is slower by a few percent 
>> if we get cleaner code for that.
> 
> I guess I just don't see how this is so much harder to follow than returning 
> to guest.  I find it harder to follow the flow when there are more round 
> trips to the guest involved.  "Treat this as an ITLB miss" is simpler than, 
> "Let this fail, and make sure we retry the trapping instruction on failure.  
> Then, an ITLB miss will happen."
> 
> Also note that making kvmppc_get_last_inst() able to fail means updating 
> several existing callsites, both for the change in function signature and to 
> actually handle failures.
> 
> I don't care that deeply either way, it just doesn't seem obviously better.
> 
>> >> I think this works. Just make sure that the gateway to the instruction 
>> >> fetch is kvmppc_get_last_inst() and make that failable. Then the 
>> >> difference between looking for the TLB entry in the host's TLB or in the 
>> >> guest's TLB cache is hopefully negligible.
>> >
>> > I don't follow here.  What does this have to do with looking in the guest 
>> > TLB?
>> I want to hide the fact that we're cheating as much as possible, that's it.
> 
> How are we cheating, and what specifically are you proposing to do to hide 
> that?  How is the guest TLB involved at all in the change you're asking for?

It's not involved, but it's basically what we do on book3s pr kvm. There 
kvmppc_ld reads the guest htab, not the host htab. I think it's fine to expose 
both cases as the same thing to the rest of kvm.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majord

Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation

2013-07-10 Thread Scott Wood

On 07/10/2013 05:50:01 PM, Alexander Graf wrote:


On 10.07.2013, at 20:42, Scott Wood wrote:

> On 07/10/2013 05:15:09 AM, Alexander Graf wrote:
>> On 10.07.2013, at 02:06, Scott Wood wrote:
>> > On 07/09/2013 04:44:24 PM, Alexander Graf wrote:
>> >> On 09.07.2013, at 20:46, Scott Wood wrote:
>> >> > I suspect that tlbsx is faster, or at worst similar.  And  
unlike comparing tlbsx to lwepx (not counting a fix for the threading  
problem), we don't already have code to search the guest TLB, so  
testing would be more work.
>> >> We have code to walk the guest TLB for TLB misses. This really  
is just the TLB miss search without host TLB injection.
>> >> So let's say we're using the shadow TLB. The guest always has  
its say 64 TLB entries that it can count on - we never evict anything  
by accident, because we store all of the 64 entries in our guest TLB  
cache. When the guest faults at an address, the first thing we do is  
we check the cache whether we have that page already mapped.
>> >> However, with this method we now have 2 enumeration methods for  
guest TLB searches. We have the tlbsx one which searches the host TLB  
and we have our guest TLB cache. The guest TLB cache might still  
contain an entry for an address that we already invalidated on the  
host. Would that impose a problem?
>> >> I guess not because we're swizzling the exit code around to  
instead be an instruction miss which means we restore the TLB entry  
into our host's TLB so that when we resume, we land here and the  
tlbsx hits. But it feels backwards.

>> >
>> > Any better way?  Searching the guest TLB won't work for the LRAT  
case, so we'd need to have this logic around anyway.  We shouldn't  
add a second codepath unless it's a clear performance gain -- and  
again, I suspect it would be the opposite, especially if the entry is  
not in TLB0 or in one of the first few entries searched in TLB1.  The  
tlbsx miss case is not what we should optimize for.

>> Hrm.
>> So let's redesign this thing theoretically. We would have an exit  
that requires an instruction fetch. We would override  
kvmppc_get_last_inst() to always do kvmppc_ld_inst(). That one can  
fail because it can't find the TLB entry in the host TLB. When it  
fails, we have to abort the emulation and resume the guest at the  
same IP.
>> Now the guest gets the TLB miss, we populate, go back into the  
guest. The guest hits the emulation failure again. We go back to  
kvmppc_ld_inst() which succeeds this time and we can emulate the  
instruction.

>
> That's pretty much what this patch does, except that it goes  
immediately to the TLB miss code rather than having the extra  
round-trip back to the guest.  Is there any benefit from adding that  
extra round-trip?  Rewriting the exit type instead doesn't seem that  
bad...


It's pretty bad. I want to have code that is easy to follow - and I  
don't care whether the very rare case of a TLB entry getting evicted  
by a random other thread right when we execute the exit path is  
slower by a few percent if we get cleaner code for that.


I guess I just don't see how this is so much harder to follow than  
returning to guest.  I find it harder to follow the flow when there are  
more round trips to the guest involved.  "Treat this as an ITLB miss"  
is simpler than, "Let this fail, and make sure we retry the trapping  
instruction on failure.  Then, an ITLB miss will happen."


Also note that making kvmppc_get_last_inst() able to fail means  
updating several existing callsites, both for the change in function  
signature and to actually handle failures.


I don't care that deeply either way, it just doesn't seem obviously  
better.


>> I think this works. Just make sure that the gateway to the  
instruction fetch is kvmppc_get_last_inst() and make that failable.  
Then the difference between looking for the TLB entry in the host's  
TLB or in the guest's TLB cache is hopefully negligible.

>
> I don't follow here.  What does this have to do with looking in the  
guest TLB?


I want to hide the fact that we're cheating as much as possible,  
that's it.


How are we cheating, and what specifically are you proposing to do to  
hide that?  How is the guest TLB involved at all in the change you're  
asking for?


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Konrad Rzeszutek Wilk
Gleb Natapov  wrote:
>On Wed, Jul 10, 2013 at 11:03:15AM -0400, Konrad Rzeszutek Wilk wrote:
>> On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote:
>> > On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
>> > > On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
>> > > 
>> > > Here's an idea, trim the damn email ;-) -- not only directed at
>gleb.
>> > > 
>> > Good idea.
>> > 
>> > > > > Ingo, Gleb,
>> > > > > 
>> > > > > From the results perspective, Andrew Theurer, Vinod's test
>results are
>> > > > > pro-pvspinlock.
>> > > > > Could you please help me to know what will make it a
>mergeable
>> > > > > candidate?.
>> > > > > 
>> > > > I need to spend more time reviewing it :) The problem with PV
>interfaces
>> > > > is that they are easy to add but hard to get rid of if better
>solution
>> > > > (HW or otherwise) appears.
>> > > 
>> > > How so? Just make sure the registration for the PV interface is
>optional; that
>> > > is, allow it to fail. A guest that fails the PV setup will either
>have to try
>> > > another PV interface or fall back to 'native'.
>> > > 
>> > We have to carry PV around for live migration purposes. PV
>interface
>> > cannot disappear under a running guest.
>> 
>> Why can't it? This is the same as handling say XSAVE operations. Some
>hosts
>> might have it - some might not. It is the job of the toolstack to
>make sure
>> to not migrate to the hosts which don't have it. Or bound the guest
>to the
>> lowest interface (so don't enable the PV interface if the other hosts
>in the
>> cluster can't support this flag)?
>XSAVE is HW feature and it is not going disappear under you after
>software
>upgrade. Upgrading kernel on part of your hosts and no longer been
>able to migrate to them is not something people who use live migration
>expect. In practise it means that updating all hosts in a datacenter to
>newer kernel is no longer possible without rebooting VMs.
>
>--
>   Gleb.

I see. Perhaps then if the hardware becomes much better at this then another PV 
interface can be provided which will use the static_key to turn off the PV spin 
lock and use the bare metal version (or perhaps some forms of super ellision 
locks). That does mean the host has to do something when this PV interface is 
invoked for the older guests. 

Anyhow that said I think the benefits are pretty neat right now and benefit 
much and worrying about whether the hardware vendors will provide something new 
is not benefiting users. What perhaps then needs to be addressed is how to have 
an obsolete mechanism in this if the hardware becomes superb? 
-- 
Sent from my Android phone. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup

2013-07-10 Thread Alexander Graf

On 11.07.2013, at 01:08, Scott Wood wrote:

> On 07/10/2013 06:04:53 PM, Alexander Graf wrote:
>> On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote:
>> > On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote:
>> >>> #ifdef CONFIG_PPC64
>> >>> + /*
>> >>> +  * To avoid races, the caller must have gone directly from having
>> >>> +  * interrupts fully-enabled to hard-disabled.
>> >>> +  */
>> >>> + WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS);
>> >>
>> >> WARN_ON(lazy_irq_pending()); ?
>> >
>> > Different semantics. What you propose will not catch irq_happened == 0 :-)
>> Right, but we only ever reach here after hard_irq_disable() I think.
> 
> And the WARN_ON helps us ensure that it stays that way.

Heh - ok :). Works for me.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup

2013-07-10 Thread Scott Wood

On 07/10/2013 06:04:53 PM, Alexander Graf wrote:


On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote:

> On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote:
>>> #ifdef CONFIG_PPC64
>>> + /*
>>> +  * To avoid races, the caller must have gone directly from  
having

>>> +  * interrupts fully-enabled to hard-disabled.
>>> +  */
>>> + WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS);
>>
>> WARN_ON(lazy_irq_pending()); ?
>
> Different semantics. What you propose will not catch irq_happened  
== 0 :-)


Right, but we only ever reach here after hard_irq_disable() I think.


And the WARN_ON helps us ensure that it stays that way.

-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup

2013-07-10 Thread Alexander Graf

On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote:

> On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote:
>>> #ifdef CONFIG_PPC64
>>> + /*
>>> +  * To avoid races, the caller must have gone directly from having
>>> +  * interrupts fully-enabled to hard-disabled.
>>> +  */
>>> + WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS);
>> 
>> WARN_ON(lazy_irq_pending()); ?
> 
> Different semantics. What you propose will not catch irq_happened == 0 :-)

Right, but we only ever reach here after hard_irq_disable() I think.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup

2013-07-10 Thread Benjamin Herrenschmidt
On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote:
> > #ifdef CONFIG_PPC64
> > + /*
> > +  * To avoid races, the caller must have gone directly from having
> > +  * interrupts fully-enabled to hard-disabled.
> > +  */
> > + WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS);
> 
> WARN_ON(lazy_irq_pending()); ?

Different semantics. What you propose will not catch irq_happened == 0 :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] kvm/ppc: fixes/cleanup

2013-07-10 Thread Alexander Graf

On 11.07.2013, at 00:47, Scott Wood wrote:

> These are the remaining patches from the set of 8 that was
> originally posted for 3.10, but which weren't deemed urgent enough
> for that.

Thanks, applied 1/3 and 3/3 to kvm-ppc-queue.


Alex

> 
> Scott Wood (3):
>  kvm/ppc: Call trace_hardirqs_on before entry
>  kvm/ppc: IRQ disabling cleanup
>  kvm/ppc/booke: Don't call kvm_guest_enter twice
> 
> arch/powerpc/include/asm/kvm_ppc.h | 17 ++---
> arch/powerpc/kvm/book3s_pr.c   | 16 +---
> arch/powerpc/kvm/booke.c   | 17 +
> arch/powerpc/kvm/powerpc.c | 25 ++---
> 4 files changed, 34 insertions(+), 41 deletions(-)
> 
> -- 
> 1.8.1.2
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup

2013-07-10 Thread Alexander Graf

On 11.07.2013, at 00:47, Scott Wood wrote:

> Simplify the handling of lazy EE by going directly from fully-enabled
> to hard-disabled.  This replaces the lazy_irq_pending() check
> (including its misplaced kvm_guest_exit() call).
> 
> As suggested by Tiejun Chen, move the interrupt disabling into
> kvmppc_prepare_to_enter() rather than have each caller do it.  Also
> move the IRQ enabling on heavyweight exit into
> kvmppc_prepare_to_enter().
> 
> Signed-off-by: Scott Wood 

Ben, could you please review this too? :) Thanks a bunch!

> ---
> arch/powerpc/include/asm/kvm_ppc.h |  6 ++
> arch/powerpc/kvm/book3s_pr.c   | 12 +++-
> arch/powerpc/kvm/booke.c   | 11 +++
> arch/powerpc/kvm/powerpc.c | 23 ++-
> 4 files changed, 22 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index 6885846..e4474f8 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -404,6 +404,12 @@ static inline void kvmppc_fix_ee_before_entry(void)
>   trace_hardirqs_on();
> 
> #ifdef CONFIG_PPC64
> + /*
> +  * To avoid races, the caller must have gone directly from having
> +  * interrupts fully-enabled to hard-disabled.
> +  */
> + WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS);

WARN_ON(lazy_irq_pending()); ?


Alex

> +
>   /* Only need to enable IRQs by hard enabling them after this */
>   local_paca->irq_happened = 0;
>   local_paca->soft_enabled = 1;
> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
> index ddfaf56..c13caa0 100644
> --- a/arch/powerpc/kvm/book3s_pr.c
> +++ b/arch/powerpc/kvm/book3s_pr.c
> @@ -884,14 +884,11 @@ program_interrupt:
>* and if we really did time things so badly, then we just exit
>* again due to a host external interrupt.
>*/
> - local_irq_disable();
>   s = kvmppc_prepare_to_enter(vcpu);
> - if (s <= 0) {
> - local_irq_enable();
> + if (s <= 0)
>   r = s;
> - } else {
> + else
>   kvmppc_fix_ee_before_entry();
> - }
>   }
> 
>   trace_kvm_book3s_reenter(r, vcpu);
> @@ -1121,12 +1118,9 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
> kvm_vcpu *vcpu)
>* really did time things so badly, then we just exit again due to
>* a host external interrupt.
>*/
> - local_irq_disable();
>   ret = kvmppc_prepare_to_enter(vcpu);
> - if (ret <= 0) {
> - local_irq_enable();
> + if (ret <= 0)
>   goto out;
> - }
> 
>   /* Save FPU state in stack */
>   if (current->thread.regs->msr & MSR_FP)
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 6e35351..aa3bc36 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -617,7 +617,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
>   local_irq_enable();
>   kvm_vcpu_block(vcpu);
>   clear_bit(KVM_REQ_UNHALT, &vcpu->requests);
> - local_irq_disable();
> + hard_irq_disable();
> 
>   kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
>   r = 1;
> @@ -666,10 +666,8 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
> kvm_vcpu *vcpu)
>   return -EINVAL;
>   }
> 
> - local_irq_disable();
>   s = kvmppc_prepare_to_enter(vcpu);
>   if (s <= 0) {
> - local_irq_enable();
>   ret = s;
>   goto out;
>   }
> @@ -1162,14 +1160,11 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
> kvm_vcpu *vcpu,
>* aren't already exiting to userspace for some other reason.
>*/
>   if (!(r & RESUME_HOST)) {
> - local_irq_disable();
>   s = kvmppc_prepare_to_enter(vcpu);
> - if (s <= 0) {
> - local_irq_enable();
> + if (s <= 0)
>   r = (s << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
> - } else {
> + else
>   kvmppc_fix_ee_before_entry();
> - }
>   }
> 
>   return r;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 4e05f8c..2f7a221 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -64,12 +64,14 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
> {
>   int r = 1;
> 
> - WARN_ON_ONCE(!irqs_disabled());
> + WARN_ON(irqs_disabled());
> + hard_irq_disable();
> +
>   while (true) {
>   if (need_resched()) {
>   local_irq_enable();
>   cond_resched();
> - local_irq_disable();
> + hard_irq_disable();
>   continue;
>  

Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 20:42, Scott Wood wrote:

> On 07/10/2013 05:15:09 AM, Alexander Graf wrote:
>> On 10.07.2013, at 02:06, Scott Wood wrote:
>> > On 07/09/2013 04:44:24 PM, Alexander Graf wrote:
>> >> On 09.07.2013, at 20:46, Scott Wood wrote:
>> >> > I suspect that tlbsx is faster, or at worst similar.  And unlike 
>> >> > comparing tlbsx to lwepx (not counting a fix for the threading 
>> >> > problem), we don't already have code to search the guest TLB, so 
>> >> > testing would be more work.
>> >> We have code to walk the guest TLB for TLB misses. This really is just 
>> >> the TLB miss search without host TLB injection.
>> >> So let's say we're using the shadow TLB. The guest always has its say 64 
>> >> TLB entries that it can count on - we never evict anything by accident, 
>> >> because we store all of the 64 entries in our guest TLB cache. When the 
>> >> guest faults at an address, the first thing we do is we check the cache 
>> >> whether we have that page already mapped.
>> >> However, with this method we now have 2 enumeration methods for guest TLB 
>> >> searches. We have the tlbsx one which searches the host TLB and we have 
>> >> our guest TLB cache. The guest TLB cache might still contain an entry for 
>> >> an address that we already invalidated on the host. Would that impose a 
>> >> problem?
>> >> I guess not because we're swizzling the exit code around to instead be an 
>> >> instruction miss which means we restore the TLB entry into our host's TLB 
>> >> so that when we resume, we land here and the tlbsx hits. But it feels 
>> >> backwards.
>> >
>> > Any better way?  Searching the guest TLB won't work for the LRAT case, so 
>> > we'd need to have this logic around anyway.  We shouldn't add a second 
>> > codepath unless it's a clear performance gain -- and again, I suspect it 
>> > would be the opposite, especially if the entry is not in TLB0 or in one of 
>> > the first few entries searched in TLB1.  The tlbsx miss case is not what 
>> > we should optimize for.
>> Hrm.
>> So let's redesign this thing theoretically. We would have an exit that 
>> requires an instruction fetch. We would override kvmppc_get_last_inst() to 
>> always do kvmppc_ld_inst(). That one can fail because it can't find the TLB 
>> entry in the host TLB. When it fails, we have to abort the emulation and 
>> resume the guest at the same IP.
>> Now the guest gets the TLB miss, we populate, go back into the guest. The 
>> guest hits the emulation failure again. We go back to kvmppc_ld_inst() which 
>> succeeds this time and we can emulate the instruction.
> 
> That's pretty much what this patch does, except that it goes immediately to 
> the TLB miss code rather than having the extra round-trip back to the guest.  
> Is there any benefit from adding that extra round-trip?  Rewriting the exit 
> type instead doesn't seem that bad...

It's pretty bad. I want to have code that is easy to follow - and I don't care 
whether the very rare case of a TLB entry getting evicted by a random other 
thread right when we execute the exit path is slower by a few percent if we get 
cleaner code for that.

> 
>> I think this works. Just make sure that the gateway to the instruction fetch 
>> is kvmppc_get_last_inst() and make that failable. Then the difference 
>> between looking for the TLB entry in the host's TLB or in the guest's TLB 
>> cache is hopefully negligible.
> 
> I don't follow here.  What does this have to do with looking in the guest TLB?

I want to hide the fact that we're cheating as much as possible, that's it.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 20:37, Scott Wood wrote:

> On 07/10/2013 05:18:10 AM, Alexander Graf wrote:
>> On 10.07.2013, at 02:12, Scott Wood wrote:
>> > On 07/09/2013 04:45:10 PM, Alexander Graf wrote:
>> >> On 28.06.2013, at 11:20, Mihai Caraman wrote:
>> >> > +   /* Get page size */
>> >> > +   if (MAS0_GET_TLBSEL(mfspr(SPRN_MAS0)) == 0)
>> >> > +   psize_shift = PAGE_SHIFT;
>> >> > +   else
>> >> > +   psize_shift = MAS1_GET_TSIZE(mas1) + 10;
>> >> > +
>> >> > +   mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) << 32) |
>> >> > +   mfspr(SPRN_MAS3);
>> >> > +   addr = (mas7_mas3 & (~0ULL << psize_shift)) |
>> >> > +  (geaddr & ((1ULL << psize_shift) - 1ULL));
>> >> > +
>> >> > +   /* Map a page and get guest's instruction */
>> >> > +   page = pfn_to_page(addr >> PAGE_SHIFT);
>> >> While looking at this I just realized that you're missing a check here. 
>> >> What if our IP is in some PCI BAR? Or can't we execute from those?
>> >
>> > We at least need to check pfn_valid() first.  That'll just keep us from 
>> > accessing a bad pointer in the host kernel, though -- it won't make the 
>> > emulation actually work.  If we need that, we'll probably need to create a 
>> > temporary TLB entry manually.
>> ioremap()?
> 
> That's a bit heavy... also we'd need to deal with cacheability.  This code is 
> already engaged in directly creating TLB entries, so it doesn't seem like 
> much of a stretch to create one for this.  It should be faster than ioremap() 
> or kmap_atomic().

Do we have any guarantees that the TLB entry we create doesn't get evicted by 
another thread by the time we want to use it?


Alex

> The one complication is allocating the virtual address space, but maybe we 
> could just use the page that kmap_atomic would have used?  Of course, if we 
> want to handle execution from other than normal kernel memory, we'll need to 
> make sure that the virtual address space is allocated even when highmem is 
> not present (e.g. 64-bit).
> 
>> However, if we were walking the guest TLB cache instead we would get a guest 
>> physical address which we can always resolve to a host virtual address.
>> I'm not sure how important that whole use case is though. Maybe we should 
>> just error out to the guest for now.
> 
> It's not that important, now that we are using hugetlb rather than directly 
> mapping a large hunk of reserved memory.  It would be nice to handle it 
> though, if we can do without too much hassle.  And I think manually creating 
> a TLB entry could be faster than kmap_atomic(), or searching the guest TLB 
> and then doing a reverse memslot lookup.
> 
> -Scott
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] kvm/ppc: IRQ disabling cleanup

2013-07-10 Thread Scott Wood
Simplify the handling of lazy EE by going directly from fully-enabled
to hard-disabled.  This replaces the lazy_irq_pending() check
(including its misplaced kvm_guest_exit() call).

As suggested by Tiejun Chen, move the interrupt disabling into
kvmppc_prepare_to_enter() rather than have each caller do it.  Also
move the IRQ enabling on heavyweight exit into
kvmppc_prepare_to_enter().

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/kvm_ppc.h |  6 ++
 arch/powerpc/kvm/book3s_pr.c   | 12 +++-
 arch/powerpc/kvm/booke.c   | 11 +++
 arch/powerpc/kvm/powerpc.c | 23 ++-
 4 files changed, 22 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 6885846..e4474f8 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -404,6 +404,12 @@ static inline void kvmppc_fix_ee_before_entry(void)
trace_hardirqs_on();
 
 #ifdef CONFIG_PPC64
+   /*
+* To avoid races, the caller must have gone directly from having
+* interrupts fully-enabled to hard-disabled.
+*/
+   WARN_ON(local_paca->irq_happened != PACA_IRQ_HARD_DIS);
+
/* Only need to enable IRQs by hard enabling them after this */
local_paca->irq_happened = 0;
local_paca->soft_enabled = 1;
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index ddfaf56..c13caa0 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -884,14 +884,11 @@ program_interrupt:
 * and if we really did time things so badly, then we just exit
 * again due to a host external interrupt.
 */
-   local_irq_disable();
s = kvmppc_prepare_to_enter(vcpu);
-   if (s <= 0) {
-   local_irq_enable();
+   if (s <= 0)
r = s;
-   } else {
+   else
kvmppc_fix_ee_before_entry();
-   }
}
 
trace_kvm_book3s_reenter(r, vcpu);
@@ -1121,12 +1118,9 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
 * really did time things so badly, then we just exit again due to
 * a host external interrupt.
 */
-   local_irq_disable();
ret = kvmppc_prepare_to_enter(vcpu);
-   if (ret <= 0) {
-   local_irq_enable();
+   if (ret <= 0)
goto out;
-   }
 
/* Save FPU state in stack */
if (current->thread.regs->msr & MSR_FP)
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 6e35351..aa3bc36 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -617,7 +617,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
local_irq_enable();
kvm_vcpu_block(vcpu);
clear_bit(KVM_REQ_UNHALT, &vcpu->requests);
-   local_irq_disable();
+   hard_irq_disable();
 
kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
r = 1;
@@ -666,10 +666,8 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
return -EINVAL;
}
 
-   local_irq_disable();
s = kvmppc_prepare_to_enter(vcpu);
if (s <= 0) {
-   local_irq_enable();
ret = s;
goto out;
}
@@ -1162,14 +1160,11 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 * aren't already exiting to userspace for some other reason.
 */
if (!(r & RESUME_HOST)) {
-   local_irq_disable();
s = kvmppc_prepare_to_enter(vcpu);
-   if (s <= 0) {
-   local_irq_enable();
+   if (s <= 0)
r = (s << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
-   } else {
+   else
kvmppc_fix_ee_before_entry();
-   }
}
 
return r;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 4e05f8c..2f7a221 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -64,12 +64,14 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
 {
int r = 1;
 
-   WARN_ON_ONCE(!irqs_disabled());
+   WARN_ON(irqs_disabled());
+   hard_irq_disable();
+
while (true) {
if (need_resched()) {
local_irq_enable();
cond_resched();
-   local_irq_disable();
+   hard_irq_disable();
continue;
}
 
@@ -95,7 +97,7 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
local_irq_enable();
trace_kvm_check_requests(vcpu);
r = kvmppc_core_check_requests(

[PATCH 1/3] kvm/ppc: Call trace_hardirqs_on before entry

2013-07-10 Thread Scott Wood
Currently this is only being done on 64-bit.  Rather than just move it
out of the 64-bit ifdef, move it to kvm_lazy_ee_enable() so that it is
consistent with lazy ee state, and so that we don't track more host
code as interrupts-enabled than necessary.

Rename kvm_lazy_ee_enable() to kvm_fix_ee_before_entry() to reflect
that this function now has a role on 32-bit as well.

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/kvm_ppc.h | 11 ---
 arch/powerpc/kvm/book3s_pr.c   |  4 ++--
 arch/powerpc/kvm/booke.c   |  4 ++--
 arch/powerpc/kvm/powerpc.c |  2 --
 4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..6885846 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -394,10 +394,15 @@ static inline void kvmppc_mmu_flush_icache(pfn_t pfn)
}
 }
 
-/* Please call after prepare_to_enter. This function puts the lazy ee state
-   back to normal mode, without actually enabling interrupts. */
-static inline void kvmppc_lazy_ee_enable(void)
+/*
+ * Please call after prepare_to_enter. This function puts the lazy ee and irq
+ * disabled tracking state back to normal mode, without actually enabling
+ * interrupts.
+ */
+static inline void kvmppc_fix_ee_before_entry(void)
 {
+   trace_hardirqs_on();
+
 #ifdef CONFIG_PPC64
/* Only need to enable IRQs by hard enabling them after this */
local_paca->irq_happened = 0;
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 19498a5..ddfaf56 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -890,7 +890,7 @@ program_interrupt:
local_irq_enable();
r = s;
} else {
-   kvmppc_lazy_ee_enable();
+   kvmppc_fix_ee_before_entry();
}
}
 
@@ -1161,7 +1161,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
if (vcpu->arch.shared->msr & MSR_FP)
kvmppc_handle_ext(vcpu, BOOK3S_INTERRUPT_FP_UNAVAIL, MSR_FP);
 
-   kvmppc_lazy_ee_enable();
+   kvmppc_fix_ee_before_entry();
 
ret = __kvmppc_vcpu_run(kvm_run, vcpu);
 
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index dcc94f0..6e35351 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -698,7 +698,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
kvmppc_load_guest_fp(vcpu);
 #endif
 
-   kvmppc_lazy_ee_enable();
+   kvmppc_fix_ee_before_entry();
 
ret = __kvmppc_vcpu_run(kvm_run, vcpu);
 
@@ -1168,7 +1168,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
local_irq_enable();
r = (s << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
} else {
-   kvmppc_lazy_ee_enable();
+   kvmppc_fix_ee_before_entry();
}
}
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..4e05f8c 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -117,8 +117,6 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
kvm_guest_exit();
continue;
}
-
-   trace_hardirqs_on();
 #endif
 
kvm_guest_enter();
-- 
1.8.1.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] kvm/ppc/booke: Don't call kvm_guest_enter twice

2013-07-10 Thread Scott Wood
kvm_guest_enter() was already called by kvmppc_prepare_to_enter().
Don't call it again.

Signed-off-by: Scott Wood 
---
 arch/powerpc/kvm/booke.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index aa3bc36..57c1462 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -672,8 +672,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
goto out;
}
 
-   kvm_guest_enter();
-
 #ifdef CONFIG_PPC_FPU
/* Save userspace FPU state in stack */
enable_kernel_fp();
-- 
1.8.1.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] kvm/ppc: fixes/cleanup

2013-07-10 Thread Scott Wood
These are the remaining patches from the set of 8 that was
originally posted for 3.10, but which weren't deemed urgent enough
for that.

Scott Wood (3):
  kvm/ppc: Call trace_hardirqs_on before entry
  kvm/ppc: IRQ disabling cleanup
  kvm/ppc/booke: Don't call kvm_guest_enter twice

 arch/powerpc/include/asm/kvm_ppc.h | 17 ++---
 arch/powerpc/kvm/book3s_pr.c   | 16 +---
 arch/powerpc/kvm/booke.c   | 17 +
 arch/powerpc/kvm/powerpc.c | 25 ++---
 4 files changed, 34 insertions(+), 41 deletions(-)

-- 
1.8.1.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3E: Emulate MCSRR0/1 SPR and rfmci instruction

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 20:24, Scott Wood wrote:

> On 07/10/2013 05:23:36 AM, Alexander Graf wrote:
>> On 10.07.2013, at 00:26, Scott Wood wrote:
>> > On 07/09/2013 05:00:26 PM, Alexander Graf wrote:
>> >> It'll also be more flexible at the same time. You could take the logs and 
>> >> actually check what's going on to debug issues that you're encountering 
>> >> for example.
>> >> We could even go as far as sharing the same tool with other 
>> >> architectures, so that we only have to learn how to debug things once.
>> >
>> > Have you encountered an actual need for this flexibility, or is it 
>> > theoretical?
>> Yeah, first thing I did back then to actually debug kvm failures was to add 
>> trace points.
> 
> I meant specifically for handling exit timings this way.

No, but I did encounter the need for debugging exits. And the only thing we 
would need to get exit timing stats once we already have trace points for exit 
types would be to have trace points for guest entry and maybe type specific 
events to indicate what the exit is about as well.

> 
>> > Is there common infrastructure for dealing with measuring intervals and 
>> > tracking statistics thereof, rather than just tracking points and letting 
>> > userspace connect the dots (though it could still do that as an option)?  
>> > Even if it must be done in userspace, it doesn't seem like something that 
>> > should be KVM-specific.
>> Would you like to have different ways of measuring mm subsystem overhead? I 
>> don't :). The same goes for KVM really. If we could converge towards a 
>> single user space interface to get exit timings, it'd make debugging a lot 
>> easier.
> 
> I agree -- that's why I said it doesn't seem like something that should be 
> KVM-specific.  But that's orthogonal to whether it's done in kernel space or 
> user space.  The ability to get begin/end events from userspace would be nice 
> when it is specifically requested, but it would also be nice if the kernel 
> could track some basic statistics so we wouldn't have to ship so much data 
> around to arrive at the same result.
> 
> At the very least, I'd like such a tool/infrastructure to exist before we 
> start complaining about doing minor maintenance of the current mechanism.

I admit that I don't fully understand qemu/scripts/kvm/kvm_stat, but it seems 
to me as if it already does pretty much what we want. It sets up a filter to 
only get events and their time stamps through.

It does use normal exit trace points on x86 to replace the old debugfs based 
stat counters. And it seems to work reasonably well for that.

> 
>> We already have this for the debugfs counters btw. And the timing framework 
>> does break kvm_stat today already, as it emits textual stats rather than 
>> numbers which all of the other debugfs stats do. But at least I can take the 
>> x86 kvm_stat tool and run it on ppc just fine to see exit stats.
> 
> We already have what?  The last two sentences seem contradictory -- can you 
> or can't you use kvm_stat as is?  I'm not familiar with kvm_stat.

Kvm_stat back in the day used debugfs to give you an idea on what exit event 
happens most often. That mechanism got replaced by trace points later which the 
current kvm_stat uses.

I still have a copy of the old kvm_stat that I always use to get a first 
feeling for what goes wrong if something goes wrong. The original code couldn't 
deal with the fact that we have a debugfs file that contains text though. I 
patched it locally. It also works just fine if you simply disable timing stats, 
since then you won't have the text file.

> What does x86 KVM expose in debugfs?

The same thing it always exposed - exit stats. I am fairly sure Avi wanted to 
completely deprecate that interface in favor of the trace point based approach, 
but I don't think he ever got around to it.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-10 Thread Scott Wood

On 07/10/2013 01:02:19 AM, Tiejun Chen wrote:

We should ensure the preemption cannot occur while calling get_paca()
insdide hard_irq_disable(), otherwise the paca_struct may be the
wrong one just after. And btw, we may update timing stats in this  
case.


The soft-ee mechanism depends on accessing the PACA directly via r13 to  
avoid this.  We probably should be using inline asm to read was_enabled  
rather than hoping the compiler doesn't do anything silly.


Plus what Alex said, regarding this patch specifically.

-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation

2013-07-10 Thread Scott Wood

On 07/10/2013 05:15:09 AM, Alexander Graf wrote:


On 10.07.2013, at 02:06, Scott Wood wrote:

> On 07/09/2013 04:44:24 PM, Alexander Graf wrote:
>> On 09.07.2013, at 20:46, Scott Wood wrote:
>> > I suspect that tlbsx is faster, or at worst similar.  And unlike  
comparing tlbsx to lwepx (not counting a fix for the threading  
problem), we don't already have code to search the guest TLB, so  
testing would be more work.
>> We have code to walk the guest TLB for TLB misses. This really is  
just the TLB miss search without host TLB injection.
>> So let's say we're using the shadow TLB. The guest always has its  
say 64 TLB entries that it can count on - we never evict anything by  
accident, because we store all of the 64 entries in our guest TLB  
cache. When the guest faults at an address, the first thing we do is  
we check the cache whether we have that page already mapped.
>> However, with this method we now have 2 enumeration methods for  
guest TLB searches. We have the tlbsx one which searches the host TLB  
and we have our guest TLB cache. The guest TLB cache might still  
contain an entry for an address that we already invalidated on the  
host. Would that impose a problem?
>> I guess not because we're swizzling the exit code around to  
instead be an instruction miss which means we restore the TLB entry  
into our host's TLB so that when we resume, we land here and the  
tlbsx hits. But it feels backwards.

>
> Any better way?  Searching the guest TLB won't work for the LRAT  
case, so we'd need to have this logic around anyway.  We shouldn't  
add a second codepath unless it's a clear performance gain -- and  
again, I suspect it would be the opposite, especially if the entry is  
not in TLB0 or in one of the first few entries searched in TLB1.  The  
tlbsx miss case is not what we should optimize for.


Hrm.

So let's redesign this thing theoretically. We would have an exit  
that requires an instruction fetch. We would override  
kvmppc_get_last_inst() to always do kvmppc_ld_inst(). That one can  
fail because it can't find the TLB entry in the host TLB. When it  
fails, we have to abort the emulation and resume the guest at the  
same IP.


Now the guest gets the TLB miss, we populate, go back into the guest.  
The guest hits the emulation failure again. We go back to  
kvmppc_ld_inst() which succeeds this time and we can emulate the  
instruction.


That's pretty much what this patch does, except that it goes  
immediately to the TLB miss code rather than having the extra  
round-trip back to the guest.  Is there any benefit from adding that  
extra round-trip?  Rewriting the exit type instead doesn't seem that  
bad...


I think this works. Just make sure that the gateway to the  
instruction fetch is kvmppc_get_last_inst() and make that failable.  
Then the difference between looking for the TLB entry in the host's  
TLB or in the guest's TLB cache is hopefully negligible.


I don't follow here.  What does this have to do with looking in the  
guest TLB?


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation

2013-07-10 Thread Scott Wood

On 07/10/2013 05:18:10 AM, Alexander Graf wrote:


On 10.07.2013, at 02:12, Scott Wood wrote:

> On 07/09/2013 04:45:10 PM, Alexander Graf wrote:
>> On 28.06.2013, at 11:20, Mihai Caraman wrote:
>> > + /* Get page size */
>> > + if (MAS0_GET_TLBSEL(mfspr(SPRN_MAS0)) == 0)
>> > + psize_shift = PAGE_SHIFT;
>> > + else
>> > + psize_shift = MAS1_GET_TSIZE(mas1) + 10;
>> > +
>> > + mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) << 32) |
>> > + mfspr(SPRN_MAS3);
>> > + addr = (mas7_mas3 & (~0ULL << psize_shift)) |
>> > +(geaddr & ((1ULL << psize_shift) - 1ULL));
>> > +
>> > + /* Map a page and get guest's instruction */
>> > + page = pfn_to_page(addr >> PAGE_SHIFT);
>> While looking at this I just realized that you're missing a check  
here. What if our IP is in some PCI BAR? Or can't we execute from  
those?

>
> We at least need to check pfn_valid() first.  That'll just keep us  
from accessing a bad pointer in the host kernel, though -- it won't  
make the emulation actually work.  If we need that, we'll probably  
need to create a temporary TLB entry manually.


ioremap()?


That's a bit heavy... also we'd need to deal with cacheability.  This  
code is already engaged in directly creating TLB entries, so it doesn't  
seem like much of a stretch to create one for this.  It should be  
faster than ioremap() or kmap_atomic().


The one complication is allocating the virtual address space, but maybe  
we could just use the page that kmap_atomic would have used?  Of  
course, if we want to handle execution from other than normal kernel  
memory, we'll need to make sure that the virtual address space is  
allocated even when highmem is not present (e.g. 64-bit).


However, if we were walking the guest TLB cache instead we would get  
a guest physical address which we can always resolve to a host  
virtual address.


I'm not sure how important that whole use case is though. Maybe we  
should just error out to the guest for now.


It's not that important, now that we are using hugetlb rather than  
directly mapping a large hunk of reserved memory.  It would be nice to  
handle it though, if we can do without too much hassle.  And I think  
manually creating a TLB entry could be faster than kmap_atomic(), or  
searching the guest TLB and then doing a reverse memslot lookup.


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3E: Emulate MCSRR0/1 SPR and rfmci instruction

2013-07-10 Thread Scott Wood

On 07/10/2013 05:23:36 AM, Alexander Graf wrote:


On 10.07.2013, at 00:26, Scott Wood wrote:

> On 07/09/2013 05:00:26 PM, Alexander Graf wrote:
>> It'll also be more flexible at the same time. You could take the  
logs and actually check what's going on to debug issues that you're  
encountering for example.
>> We could even go as far as sharing the same tool with other  
architectures, so that we only have to learn how to debug things once.

>
> Have you encountered an actual need for this flexibility, or is it  
theoretical?


Yeah, first thing I did back then to actually debug kvm failures was  
to add trace points.


I meant specifically for handling exit timings this way.

> Is there common infrastructure for dealing with measuring intervals  
and tracking statistics thereof, rather than just tracking points and  
letting userspace connect the dots (though it could still do that as  
an option)?  Even if it must be done in userspace, it doesn't seem  
like something that should be KVM-specific.


Would you like to have different ways of measuring mm subsystem  
overhead? I don't :). The same goes for KVM really. If we could  
converge towards a single user space interface to get exit timings,  
it'd make debugging a lot easier.


I agree -- that's why I said it doesn't seem like something that should  
be KVM-specific.  But that's orthogonal to whether it's done in kernel  
space or user space.  The ability to get begin/end events from  
userspace would be nice when it is specifically requested, but it would  
also be nice if the kernel could track some basic statistics so we  
wouldn't have to ship so much data around to arrive at the same result.


At the very least, I'd like such a tool/infrastructure to exist before  
we start complaining about doing minor maintenance of the current  
mechanism.


We already have this for the debugfs counters btw. And the timing  
framework does break kvm_stat today already, as it emits textual  
stats rather than numbers which all of the other debugfs stats do.  
But at least I can take the x86 kvm_stat tool and run it on ppc just  
fine to see exit stats.


We already have what?  The last two sentences seem contradictory -- can  
you or can't you use kvm_stat as is?  I'm not familiar with kvm_stat.


What does x86 KVM expose in debugfs?

>> > Lots of debug options are enabled at build time; why must this  
be different?
>> Because I think it's valuable as debug tool for cases where  
compile time switches are not the best way of debugging things. It's  
not a high profile thing to tackle for me tbh, but I don't really  
think working heavily on the timing stat thing is the correct path to  
walk along.

>
> Adding new exit types isn't "working heavily" on it.

No, but the fact that the first patch is a fix to add exit stats for  
exits that we missed out before doesn't give me a lot of confidence  
that lots of people use timing stats. And I am always very weary of  
#ifdef'ed code, as it blows up the test matrix heavily.


I used it quite a lot when I was doing KVM performance work.  It's just  
been a while since I last did that.


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.9 / 3.10: Reliable host crashes

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 06:23:26PM +0200, Bernd Schubert wrote:
> On 07/10/2013 06:16 PM, Gleb Natapov wrote:
> >On Wed, Jul 10, 2013 at 06:06:27PM +0200, Bernd Schubert wrote:
> >>On 07/10/2013 06:02 PM, Gleb Natapov wrote:
> >>>On Wed, Jul 10, 2013 at 04:16:46PM +0200, Bernd Schubert wrote:
> Hi all,
> 
> I found a way to reliably crash my host system:
> 
> 1) Boot guest VM with init=/bin/bash
> 
> 2) In guest VM: echo b >/proc/sysrq-trigger
> 
> 3) Try to reboot the guest -> crashes the host during kernel 
> initialization
> 
> >>>What 3 means? 2 already reboots it.
> >>
> >>Sorry, not a good wording. The guest reboots, goes through grub,
> >>starts the kernel and the host then crashes during early
> >>initialization of the guest kernel.
> >And if you boot it without init= first and just reboot it with "reboot"
> >does the same happens?
> 
> I currently need the host and the VMs, so I can't try that right now.
> 
> >
> >Can your attach your .config here? Also can you compile KSM out and
> >retry?
> 
> The host config ist attached.
> 
> I probably will only manage to test that on Friday afternoon, I
> guess only the host KSM is important?
> 
Yes.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.9 / 3.10: Reliable host crashes

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 06:06:27PM +0200, Bernd Schubert wrote:
> On 07/10/2013 06:02 PM, Gleb Natapov wrote:
> >On Wed, Jul 10, 2013 at 04:16:46PM +0200, Bernd Schubert wrote:
> >>Hi all,
> >>
> >>I found a way to reliably crash my host system:
> >>
> >>1) Boot guest VM with init=/bin/bash
> >>
> >>2) In guest VM: echo b >/proc/sysrq-trigger
> >>
> >>3) Try to reboot the guest -> crashes the host during kernel initialization
> >>
> >What 3 means? 2 already reboots it.
> 
> Sorry, not a good wording. The guest reboots, goes through grub,
> starts the kernel and the host then crashes during early
> initialization of the guest kernel.
And if you boot it without init= first and just reboot it with "reboot"
does the same happens?

Can your attach your .config here? Also can you compile KSM out and
retry?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.9 / 3.10: Reliable host crashes

2013-07-10 Thread Bernd Schubert

On 07/10/2013 06:02 PM, Gleb Natapov wrote:

On Wed, Jul 10, 2013 at 04:16:46PM +0200, Bernd Schubert wrote:

Hi all,

I found a way to reliably crash my host system:

1) Boot guest VM with init=/bin/bash

2) In guest VM: echo b >/proc/sysrq-trigger

3) Try to reboot the guest -> crashes the host during kernel initialization


What 3 means? 2 already reboots it.


Sorry, not a good wording. The guest reboots, goes through grub, starts 
the kernel and the host then crashes during early initialization of the 
guest kernel.



[...]




What is your host cpu? cat /proc/cpuinfo.



processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model   : 42
model name  : Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz
stepping: 7
microcode   : 0x28
cpu MHz : 3292.517
cache size  : 6144 KB
physical id : 0
siblings: 4
core id : 3
cpu cores   : 4
apicid  : 6
initial apicid  : 6
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm 
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 
cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave 
avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority 
ept vpid
bogomips: 6585.03
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.9 / 3.10: Reliable host crashes

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 04:16:46PM +0200, Bernd Schubert wrote:
> Hi all,
> 
> I found a way to reliably crash my host system:
> 
> 1) Boot guest VM with init=/bin/bash
> 
> 2) In guest VM: echo b >/proc/sysrq-trigger
> 
> 3) Try to reboot the guest -> crashes the host during kernel initialization
> 
What 3 means? 2 already reboots it.

> 
> When I checked the logs I first thought it would be a KSM issue:
> 
> >Jul 10 15:33:37 fsdevel7 kernel: [  507.995602] br0: port 3(tap2) entered 
> >disabled state
> >Jul 10 15:33:37 fsdevel7 kernel: [  508.043454] BUG: unable to handle kernel 
> >NULL pointer dereference at 0020
> >Jul 10 15:33:37 fsdevel7 kernel: [  508.043487] IP: [] 
> >get_ksm_page+0x39/0x130
> 
> and
> 
> 
> >Jul 10 15:34:49 fsdevel7 kernel: [  580.274709] RIP: 
> >0010:[]  [] anon_vma_clone+0x94/0x1a0
> >Jul 10 15:34:49 fsdevel7 kernel: [  580.274710] RSP: 0018:880406495cf0  
> >EFLAGS: 00010286
> 
> >Jul 10 15:34:49 fsdevel7 kernel: [  580.274721] Call Trace:
> >Jul 10 15:34:49 fsdevel7 kernel: [  580.274724]  [] 
> >anon_vma_fork+0x38/0x130
> >Jul 10 15:34:49 fsdevel7 kernel: [  580.274728]  [] 
> >dup_mmap+0x1bf/0x420
> 
> 
> But then I simply disabled KSM in rc.local
> (echo 0 > /sys/kernel/mm/ksm/run) before starting the VMs. Now the
> system still crashes, but now without any logs message.
> At least sysrq-b still works, but I don' see any
> "SysRq : Emergency Sync" messages in log files. So log messages are
> not written properly anymore. Unfortunately I don't have a serial
> console on my desktop system.
> 
> I don't have the time now, but the next step is probably to get a
> crash-dump and to see if that has something useful. And I'm also
> going to run the VM from a tty without X, maybe it prints something
> to the console.
> Anything else you might want? Or already any idea?
> Crashes do not happen with 3.8-ubuntu, but very reliably with 3.9.9
> or 3.10.1 (host).
> 
What is your host cpu? cat /proc/cpuinfo.

> Here's my kvm command line:
> 
> >qemu-system-x86_64  \
> >-m 8192 \
> >-machine type=pc,accel=kvm,kernel_irqchip=on\
> >-netdev 
> > type=tap,script=${kvm_ifup},downscript=${kvm_ifdown},ifname=$iface,vhost=on,id=guest0
> >\
> >-device virtio-net-pci,netdev=guest0,mac=52:54:00:12:34:11  \
> >-boot c \
> >-drive file=${FILE},if=${DISKIF},cache=unsafe   \
> >-drive file=${META},if=${DISKIF},cache=writeback\
> >-drive file=${STORAGE},if=${DISKIF},cache=writeback \
> >-enable-kvm \
> >-vga vmware \
> >-cpu host   \
> >-smp 4  \
> >"$@"
> 
> 
> 
> Thanks,
> Bernd
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI assignement with nvidia K1: RmInitAdapter failed

2013-07-10 Thread Guillaume Thouvenin

Guillaume Thouvenin  a écrit :


 NVIDIA: could not open the device file /dev/nvidia0 (Input/output error).
 Unable to determine the device handle for GPU :00:03.0: Unknown Error


And I look with strace and I can see that:

stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 
255), ...}) = 0

open("/dev/nvidiactl", O_RDWR)  = 3
fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
ioctl(3, 0xc04846d2, 0x7fff95a7f380)= 0
ioctl(3, 0xc00446ca, 0x7f7692482000)= 0
ioctl(3, 0xc70046c8, 0x7f7692482060)= 0
ioctl(3, 0xc020462b, 0x7fff95a7f3d0)= 0
ioctl(3, 0xc020462a, 0x7fff95a7f3b0)= 0
ioctl(3, 0xc020462a, 0x7fff95a7f3b0)= 0
ioctl(3, 0xc020462a, 0x7fff95a7f410)= 0
open("/proc/driver/nvidia/params", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0) = 0x7f7692e8b000

read(4, "Mobile: 4294967295\nResmanDebugLe"..., 1024) = 417
close(4)= 0
munmap(0x7f7692e8b000, 4096)= 0
stat("/dev/nvidia0", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 0), ...}) = 0
open("/dev/nvidia0", O_RDWR)= -1 EIO (Input/output error)


I dig... I dig...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 11:03:15AM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
> > > 
> > > Here's an idea, trim the damn email ;-) -- not only directed at gleb.
> > > 
> > Good idea.
> > 
> > > > > Ingo, Gleb,
> > > > > 
> > > > > From the results perspective, Andrew Theurer, Vinod's test results are
> > > > > pro-pvspinlock.
> > > > > Could you please help me to know what will make it a mergeable
> > > > > candidate?.
> > > > > 
> > > > I need to spend more time reviewing it :) The problem with PV interfaces
> > > > is that they are easy to add but hard to get rid of if better solution
> > > > (HW or otherwise) appears.
> > > 
> > > How so? Just make sure the registration for the PV interface is optional; 
> > > that
> > > is, allow it to fail. A guest that fails the PV setup will either have to 
> > > try
> > > another PV interface or fall back to 'native'.
> > > 
> > We have to carry PV around for live migration purposes. PV interface
> > cannot disappear under a running guest.
> 
> Why can't it? This is the same as handling say XSAVE operations. Some hosts
> might have it - some might not. It is the job of the toolstack to make sure
> to not migrate to the hosts which don't have it. Or bound the guest to the
> lowest interface (so don't enable the PV interface if the other hosts in the
> cluster can't support this flag)?
XSAVE is HW feature and it is not going disappear under you after software
upgrade. Upgrading kernel on part of your hosts and no longer been
able to migrate to them is not something people who use live migration
expect. In practise it means that updating all hosts in a datacenter to
newer kernel is no longer possible without rebooting VMs.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI assignement with nvidia K1: RmInitAdapter failed

2013-07-10 Thread Alex Williamson
On Wed, 2013-07-10 at 14:54 +0200, Guillaume Thouvenin wrote:
> Alex Williamson  a écrit :
> 
> > Are you sure that nvidia-smi is relevant to the K1/K2 devices?  Does it
> > work on the host?
> 
> Yes it works on the host and I have some information like power 
> consumption, temperature of the GPU, etc...
> 
> > Are
> > there other tests you can do to check whether the device is otherwise
> > available?
> 
> I also tried to start an X server but I got the same error reported in 
> the syslog:  [  435.745673] NVRM: RmInitAdapter failed! (0x26:0x38:1170)
>   [  435.745695] NVRM: rm_init_adapter(0) failed
> 
> And in the xorg.log I have:
> 
> 
>   [   423.624] (==) NVIDIA(0): Depth 24, (==) framebuffer bpp 32
>   [   423.624] (==) NVIDIA(0): RGB weight 888
>   [   423.624] (==) NVIDIA(0): Default visual is TrueColor
>   [   423.624] (==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)
>   [   423.624] (**) NVIDIA(0): Option "NoLogo" "true"
>   [   423.624] (**) NVIDIA(0): Option "UseDisplayDevice" "none"
>   [   423.624] (**) NVIDIA(0): Enabling 2D acceleration
>   [   423.624] (**) NVIDIA(0): Option "UseDisplayDevice" set to "none"; 
> enabling NoScanout
>   [   423.624] (**) NVIDIA(0): mode
>   [   435.746] (EE) NVIDIA(0): Failed to initialize the NVIDIA GPU at 
> PCI:0:3:0.  Please
>   [   435.746] (EE) NVIDIA(0): check your system's kernel log for 
> additional error
>   [   435.746] (EE) NVIDIA(0): messages and refer to Chapter 8: 
> Common Problems in the
>   [   435.746] (EE) NVIDIA(0): README for additional information.
>   [   435.746] (EE) NVIDIA(0): Failed to initialize the NVIDIA graphics 
> device!
>   [   435.746] (EE) NVIDIA(0): Failing initialization of X screen 0

AFAICT, rm_init_adapter is in the binary blob part of the nvidia driver,
so we can't simply check the code to see what's wrong.  Google finds
this:

http://forums.gentoo.org/viewtopic-t-961110.html?sid=0a60d40ba001f7bbe799bb16b2921cac

That was of course on bare metal, but note that the IOMMU prevented some
accesses and the solution was to disable the IOMMU.  Of course we can't
disable the IOMMU in this case.  Do you see any IOMMU faults in dmesg on
the host around this error?  Does nouveau work?  Thanks,

Alex



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Konrad Rzeszutek Wilk
On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote:
> On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
> > 
> > Here's an idea, trim the damn email ;-) -- not only directed at gleb.
> > 
> Good idea.
> 
> > > > Ingo, Gleb,
> > > > 
> > > > From the results perspective, Andrew Theurer, Vinod's test results are
> > > > pro-pvspinlock.
> > > > Could you please help me to know what will make it a mergeable
> > > > candidate?.
> > > > 
> > > I need to spend more time reviewing it :) The problem with PV interfaces
> > > is that they are easy to add but hard to get rid of if better solution
> > > (HW or otherwise) appears.
> > 
> > How so? Just make sure the registration for the PV interface is optional; 
> > that
> > is, allow it to fail. A guest that fails the PV setup will either have to 
> > try
> > another PV interface or fall back to 'native'.
> > 
> We have to carry PV around for live migration purposes. PV interface
> cannot disappear under a running guest.

Why can't it? This is the same as handling say XSAVE operations. Some hosts
might have it - some might not. It is the job of the toolstack to make sure
to not migrate to the hosts which don't have it. Or bound the guest to the
lowest interface (so don't enable the PV interface if the other hosts in the
cluster can't support this flag)?

> 
> > > > I agree that Jiannan's Preemptable Lock idea is promising and we could
> > > > evaluate that  approach, and make the best one get into kernel and also
> > > > will carry on discussion with Jiannan to improve that patch.
> > > That would be great. The work is stalled from what I can tell.
> > 
> > I absolutely hated that stuff because it wrecked the native code.
> Yes, the idea was to hide it from native code behind PV hooks.
> 
> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 16:17, Alexey Kardashevskiy wrote:

> On 07/10/2013 08:27 PM, Alexander Graf wrote:
>> 
>> On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote:
>> 
>>> On 07/10/2013 01:35 AM, Alexander Graf wrote:
 On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
> Signed-off-by: Alexey Kardashevskiy
> ---
> include/uapi/linux/kvm.h |2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 970b1f5..0865c01 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_PPC_RTAS 91
> #define KVM_CAP_IRQ_XICS 92
> #define KVM_CAP_SPAPR_MULTITCE 93
> +#define KVM_CAP_SPAPR_TCE_IOMMU 94
> 
> #ifdef KVM_CAP_IRQ_ROUTING
> 
> @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
> /* Available with KVM_CAP_PPC_ALLOC_HTAB */
> #define KVM_PPC_ALLOCATE_HTAB  _IOWR(KVMIO, 0xa7, __u32)
> #define KVM_CREATE_SPAPR_TCE  _IOW(KVMIO,  0xa8, struct
> kvm_create_spapr_tce)
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct
> kvm_create_spapr_tce_iommu)
 
 Please order them by number.
>>> 
>>> Oh. Again :( We have had this discussion with Scott Wood here already.
>>> Where _exactly_ do you want me to put it?
>> 
>> 8 lines further down. With a comment saying when it's available. Also why is 
>> it af, not ad?
> 
> 
> 0xad and 0xae are taken.
> Where should I have commented this? In the commit message? Or in the patch
> itself?

Yeah, with a comment right in between 0xad and your 0xaf entry explaining the 
gap.


Alex

> 
> 
>>> Many sections, not really ordered. Thank you.
>> 
>> They should all be ordered inside of their own categories.
> 
> 
> -- 
> Alexey

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 16/43] target-s390x: Don't overuse ENV_GET_CPU()

2013-07-10 Thread Andreas Färber
Commit 3474b679486caa8f6448bae974e131370f360c13 (Utilize selective
runtime reg sync for hot code paths) introduced two uses of
ENV_GET_CPU() inside target-s390x/ KVM code. In one case we can use a
direct CPU() cast instead.

Cc: Jason J. Herne 
Signed-off-by: Andreas Färber 
---
 target-s390x/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
index 42f758f..af499cf 100644
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -469,7 +469,7 @@ static int kvm_handle_css_inst(S390CPU *cpu, struct kvm_run 
*run,
 int r = 0;
 int no_cc = 0;
 CPUS390XState *env = &cpu->env;
-CPUState *cs = ENV_GET_CPU(env);
+CPUState *cs = CPU(cpu);
 
 if (ipa0 != 0xb2) {
 /* Not handled for now. */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 07/43] kvm: Change kvm_remove_all_breakpoints() argument to CPUState

2013-07-10 Thread Andreas Färber
Acked-by: Paolo Bonzini 
Reviewed-by: Richard Henderson 
Signed-off-by: Andreas Färber 
---
 gdbstub.c| 2 +-
 include/sysemu/kvm.h | 2 +-
 kvm-all.c| 6 +++---
 kvm-stub.c   | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/gdbstub.c b/gdbstub.c
index 9ae6576..f7d9f13 100644
--- a/gdbstub.c
+++ b/gdbstub.c
@@ -2019,7 +2019,7 @@ static void gdb_breakpoint_remove_all(void)
 CPUArchState *env;
 
 if (kvm_enabled()) {
-kvm_remove_all_breakpoints(gdbserver_state->c_cpu);
+kvm_remove_all_breakpoints(ENV_GET_CPU(gdbserver_state->c_cpu));
 return;
 }
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 7596aca..1e08a85 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -173,7 +173,7 @@ int kvm_insert_breakpoint(CPUArchState *env, target_ulong 
addr,
   target_ulong len, int type);
 int kvm_remove_breakpoint(CPUArchState *env, target_ulong addr,
   target_ulong len, int type);
-void kvm_remove_all_breakpoints(CPUArchState *env);
+void kvm_remove_all_breakpoints(CPUState *cpu);
 int kvm_update_guest_debug(CPUArchState *env, unsigned long reinject_trap);
 #ifndef _WIN32
 int kvm_set_signal_mask(CPUState *cpu, const sigset_t *sigset);
diff --git a/kvm-all.c b/kvm-all.c
index ed13d57..2c14ef3 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1988,11 +1988,11 @@ int kvm_remove_breakpoint(CPUArchState *env, 
target_ulong addr,
 return 0;
 }
 
-void kvm_remove_all_breakpoints(CPUArchState *env)
+void kvm_remove_all_breakpoints(CPUState *cpu)
 {
-CPUState *cpu = ENV_GET_CPU(env);
 struct kvm_sw_breakpoint *bp, *next;
 KVMState *s = cpu->kvm_state;
+CPUArchState *env;
 
 QTAILQ_FOREACH_SAFE(bp, &s->kvm_sw_breakpoints, entry, next) {
 if (kvm_arch_remove_sw_breakpoint(cpu, bp) != 0) {
@@ -2033,7 +2033,7 @@ int kvm_remove_breakpoint(CPUArchState *env, target_ulong 
addr,
 return -EINVAL;
 }
 
-void kvm_remove_all_breakpoints(CPUArchState *env)
+void kvm_remove_all_breakpoints(CPUState *cpu)
 {
 }
 #endif /* !KVM_CAP_SET_GUEST_DEBUG */
diff --git a/kvm-stub.c b/kvm-stub.c
index 583c636..370c837 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -95,7 +95,7 @@ int kvm_remove_breakpoint(CPUArchState *env, target_ulong 
addr,
 return -EINVAL;
 }
 
-void kvm_remove_all_breakpoints(CPUArchState *env)
+void kvm_remove_all_breakpoints(CPUState *cpu)
 {
 }
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 05/43] kvm: Free current_cpu identifier

2013-07-10 Thread Andreas Färber
Since CPU loops are done as last step in kvm_{insert,remove}_breakpoint()
and kvm_remove_all_breakpoints(), we do not need to distinguish between
invoking CPU and iterated CPUs and can thereby free the identifier for
use as a global variable.

Acked-by: Paolo Bonzini 
Signed-off-by: Andreas Färber 
---
 include/sysemu/kvm.h | 10 +-
 kvm-all.c| 39 +--
 kvm-stub.c   |  6 +++---
 3 files changed, 25 insertions(+), 30 deletions(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a14cfe9..7596aca 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -169,11 +169,11 @@ void *kvm_arch_ram_alloc(ram_addr_t size);
 void kvm_setup_guest_memory(void *start, size_t size);
 void kvm_flush_coalesced_mmio_buffer(void);
 
-int kvm_insert_breakpoint(CPUArchState *current_env, target_ulong addr,
+int kvm_insert_breakpoint(CPUArchState *env, target_ulong addr,
   target_ulong len, int type);
-int kvm_remove_breakpoint(CPUArchState *current_env, target_ulong addr,
+int kvm_remove_breakpoint(CPUArchState *env, target_ulong addr,
   target_ulong len, int type);
-void kvm_remove_all_breakpoints(CPUArchState *current_env);
+void kvm_remove_all_breakpoints(CPUArchState *env);
 int kvm_update_guest_debug(CPUArchState *env, unsigned long reinject_trap);
 #ifndef _WIN32
 int kvm_set_signal_mask(CPUState *cpu, const sigset_t *sigset);
@@ -252,9 +252,9 @@ struct kvm_sw_breakpoint *kvm_find_sw_breakpoint(CPUState 
*cpu,
 
 int kvm_sw_breakpoints_active(CPUState *cpu);
 
-int kvm_arch_insert_sw_breakpoint(CPUState *current_cpu,
+int kvm_arch_insert_sw_breakpoint(CPUState *cpu,
   struct kvm_sw_breakpoint *bp);
-int kvm_arch_remove_sw_breakpoint(CPUState *current_cpu,
+int kvm_arch_remove_sw_breakpoint(CPUState *cpu,
   struct kvm_sw_breakpoint *bp);
 int kvm_arch_insert_hw_breakpoint(target_ulong addr,
   target_ulong len, int type);
diff --git a/kvm-all.c b/kvm-all.c
index de658de..ed13d57 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1903,16 +1903,15 @@ int kvm_update_guest_debug(CPUArchState *env, unsigned 
long reinject_trap)
 return data.err;
 }
 
-int kvm_insert_breakpoint(CPUArchState *current_env, target_ulong addr,
+int kvm_insert_breakpoint(CPUArchState *env, target_ulong addr,
   target_ulong len, int type)
 {
-CPUState *current_cpu = ENV_GET_CPU(current_env);
+CPUState *cpu = ENV_GET_CPU(env);
 struct kvm_sw_breakpoint *bp;
-CPUArchState *env;
 int err;
 
 if (type == GDB_BREAKPOINT_SW) {
-bp = kvm_find_sw_breakpoint(current_cpu, addr);
+bp = kvm_find_sw_breakpoint(cpu, addr);
 if (bp) {
 bp->use_count++;
 return 0;
@@ -1925,14 +1924,13 @@ int kvm_insert_breakpoint(CPUArchState *current_env, 
target_ulong addr,
 
 bp->pc = addr;
 bp->use_count = 1;
-err = kvm_arch_insert_sw_breakpoint(current_cpu, bp);
+err = kvm_arch_insert_sw_breakpoint(cpu, bp);
 if (err) {
 g_free(bp);
 return err;
 }
 
-QTAILQ_INSERT_HEAD(¤t_cpu->kvm_state->kvm_sw_breakpoints,
-  bp, entry);
+QTAILQ_INSERT_HEAD(&cpu->kvm_state->kvm_sw_breakpoints, bp, entry);
 } else {
 err = kvm_arch_insert_hw_breakpoint(addr, len, type);
 if (err) {
@@ -1949,16 +1947,15 @@ int kvm_insert_breakpoint(CPUArchState *current_env, 
target_ulong addr,
 return 0;
 }
 
-int kvm_remove_breakpoint(CPUArchState *current_env, target_ulong addr,
+int kvm_remove_breakpoint(CPUArchState *env, target_ulong addr,
   target_ulong len, int type)
 {
-CPUState *current_cpu = ENV_GET_CPU(current_env);
+CPUState *cpu = ENV_GET_CPU(env);
 struct kvm_sw_breakpoint *bp;
-CPUArchState *env;
 int err;
 
 if (type == GDB_BREAKPOINT_SW) {
-bp = kvm_find_sw_breakpoint(current_cpu, addr);
+bp = kvm_find_sw_breakpoint(cpu, addr);
 if (!bp) {
 return -ENOENT;
 }
@@ -1968,12 +1965,12 @@ int kvm_remove_breakpoint(CPUArchState *current_env, 
target_ulong addr,
 return 0;
 }
 
-err = kvm_arch_remove_sw_breakpoint(current_cpu, bp);
+err = kvm_arch_remove_sw_breakpoint(cpu, bp);
 if (err) {
 return err;
 }
 
-QTAILQ_REMOVE(¤t_cpu->kvm_state->kvm_sw_breakpoints, bp, entry);
+QTAILQ_REMOVE(&cpu->kvm_state->kvm_sw_breakpoints, bp, entry);
 g_free(bp);
 } else {
 err = kvm_arch_remove_hw_breakpoint(addr, len, type);
@@ -1991,16 +1988,14 @@ int kvm_remove_breakpoint(CPUArchState *current_env, 
target_ulong addr,
 return 0;
 }
 
-void kvm_remove_all_breakpoints(CPUArchState *current_env)
+void kvm_remove_all_breakpoints(CPUArchState *env)
 {
-CPUState *current_c

[PULL 17/43] target-s390x: Change handle_{hypercall,diag}() argument to S390CPU

2013-07-10 Thread Andreas Färber
This allows to get rid of the last remaining ENV_GET_CPU() in
target-s390x/ by using CPU() cast directly on the argument.

Cc: Jason J. Herne 
Signed-off-by: Andreas Färber 
---
 target-s390x/kvm.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
index af499cf..60e94f8 100644
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -607,9 +607,10 @@ static int handle_priv(S390CPU *cpu, struct kvm_run *run,
 return r;
 }
 
-static int handle_hypercall(CPUS390XState *env, struct kvm_run *run)
+static int handle_hypercall(S390CPU *cpu, struct kvm_run *run)
 {
-CPUState *cs = ENV_GET_CPU(env);
+CPUState *cs = CPU(cpu);
+CPUS390XState *env = &cpu->env;
 
 kvm_s390_get_registers_partial(cs);
 cs->kvm_vcpu_dirty = true;
@@ -618,13 +619,13 @@ static int handle_hypercall(CPUS390XState *env, struct 
kvm_run *run)
 return 0;
 }
 
-static int handle_diag(CPUS390XState *env, struct kvm_run *run, int ipb_code)
+static int handle_diag(S390CPU *cpu, struct kvm_run *run, int ipb_code)
 {
 int r = 0;
 
 switch (ipb_code) {
 case DIAG_KVM_HYPERCALL:
-r = handle_hypercall(env, run);
+r = handle_hypercall(cpu, run);
 break;
 case DIAG_KVM_BREAKPOINT:
 sleep(10);
@@ -735,7 +736,6 @@ out:
 
 static int handle_instruction(S390CPU *cpu, struct kvm_run *run)
 {
-CPUS390XState *env = &cpu->env;
 unsigned int ipa0 = (run->s390_sieic.ipa & 0xff00);
 uint8_t ipa1 = run->s390_sieic.ipa & 0x00ff;
 int ipb_code = (run->s390_sieic.ipb & 0x0fff) >> 16;
@@ -749,7 +749,7 @@ static int handle_instruction(S390CPU *cpu, struct kvm_run 
*run)
 r = handle_priv(cpu, run, ipa0 >> 8, ipa1);
 break;
 case IPA0_DIAG:
-r = handle_diag(env, run, ipb_code);
+r = handle_diag(cpu, run, ipb_code);
 break;
 case IPA0_SIGP:
 r = handle_sigp(cpu, run, ipa1);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


3.9 / 3.10: Reliable host crashes

2013-07-10 Thread Bernd Schubert

Hi all,

I found a way to reliably crash my host system:

1) Boot guest VM with init=/bin/bash

2) In guest VM: echo b >/proc/sysrq-trigger

3) Try to reboot the guest -> crashes the host during kernel initialization


When I checked the logs I first thought it would be a KSM issue:


Jul 10 15:33:37 fsdevel7 kernel: [  507.995602] br0: port 3(tap2) entered 
disabled state
Jul 10 15:33:37 fsdevel7 kernel: [  508.043454] BUG: unable to handle kernel 
NULL pointer dereference at 0020
Jul 10 15:33:37 fsdevel7 kernel: [  508.043487] IP: [] 
get_ksm_page+0x39/0x130


and



Jul 10 15:34:49 fsdevel7 kernel: [  580.274709] RIP: 0010:[]  
[] anon_vma_clone+0x94/0x1a0
Jul 10 15:34:49 fsdevel7 kernel: [  580.274710] RSP: 0018:880406495cf0  
EFLAGS: 00010286



Jul 10 15:34:49 fsdevel7 kernel: [  580.274721] Call Trace:
Jul 10 15:34:49 fsdevel7 kernel: [  580.274724]  [] 
anon_vma_fork+0x38/0x130
Jul 10 15:34:49 fsdevel7 kernel: [  580.274728]  [] 
dup_mmap+0x1bf/0x420



But then I simply disabled KSM in rc.local
(echo 0 > /sys/kernel/mm/ksm/run) before starting the VMs. Now the 
system still crashes, but now without any logs message.

At least sysrq-b still works, but I don' see any
"SysRq : Emergency Sync" messages in log files. So log messages are not 
written properly anymore. Unfortunately I don't have a serial console on 
my desktop system.


I don't have the time now, but the next step is probably to get a 
crash-dump and to see if that has something useful. And I'm also going 
to run the VM from a tty without X, maybe it prints something to the 
console.

Anything else you might want? Or already any idea?
Crashes do not happen with 3.8-ubuntu, but very reliably with 3.9.9 or 
3.10.1 (host).


Here's my kvm command line:


qemu-system-x86_64  \
-m 8192 \
-machine type=pc,accel=kvm,kernel_irqchip=on\
-netdev 
type=tap,script=${kvm_ifup},downscript=${kvm_ifdown},ifname=$iface,vhost=on,id=guest0
   \
-device virtio-net-pci,netdev=guest0,mac=52:54:00:12:34:11  \
-boot c \
-drive file=${FILE},if=${DISKIF},cache=unsafe   \
-drive file=${META},if=${DISKIF},cache=writeback\
-drive file=${STORAGE},if=${DISKIF},cache=writeback \
-enable-kvm \
-vga vmware \
-cpu host   \
-smp 4  \
"$@"




Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-07-10 Thread Alexey Kardashevskiy
On 07/10/2013 08:27 PM, Alexander Graf wrote:
> 
> On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote:
> 
>> On 07/10/2013 01:35 AM, Alexander Graf wrote:
>>> On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
 Signed-off-by: Alexey Kardashevskiy
 ---
  include/uapi/linux/kvm.h |2 ++
  1 file changed, 2 insertions(+)

 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 970b1f5..0865c01 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_PPC_RTAS 91
  #define KVM_CAP_IRQ_XICS 92
  #define KVM_CAP_SPAPR_MULTITCE 93
 +#define KVM_CAP_SPAPR_TCE_IOMMU 94

  #ifdef KVM_CAP_IRQ_ROUTING

 @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
  /* Available with KVM_CAP_PPC_ALLOC_HTAB */
  #define KVM_PPC_ALLOCATE_HTAB  _IOWR(KVMIO, 0xa7, __u32)
  #define KVM_CREATE_SPAPR_TCE  _IOW(KVMIO,  0xa8, struct
 kvm_create_spapr_tce)
 +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct
 kvm_create_spapr_tce_iommu)
>>>
>>> Please order them by number.
>>
>> Oh. Again :( We have had this discussion with Scott Wood here already.
>> Where _exactly_ do you want me to put it?
> 
> 8 lines further down. With a comment saying when it's available. Also why is 
> it af, not ad?


0xad and 0xae are taken.
Where should I have commented this? In the commit message? Or in the patch
itself?


>> Many sections, not really ordered. Thank you.
> 
> They should all be ordered inside of their own categories.


-- 
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for 2013-07-16

2013-07-10 Thread Juan Quintela

Hi

Please, send any topic that you are interested in covering.

Thanks, Juan.

PD.  If you want to attend and you don't have the call details,
  contact me.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: reset arch memslot info on memslot creation

2013-07-10 Thread Takuya Yoshikawa
On Wed, 10 Jul 2013 11:24:39 +0300
"Michael S. Tsirkin"  wrote:

> On x86, kvm_arch_create_memslot assumes that rmap/lpage_info for the
> slot are zeroed out: if they weren't, error handling code after out_free
> label will free memory which wasn't allocated here.
> This always happens to be the case because on KVM_MR_DELETE we clear the
> whole arch structure.  So there's no bug, but it's cleaner not to rely
> on this here.

Yes, the assumption is that the function is called only with zero-sized slots.
Since changing the size is not allowed, DELETE-CREATE is the only case we
care about.

But isn't it possible to make it explicit that zero-sized slots have always
zero-cleared contents instead?  Otherwise, there would be many troubles.

Takuya

> 
> Make the code more robust by clearing the rmap/lpage_info explicitly.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  arch/x86/kvm/x86.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e8ba99c..96e6eb4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6922,6 +6922,10 @@ int kvm_arch_create_memslot(struct kvm_memory_slot 
> *slot, unsigned long npages)
>  {
>   int i;
>  
> + /* Reset in case slot had some rmap/lpage_info. */
> + memset(&slot->arch.rmap, 0, sizeof slot->arch.rmap);
> + memset(&slot->arch.lpage_info, 0, sizeof slot->arch.lpage_info);
> +
>   for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
>   unsigned long ugfn;
>   int lpages;
> -- 
> MST
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Takuya Yoshikawa 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Bugs filed in the week for Upstream Qemu and Libvirt

2013-07-10 Thread Daniel P. Berrange
On Wed, Jul 10, 2013 at 06:45:08PM +0530, chandrashekar shastri wrote:
> Hi,
> 
> Below are bugs filed in this week for Upstream qemu and libvirt:
> 
> Qemu in Launchpad:
> 
> https://bugs.launchpad.net/opensuse/+bug/1199416
> Hot-add qcow2 [virtio-scsi] devices doesn't work in SlLES-11-SP2guest
> 
> Libvirt Bugs:
> 
> Bug 982224 - Attaching of the Virtio-scsi [qcow2] drives fails with
> "error: internal error No more available PCI addresses"
> Bug 982455 - RHEL Guest fails to boot after attaching 200+ scsi
> devices [virtio-scsi qcow2]
> Bug 980954 - Virtio-scsi drives in Windows7 shows yellow bang in
> device manager though virtio scsi pass through driver is installed
> Bug 982630 - Documentation : virsh attach-disk --help shouldbe
> updated with proper examples for --type and --driver

We really don't need lists of bugs emailed to the libvirt-list mailing
list. People already monitor bugzilla & have email alerts from bugzilla
as they desire.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bugs filed in the week for Upstream Qemu and Libvirt

2013-07-10 Thread chandrashekar shastri

Hi,

Below are bugs filed in this week for Upstream qemu and libvirt:

Qemu in Launchpad:

https://bugs.launchpad.net/opensuse/+bug/1199416
Hot-add qcow2 [virtio-scsi] devices doesn't work in SlLES-11-SP2guest

Libvirt Bugs:

Bug 982224 - Attaching of the Virtio-scsi [qcow2] drives fails with 
"error: internal error No more available PCI addresses"
Bug 982455 - RHEL Guest fails to boot after attaching 200+ scsi devices 
[virtio-scsi qcow2]
Bug 980954 - Virtio-scsi drives in Windows7 shows yellow bang in device 
manager though virtio scsi pass through driver is installed
Bug 982630 - Documentation : virsh attach-disk --help shouldbe updated 
with proper examples for --type and --driver


Thanks,
Shastri

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] PF: Async page fault support on s390

2013-07-10 Thread Dominik Dingel
This patch enables async page faults for s390 kvm guests.
It provides the userspace API to enable, disable or get the status of this
feature. Also it includes the diagnose code, called by the guest to enable
async page faults.

The async page faults will use an already existing guest interface for this
purpose, as described in "CP Programming Services (SC24-6084)".

Signed-off-by: Dominik Dingel 
---
 Documentation/s390/kvm.txt   |  24 +
 arch/s390/include/asm/kvm_host.h |  22 
 arch/s390/include/uapi/asm/kvm.h |  10 
 arch/s390/kvm/Kconfig|   2 +
 arch/s390/kvm/Makefile   |   2 +-
 arch/s390/kvm/diag.c |  63 +++
 arch/s390/kvm/interrupt.c|  43 +---
 arch/s390/kvm/kvm-s390.c | 107 ++-
 arch/s390/kvm/kvm-s390.h |   4 ++
 arch/s390/kvm/sigp.c |   6 +++
 include/uapi/linux/kvm.h |   2 +
 11 files changed, 276 insertions(+), 9 deletions(-)

diff --git a/Documentation/s390/kvm.txt b/Documentation/s390/kvm.txt
index 85f3280..707b7e9 100644
--- a/Documentation/s390/kvm.txt
+++ b/Documentation/s390/kvm.txt
@@ -70,6 +70,30 @@ floating interrupts are:
 KVM_S390_INT_VIRTIO
 KVM_S390_INT_SERVICE
 
+ioctl:  KVM_S390_APF_ENABLE:
+args:   none
+This ioctl is used to enable the async page fault interface. So in a
+host page fault case the host can now submit pfault tokens to the guest.
+
+ioctl:  KVM_S390_APF_DISABLE:
+args:   none
+This ioctl is used to disable the async page fault interface. From this point
+on no new pfault tokens will be issued to the guest. Already existing async
+page faults are not covered by this and will be normally handled.
+
+ioctl:  KVM_S390_APF_STATUS:
+args:   none
+This ioctl allows the userspace to get the current status of the APF feature.
+The main purpose for this, is to ensure that no pfault tokens will be lost
+during live migration or similar management operations.
+The possible return values are:
+KVM_S390_APF_DISABLED_NON_PENDING
+KVM_S390_APF_DISABLED_PENDING
+KVM_S390_APF_ENABLED_NON_PENDING
+KVM_S390_APF_ENABLED_PENDING
+Caution: if KVM_S390_APF is enabled the PENDING status could be already changed
+as soon as the ioctl returns to userspace.
+
 3. ioctl calls to the kvm-vcpu file descriptor
 KVM does support the following ioctls on s390 that are common with other
 architectures and do behave the same:
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index cd30c3d..e8012fc 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -257,6 +257,10 @@ struct kvm_vcpu_arch {
u64 stidp_data;
};
struct gmap *gmap;
+#define KVM_S390_PFAULT_TOKEN_INVALID  (-1UL)
+   unsigned long pfault_token;
+   unsigned long pfault_select;
+   unsigned long pfault_compare;
 };
 
 struct kvm_vm_stat {
@@ -282,6 +286,24 @@ static inline bool kvm_is_error_hva(unsigned long addr)
return addr == KVM_HVA_ERR_BAD;
 }
 
+#define ASYNC_PF_PER_VCPU  64
+struct kvm_vcpu;
+struct kvm_async_pf;
+struct kvm_arch_async_pf {
+   unsigned long pfault_token;
+};
+
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
+
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+  struct kvm_async_pf *work);
+
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+struct kvm_async_pf *work);
+
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+struct kvm_async_pf *work);
+
 extern int sie64a(struct kvm_s390_sie_block *, u64 *);
 extern char sie_exit;
 #endif
diff --git a/arch/s390/include/uapi/asm/kvm.h b/arch/s390/include/uapi/asm/kvm.h
index d25da59..b6c83e0 100644
--- a/arch/s390/include/uapi/asm/kvm.h
+++ b/arch/s390/include/uapi/asm/kvm.h
@@ -57,4 +57,14 @@ struct kvm_sync_regs {
 #define KVM_REG_S390_EPOCHDIFF (KVM_REG_S390 | KVM_REG_SIZE_U64 | 0x2)
 #define KVM_REG_S390_CPU_TIMER  (KVM_REG_S390 | KVM_REG_SIZE_U64 | 0x3)
 #define KVM_REG_S390_CLOCK_COMP (KVM_REG_S390 | KVM_REG_SIZE_U64 | 0x4)
+
+/* ioctls used for setting/getting status of APF on s390x */
+#define KVM_S390_APF_ENABLE1
+#define KVM_S390_APF_DISABLE   2
+#define KVM_S390_APF_STATUS3
+#define KVM_S390_APF_DISABLED_NON_PENDING  0
+#define KVM_S390_APF_DISABLED_PENDING  1
+#define KVM_S390_APF_ENABLED_NON_PENDING   2
+#define KVM_S390_APF_ENABLED_PENDING   3
+
 #endif
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 70b46ea..4993eed 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -23,6 +23,8 @@ config KVM
select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_EVENTFD
+   select KVM_ASYNC_PF
+   select KVM_ASYNC_PF_DIRECT
---help---
  Support hosting paravirtualized guest machines using 

[PATCH 3/4] PF: Provide additional direct page notification

2013-07-10 Thread Dominik Dingel
By setting a Kconfig option, the architecture can control when
guest notifications will be presented by the apf backend.
So there is the default batch mechanism, working as before, where the vcpu 
thread
should pull in this information. On the other hand there is now the direct
mechanism, this will directly push the information to the guest.
This way s390 can use an already existing architecture interface.

Still the vcpu thread should call check_completion to cleanup leftovers,
that leaves most of the common code untouched.

Signed-off-by: Dominik Dingel 
Acked-by: Christian Borntraeger 
---
 arch/x86/kvm/mmu.c   |  2 +-
 include/linux/kvm_host.h |  2 +-
 virt/kvm/Kconfig |  4 
 virt/kvm/async_pf.c  | 22 +++---
 4 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0d094da..b8632e9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3343,7 +3343,7 @@ static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, 
gva_t gva, gfn_t gfn)
arch.direct_map = vcpu->arch.mmu.direct_map;
arch.cr3 = vcpu->arch.mmu.get_cr3(vcpu);
 
-   return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
+   return kvm_setup_async_pf(vcpu, gva, gfn_to_hva(vcpu->kvm, gfn), &arch);
 }
 
 static bool can_do_async_pf(struct kvm_vcpu *vcpu)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 92e8f64..fe87e46 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -191,7 +191,7 @@ struct kvm_async_pf {
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
-int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
   struct kvm_arch_async_pf *arch);
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 779262f..0774495 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -22,6 +22,10 @@ config KVM_MMIO
 config KVM_ASYNC_PF
bool
 
+# Toggle to switch between direct notification and batch job
+config KVM_ASYNC_PF_SYNC
+   bool
+
 config HAVE_KVM_MSI
bool
 
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index ea475cd..cfa9366 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -28,6 +28,21 @@
 #include "async_pf.h"
 #include 
 
+static inline void kvm_async_page_present_sync(struct kvm_vcpu *vcpu,
+  struct kvm_async_pf *work)
+{
+#ifdef CONFIG_KVM_ASYNC_PF_SYNC
+   kvm_arch_async_page_present(vcpu, work);
+#endif
+}
+static inline void kvm_async_page_present_async(struct kvm_vcpu *vcpu,
+   struct kvm_async_pf *work)
+{
+#ifndef CONFIG_KVM_ASYNC_PF_SYNC
+   kvm_arch_async_page_present(vcpu, work);
+#endif
+}
+
 static struct kmem_cache *async_pf_cache;
 
 int kvm_async_pf_init(void)
@@ -70,6 +85,7 @@ static void async_pf_execute(struct work_struct *work)
down_read(&mm->mmap_sem);
get_user_pages(current, mm, addr, 1, 1, 0, &page, NULL);
up_read(&mm->mmap_sem);
+   kvm_async_page_present_sync(vcpu, apf);
unuse_mm(mm);
 
spin_lock(&vcpu->async_pf.lock);
@@ -134,7 +150,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 
if (work->page)
kvm_arch_async_page_ready(vcpu, work);
-   kvm_arch_async_page_present(vcpu, work);
+   kvm_async_page_present_async(vcpu, work);
 
list_del(&work->queue);
vcpu->async_pf.queued--;
@@ -144,7 +160,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
}
 }
 
-int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
   struct kvm_arch_async_pf *arch)
 {
struct kvm_async_pf *work;
@@ -166,7 +182,7 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, 
gfn_t gfn,
work->done = false;
work->vcpu = vcpu;
work->gva = gva;
-   work->addr = gfn_to_hva(vcpu->kvm, gfn);
+   work->addr = hva;
work->arch = *arch;
work->mm = current->mm;
atomic_inc(&work->mm->mm_count);
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/4] Enable async page faults on s390

2013-07-10 Thread Dominik Dingel
Gleb, Paolo, 

based on the work from Martin and Carsten, this implementation enables async 
page faults.
To the guest it will provide the pfault interface, but internally it uses the
async page fault common code. 

The inital submission and it's discussion can be followed on 
http://www.mail-archive.com/kvm@vger.kernel.org/msg63359.html . 

There is a slight modification for common code to move from a pull to a push 
based approch on s390. 
As s390 we don't want to wait till we leave the guest state to queue the 
notification interrupts.

To use this feature the controlling userspace hase to enable the capability.
With that knob we can later on disable this feature for live migration.

v3 -> v4
 - Change "done" interrupts from local to floating
 - Add a comment for clarification
 - Change KVM_HAVE_ERR_BAD to move s390 implementation to s390 backend 

v2 -> v3
 - Reworked the architecture specific parts, to only provide on addtional
   implementation
 - Renamed function to kvm_async_page_present_(sync|async)
 - Fixing KVM_HVA_ERR_BAD handling

v1 -> v2:
 - Adding other architecture backends
 - Adding documentation for the ioctl
 - Improving the overall error handling
 - Reducing the needed modifications on the common code

Dominik Dingel (4):
  PF: Add FAULT_FLAG_RETRY_NOWAIT for guest fault
  PF: Make KVM_HVA_ERR_BAD usable on s390
  PF: Provide additional direct page notification
  PF: Async page fault support on s390

 Documentation/s390/kvm.txt|  24 
 arch/s390/include/asm/kvm_host.h  |  30 ++
 arch/s390/include/asm/pgtable.h   |   2 +
 arch/s390/include/asm/processor.h |   1 +
 arch/s390/include/uapi/asm/kvm.h  |  10 
 arch/s390/kvm/Kconfig |   2 +
 arch/s390/kvm/Makefile|   2 +-
 arch/s390/kvm/diag.c  |  63 
 arch/s390/kvm/interrupt.c |  43 +++---
 arch/s390/kvm/kvm-s390.c  | 118 ++
 arch/s390/kvm/kvm-s390.h  |   4 ++
 arch/s390/kvm/sigp.c  |   6 ++
 arch/s390/mm/fault.c  |  26 +++--
 arch/x86/kvm/mmu.c|   2 +-
 include/linux/kvm_host.h  |  10 +++-
 include/uapi/linux/kvm.h  |   2 +
 virt/kvm/Kconfig  |   4 ++
 virt/kvm/async_pf.c   |  22 ++-
 18 files changed, 354 insertions(+), 17 deletions(-)

-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] PF: Add FAULT_FLAG_RETRY_NOWAIT for guest fault

2013-07-10 Thread Dominik Dingel
In case of a fault retry exit sie64() with gmap_fault indication for the
running thread set. This makes it possible to handle async page faults
without the need for mm notifiers.

Based on a patch from Martin Schwidefsky.

Signed-off-by: Dominik Dingel 
Acked-by: Christian Borntraeger 
---
 arch/s390/include/asm/pgtable.h   |  2 ++
 arch/s390/include/asm/processor.h |  1 +
 arch/s390/kvm/kvm-s390.c  | 13 +
 arch/s390/mm/fault.c  | 26 ++
 4 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 0ea4e59..4a4cc64 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -740,6 +740,7 @@ static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
  * @table: pointer to the page directory
  * @asce: address space control element for gmap page table
  * @crst_list: list of all crst tables used in the guest address space
+ * @pfault_enabled: defines if pfaults are applicable for the guest
  */
 struct gmap {
struct list_head list;
@@ -748,6 +749,7 @@ struct gmap {
unsigned long asce;
void *private;
struct list_head crst_list;
+   unsigned long pfault_enabled;
 };
 
 /**
diff --git a/arch/s390/include/asm/processor.h 
b/arch/s390/include/asm/processor.h
index 6b49987..4fa96ca 100644
--- a/arch/s390/include/asm/processor.h
+++ b/arch/s390/include/asm/processor.h
@@ -77,6 +77,7 @@ struct thread_struct {
 unsigned long ksp;  /* kernel stack pointer */
mm_segment_t mm_segment;
unsigned long gmap_addr;/* address of last gmap fault. */
+   unsigned int gmap_pfault;   /* signal of a pending guest pfault */
struct per_regs per_user;   /* User specified PER registers */
struct per_event per_event; /* Cause of the last PER trap */
unsigned long per_flags;/* Flags to control debug behavior */
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ba694d2..702daca 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -682,6 +682,15 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
return 0;
 }
 
+static void kvm_arch_fault_in_sync(struct kvm_vcpu *vcpu)
+{
+   hva_t fault = gmap_fault(current->thread.gmap_addr, vcpu->arch.gmap);
+   struct mm_struct *mm = current->mm;
+   down_read(&mm->mmap_sem);
+   get_user_pages(current, mm, fault, 1, 1, 0, NULL, NULL);
+   up_read(&mm->mmap_sem);
+}
+
 static int __vcpu_run(struct kvm_vcpu *vcpu)
 {
int rc;
@@ -715,6 +724,10 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
if (rc < 0) {
if (kvm_is_ucontrol(vcpu->kvm)) {
rc = SIE_INTERCEPT_UCONTROL;
+   } else if (current->thread.gmap_pfault) {
+   kvm_arch_fault_in_sync(vcpu);
+   current->thread.gmap_pfault = 0;
+   rc = 0;
} else {
VCPU_EVENT(vcpu, 3, "%s", "fault in sie instruction");
trace_kvm_s390_sie_fault(vcpu);
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 047c3e4..7d4c4b1 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -50,6 +50,7 @@
 #define VM_FAULT_BADMAP0x02
 #define VM_FAULT_BADACCESS 0x04
 #define VM_FAULT_SIGNAL0x08
+#define VM_FAULT_PFAULT0x10
 
 static unsigned long store_indication __read_mostly;
 
@@ -232,6 +233,7 @@ static noinline void do_fault_error(struct pt_regs *regs, 
int fault)
return;
}
case VM_FAULT_BADCONTEXT:
+   case VM_FAULT_PFAULT:
do_no_context(regs);
break;
case VM_FAULT_SIGNAL:
@@ -269,6 +271,9 @@ static noinline void do_fault_error(struct pt_regs *regs, 
int fault)
  */
 static inline int do_exception(struct pt_regs *regs, int access)
 {
+#ifdef CONFIG_PGSTE
+   struct gmap *gmap;
+#endif
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct *vma;
@@ -307,9 +312,10 @@ static inline int do_exception(struct pt_regs *regs, int 
access)
down_read(&mm->mmap_sem);
 
 #ifdef CONFIG_PGSTE
-   if ((current->flags & PF_VCPU) && S390_lowcore.gmap) {
-   address = __gmap_fault(address,
-(struct gmap *) S390_lowcore.gmap);
+   gmap = (struct gmap *)
+   ((current->flags & PF_VCPU) ? S390_lowcore.gmap : 0);
+   if (gmap) {
+   address = __gmap_fault(address, gmap);
if (address == -EFAULT) {
fault = VM_FAULT_BADMAP;
goto out_up;
@@ -318,6 +324,8 @@ static inline int do_exception(struct pt_regs *regs, int 
access)
fault = VM_FAULT_OOM;
   

[PATCH 2/4] PF: Make KVM_HVA_ERR_BAD usable on s390

2013-07-10 Thread Dominik Dingel
Current common code uses PAGE_OFFSET to indicate a bad host virtual address.
As this check won't work on architectures that don't map kernel and user memory
into the same address space (e.g. s390), such architectures can now provide
there own KVM_HVA_ERR_BAD defines.

Signed-off-by: Dominik Dingel 
---
 arch/s390/include/asm/kvm_host.h | 8 
 include/linux/kvm_host.h | 8 
 2 files changed, 16 insertions(+)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 3238d40..cd30c3d 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -274,6 +274,14 @@ struct kvm_arch{
int css_support;
 };
 
+#define KVM_HVA_ERR_BAD(-1UL)
+#define KVM_HVA_ERR_RO_BAD (-1UL)
+
+static inline bool kvm_is_error_hva(unsigned long addr)
+{
+   return addr == KVM_HVA_ERR_BAD;
+}
+
 extern int sie64a(struct kvm_s390_sie_block *, u64 *);
 extern char sie_exit;
 #endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a63d83e..92e8f64 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -85,6 +85,12 @@ static inline bool is_noslot_pfn(pfn_t pfn)
return pfn == KVM_PFN_NOSLOT;
 }
 
+/*
+ * architectures with KVM_HVA_ERR_BAD other than PAGE_OFFSET (e.g. s390)
+ * provide own defines and kvm_is_error_hva
+ */
+#ifndef KVM_HVA_ERR_BAD
+
 #define KVM_HVA_ERR_BAD(PAGE_OFFSET)
 #define KVM_HVA_ERR_RO_BAD (PAGE_OFFSET + PAGE_SIZE)
 
@@ -93,6 +99,8 @@ static inline bool kvm_is_error_hva(unsigned long addr)
return addr >= PAGE_OFFSET;
 }
 
+#endif
+
 #define KVM_ERR_PTR_BAD_PAGE   (ERR_PTR(-ENOENT))
 
 static inline bool is_error_page(struct page *page)
-- 
1.8.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI assignement with nvidia K1: RmInitAdapter failed

2013-07-10 Thread Guillaume Thouvenin

Alex Williamson  a écrit :


Are you sure that nvidia-smi is relevant to the K1/K2 devices?  Does it
work on the host?


Yes it works on the host and I have some information like power 
consumption, temperature of the GPU, etc...



Are
there other tests you can do to check whether the device is otherwise
available?


I also tried to start an X server but I got the same error reported in 
the syslog:  [  435.745673] NVRM: RmInitAdapter failed! (0x26:0x38:1170)

 [  435.745695] NVRM: rm_init_adapter(0) failed

And in the xorg.log I have:


 [   423.624] (==) NVIDIA(0): Depth 24, (==) framebuffer bpp 32
 [   423.624] (==) NVIDIA(0): RGB weight 888
 [   423.624] (==) NVIDIA(0): Default visual is TrueColor
 [   423.624] (==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)
 [   423.624] (**) NVIDIA(0): Option "NoLogo" "true"
 [   423.624] (**) NVIDIA(0): Option "UseDisplayDevice" "none"
 [   423.624] (**) NVIDIA(0): Enabling 2D acceleration
 [   423.624] (**) NVIDIA(0): Option "UseDisplayDevice" set to "none"; 
enabling NoScanout

 [   423.624] (**) NVIDIA(0): mode
 [   435.746] (EE) NVIDIA(0): Failed to initialize the NVIDIA GPU at 
PCI:0:3:0.  Please
 [   435.746] (EE) NVIDIA(0): check your system's kernel log for 
additional error
 [   435.746] (EE) NVIDIA(0): messages and refer to Chapter 8: 
Common Problems in the

 [   435.746] (EE) NVIDIA(0): README for additional information.
 [   435.746] (EE) NVIDIA(0): Failed to initialize the NVIDIA graphics device!
 [   435.746] (EE) NVIDIA(0): Failing initialization of X screen 0


Regards,
Guillaume




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Raghavendra K T

On 07/10/2013 05:11 PM, Gleb Natapov wrote:

On Wed, Jul 10, 2013 at 04:54:12PM +0530, Raghavendra K T wrote:

Ingo, Gleb,

 From the results perspective, Andrew Theurer, Vinod's test results are
pro-pvspinlock.
Could you please help me to know what will make it a mergeable
candidate?.


I need to spend more time reviewing it :) The problem with PV interfaces
is that they are easy to add but hard to get rid of if better solution
(HW or otherwise) appears.


Infact Avi had acked the whole V8 series, but delayed for seeing how
PLE improvement would affect it.


I see that Ingo was happy with it too.


The only addition from that series has been
1. tuning the SPIN_THRESHOLD to 32k (from 2k)
and
2. the halt handler now calls vcpu_on_spin to take the advantage of
PLE improvements. (this can also go as an independent patch into
kvm)

The rationale for making SPIN_THERSHOLD 32k needs big explanation.
Before PLE improvements, as you know,
kvm undercommit scenario was very worse in ple enabled cases.
(compared to ple disabled cases).
pvspinlock patches behaved equally bad in undercommit. Both had
similar reason so at the end there was no degradation w.r.t base.

The reason for bad performance in PLE case was unneeded vcpu
iteration in ple handler resulting in high yield_to calls and double
run queue locks.
With pvspinlock applied, same villain role was played by excessive
halt exits.

But after ple handler improved, we needed to throttle unnecessary halts
in undercommit for pvspinlock to be on par with 1x result.


Make sense. I will review it ASAP. BTW the latest version is V10 right?



Yes. Thank you.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 04:54:12PM +0530, Raghavendra K T wrote:
> >>Ingo, Gleb,
> >>
> >> From the results perspective, Andrew Theurer, Vinod's test results are
> >>pro-pvspinlock.
> >>Could you please help me to know what will make it a mergeable
> >>candidate?.
> >>
> >I need to spend more time reviewing it :) The problem with PV interfaces
> >is that they are easy to add but hard to get rid of if better solution
> >(HW or otherwise) appears.
> 
> Infact Avi had acked the whole V8 series, but delayed for seeing how
> PLE improvement would affect it.
> 
I see that Ingo was happy with it too.

> The only addition from that series has been
> 1. tuning the SPIN_THRESHOLD to 32k (from 2k)
> and
> 2. the halt handler now calls vcpu_on_spin to take the advantage of
> PLE improvements. (this can also go as an independent patch into
> kvm)
> 
> The rationale for making SPIN_THERSHOLD 32k needs big explanation.
> Before PLE improvements, as you know,
> kvm undercommit scenario was very worse in ple enabled cases.
> (compared to ple disabled cases).
> pvspinlock patches behaved equally bad in undercommit. Both had
> similar reason so at the end there was no degradation w.r.t base.
> 
> The reason for bad performance in PLE case was unneeded vcpu
> iteration in ple handler resulting in high yield_to calls and double
> run queue locks.
> With pvspinlock applied, same villain role was played by excessive
> halt exits.
> 
> But after ple handler improved, we needed to throttle unnecessary halts
> in undercommit for pvspinlock to be on par with 1x result.
> 
Make sense. I will review it ASAP. BTW the latest version is V10 right?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Raghavendra K T

dropping stephen becuase of bounce

On 07/10/2013 04:58 PM, Raghavendra K T wrote:

On 07/10/2013 04:17 PM, Gleb Natapov wrote:

On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:

On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:

Here's an idea, trim the damn email ;-) -- not only directed at gleb.


Good idea.


Ingo, Gleb,

 From the results perspective, Andrew Theurer, Vinod's test results
are
pro-pvspinlock.
Could you please help me to know what will make it a mergeable
candidate?.


I need to spend more time reviewing it :) The problem with PV
interfaces
is that they are easy to add but hard to get rid of if better solution
(HW or otherwise) appears.


How so? Just make sure the registration for the PV interface is
optional; that
is, allow it to fail. A guest that fails the PV setup will either
have to try
another PV interface or fall back to 'native'.



Forgot to add. Yes currently pvspinlocks are not enabled by default and
also, we have jump_label mechanism to enable it.
[...]

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 04:58:29PM +0530, Raghavendra K T wrote:
> On 07/10/2013 04:17 PM, Gleb Natapov wrote:
> >On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
> >>On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
> >>
> >>Here's an idea, trim the damn email ;-) -- not only directed at gleb.
> >>
> >Good idea.
> >
> Ingo, Gleb,
> 
>  From the results perspective, Andrew Theurer, Vinod's test results are
> pro-pvspinlock.
> Could you please help me to know what will make it a mergeable
> candidate?.
> 
> >>>I need to spend more time reviewing it :) The problem with PV interfaces
> >>>is that they are easy to add but hard to get rid of if better solution
> >>>(HW or otherwise) appears.
> >>
> >>How so? Just make sure the registration for the PV interface is optional; 
> >>that
> >>is, allow it to fail. A guest that fails the PV setup will either have to 
> >>try
> >>another PV interface or fall back to 'native'.
> >>
> >We have to carry PV around for live migration purposes. PV interface
> >cannot disappear under a running guest.
> >
> 
> IIRC, The only requirement was running state of the vcpu to be retained.
> This was addressed by
> [PATCH RFC V10 13/18] kvm : Fold pv_unhalt flag into GET_MP_STATE
> ioctl to aid migration.
> 
> I would have to know more if I missed something here.
> 
I was not talking about the state that has to be migrated, but
HV<->guest interface that has to be preserved after migration.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Raghavendra K T

On 07/10/2013 04:17 PM, Gleb Natapov wrote:

On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:

On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:

Here's an idea, trim the damn email ;-) -- not only directed at gleb.


Good idea.


Ingo, Gleb,

 From the results perspective, Andrew Theurer, Vinod's test results are
pro-pvspinlock.
Could you please help me to know what will make it a mergeable
candidate?.


I need to spend more time reviewing it :) The problem with PV interfaces
is that they are easy to add but hard to get rid of if better solution
(HW or otherwise) appears.


How so? Just make sure the registration for the PV interface is optional; that
is, allow it to fail. A guest that fails the PV setup will either have to try
another PV interface or fall back to 'native'.


We have to carry PV around for live migration purposes. PV interface
cannot disappear under a running guest.



IIRC, The only requirement was running state of the vcpu to be retained.
This was addressed by
[PATCH RFC V10 13/18] kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl 
to aid migration.


I would have to know more if I missed something here.


I agree that Jiannan's Preemptable Lock idea is promising and we could
evaluate that  approach, and make the best one get into kernel and also
will carry on discussion with Jiannan to improve that patch.

That would be great. The work is stalled from what I can tell.


I absolutely hated that stuff because it wrecked the native code.

Yes, the idea was to hide it from native code behind PV hooks.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Raghavendra K T

On 07/10/2013 04:03 PM, Gleb Natapov wrote:

On Tue, Jul 09, 2013 at 02:41:30PM +0530, Raghavendra K T wrote:

On 06/26/2013 11:24 PM, Raghavendra K T wrote:

On 06/26/2013 09:41 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote:

On 06/26/2013 06:22 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:

On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:

On 06/25/2013 08:20 PM, Andrew Theurer wrote:

On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:

This series replaces the existing paravirtualized spinlock
mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler
improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions
to look
at PLE handler's improvements, various optimizations in PLE
handling
have been tried.


Sorry for not posting this sooner.  I have tested the v9
pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I
have
tested these patches with and without PLE, as PLE is still not
scalable
with large VMs.



Hi Andrew,

Thanks for testing.


System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
ConfigurationThroughput(MB/s)Notes

3.10-default-ple_on229455% CPU in host
kernel, 2% spin_lock in guests
3.10-default-ple_off231845% CPU in host
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on228955% CPU in host
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off230515% CPU in host
kernel, 2% spin_lock in guests
[all 1x results look good here]


Yes. The 1x results look too close




2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
ConfigurationThroughputNotes

3.10-default-ple_on 628755% CPU  host
kernel, 17% spin_lock in guests
3.10-default-ple_off 18492% CPU in host
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 669150% CPU in host
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off164648% CPU in host
kernel, 33% spin_lock in guests


I see 6.426% improvement with ple_on
and 161.87% improvement with ple_off. I think this is a very good
sign
  for the patches


[PLE hinders pv-ticket improvements, but even with PLE off,
  we still off from ideal throughput (somewhere >2)]



Okay, The ideal throughput you are referring is getting around
atleast
80% of 1x throughput for over-commit. Yes we are still far away from
there.



1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
ConfigurationThroughputNotes

3.10-default-ple_on227366% CPU in host
kernel, 3% spin_lock in guests
3.10-default-ple_off233775% CPU in host
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on224716% CPU in host
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off234455% CPU in host
kernel, 3% spin_lock in guests
[1x looking fine here]



I see ple_off is little better here.



2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
ConfigurationThroughputNotes

3.10-default-ple_on 196570% CPU in host
kernel, 34% spin_lock in guests
3.10-default-ple_off  2262% CPU in host
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 194270% CPU in host
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off 800311% CPU in host
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
  Still quite a bit off from ideal throughput]


This is again a remarkable improvement (307%).
This motivates me to add a patch to disable ple when pvspinlock is
on.
probably we can add a hypercall that disables ple in kvm init patch.
but only problem I see is what if the guests are mixed.

  (i.e one guest has pvspinlock support but other does not. Host
supports pv)


How about reintroducing the idea to create per-kvm ple_gap,ple_window
state. We were headed down that road when considering a dynamic
window at
one point. Then you can just set a single guest's ple_gap to zero,
which
would lead to

Re: [PATCH 1/2] KVM: PPC: Book3S HV: Correct tlbie usage

2013-07-10 Thread Alexander Graf

On 08.07.2013, at 12:08, Paul Mackerras wrote:

> This corrects the usage of the tlbie (TLB invalidate entry) instruction
> in HV KVM.  The tlbie instruction changed between PPC970 and POWER7.
> On the PPC970, the bit to select large vs. small page is in the instruction,
> not in the RB register value.  This changes the code to use the correct
> form on PPC970.
> 
> On POWER7 we were calculating the AVAL (Abbreviated Virtual Address, Lower)
> field of the RB value incorrectly for 64k pages.  This fixes it.
> 
> Since we now have several cases to handle for the tlbie instruction, this
> factors out the code to do a sequence of tlbies into a new function,
> do_tlbies(), and calls that from the various places where the code was
> doing tlbie instructions inline.  It also makes kvmppc_h_bulk_remove()
> use the same global_invalidates() function for determining whether to do
> local or global TLB invalidations as is used in other places, for
> consistency, and also to make sure that kvm->arch.need_tlb_flush gets
> updated properly.
> 
> Signed-off-by: Paul Mackerras 
> Cc: sta...@vger.kernel.org

Thanks, applied both to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] PF: Provide additional direct page notification

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 12:48, Gleb Natapov wrote:

> On Wed, Jul 10, 2013 at 12:45:59PM +0200, Alexander Graf wrote:
>> 
>> On 10.07.2013, at 12:42, Gleb Natapov wrote:
>> 
>>> On Wed, Jul 10, 2013 at 12:39:01PM +0200, Alexander Graf wrote:
 
 On 09.07.2013, at 18:01, Christian Borntraeger wrote:
 
> On 09/07/13 15:56, Dominik Dingel wrote:
>> By setting a Kconfig option, the architecture can control when
>> guest notifications will be presented by the apf backend.
>> So there is the default batch mechanism, working as before, where the 
>> vcpu thread
>> should pull in this information. On the other hand there is now the 
>> direct
>> mechanism, this will directly push the information to the guest.
>> 
>> Still the vcpu thread should call check_completion to cleanup leftovers,
>> that leaves most of the common code untouched.
>> 
>> Signed-off-by: Dominik Dingel 
> 
> Acked-by: Christian Borntraeger  
> for the "why". We want to use the existing architectured interface.
 
 Shouldn't this be a runtime option?
 
>>> Why? What is the advantage of using sync delivery when HW can do it
>>> async?
>> 
>> What's the advantage of having an option at all then? Who selects it?
>> 
> x86 is stupid and cannot deliver the even asynchronously. Platform that
> can do it select the option.

We're in generic code. S390x enables it. X86 does not. That was the missing 
link!

Thanks a lot and sorry for the fuss :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] PF: Provide additional direct page notification

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 12:49, Christian Borntraeger wrote:

> On 10/07/13 12:39, Alexander Graf wrote:
>> 
>> On 09.07.2013, at 18:01, Christian Borntraeger wrote:
>> 
>>> On 09/07/13 15:56, Dominik Dingel wrote:
 By setting a Kconfig option, the architecture can control when
 guest notifications will be presented by the apf backend.
 So there is the default batch mechanism, working as before, where the vcpu 
 thread
 should pull in this information. On the other hand there is now the direct
 mechanism, this will directly push the information to the guest.
 
 Still the vcpu thread should call check_completion to cleanup leftovers,
 that leaves most of the common code untouched.
 
 Signed-off-by: Dominik Dingel 
>>> 
>>> Acked-by: Christian Borntraeger  
>>> for the "why". We want to use the existing architectured interface.
>> 
>> Shouldn't this be a runtime option?
> 
> This is an a) or b) depending on the architecture. So making this a kconfig
> option is the most sane approach no?

I guess I'm just missing the patch that actually selects it. Last thing I 
remember you can have a kernel configured for s390x that runs on any 64bit 
capable system out there. What would you select? If that kernel runs on newer 
hardware, it would be able to do async pf, no?

There's a good chance I simply miss a critical component here :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] PF: Provide additional direct page notification

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 12:45:59PM +0200, Alexander Graf wrote:
> 
> On 10.07.2013, at 12:42, Gleb Natapov wrote:
> 
> > On Wed, Jul 10, 2013 at 12:39:01PM +0200, Alexander Graf wrote:
> >> 
> >> On 09.07.2013, at 18:01, Christian Borntraeger wrote:
> >> 
> >>> On 09/07/13 15:56, Dominik Dingel wrote:
>  By setting a Kconfig option, the architecture can control when
>  guest notifications will be presented by the apf backend.
>  So there is the default batch mechanism, working as before, where the 
>  vcpu thread
>  should pull in this information. On the other hand there is now the 
>  direct
>  mechanism, this will directly push the information to the guest.
>  
>  Still the vcpu thread should call check_completion to cleanup leftovers,
>  that leaves most of the common code untouched.
>  
>  Signed-off-by: Dominik Dingel 
> >>> 
> >>> Acked-by: Christian Borntraeger  
> >>> for the "why". We want to use the existing architectured interface.
> >> 
> >> Shouldn't this be a runtime option?
> >> 
> > Why? What is the advantage of using sync delivery when HW can do it
> > async?
> 
> What's the advantage of having an option at all then? Who selects it?
> 
x86 is stupid and cannot deliver the even asynchronously. Platform that
can do it select the option.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] PF: Provide additional direct page notification

2013-07-10 Thread Christian Borntraeger
On 10/07/13 12:39, Alexander Graf wrote:
> 
> On 09.07.2013, at 18:01, Christian Borntraeger wrote:
> 
>> On 09/07/13 15:56, Dominik Dingel wrote:
>>> By setting a Kconfig option, the architecture can control when
>>> guest notifications will be presented by the apf backend.
>>> So there is the default batch mechanism, working as before, where the vcpu 
>>> thread
>>> should pull in this information. On the other hand there is now the direct
>>> mechanism, this will directly push the information to the guest.
>>>
>>> Still the vcpu thread should call check_completion to cleanup leftovers,
>>> that leaves most of the common code untouched.
>>>
>>> Signed-off-by: Dominik Dingel 
>>
>> Acked-by: Christian Borntraeger  
>> for the "why". We want to use the existing architectured interface.
> 
> Shouldn't this be a runtime option?

This is an a) or b) depending on the architecture. So making this a kconfig
option is the most sane approach no?

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
> 
> Here's an idea, trim the damn email ;-) -- not only directed at gleb.
> 
Good idea.

> > > Ingo, Gleb,
> > > 
> > > From the results perspective, Andrew Theurer, Vinod's test results are
> > > pro-pvspinlock.
> > > Could you please help me to know what will make it a mergeable
> > > candidate?.
> > > 
> > I need to spend more time reviewing it :) The problem with PV interfaces
> > is that they are easy to add but hard to get rid of if better solution
> > (HW or otherwise) appears.
> 
> How so? Just make sure the registration for the PV interface is optional; that
> is, allow it to fail. A guest that fails the PV setup will either have to try
> another PV interface or fall back to 'native'.
> 
We have to carry PV around for live migration purposes. PV interface
cannot disappear under a running guest.

> > > I agree that Jiannan's Preemptable Lock idea is promising and we could
> > > evaluate that  approach, and make the best one get into kernel and also
> > > will carry on discussion with Jiannan to improve that patch.
> > That would be great. The work is stalled from what I can tell.
> 
> I absolutely hated that stuff because it wrecked the native code.
Yes, the idea was to hide it from native code behind PV hooks.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] PF: Provide additional direct page notification

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 12:42, Gleb Natapov wrote:

> On Wed, Jul 10, 2013 at 12:39:01PM +0200, Alexander Graf wrote:
>> 
>> On 09.07.2013, at 18:01, Christian Borntraeger wrote:
>> 
>>> On 09/07/13 15:56, Dominik Dingel wrote:
 By setting a Kconfig option, the architecture can control when
 guest notifications will be presented by the apf backend.
 So there is the default batch mechanism, working as before, where the vcpu 
 thread
 should pull in this information. On the other hand there is now the direct
 mechanism, this will directly push the information to the guest.
 
 Still the vcpu thread should call check_completion to cleanup leftovers,
 that leaves most of the common code untouched.
 
 Signed-off-by: Dominik Dingel 
>>> 
>>> Acked-by: Christian Borntraeger  
>>> for the "why". We want to use the existing architectured interface.
>> 
>> Shouldn't this be a runtime option?
>> 
> Why? What is the advantage of using sync delivery when HW can do it
> async?

What's the advantage of having an option at all then? Who selects it?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] PF: Provide additional direct page notification

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 12:39:01PM +0200, Alexander Graf wrote:
> 
> On 09.07.2013, at 18:01, Christian Borntraeger wrote:
> 
> > On 09/07/13 15:56, Dominik Dingel wrote:
> >> By setting a Kconfig option, the architecture can control when
> >> guest notifications will be presented by the apf backend.
> >> So there is the default batch mechanism, working as before, where the vcpu 
> >> thread
> >> should pull in this information. On the other hand there is now the direct
> >> mechanism, this will directly push the information to the guest.
> >> 
> >> Still the vcpu thread should call check_completion to cleanup leftovers,
> >> that leaves most of the common code untouched.
> >> 
> >> Signed-off-by: Dominik Dingel 
> > 
> > Acked-by: Christian Borntraeger  
> > for the "why". We want to use the existing architectured interface.
> 
> Shouldn't this be a runtime option?
> 
Why? What is the advantage of using sync delivery when HW can do it
async?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 12:40, Alexander Graf wrote:

> 
> On 10.07.2013, at 12:39, Benjamin Herrenschmidt wrote:
> 
>> On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote:
>>> 
>>> It's not exactly obvious that you're calling it with writing == 1 :).
>>> Can you create a new local variable "is_write" in the calling
>>> function, set that to 1 before the call to get_user_pages_fast and
>>> pass it in instead of the 1? The compiler should easily optimize all
>>> of that away, but it makes the code by far easier to read.
>> 
>> Ugh ?
>> 
>> Nobody else does that  (look at futex :-)
> 
> Yeah, that's fortunately code that I don't have to read :).

The "proper" alternative would be to pass an enum for read/write into the 
function rather than an int. But that'd be a pretty controversial, big change 
that I'd rather not put on Alexey. With a local variable we're nicely 
self-contained readable ;)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Peter Zijlstra
On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:

Here's an idea, trim the damn email ;-) -- not only directed at gleb.

> > Ingo, Gleb,
> > 
> > From the results perspective, Andrew Theurer, Vinod's test results are
> > pro-pvspinlock.
> > Could you please help me to know what will make it a mergeable
> > candidate?.
> > 
> I need to spend more time reviewing it :) The problem with PV interfaces
> is that they are easy to add but hard to get rid of if better solution
> (HW or otherwise) appears.

How so? Just make sure the registration for the PV interface is optional; that
is, allow it to fail. A guest that fails the PV setup will either have to try
another PV interface or fall back to 'native'.

> > I agree that Jiannan's Preemptable Lock idea is promising and we could
> > evaluate that  approach, and make the best one get into kernel and also
> > will carry on discussion with Jiannan to improve that patch.
> That would be great. The work is stalled from what I can tell.

I absolutely hated that stuff because it wrecked the native code.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 12:39, Benjamin Herrenschmidt wrote:

> On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote:
>> 
>> It's not exactly obvious that you're calling it with writing == 1 :).
>> Can you create a new local variable "is_write" in the calling
>> function, set that to 1 before the call to get_user_pages_fast and
>> pass it in instead of the 1? The compiler should easily optimize all
>> of that away, but it makes the code by far easier to read.
> 
> Ugh ?
> 
> Nobody else does that  (look at futex :-)

Yeah, that's fortunately code that I don't have to read :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-10 Thread Benjamin Herrenschmidt
On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote:
> 
> It's not exactly obvious that you're calling it with writing == 1 :).
> Can you create a new local variable "is_write" in the calling
> function, set that to 1 before the call to get_user_pages_fast and
> pass it in instead of the 1? The compiler should easily optimize all
> of that away, but it makes the code by far easier to read.

Ugh ?

Nobody else does that  (look at futex :-)

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] PF: Provide additional direct page notification

2013-07-10 Thread Alexander Graf

On 09.07.2013, at 18:01, Christian Borntraeger wrote:

> On 09/07/13 15:56, Dominik Dingel wrote:
>> By setting a Kconfig option, the architecture can control when
>> guest notifications will be presented by the apf backend.
>> So there is the default batch mechanism, working as before, where the vcpu 
>> thread
>> should pull in this information. On the other hand there is now the direct
>> mechanism, this will directly push the information to the guest.
>> 
>> Still the vcpu thread should call check_completion to cleanup leftovers,
>> that leaves most of the common code untouched.
>> 
>> Signed-off-by: Dominik Dingel 
> 
> Acked-by: Christian Borntraeger  
> for the "why". We want to use the existing architectured interface.

Shouldn't this be a runtime option?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Tue, Jul 09, 2013 at 02:41:30PM +0530, Raghavendra K T wrote:
> On 06/26/2013 11:24 PM, Raghavendra K T wrote:
> >On 06/26/2013 09:41 PM, Gleb Natapov wrote:
> >>On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote:
> >>>On 06/26/2013 06:22 PM, Gleb Natapov wrote:
> On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
> >On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
> >>On 06/25/2013 08:20 PM, Andrew Theurer wrote:
> >>>On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
> This series replaces the existing paravirtualized spinlock
> mechanism
> with a paravirtualized ticketlock mechanism. The series provides
> implementation for both Xen and KVM.
> 
> Changes in V9:
> - Changed spin_threshold to 32k to avoid excess halt exits that are
> causing undercommit degradation (after PLE handler
> improvement).
> - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
> - Optimized halt exit path to use PLE handler
> 
> V8 of PVspinlock was posted last year. After Avi's suggestions
> to look
> at PLE handler's improvements, various optimizations in PLE
> handling
> have been tried.
> >>>
> >>>Sorry for not posting this sooner.  I have tested the v9
> >>>pv-ticketlock
> >>>patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I
> >>>have
> >>>tested these patches with and without PLE, as PLE is still not
> >>>scalable
> >>>with large VMs.
> >>>
> >>
> >>Hi Andrew,
> >>
> >>Thanks for testing.
> >>
> >>>System: x3850X5, 40 cores, 80 threads
> >>>
> >>>
> >>>1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
> >>>--
> >>>Total
> >>>ConfigurationThroughput(MB/s)Notes
> >>>
> >>>3.10-default-ple_on229455% CPU in host
> >>>kernel, 2% spin_lock in guests
> >>>3.10-default-ple_off231845% CPU in host
> >>>kernel, 2% spin_lock in guests
> >>>3.10-pvticket-ple_on228955% CPU in host
> >>>kernel, 2% spin_lock in guests
> >>>3.10-pvticket-ple_off230515% CPU in host
> >>>kernel, 2% spin_lock in guests
> >>>[all 1x results look good here]
> >>
> >>Yes. The 1x results look too close
> >>
> >>>
> >>>
> >>>2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
> >>>---
> >>>Total
> >>>ConfigurationThroughputNotes
> >>>
> >>>3.10-default-ple_on 628755% CPU  host
> >>>kernel, 17% spin_lock in guests
> >>>3.10-default-ple_off 18492% CPU in host
> >>>kernel, 95% spin_lock in guests
> >>>3.10-pvticket-ple_on 669150% CPU in host
> >>>kernel, 15% spin_lock in guests
> >>>3.10-pvticket-ple_off164648% CPU in host
> >>>kernel, 33% spin_lock in guests
> >>
> >>I see 6.426% improvement with ple_on
> >>and 161.87% improvement with ple_off. I think this is a very good
> >>sign
> >>  for the patches
> >>
> >>>[PLE hinders pv-ticket improvements, but even with PLE off,
> >>>  we still off from ideal throughput (somewhere >2)]
> >>>
> >>
> >>Okay, The ideal throughput you are referring is getting around
> >>atleast
> >>80% of 1x throughput for over-commit. Yes we are still far away from
> >>there.
> >>
> >>>
> >>>1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
> >>>--
> >>>Total
> >>>ConfigurationThroughputNotes
> >>>
> >>>3.10-default-ple_on227366% CPU in host
> >>>kernel, 3% spin_lock in guests
> >>>3.10-default-ple_off233775% CPU in host
> >>>kernel, 3% spin_lock in guests
> >>>3.10-pvticket-ple_on224716% CPU in host
> >>>kernel, 3% spin_lock in guests
> >>>3.10-pvticket-ple_off234455% CPU in host
> >>>kernel, 3% spin_lock in guests
> >>>[1x looking fine here]
> >>>
> >>
> >>I see ple_off is little better here.
> >>
> >>>
> >>>2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
> >>>--
> >>>Total
> >>>ConfigurationThroughputNotes
> >>>
> >>>3.10-default-ple_on 196570% CPU in host
> >>>kernel, 34% spin_lock in guests
> >>>3.10-defa

Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 01:29, Alexey Kardashevskiy wrote:

> On 07/10/2013 03:32 AM, Alexander Graf wrote:
>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>> This adds special support for huge pages (16MB).  The reference
>>> counting cannot be easily done for such pages in real mode (when
>>> MMU is off) so we added a list of huge pages.  It is populated in
>>> virtual mode and get_page is called just once per a huge page.
>>> Real mode handlers check if the requested page is huge and in the list,
>>> then no reference counting is done, otherwise an exit to virtual mode
>>> happens.  The list is released at KVM exit.  At the moment the fastest
>>> card available for tests uses up to 9 huge pages so walking through this
>>> list is not very expensive.  However this can change and we may want
>>> to optimize this.
>>> 
>>> Signed-off-by: Paul Mackerras
>>> Signed-off-by: Alexey Kardashevskiy
>>> 
>>> ---
>>> 
>>> Changes:
>>> 2013/06/27:
>>> * list of huge pages replaces with hashtable for better performance
>> 
>> So the only thing your patch description really talks about is not true
>> anymore?
>> 
>>> * spinlock removed from real mode and only protects insertion of new
>>> huge [ages descriptors into the hashtable
>>> 
>>> 2013/06/05:
>>> * fixed compile error when CONFIG_IOMMU_API=n
>>> 
>>> 2013/05/20:
>>> * the real mode handler now searches for a huge page by gpa (used to be pte)
>>> * the virtual mode handler prints warning if it is called twice for the same
>>> huge page as the real mode handler is expected to fail just once - when a
>>> huge
>>> page is not in the list yet.
>>> * the huge page is refcounted twice - when added to the hugepage list and
>>> when used in the virtual mode hcall handler (can be optimized but it will
>>> make the patch less nice).
>>> 
>>> Signed-off-by: Alexey Kardashevskiy
>>> ---
>>>  arch/powerpc/include/asm/kvm_host.h |  25 +
>>>  arch/powerpc/kernel/iommu.c |   6 ++-
>>>  arch/powerpc/kvm/book3s_64_vio.c| 104
>>> +---
>>>  arch/powerpc/kvm/book3s_64_vio_hv.c |  21 ++--
>>>  4 files changed, 146 insertions(+), 10 deletions(-)
>>> 
>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>> b/arch/powerpc/include/asm/kvm_host.h
>>> index 53e61b2..a7508cf 100644
>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>> @@ -30,6 +30,7 @@
>>>  #include
>>>  #include
>>>  #include
>>> +#include
>>>  #include
>>>  #include
>>>  #include
>>> @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
>>>  u32 window_size;
>>>  struct iommu_group *grp;/* used for IOMMU groups */
>>>  struct vfio_group *vfio_grp;/* used for IOMMU groups */
>>> +DECLARE_HASHTABLE(hash_tab, ilog2(64));/* used for IOMMU groups */
>>> +spinlock_t hugepages_write_lock;/* used for IOMMU groups */
>>>  struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>  struct page *pages[0];
>>>  };
>>> 
>>> +/*
>>> + * The KVM guest can be backed with 16MB pages.
>>> + * In this case, we cannot do page counting from the real mode
>>> + * as the compound pages are used - they are linked in a list
>>> + * with pointers as virtual addresses which are inaccessible
>>> + * in real mode.
>>> + *
>>> + * The code below keeps a 16MB pages list and uses page struct
>>> + * in real mode if it is already locked in RAM and inserted into
>>> + * the list or switches to the virtual mode where it can be
>>> + * handled in a usual manner.
>>> + */
>>> +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa>>  24, 32)
>>> +
>>> +struct kvmppc_spapr_iommu_hugepage {
>>> +struct hlist_node hash_node;
>>> +unsigned long gpa;/* Guest physical address */
>>> +unsigned long hpa;/* Host physical address */
>>> +struct page *page;/* page struct of the very first subpage */
>>> +unsigned long size;/* Huge page size (always 16MB at the moment) */
>>> +};
>>> +
>>>  struct kvmppc_linear_info {
>>>  void*base_virt;
>>>  unsigned long base_pfn;
>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>> index 51678ec..e0b6eca 100644
>>> --- a/arch/powerpc/kernel/iommu.c
>>> +++ b/arch/powerpc/kernel/iommu.c
>>> @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned
>>> long entry,
>>>  if (!pg) {
>>>  ret = -EAGAIN;
>>>  } else if (PageCompound(pg)) {
>>> -ret = -EAGAIN;
>>> +/* Hugepages will be released at KVM exit */
>>> +ret = 0;
>>>  } else {
>>>  if (oldtce&  TCE_PCI_WRITE)
>>>  SetPageDirty(pg);
>>> @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl,
>>> unsigned long entry,
>>>  struct page *pg = pfn_to_page(oldtce>>  PAGE_SHIFT);
>>>  if (!pg) {
>>>  ret = -EAGAIN;
>>> +} else if (

Re: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote:

> On 07/10/2013 01:35 AM, Alexander Graf wrote:
>> On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
>>> Signed-off-by: Alexey Kardashevskiy
>>> ---
>>>  include/uapi/linux/kvm.h |2 ++
>>>  1 file changed, 2 insertions(+)
>>> 
>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index 970b1f5..0865c01 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>>>  #define KVM_CAP_PPC_RTAS 91
>>>  #define KVM_CAP_IRQ_XICS 92
>>>  #define KVM_CAP_SPAPR_MULTITCE 93
>>> +#define KVM_CAP_SPAPR_TCE_IOMMU 94
>>> 
>>>  #ifdef KVM_CAP_IRQ_ROUTING
>>> 
>>> @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
>>>  /* Available with KVM_CAP_PPC_ALLOC_HTAB */
>>>  #define KVM_PPC_ALLOCATE_HTAB  _IOWR(KVMIO, 0xa7, __u32)
>>>  #define KVM_CREATE_SPAPR_TCE  _IOW(KVMIO,  0xa8, struct
>>> kvm_create_spapr_tce)
>>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct
>>> kvm_create_spapr_tce_iommu)
>> 
>> Please order them by number.
> 
> Oh. Again :( We have had this discussion with Scott Wood here already.
> Where _exactly_ do you want me to put it?

8 lines further down. With a comment saying when it's available. Also why is it 
af, not ad?

> Many sections, not really ordered. Thank you.

They should all be ordered inside of their own categories.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3E: Emulate MCSRR0/1 SPR and rfmci instruction

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 00:26, Scott Wood wrote:

> On 07/09/2013 05:00:26 PM, Alexander Graf wrote:
>> On 09.07.2013, at 23:54, Scott Wood wrote:
>> > On 07/09/2013 04:49:32 PM, Alexander Graf wrote:
>> >> Not sure I understand. What the timing stats do is that they measure the 
>> >> time between [exit ... entry], right? We'd do the same thing, just all in 
>> >> C code. That means we would become slightly less accurate, but gain 
>> >> dynamic enabling of the traces and get rid of all the timing stat asm 
>> >> code.
>> >
>> > Compile-time enabling bothers me less than a loss of accuracy (not just a 
>> > small loss by moving into C code, but a potential for a large loss if we 
>> > overflow the buffer)
>> Then don't overflow the buffer. Make it large enough.
> 
> How large is that?  Does the tool recognize and report when overflow happens?
> 
> How much will the overhead of running some python script on the host, 
> consuming a large volume of data, affect the results?
> 
>> IIRC ftrace improved recently to dynamically increase the buffer size too.
>> Steven, do I remember correctly here?
> 
> Yay more complexity.
> 
> So now we get to worry about possible memory allocations happening when we 
> try to log something?  Or if there is a way to do an "atomic" log, we're back 
> to the "buffer might be full" situation.
> 
>> > and a dependency on a userspace tool
>> We already have that for kvm_stat. It's a simple python script - and you 
>> surely have python on your rootfs, no?
>> > (both in terms of the tool needing to be written, and in the hassle of 
>> > ensuring that it's present in the root filesystem of whatever system I'm 
>> > testing).  And the whole mechanism will be more complicated.
>> It'll also be more flexible at the same time. You could take the logs and 
>> actually check what's going on to debug issues that you're encountering for 
>> example.
>> We could even go as far as sharing the same tool with other architectures, 
>> so that we only have to learn how to debug things once.
> 
> Have you encountered an actual need for this flexibility, or is it 
> theoretical?

Yeah, first thing I did back then to actually debug kvm failures was to add 
trace points.

> Is there common infrastructure for dealing with measuring intervals and 
> tracking statistics thereof, rather than just tracking points and letting 
> userspace connect the dots (though it could still do that as an option)?  
> Even if it must be done in userspace, it doesn't seem like something that 
> should be KVM-specific.

Would you like to have different ways of measuring mm subsystem overhead? I 
don't :). The same goes for KVM really. If we could converge towards a single 
user space interface to get exit timings, it'd make debugging a lot easier.

We already have this for the debugfs counters btw. And the timing framework 
does break kvm_stat today already, as it emits textual stats rather than 
numbers which all of the other debugfs stats do. But at least I can take the 
x86 kvm_stat tool and run it on ppc just fine to see exit stats.

> 
>> > Lots of debug options are enabled at build time; why must this be 
>> > different?
>> Because I think it's valuable as debug tool for cases where compile time 
>> switches are not the best way of debugging things. It's not a high profile 
>> thing to tackle for me tbh, but I don't really think working heavily on the 
>> timing stat thing is the correct path to walk along.
> 
> Adding new exit types isn't "working heavily" on it.

No, but the fact that the first patch is a fix to add exit stats for exits that 
we missed out before doesn't give me a lot of confidence that lots of people 
use timing stats. And I am always very weary of #ifdef'ed code, as it blows up 
the test matrix heavily.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 02:12, Scott Wood wrote:

> On 07/09/2013 04:45:10 PM, Alexander Graf wrote:
>> On 28.06.2013, at 11:20, Mihai Caraman wrote:
>> > +  /* Get page size */
>> > +  if (MAS0_GET_TLBSEL(mfspr(SPRN_MAS0)) == 0)
>> > +  psize_shift = PAGE_SHIFT;
>> > +  else
>> > +  psize_shift = MAS1_GET_TSIZE(mas1) + 10;
>> > +
>> > +  mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) << 32) |
>> > +  mfspr(SPRN_MAS3);
>> > +  addr = (mas7_mas3 & (~0ULL << psize_shift)) |
>> > + (geaddr & ((1ULL << psize_shift) - 1ULL));
>> > +
>> > +  /* Map a page and get guest's instruction */
>> > +  page = pfn_to_page(addr >> PAGE_SHIFT);
>> While looking at this I just realized that you're missing a check here. What 
>> if our IP is in some PCI BAR? Or can't we execute from those?
> 
> We at least need to check pfn_valid() first.  That'll just keep us from 
> accessing a bad pointer in the host kernel, though -- it won't make the 
> emulation actually work.  If we need that, we'll probably need to create a 
> temporary TLB entry manually.

ioremap()?

However, if we were walking the guest TLB cache instead we would get a guest 
physical address which we can always resolve to a host virtual address.

I'm not sure how important that whole use case is though. Maybe we should just 
error out to the guest for now.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 02:06, Scott Wood wrote:

> On 07/09/2013 04:44:24 PM, Alexander Graf wrote:
>> On 09.07.2013, at 20:46, Scott Wood wrote:
>> > I suspect that tlbsx is faster, or at worst similar.  And unlike comparing 
>> > tlbsx to lwepx (not counting a fix for the threading problem), we don't 
>> > already have code to search the guest TLB, so testing would be more work.
>> We have code to walk the guest TLB for TLB misses. This really is just the 
>> TLB miss search without host TLB injection.
>> So let's say we're using the shadow TLB. The guest always has its say 64 TLB 
>> entries that it can count on - we never evict anything by accident, because 
>> we store all of the 64 entries in our guest TLB cache. When the guest faults 
>> at an address, the first thing we do is we check the cache whether we have 
>> that page already mapped.
>> However, with this method we now have 2 enumeration methods for guest TLB 
>> searches. We have the tlbsx one which searches the host TLB and we have our 
>> guest TLB cache. The guest TLB cache might still contain an entry for an 
>> address that we already invalidated on the host. Would that impose a problem?
>> I guess not because we're swizzling the exit code around to instead be an 
>> instruction miss which means we restore the TLB entry into our host's TLB so 
>> that when we resume, we land here and the tlbsx hits. But it feels backwards.
> 
> Any better way?  Searching the guest TLB won't work for the LRAT case, so 
> we'd need to have this logic around anyway.  We shouldn't add a second 
> codepath unless it's a clear performance gain -- and again, I suspect it 
> would be the opposite, especially if the entry is not in TLB0 or in one of 
> the first few entries searched in TLB1.  The tlbsx miss case is not what we 
> should optimize for.

Hrm.

So let's redesign this thing theoretically. We would have an exit that requires 
an instruction fetch. We would override kvmppc_get_last_inst() to always do 
kvmppc_ld_inst(). That one can fail because it can't find the TLB entry in the 
host TLB. When it fails, we have to abort the emulation and resume the guest at 
the same IP.

Now the guest gets the TLB miss, we populate, go back into the guest. The guest 
hits the emulation failure again. We go back to kvmppc_ld_inst() which succeeds 
this time and we can emulate the instruction.

I think this works. Just make sure that the gateway to the instruction fetch is 
kvmppc_get_last_inst() and make that failable. Then the difference between 
looking for the TLB entry in the host's TLB or in the guest's TLB cache is 
hopefully negligible.

> 
>> At least this code has to become something more generic, such as 
>> kvmppc_read_guest(vcpu, addr, TYPE_INSN) and move into the host mmu 
>> implementation, as it's 100% host mmu specific.
> 
> I agree that e500_mmu_host.c is a better place for it (with an ifdef for 
> BOOKEHV), but supporting anything other than instruction fetches could wait 
> until we have a user for it (it means extra code to figure out if permissions 
> are correct).

Works for me, as long as it's either documented through BUG_ON/WARN_ON's or an 
explicit naming convention.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:

> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>> devices or emulated PCI.  These calls allow adding multiple entries
>>> (up to 512) into the TCE table in one call which saves time on
>>> transition to/from real mode.
>> 
>> We don't mention QEMU explicitly in KVM code usually.
>> 
>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>> (copied from user and verified) before writing the whole list into
>>> the TCE table. This cache will be utilized more in the upcoming
>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>> mode in the case if the real mode handler failed for some reason.
>>> 
>>> This adds a guest physical to host real address converter
>>> and calls the existing H_PUT_TCE handler. The converting function
>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>> 
>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>> so in order to support the functionality of this patch, QEMU
>>> needs to query for this capability and set the "hcall-multi-tce"
>>> hypertas property only if the capability is present, otherwise
>>> there will be serious performance degradation.
>> 
>> Same as above. But really you're only giving recommendations here. What's
>> the point? Please describe what the benefit of this patch is, not what some
>> other random subsystem might do with the benefits it brings.
>> 
>>> 
>>> Signed-off-by: Paul Mackerras
>>> Signed-off-by: Alexey Kardashevskiy
>>> 
>>> ---
>>> Changelog:
>>> 2013/07/06:
>>> * fixed number of wrong get_page()/put_page() calls
>>> 
>>> 2013/06/27:
>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>> * updated doc
>>> 
>>> 2013/06/05:
>>> * fixed mistype about IBMVIO in the commit message
>>> * updated doc and moved it to another section
>>> * changed capability number
>>> 
>>> 2013/05/21:
>>> * added kvm_vcpu_arch::tce_tmp
>>> * removed cleanup if put_indirect failed, instead we do not even start
>>> writing to TCE table if we cannot get TCEs from the user and they are
>>> invalid
>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>> and kvmppc_emulated_validate_tce (for the previous item)
>>> * fixed bug with failthrough for H_IPI
>>> * removed all get_user() from real mode handlers
>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>> 
>>> Signed-off-by: Alexey Kardashevskiy
>>> ---
>>>  Documentation/virtual/kvm/api.txt   |  25 +++
>>>  arch/powerpc/include/asm/kvm_host.h |   9 ++
>>>  arch/powerpc/include/asm/kvm_ppc.h  |  16 +-
>>>  arch/powerpc/kvm/book3s_64_vio.c| 154 ++-
>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 260
>>> 
>>>  arch/powerpc/kvm/book3s_hv.c|  41 -
>>>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
>>>  arch/powerpc/kvm/book3s_pr_papr.c   |  37 -
>>>  arch/powerpc/kvm/powerpc.c  |   3 +
>>>  9 files changed, 517 insertions(+), 34 deletions(-)
>>> 
>>> diff --git a/Documentation/virtual/kvm/api.txt
>>> b/Documentation/virtual/kvm/api.txt
>>> index 6365fef..762c703 100644
>>> --- a/Documentation/virtual/kvm/api.txt
>>> +++ b/Documentation/virtual/kvm/api.txt
>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>> to userspace to be
>>>  handled.
>>> 
>>> 
>>> +4.86 KVM_CAP_PPC_MULTITCE
>>> +
>>> +Capability: KVM_CAP_PPC_MULTITCE
>>> +Architectures: ppc
>>> +Type: vm
>>> +
>>> +This capability means the kernel is capable of handling hypercalls
>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>> 
>> significanly? Please run this through a spell checker.
>> 
>>> +The user space should expect that its handlers for these hypercalls
>> 
>> s/The//
>> 
>>> +are not going to be called.
>> 
>> Is user space guaranteed they will not be called? Or can it still happen?
> 
> ... if user space previously registered LIOBN in KVM (via
> KVM_CREATE_SPAPR_TCE or similar calls).
> 
> ok?

How about this?

The hypercalls mentioned above may or may not be processed successfully in the 
kernel based fast path. If they can not be handled by the kernel, they will get 
passed on to user space. So user space still has to have an implementation for 
these despite the in kernel acceleration.

---

The target audience for this documentation is user space KVM API users. Someone 
developing kvm tool for example. They want to know implications specific CAPs 
have.

> 
> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
> and may never get there.
>

Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()

2013-07-10 Thread Alexander Graf

On 10.07.2013, at 08:02, Tiejun Chen wrote:

> We should ensure the preemption cannot occur while calling get_paca()
> insdide hard_irq_disable(), otherwise the paca_struct may be the
> wrong one just after. And btw, we may update timing stats in this case.
> 
> Signed-off-by: Tiejun Chen 
> ---
> arch/powerpc/kvm/booke.c |2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index dcc94f0..9dae25d 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -839,6 +839,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
> kvm_vcpu *vcpu,
>   WARN_ON(local_paca->irq_happened != 0);
> #endif
> 
> + preempt_disable();
>   /*
>* We enter with interrupts disabled in hardware, but
>* we need to call hard_irq_disable anyway to ensure that
> @@ -848,6 +849,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
> kvm_vcpu *vcpu,
> 
>   /* update before a new last_exit_type is rewritten */
>   kvmppc_update_timing_stats(vcpu);
> + preempt_enable();

All of the code here is already called with interrupts disabled. I don't see 
how we could preempt then?


Alex

> 
>   /* restart interrupts if they were meant for the host */
>   kvmppc_restart_interrupt(vcpu, exit_nr);
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/2] KVM: s390: virtio-ccw: Handle command rejects.

2013-07-10 Thread Christian Borntraeger
On 09/07/13 13:34, Cornelia Huck wrote:
> A command reject for a ccw may happen if we run on a host not supporting
> a certain feature. We want to be able to handle this as special case of
> command failure, so let's split this off from the generic -EIO error code.
> 
> Signed-off-by: Cornelia Huck 
Acked-by: Christian Borntraeger 



> ---
>  drivers/s390/kvm/virtio_ccw.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
> index 779dc51..d6c7aba 100644
> --- a/drivers/s390/kvm/virtio_ccw.c
> +++ b/drivers/s390/kvm/virtio_ccw.c
> @@ -639,8 +639,15 @@ static void virtio_ccw_int_handler(struct ccw_device 
> *cdev,
>(SCSW_STCTL_ALERT_STATUS | SCSW_STCTL_STATUS_PEND))) {
>   /* OK */
>   }
> - if (irb_is_error(irb))
> - vcdev->err = -EIO; /* XXX - use real error */
> + if (irb_is_error(irb)) {
> + /* Command reject? */
> + if ((scsw_dstat(&irb->scsw) & DEV_STAT_UNIT_CHECK) &&
> + (irb->ecw[0] & SNS0_CMD_REJECT))
> + vcdev->err = -EOPNOTSUPP;
> + else
> + /* Map everything else to -EIO. */
> + vcdev->err = -EIO;
> + }
>   if (vcdev->curr_io & activity) {
>   switch (activity) {
>   case VIRTIO_CCW_DOING_READ_FEAT:
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: reset arch memslot info on memslot creation

2013-07-10 Thread Michael S. Tsirkin
On x86, kvm_arch_create_memslot assumes that rmap/lpage_info for the
slot are zeroed out: if they weren't, error handling code after out_free
label will free memory which wasn't allocated here.
This always happens to be the case because on KVM_MR_DELETE we clear the
whole arch structure.  So there's no bug, but it's cleaner not to rely
on this here.

Make the code more robust by clearing the rmap/lpage_info explicitly.

Signed-off-by: Michael S. Tsirkin 
---
 arch/x86/kvm/x86.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e8ba99c..96e6eb4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6922,6 +6922,10 @@ int kvm_arch_create_memslot(struct kvm_memory_slot 
*slot, unsigned long npages)
 {
int i;
 
+   /* Reset in case slot had some rmap/lpage_info. */
+   memset(&slot->arch.rmap, 0, sizeof slot->arch.rmap);
+   memset(&slot->arch.lpage_info, 0, sizeof slot->arch.lpage_info);
+
for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
unsigned long ugfn;
int lpages;
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] PF: Async page fault support on s390

2013-07-10 Thread Christian Borntraeger
On 09/07/13 15:56, Dominik Dingel wrote:
> This patch enables async page faults for s390 kvm guests.
> It provides the userspace API to enable, disable or get the status of this
> feature. Also it includes the diagnose code, called by the guest to enable
> async page faults.
> 
> The async page faults will use an already existing guest interface for this
> purpose, as described in "CP Programming Services (SC24-6084)".
> 
> Signed-off-by: Dominik Dingel 

Re-reading this patch again, found some thing that you should take 
care of (nothing major, just small details). Sorry for not seeing them
earlier.

[...]
> + case 1: /* CANCEL */
> + if (vcpu->run->s.regs.gprs[rx] & 7)
> + return kvm_s390_inject_program_int(vcpu, 
> PGM_ADDRESSING);
> +
> + vcpu->run->s.regs.gprs[ry] = 0;
> +
> + if (vcpu->arch.pfault_token == KVM_S390_PFAULT_TOKEN_INVALID)
> + vcpu->run->s.regs.gprs[ry] = 1;
> +
> + vcpu->arch.pfault_token = KVM_S390_PFAULT_TOKEN_INVALID;
> + rc = 0;
> + break;

Dont we need a kvm_clear_async_pf_completion_queue(vcpu) or similar here as 
well?
The cancel function is supposed to purge all outstanding requests (those were 
no 
completion signal was made pending yet)

[...]
> @@ -264,6 +292,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>  {
>   VCPU_EVENT(vcpu, 3, "%s", "free cpu");
>   trace_kvm_s390_destroy_vcpu(vcpu->vcpu_id);
> + kvm_clear_async_pf_completion_queue(vcpu);
>   if (!kvm_is_ucontrol(vcpu->kvm)) {
>   clear_bit(63 - vcpu->vcpu_id,
> (unsigned long *) &vcpu->kvm->arch.sca->mcn);
> @@ -313,6 +342,9 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  /* Section: vcpu related */
>  int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
>  {
> + vcpu->arch.pfault_token = KVM_S390_PFAULT_TOKEN_INVALID;
> + kvm_clear_async_pf_completion_queue(vcpu);
> + kvm_async_pf_wakeup_all(vcpu);
>   if (kvm_is_ucontrol(vcpu->kvm)) {
>   vcpu->arch.gmap = gmap_alloc(current->mm);
>   if (!vcpu->arch.gmap)

We should also reset pfault handling for CPU reset, no?

> @@ -691,10 +723,75 @@ static void kvm_arch_fault_in_sync(struct kvm_vcpu 
> *vcpu)
>   up_read(&mm->mmap_sem);
>  }
> 
> +static void __kvm_inject_pfault_token(struct kvm_vcpu *vcpu, bool 
> start_token,
> +   unsigned long token)
> +{
> + struct kvm_s390_interrupt inti;
> + inti.type = start_token ? KVM_S390_INT_PFAULT_INIT :
> +   KVM_S390_INT_PFAULT_DONE;
> + inti.parm64 = token;
> + if (kvm_s390_inject_vcpu(vcpu, &inti))
> + WARN(1, "pfault interrupt injection failed");
> +}

The PFAULT_DONE is architectured as a floating interrupt (can happen
on other CPUs).

[...]
> --- a/arch/s390/kvm/sigp.c
> +++ b/arch/s390/kvm/sigp.c
> @@ -187,6 +187,8 @@ static int __sigp_set_arch(struct kvm_vcpu *vcpu, u32 
> parameter)
>  {
>   int rc;
> 
> + vcpu->arch.pfault_token = KVM_S390_PFAULT_TOKEN_INVALID;
> +

sigp set architecture affects all cpus, so we must reset pfault via
kvm_for_each_vcpu, I guess.

Otherwise patch looks good. I guess only one more iteration
Christian



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html