Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

2013-08-02 Thread Takuya Yoshikawa
On Tue, 30 Jul 2013 21:01:58 +0800
Xiao Guangrong  wrote:

> Background
> ==
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we need to
> hold mmu-lock which is also required by vcpu to fix its page table fault and
> mmu-notifier when host page is being changed. In the extreme cpu / memory used
> guest, it becomes a scalability issue.
> 
> This patchset introduces a way to locklessly write-protect guest memory.

Nice improvements!

If I read the patch set correctly, this work contains the following changes:

Cleanups:
Patch 1 and patch 12.

Lazy large page dropping for dirty logging:
Patch 2-3.
Patch 2 is preparatory to patch 3.

This does not look like an RFC if you address Marcelo's comment.
Any reason to include this in an RFC patch set?

Making remote TLBs flushable outside of mmu_lock for dirty logging:
Patch 6.

This is nice.  I'm locally using a similar patch for my work, but yours
is much cleaner and better.  I hope this will get merged soon.

New Pte-list handling:
Patch 7-9.

Still reading the details.

RCU-based lockless write protection.
Patch 10-11.

If I understand RCU correctly, the current implementation has a problem:
read-side critical sections can become too long.

See the following LWN's article:
"Sleepable RCU"
https://lwn.net/Articles/202847/

Especially, kvm_mmu_slot_remove_write_access() can take hundreds of
milliseconds, or even a few seconds for guests using shadow paging.
Is it possible to break the read-side critical section after protecting
some pages? -- I guess so.

Anyway, I want to see the following non-RFC quality patches get merged first:
- Lazy large page dropping for dirty logging:
- Making remote TLBs flushable outside of mmu_lock for dirty logging

As you are doing in patch 11, the latter can eliminate the TLB flushes before
cond_resched_lock().  So this alone is an optimization, and since my work is
based on this TLB flush-less lock breaking, I would appriciate if you make this
change first in your clean way.

The remaining patches, pte-list refactoring and lock-less ones, also look
interesting, but I need to read more to understand them.

Thanks for the nice work!
Takuya


> 
> Idea
> ==
> There are the challenges we meet and the ideas to resolve them.
> 
> 1) How to locklessly walk rmap?
> The first idea we got to prevent "desc" being freed when we are walking the
> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
> it updates the rmap really frequently.
> 
> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
> to be reused more quickly. We also store a "nulls" in the last "desc"
> (desc->more) which can help us to detect whether the "desc" is moved to anther
> rmap then we can re-walk the rmap if that happened. I learned this idea from
> nulls-list.
> 
> Another issue is, when a spte is deleted from the "desc", another spte in the
> last "desc" will be moved to this position to replace the deleted one. If the
> deleted one has been accessed and we do not access the replaced one, the
> replaced one is missed when we do lockless walk.
> To fix this case, we do not backward move the spte, instead, we forward move
> the entry: when a spte is deleted, we move the entry in the first desc to that
> position.
> 
> 2) How to locklessly access shadow page table?
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page be freed. But we are on the ioctl 
> context
> and the paths we are optimizing for have heavy workload, disabling interrupt 
> is
> not good for the system performance.
> 
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability.
> 
> 3) How to locklessly write-protect guest memory?
> Currently, there are two behaviors when we write-protect guest memory, one is
> clearing the Writable bit on spte and the another one is dropping spte when it
> points to large page. The former is easy we only need to atomicly clear a bit
> but the latter is hard since we need to remove the spte from rmap. so we unify
> these two behaviors that only make the spte readonly. Making large spte
> readonly instead of nonpresent is also good for reducing jitter.
> 
> And we need to pay more attention on the order of making spte writable, adding
> spte into rmap and setting the corresponding bit on dirty bitmap since
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty 

Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Benjamin Herrenschmidt
On Sat, 2013-08-03 at 03:11 +, Bhushan Bharat-R65777 wrote:
> 
> > 
> > Could you explain why we need to set dirty/referenced on the PTE, when we 
> > didn't
> > need to do that before? All we're getting from the PTE is wimg.
> > We have MMU notifiers to take care of the page being unmapped, and we've 
> > already
> > marked the page itself as dirty if the TLB entry is writeable.
> 
> I pulled this code from book3s.
> 
> Ben, can you describe why we need this on book3s ?

If you let the guest write to the page you must set the dirty bit on the PTE
(or the struct page, at least one of them), similar with accessed on any access.

If you don't, the VM might swap the page out without writing it back to disk
for example, assuming it contains no modified data.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Benjamin Herrenschmidt
On Sat, 2013-08-03 at 02:58 +, Bhushan Bharat-R65777 wrote:
> One of the problem I saw was that if I put this code in
> asm/pgtable-32.h and asm/pgtable-64.h then pte_persent() and other
> friend function (on which this code depends) are defined in pgtable.h.
> And pgtable.h includes asm/pgtable-32.h and asm/pgtable-64.h before it
> defines pte_present() and friends functions.
> 
> Ok I move wove this in asm/pgtable*.h, initially I fought with myself
> to take this code in pgtable* but finally end up doing here (got
> biased by book3s :)).

Is there a reason why these routines can not be completely generic
in pgtable.h ?

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Bhushan Bharat-R65777


> -Original Message-
> From: Wood Scott-B07421
> Sent: Saturday, August 03, 2013 5:05 AM
> To: Bhushan Bharat-R65777
> Cc: b...@kernel.crashing.org; ag...@suse.de; kvm-...@vger.kernel.org;
> kvm@vger.kernel.org; linuxppc-...@lists.ozlabs.org; Bhushan Bharat-R65777
> Subject: Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux
> pte
> 
> On Thu, Aug 01, 2013 at 04:42:38PM +0530, Bharat Bhushan wrote:
> > diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index
> > 17722d8..eb2 100644
> > --- a/arch/powerpc/kvm/booke.c
> > +++ b/arch/powerpc/kvm/booke.c
> > @@ -697,7 +697,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run,
> > struct kvm_vcpu *vcpu)  #endif
> >
> > kvmppc_fix_ee_before_entry();
> > -
> > +   vcpu->arch.pgdir = current->mm->pgd;
> > ret = __kvmppc_vcpu_run(kvm_run, vcpu);
> 
> kvmppc_fix_ee_before_entry() is supposed to be the last thing that happens
> before __kvmppc_vcpu_run().
> 
> > @@ -332,6 +324,8 @@ static inline int kvmppc_e500_shadow_map(struct
> kvmppc_vcpu_e500 *vcpu_e500,
> > unsigned long hva;
> > int pfnmap = 0;
> > int tsize = BOOK3E_PAGESZ_4K;
> > +   pte_t pte;
> > +   int wimg = 0;
> >
> > /*
> >  * Translate guest physical to true physical, acquiring @@ -437,6
> > +431,8 @@ static inline int kvmppc_e500_shadow_map(struct
> > kvmppc_vcpu_e500 *vcpu_e500,
> >
> > if (likely(!pfnmap)) {
> > unsigned long tsize_pages = 1 << (tsize + 10 - PAGE_SHIFT);
> > +   pgd_t *pgdir;
> > +
> > pfn = gfn_to_pfn_memslot(slot, gfn);
> > if (is_error_noslot_pfn(pfn)) {
> > printk(KERN_ERR "Couldn't get real page for gfn 
> > %lx!\n", @@
> -447,9
> > +443,18 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500
> *vcpu_e500,
> > /* Align guest and physical address to page map boundaries */
> > pfn &= ~(tsize_pages - 1);
> > gvaddr &= ~((tsize_pages << PAGE_SHIFT) - 1);
> > +   pgdir = vcpu_e500->vcpu.arch.pgdir;
> > +   pte = lookup_linux_pte(pgdir, hva, 1, &tsize_pages);
> > +   if (pte_present(pte)) {
> > +   wimg = (pte >> PTE_WIMGE_SHIFT) & MAS2_WIMGE_MASK;
> > +   } else {
> > +   printk(KERN_ERR "pte not present: gfn %lx, pfn %lx\n",
> > +   (long)gfn, pfn);
> > +   return -EINVAL;
> > +   }
> > }
> 
> How does wimg get set in the pfnmap case?

Pfnmap is not kernel managed pages, right? So should we set I+G there ?

> 
> Could you explain why we need to set dirty/referenced on the PTE, when we 
> didn't
> need to do that before? All we're getting from the PTE is wimg.
> We have MMU notifiers to take care of the page being unmapped, and we've 
> already
> marked the page itself as dirty if the TLB entry is writeable.

I pulled this code from book3s.

Ben, can you describe why we need this on book3s ?

Thanks
-Bharat
> 
> -Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Bhushan Bharat-R65777


> -Original Message-
> From: Benjamin Herrenschmidt [mailto:b...@kernel.crashing.org]
> Sent: Saturday, August 03, 2013 4:47 AM
> To: Wood Scott-B07421
> Cc: Bhushan Bharat-R65777; ag...@suse.de; kvm-...@vger.kernel.org;
> kvm@vger.kernel.org; linuxppc-...@lists.ozlabs.org; Bhushan Bharat-R65777
> Subject: Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like
> booke3s
> 
> On Fri, 2013-08-02 at 17:58 -0500, Scott Wood wrote:
> >
> > What about 64-bit PTEs on 32-bit kernels?
> >
> > In any case, this code does not belong in KVM.  It should be in the
> > main PPC mm code, even if KVM is the only user.
> 
> Also don't we do similar things in BookS KVM ? At the very least that sutff
> should become common. And yes, I agree, it should probably also move to 
> pgtable*

One of the problem I saw was that if I put this code in asm/pgtable-32.h and 
asm/pgtable-64.h then pte_persent() and other friend function (on which this 
code depends) are defined in pgtable.h. And pgtable.h includes asm/pgtable-32.h 
and asm/pgtable-64.h before it defines pte_present() and friends functions.

Ok I move wove this in asm/pgtable*.h, initially I fought with myself to take 
this code in pgtable* but finally end up doing here (got biased by book3s :)).

Thanks
-Bharat

> 
> Cheers,
> Ben.
> 
> 

N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Benjamin Herrenschmidt
On Fri, 2013-08-02 at 17:37 -0600, Alex Williamson wrote:

> > Yes.
> > 
> > We have that similar issue with error handling, when the driver doesn't
> > have the right hooks, we simulate an unplug, reset, then replug.
> > 
> > Maybe we could provide generic helpers to do that...
> 
> Devices going away and coming back is pretty difficult for vfio to
> handle.  Perhaps helpers to rescan a device in-place would be easier.

Well, in the error handling case (and *maybe* in the "FW update" case)
we need to unbind and rebind the driver as well. The whole point is that
we have to do that because the driver doesn't have the right hooks.

In the VFIO case, we will have to implement something here so that the
VFIO driver stub doesn't get handled that treatment :-) We'll probably
need some arch specific stuff in vfio_pci unfortunately, so that the
errors get forwarded to the guest via our EEH interfaces, and let the
guest handle it's error handling.

Of course that leaves an interesting problem as to what state the device
is in when it comes back to the host ...

This is made a LOT more complicated in the VFIO model than it is in the
"pHyp" model (our proprietary hypervisor).

Under pHyp, the PE doesn't have a concept of being used outside of a
guest, and it always reset before being assigned/unassigned. The guest
can mess around all it wants (bus numbers, BARs, etc...) it can only
hurt itself. The hypervisor doesn't keep track of any of that.

> On the QEMU side we'd need to rescan the device after each reset, which
> would be rather tedious for the typical case where it doesn't change.

This is a direct consequence of your model :-) It makes things more
complex for us as well, but we have to bite that bullet now. Maybe we
can consider an alternate/simpler model in the future, more akin to what
pHyp does, where we "unplug" the device from the host when assigning it
to a guest (and the whole hierarchy below it if it's a bridge) and let
the guest do what it wants with it. 

Doing so means we no longer have to emulate/filter config space
(provided your HW handles MSI virtualization properly), care about bus
numbers, BARs, etc... nor do we need to keep track of a lot of this in
qemu. All we need is reset the whole & lot and re-probe the bus leg when
returning the devices to the host.
 
Cheers,
Ben.

> Thanks,
> 
> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Benjamin Herrenschmidt
On Fri, 2013-08-02 at 17:58 -0500, Scott Wood wrote:
> 
> What about 64-bit PTEs on 32-bit kernels?
> 
> In any case, this code does not belong in KVM.  It should be in the
> main
> PPC mm code, even if KVM is the only user.

Also don't we do similar things in BookS KVM ? At the very least that
sutff should become common. And yes, I agree, it should probably also
move to pgtable*

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Alex Williamson
On Sat, 2013-08-03 at 09:15 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2013-08-02 at 16:49 -0600, Bjorn Helgaas wrote:
> > [+cc linux-pci]
> > 
> > On Fri, Aug 2, 2013 at 3:28 PM, Benjamin Herrenschmidt
> >  wrote:
> > 
> > > Right. Another use case is, I know of devices that need a fundamental
> > > reset (PERST) after applying a FW update.
> > 
> > This is a tangent from the real discussion here, but the question of
> > resetting a device after a firmware update concerns me.  Many if not
> > all of our current reset interfaces save & restore the architected
> > parts of config space around the reset.  But a reset after a firmware
> > update may change things like the number and type of BARs or even the
> > functionality of the device, so I don't think the restore is safe in
> > general.
> 
> Right.
> 
> > I doubt this is a big problem in general, but I have found reports of
> > people having to do a system reset or reboot after updating, e.g.,
> > FPGA images.  I suppose at least some of these could be worked around
> > with the right hotplug incantations.
> 
> Yes.
> 
> We have that similar issue with error handling, when the driver doesn't
> have the right hooks, we simulate an unplug, reset, then replug.
> 
> Maybe we could provide generic helpers to do that...

Devices going away and coming back is pretty difficult for vfio to
handle.  Perhaps helpers to rescan a device in-place would be easier.
On the QEMU side we'd need to rescan the device after each reset, which
would be rather tedious for the typical case where it doesn't change.
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6 v2] kvm: powerpc: use caching attributes as per linux pte

2013-08-02 Thread Scott Wood
On Thu, Aug 01, 2013 at 04:42:38PM +0530, Bharat Bhushan wrote:
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 17722d8..eb2 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -697,7 +697,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
> kvm_vcpu *vcpu)
>  #endif
>  
>   kvmppc_fix_ee_before_entry();
> -
> + vcpu->arch.pgdir = current->mm->pgd;
>   ret = __kvmppc_vcpu_run(kvm_run, vcpu);

kvmppc_fix_ee_before_entry() is supposed to be the last thing that
happens before __kvmppc_vcpu_run().

> @@ -332,6 +324,8 @@ static inline int kvmppc_e500_shadow_map(struct 
> kvmppc_vcpu_e500 *vcpu_e500,
>   unsigned long hva;
>   int pfnmap = 0;
>   int tsize = BOOK3E_PAGESZ_4K;
> + pte_t pte;
> + int wimg = 0;
>  
>   /*
>* Translate guest physical to true physical, acquiring
> @@ -437,6 +431,8 @@ static inline int kvmppc_e500_shadow_map(struct 
> kvmppc_vcpu_e500 *vcpu_e500,
>  
>   if (likely(!pfnmap)) {
>   unsigned long tsize_pages = 1 << (tsize + 10 - PAGE_SHIFT);
> + pgd_t *pgdir;
> +
>   pfn = gfn_to_pfn_memslot(slot, gfn);
>   if (is_error_noslot_pfn(pfn)) {
>   printk(KERN_ERR "Couldn't get real page for gfn %lx!\n",
> @@ -447,9 +443,18 @@ static inline int kvmppc_e500_shadow_map(struct 
> kvmppc_vcpu_e500 *vcpu_e500,
>   /* Align guest and physical address to page map boundaries */
>   pfn &= ~(tsize_pages - 1);
>   gvaddr &= ~((tsize_pages << PAGE_SHIFT) - 1);
> + pgdir = vcpu_e500->vcpu.arch.pgdir;
> + pte = lookup_linux_pte(pgdir, hva, 1, &tsize_pages);
> + if (pte_present(pte)) {
> + wimg = (pte >> PTE_WIMGE_SHIFT) & MAS2_WIMGE_MASK;
> + } else {
> + printk(KERN_ERR "pte not present: gfn %lx, pfn %lx\n",
> + (long)gfn, pfn);
> + return -EINVAL;
> + }
>   }

How does wimg get set in the pfnmap case?

Could you explain why we need to set dirty/referenced on the PTE, when we
didn't need to do that before?  All we're getting from the PTE is wimg. 
We have MMU notifiers to take care of the page being unmapped, and we've
already marked the page itself as dirty if the TLB entry is writeable.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Benjamin Herrenschmidt
On Fri, 2013-08-02 at 16:49 -0600, Bjorn Helgaas wrote:
> [+cc linux-pci]
> 
> On Fri, Aug 2, 2013 at 3:28 PM, Benjamin Herrenschmidt
>  wrote:
> 
> > Right. Another use case is, I know of devices that need a fundamental
> > reset (PERST) after applying a FW update.
> 
> This is a tangent from the real discussion here, but the question of
> resetting a device after a firmware update concerns me.  Many if not
> all of our current reset interfaces save & restore the architected
> parts of config space around the reset.  But a reset after a firmware
> update may change things like the number and type of BARs or even the
> functionality of the device, so I don't think the restore is safe in
> general.

Right.

> I doubt this is a big problem in general, but I have found reports of
> people having to do a system reset or reboot after updating, e.g.,
> FPGA images.  I suppose at least some of these could be worked around
> with the right hotplug incantations.

Yes.

We have that similar issue with error handling, when the driver doesn't
have the right hooks, we simulate an unplug, reset, then replug.

Maybe we could provide generic helpers to do that...

Cheers,
Ben.

> http://forums.xilinx.com/t5/PCI-Express/PC-is-hung-if-run-the-second-program-FPGA-by-JTAG/td-p/20871
> http://www.alteraforum.com/forum/archive/index.php/t-28477.html
> http://www.alteraforum.com/forum/showthread.php?t=35091
> https://github.com/NetFPGA/NetFPGA-public/wiki/Restore-PCIE-Configuration
> http://lkml.indiana.edu/hypermail/linux/kernel/1104.3/02544.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6 v2] kvm: powerpc: booke: Add linux pte lookup like booke3s

2013-08-02 Thread Scott Wood
On Thu, 2013-08-01 at 16:42 +0530, Bharat Bhushan wrote:
> KVM need to lookup linux pte for getting TLB attributes (WIMGE).
> This is similar to how book3s does.
> This will be used in follow-up patches.
> 
> Signed-off-by: Bharat Bhushan 
> ---
> v1->v2
>  - This is a new change in this version
> 
>  arch/powerpc/include/asm/kvm_booke.h |   73 
> ++
>  1 files changed, 73 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_booke.h 
> b/arch/powerpc/include/asm/kvm_booke.h
> index d3c1eb3..903624d 100644
> --- a/arch/powerpc/include/asm/kvm_booke.h
> +++ b/arch/powerpc/include/asm/kvm_booke.h
> @@ -102,4 +102,77 @@ static inline ulong kvmppc_get_msr(struct kvm_vcpu *vcpu)
>  {
>   return vcpu->arch.shared->msr;
>  }
> +
> +/*
> + * Lock and read a linux PTE.  If it's present and writable, atomically
> + * set dirty and referenced bits and return the PTE, otherwise return 0.
> + */
> +static inline pte_t kvmppc_read_update_linux_pte(pte_t *p, int writing)
> +{
> + pte_t pte;
> +
> +#ifdef PTE_ATOMIC_UPDATES
> + pte_t tmp;
> +/* wait until _PAGE_BUSY is clear then set it atomically */

_PAGE_BUSY is 0 on book3e.

> +#ifdef CONFIG_PPC64
> + __asm__ __volatile__ (
> + "1: ldarx   %0,0,%3\n"
> + "   andi.   %1,%0,%4\n"
> + "   bne-1b\n"
> + "   ori %1,%0,%4\n"
> + "   stdcx.  %1,0,%3\n"
> + "   bne-1b"
> + : "=&r" (pte), "=&r" (tmp), "=m" (*p)
> + : "r" (p), "i" (_PAGE_BUSY)
> + : "cc");
> +#else
> +__asm__ __volatile__ (
> +"1: lwarx   %0,0,%3\n"
> +"   andi.   %1,%0,%4\n"
> +"   bne-1b\n"
> +"   ori %1,%0,%4\n"
> +"   stwcx.  %1,0,%3\n"
> +"   bne-1b"
> +: "=&r" (pte), "=&r" (tmp), "=m" (*p)
> +: "r" (p), "i" (_PAGE_BUSY)
> +: "cc");
> +#endif

What about 64-bit PTEs on 32-bit kernels?

In any case, this code does not belong in KVM.  It should be in the main
PPC mm code, even if KVM is the only user.

-Scott



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Xiao Guangrong

On Aug 3, 2013, at 4:27 AM, Marcelo Tosatti  wrote:

> On Fri, Aug 02, 2013 at 11:42:19PM +0800, Xiao Guangrong wrote:
>> 
>> On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti  wrote:
>> 
>>> On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
 Currently, kvm zaps the large spte if write-protected is needed, the later
 read can fault on that spte. Actually, we can make the large spte readonly
 instead of making them un-present, the page fault caused by read access can
 be avoided
 
 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it reduces
 | jitter.  This removes the need for the return value.
 
 [
 It has fixed the issue reported in 6b73a9606 by stopping fast page fault
 marking the large spte to writable
 ]
>>> 
>>> Xiao,
>>> 
>>> Can you please write a comment explaining why are the problems 
>>> with shadow vs large read-only sptes (can't recall anymore),
>>> and then why it is now safe to do it.
>> 
>> Hi Marcelo,
>> 
>> Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this 
>> patch,
>> we mark the large spte as readonly when the pages are dirt logged and the
>> readonly spte can be set to writable by fast page fault, but on that path, 
>> it failed
>> to check dirty logging, so it will set the large spte to writable but only 
>> set the first
>> page to the dirty bitmap.
>> 
>> For example:
>> 
>> 1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
>> is writable.
>> 
>> 2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly
>> 
>> 3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.
>> 
>> Then 4K ~ 2M memory is not dirty logged.
> 
> Ok can you write a self contained summary of read-only large sptes (when
> they are created, when destroyed, from which point they can't be created,
> etc), and the interaction with shadow write protection and creation of
> writeable sptes?
> Its easy to get lost.

Okay, will do.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Marcelo Tosatti
On Fri, Aug 02, 2013 at 11:42:19PM +0800, Xiao Guangrong wrote:
> 
> On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti  wrote:
> 
> > On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
> >> Currently, kvm zaps the large spte if write-protected is needed, the later
> >> read can fault on that spte. Actually, we can make the large spte readonly
> >> instead of making them un-present, the page fault caused by read access can
> >> be avoided
> >> 
> >> The idea is from Avi:
> >> | As I mentioned before, write-protecting a large spte is a good idea,
> >> | since it moves some work from protect-time to fault-time, so it reduces
> >> | jitter.  This removes the need for the return value.
> >> 
> >> [
> >>  It has fixed the issue reported in 6b73a9606 by stopping fast page fault
> >>  marking the large spte to writable
> >> ]
> > 
> > Xiao,
> > 
> > Can you please write a comment explaining why are the problems 
> > with shadow vs large read-only sptes (can't recall anymore),
> > and then why it is now safe to do it.
> 
> Hi Marcelo,
> 
> Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this 
> patch,
> we mark the large spte as readonly when the pages are dirt logged and the
> readonly spte can be set to writable by fast page fault, but on that path, it 
> failed
> to check dirty logging, so it will set the large spte to writable but only 
> set the first
> page to the dirty bitmap.
> 
> For example:
> 
> 1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
>  is writable.
> 
> 2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly
> 
> 3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.
> 
> Then 4K ~ 2M memory is not dirty logged.

Ok can you write a self contained summary of read-only large sptes (when
they are created, when destroyed, from which point they can't be created,
etc), and the interaction with shadow write protection and creation of
writeable sptes?
Its easy to get lost.

> In this version, we let fast page fault do not mark large spte to writable if
> its page are dirty logged.  But it is still not safe as you pointed out.
> 
> >> 
> >> 
> >>/*
> >> +   * Can not map the large spte to writable if the page is dirty
> >> +   * logged.
> >> +   */
> >> +  if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
> >> +  goto exit;
> >> +
> > 
> > It is not safe to derive slot->dirty_bitmap like this: 
> > since dirty log is enabled via RCU update, "is dirty bitmap enabled"
> > info could be stale by the time you check it here via the parameter,
> > so you can instantiate a large spte (because force_pt_level == false),
> > while you should not.
> 
> Good catch! This is true even if we enable dirty log under the protection
> of mmu lock.
> 
> How about let the fault page fault only fix the small spte, that is changing
> the code to:
>   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
>   goto exit;
> ?

Sure.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Bjorn Helgaas
[+cc linux-pci]

On Fri, Aug 2, 2013 at 3:28 PM, Benjamin Herrenschmidt
 wrote:

> Right. Another use case is, I know of devices that need a fundamental
> reset (PERST) after applying a FW update.

This is a tangent from the real discussion here, but the question of
resetting a device after a firmware update concerns me.  Many if not
all of our current reset interfaces save & restore the architected
parts of config space around the reset.  But a reset after a firmware
update may change things like the number and type of BARs or even the
functionality of the device, so I don't think the restore is safe in
general.

I doubt this is a big problem in general, but I have found reports of
people having to do a system reset or reboot after updating, e.g.,
FPGA images.  I suppose at least some of these could be worked around
with the right hotplug incantations.

http://forums.xilinx.com/t5/PCI-Express/PC-is-hung-if-run-the-second-program-FPGA-by-JTAG/td-p/20871
http://www.alteraforum.com/forum/archive/index.php/t-28477.html
http://www.alteraforum.com/forum/showthread.php?t=35091
https://github.com/NetFPGA/NetFPGA-public/wiki/Restore-PCIE-Configuration
http://lkml.indiana.edu/hypermail/linux/kernel/1104.3/02544.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Benjamin Herrenschmidt
On Fri, 2013-08-02 at 10:36 -0600, Alex Williamson wrote:

> > Also in some cases, at least for us, we have a problem where the scope
> > of the reset can be larger than the group ... in this case I think we
> > need to forbid the reset.
> 
> Interesting, I would have ventured to guess that resets are always
> contained within a single group given the PE isolation domains rooted at
> a PHB. 

Depends how I configure them. For example, if I assign two functions to
two different PEs (SR-IOV typically), a hot reset will kill both.

For fundamental reset, depending on how the mobo is wired up and other
platform things, I might for example be limited on doing a PERST at the
PHB level, thus accross multiple PEs.

>  Why disallow it though?  What I propose below would still allow
> it if the user can prove that they own all of the affected groups.

Right, I mean disallow it if the PEs are not all owned by the same user.
What you propose would probably work.

>   Even
> on x86, I think it's too restrictive to only allow reset if the affect
> is contained within a group.  It's just too easy to need it for a
> multi-function device.  We could hope that a device that implements ACS
> also implements some sort of FLR, but that may just be wishful thinking.

Right. Another use case is, I know of devices that need a fundamental
reset (PERST) after applying a FW update.

> > For us, it's also tied into our "EEH" error handling. IE. The way the
> > guest will request some of these things is going to be via firmware APIs
> > that we have yet to implement, but that we were also thinking of
> > implementing entire in the kernel rather than qemu for various
> > reasons... IE. There is more to error handling & recovery in our case at
> > least than AER and reset :-)
> 
> I've been focused on reset for device re-initialization and
> repeatability, but that's a good point, we need to consider error
> recovery as a use of this interface as well.

Unfortunately, with EEH we have very specific FW interface (including
low level diagnostic data "blobs" etc...) that we need to implement so I
am not sure we can "fit" it in a generic interface.

We might want to attempt to list our interfaces and see. CC'ing Gavin
who is our EEH expert.

A rough cut is

 - We have the concept of "frozen" PE (frozen MMIO and frozen DMA, two
separate attributes). So we can query and change the freeze state on a
PE.

 - We have various level of error severity (from informational
(corrected) to full PHB need a reset with some variations in the middle
of PEs being frozen etc...)

 - We have diag blobs coming out of FW for the guest to log (and
possibly partially decode).

 - We have interfaces for various types of resets

> > I need to spend more time reading your proposal and see how it can work
> > for us (or not...)... but we might end up just doing our own thing that
> > carries the whole EEH API instead.
> 
> Obviously it would be great if it could work for you, but regardless of
> whether you intend to use it I'd appreciate feedback.  Thanks,

Ben.

> Alex
> 
> > > ---
> > > Mechanism to do PCI hot resets through VFIO:
> > > 
> > > VFIO is fundamentally an IOMMU group and device level interface.
> > > There's no concept of buses, slots, or hierarchies of devices.  There
> > > are only IOMMU group and devices.  A bus (or slot) may contain exactly
> > > one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
> > > An IOMMU group may contain one or more devices.
> > > 
> > > The first question is perhaps where should we create the interface to do
> > > a PCI hot reset.  Assuming an ioctl interface, our choices are the
> > > group, the container, or the device file descriptors.  Groups and
> > > containers are not PCI specific, so an extension on either of those
> > > doesn't make much sense.  They also don't have much granularity if your
> > > goal is to do a hot reset on the smallest subset of devices you can.
> > > Therefore the only choice seems to be a VFIO device level interface.
> > > 
> > > The fact that a hot reset affects multiple devices also raises concerns.
> > > How do we make sure a user has sufficient access/privilege to perform
> > > this operation?  If all of the affected devices are within the same
> > > group, then we know the user already "owns" all those devices.  Using
> > > groups as the boundary excludes a number of use cases though.  The user
> > > would need to prove that they also own the other groups or devices that
> > > are affected by the reset.  This might be multiple groups, so the ioctl
> > > quickly grows to requiring a list of file descriptors be passed for
> > > validation.
> > > 
> > > We already use the group file descriptor as a unit of ownership for
> > > enabling the container, so it seems like it would make sense to use it
> > > here too.  The alternative is a device file descriptor, but groups may
> > > encompass devices the user doesn't care to use and we don't want to
> > > require that they op

Re: FAQ on linux-kvm.org has broken link

2013-08-02 Thread folkert
Hi Stefan,

> > I pinged, I sniffed, I updated the bug report (it also happens with
> > other guests now!).
> >
> > And the bring down interfaces / rmmod / modprobe / ifup works!
> > So I think something is wrong with virtio_net!
> >
> > What shall I do now?
> 
> I wrote a reply earlier today but it was rejected because I not have a
> kernel.org bugzilla account.  If you don't mind let's continue
> discussing on this mailing list - we don't know whether this is a
> kernel bug yet anyway.

no problem

> A couple of questions:
> Please post the QEMU command-line from the host (ps aux | grep qemu).

I'll post them all:
- UMTS-clone: this one works fine since it was created a weak ago
- belle: this one was fine but suddenly also showed the problem
- mauer: the problem one

112   4819 1  4 Jul30 ?03:29:39 /usr/bin/kvm -S -M pc-1.1 
-enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name UMTS-clone -uuid 
e49502f1-0c74-2a60-99dc-7602da5ee640 -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/UMTS-clone.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/dev/VGNEO/LV_V_UMTS-clone,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/home/folkert/ISOs/wheezy.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw
 -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
tap,fd=20,id=hostnet0,vhost=on,vhostfd=21 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:09:3b:b6,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-vnc 127.0.0.1:0,password -vga cirrus -device 
usb-host,hostbus=6,hostaddr=5,id=hostdev0 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
112  10065 1 11 Jul30 ?07:46:16 /usr/bin/kvm -S -M pc-1.1 
-enable-kvm -m 8192 -smp 12,sockets=12,cores=1,threads=1 -name belle -uuid 
16b704d7-5fbd-d67b-71e6-0d6b43f1bc0a -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/belle.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime 
-no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/dev/VGNEO/LV_V_BELLE,if=none,id=drive-virtio-disk0,format=raw -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/dev/VGNEO/LV_V_BELLE_OS,if=none,id=drive-virtio-disk1,format=raw,cache=writeback
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk1,id=virtio-disk1
 -drive 
file=/dev/VGJOURNAL/LV_J_BELLE,if=none,id=drive-ide0-0-0,format=raw,cache=writeback
 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive 
if=none,id=drive-ide0-1-0,readonly=on,format=raw,cache=none -device 
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:75:4a:6f,bus=pci.0,addr=0x3 
-netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:0a:6e:de,bus=pci.0,addr=0x7 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-device usb-tablet,id=input0 -vnc 127.0.0.1:1,password -vga cirrus -device 
intel-hda,id=sound0,bus=pci.0,addr=0x4 -device 
hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
root 13116 12830  0 19:54 pts/800:00:00 grep qemu
112  23453 1 57 13:16 ?03:46:51 /usr/bin/kvm -S -M pc-1.1 
-enable-kvm -m 8192 -smp 8,maxcpus=12,sockets=12,cores=1,threads=1 -name mauer 
-uuid 3a8452e6-81af-b185-63b6-2b32be17ed87 -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/mauer.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/dev/VGNEO/LV_V_MAUER,if=none,id=drive-virtio-disk0,format=raw,cache=writeback
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/dev/VGJOURNAL/LV_J_MAUER,if=none,id=drive-virtio-disk1,format=raw,cache=writethrough
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0xa,drive=drive-virtio-disk1,id=virtio-disk1
 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device 
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:86:d9:1f,bus=pci.0,addr=0x3 
-netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:a3:12:8a,bus=pci.0,addr=0x4 
-netdev tap,fd=30,id=hostnet2,vhost=on,vhostfd=31 -

Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

2013-08-02 Thread Alex Williamson
On Fri, 2013-08-02 at 15:10 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2013-08-01 at 16:18 -0600, Alex Williamson wrote:
> > vfio-pci needs to support an interface to do hot resets (PCI parent
> > bridge secondary bus reset).   We need this to support reset of
> > co-assigned devices where one or more of the devices does not support
> > function level reset.  In particular, discrete graphics cards typically
> > have no reset options other than doing a link reset.  
> 
> Link reset != hot reset but yeah I see your point. There is more too,
> there is fundamental reset which may or may not be something we can
> control for individual slots (ie, switch legs) depending on platform
> specific stuff...
> 
> > What I have below
> > is a bit awkward, so I welcome other ideas to accomplish this goal.
> > I've been using a "blind" interface based on all affected devices
> > belonging to the same VFIO container for current VGA testing.  This is
> > ok when all you want to do is VGA, but I'd really like to make use of
> > this any time a device doesn't support a function level reset.  I've
> > posted a series to the PCI list to add bus and slot reset interfaces to
> > PCI-core, this API is how we expose that through VFIO to a user.  Please
> > comment.  Thanks,
> 
> Also in some cases, at least for us, we have a problem where the scope
> of the reset can be larger than the group ... in this case I think we
> need to forbid the reset.

Interesting, I would have ventured to guess that resets are always
contained within a single group given the PE isolation domains rooted at
a PHB.  Why disallow it though?  What I propose below would still allow
it if the user can prove that they own all of the affected groups.  Even
on x86, I think it's too restrictive to only allow reset if the affect
is contained within a group.  It's just too easy to need it for a
multi-function device.  We could hope that a device that implements ACS
also implements some sort of FLR, but that may just be wishful thinking.

> For us, it's also tied into our "EEH" error handling. IE. The way the
> guest will request some of these things is going to be via firmware APIs
> that we have yet to implement, but that we were also thinking of
> implementing entire in the kernel rather than qemu for various
> reasons... IE. There is more to error handling & recovery in our case at
> least than AER and reset :-)

I've been focused on reset for device re-initialization and
repeatability, but that's a good point, we need to consider error
recovery as a use of this interface as well.

> I need to spend more time reading your proposal and see how it can work
> for us (or not...)... but we might end up just doing our own thing that
> carries the whole EEH API instead.

Obviously it would be great if it could work for you, but regardless of
whether you intend to use it I'd appreciate feedback.  Thanks,

Alex

> > ---
> > Mechanism to do PCI hot resets through VFIO:
> > 
> > VFIO is fundamentally an IOMMU group and device level interface.
> > There's no concept of buses, slots, or hierarchies of devices.  There
> > are only IOMMU group and devices.  A bus (or slot) may contain exactly
> > one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
> > An IOMMU group may contain one or more devices.
> > 
> > The first question is perhaps where should we create the interface to do
> > a PCI hot reset.  Assuming an ioctl interface, our choices are the
> > group, the container, or the device file descriptors.  Groups and
> > containers are not PCI specific, so an extension on either of those
> > doesn't make much sense.  They also don't have much granularity if your
> > goal is to do a hot reset on the smallest subset of devices you can.
> > Therefore the only choice seems to be a VFIO device level interface.
> > 
> > The fact that a hot reset affects multiple devices also raises concerns.
> > How do we make sure a user has sufficient access/privilege to perform
> > this operation?  If all of the affected devices are within the same
> > group, then we know the user already "owns" all those devices.  Using
> > groups as the boundary excludes a number of use cases though.  The user
> > would need to prove that they also own the other groups or devices that
> > are affected by the reset.  This might be multiple groups, so the ioctl
> > quickly grows to requiring a list of file descriptors be passed for
> > validation.
> > 
> > We already use the group file descriptor as a unit of ownership for
> > enabling the container, so it seems like it would make sense to use it
> > here too.  The alternative is a device file descriptor, but groups may
> > encompass devices the user doesn't care to use and we don't want to
> > require that they open a file descriptor simply to perform a hot reset.
> > Groups can also contain devices that the user cannot open, for instance
> > those owned by VFIO "compatible" drivers like pci-stub or pcieport.
> > 
> > The user also needs

Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Xiao Guangrong

On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti  wrote:

> On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
>> Currently, kvm zaps the large spte if write-protected is needed, the later
>> read can fault on that spte. Actually, we can make the large spte readonly
>> instead of making them un-present, the page fault caused by read access can
>> be avoided
>> 
>> The idea is from Avi:
>> | As I mentioned before, write-protecting a large spte is a good idea,
>> | since it moves some work from protect-time to fault-time, so it reduces
>> | jitter.  This removes the need for the return value.
>> 
>> [
>>  It has fixed the issue reported in 6b73a9606 by stopping fast page fault
>>  marking the large spte to writable
>> ]
> 
> Xiao,
> 
> Can you please write a comment explaining why are the problems 
> with shadow vs large read-only sptes (can't recall anymore),
> and then why it is now safe to do it.

Hi Marcelo,

Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this patch,
we mark the large spte as readonly when the pages are dirt logged and the
readonly spte can be set to writable by fast page fault, but on that path, it 
failed
to check dirty logging, so it will set the large spte to writable but only set 
the first
page to the dirty bitmap.

For example:

1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
 is writable.

2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly

3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.

Then 4K ~ 2M memory is not dirty logged.

In this version, we let fast page fault do not mark large spte to writable if
its page are dirty logged.  But it is still not safe as you pointed out.

>> 
>> 
>>  /*
>> + * Can not map the large spte to writable if the page is dirty
>> + * logged.
>> + */
>> +if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
>> +goto exit;
>> +
> 
> It is not safe to derive slot->dirty_bitmap like this: 
> since dirty log is enabled via RCU update, "is dirty bitmap enabled"
> info could be stale by the time you check it here via the parameter,
> so you can instantiate a large spte (because force_pt_level == false),
> while you should not.

Good catch! This is true even if we enable dirty log under the protection
of mmu lock.

How about let the fault page fault only fix the small spte, that is changing
the code to:
if (sp->role.level > PT_PAGE_TABLE_LEVEL)
goto exit;
?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FAQ on linux-kvm.org has broken link

2013-08-02 Thread Stefan Hajnoczi
On Fri, Aug 2, 2013 at 1:37 PM, folkert  wrote:
>> If the result is #2, check firewalls on host and guest.  Also try the
>> following inside the guest: disable the network interface, rmmod
>> virtio_net, modprobe virtio_net again, and bring the network up.
>
> I pinged, I sniffed, I updated the bug report (it also happens with
> other guests now!).
>
> And the bring down interfaces / rmmod / modprobe / ifup works!
> So I think something is wrong with virtio_net!
>
> What shall I do now?

Hi Folkert,
I wrote a reply earlier today but it was rejected because I not have a
kernel.org bugzilla account.  If you don't mind let's continue
discussing on this mailing list - we don't know whether this is a
kernel bug yet anyway.

A couple of questions:

Please post the QEMU command-line from the host (ps aux | grep qemu).

Please confirm that vhost_net is being used on the host (lsmod | grep
vhost_net).

Please double-check both guest and host dmesg for any suspicious
messages.  It could be about networking, out-of-memory, or kernel
backtraces.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Marcelo Tosatti
On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
> Currently, kvm zaps the large spte if write-protected is needed, the later
> read can fault on that spte. Actually, we can make the large spte readonly
> instead of making them un-present, the page fault caused by read access can
> be avoided
> 
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter.  This removes the need for the return value.
> 
> [
>   It has fixed the issue reported in 6b73a9606 by stopping fast page fault
>   marking the large spte to writable
> ]

Xiao,

Can you please write a comment explaining why are the problems 
with shadow vs large read-only sptes (can't recall anymore),
and then why it is now safe to do it.

Comments below.

> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c | 36 +---
>  1 file changed, 17 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cf163ca..35d4b50 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1181,8 +1181,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
> *sptep)
>  
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
> - * spte writ-protection is caused by protecting shadow page table.
> - * @flush indicates whether tlb need be flushed.
> + * spte write-protection is caused by protecting shadow page table.
>   *
>   * Note: write protection is difference between drity logging and spte
>   * protection:
> @@ -1191,10 +1190,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
> *sptep)
>   * - for spte protection, the spte can be writable only after unsync-ing
>   *   shadow page.
>   *
> - * Return true if the spte is dropped.
> + * Return true if tlb need be flushed.
>   */
> -static bool
> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
>  {
>   u64 spte = *sptep;
>  
> @@ -1204,17 +1202,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
> *flush, bool pt_protect)
>  
>   rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>  
> - if (__drop_large_spte(kvm, sptep)) {
> - *flush |= true;
> - return true;
> - }
> -
>   if (pt_protect)
>   spte &= ~SPTE_MMU_WRITEABLE;
>   spte = spte & ~PT_WRITABLE_MASK;
>  
> - *flush |= mmu_spte_update(sptep, spte);
> - return false;
> + return mmu_spte_update(sptep, spte);
>  }
>  
>  static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> @@ -1226,11 +1218,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
> unsigned long *rmapp,
>  
>   for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>   BUG_ON(!(*sptep & PT_PRESENT_MASK));
> - if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
> - sptep = rmap_get_first(*rmapp, &iter);
> - continue;
> - }
>  
> + flush |= spte_write_protect(kvm, sptep, pt_protect);
>   sptep = rmap_get_next(&iter);
>   }
>  
> @@ -2701,6 +2690,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
> int write,
>   break;
>   }
>  
> + drop_large_spte(vcpu, iterator.sptep);
> +
>   if (!is_shadow_present_pte(*iterator.sptep)) {
>   u64 base_addr = iterator.addr;
>  
> @@ -2855,7 +2846,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct 
> kvm_mmu_page *sp,
>   * - false: let the real page fault path to fix it.
>   */
>  static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> - u32 error_code)
> + u32 error_code, bool force_pt_level)
>  {
>   struct kvm_shadow_walk_iterator iterator;
>   struct kvm_mmu_page *sp;
> @@ -2884,6 +2875,13 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, 
> gva_t gva, int level,
>   goto exit;
>  
>   /*
> +  * Can not map the large spte to writable if the page is dirty
> +  * logged.
> +  */
> + if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
> + goto exit;
> +

It is not safe to derive slot->dirty_bitmap like this: 
since dirty log is enabled via RCU update, "is dirty bitmap enabled"
info could be stale by the time you check it here via the parameter,
so you can instantiate a large spte (because force_pt_level == false),
while you should not.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for 2013-08-06

2013-08-02 Thread Eduardo Habkost
On Tue, Jul 23, 2013 at 06:31:45PM +0200, Juan Quintela wrote:
> 
> Hi
> 
> Please, send any topic that you are interested in covering.

* libvirt requirements for:
   * Checking which CPU features are exposed/required by each CPU model
 + machine-type
   * Checking which CPU features are available on a given host
 (considering QEMU + kernel + host CPU capabilities)

I hope Jiri and Daniel can join the call.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 60620] guest loses frequently (multiple times per day!) connectivity to network device

2013-08-02 Thread Stefan Hajnoczi
On Fri, Aug 02, 2013 at 11:28:45AM +, bugzilla-dae...@bugzilla.kernel.org 
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=60620
> 
> --- Comment #9 from Folkert van Heusden  ---
> Good news!
> If I
> 
> - bring down all interfaces in the guest (ifdown eth0...)
> - rmmod virtio_net
> - modprobe virtio_net
> - bring up the interfaces again
> + it all works again!
> 
> So hopefully this helps the bug hunt?

Hi Folkert,
Please post the QEMU command-line on the host (ps aux | grep qemu) and
the output of lsmod | grep vhost_net.

Since reinitializing the guest driver "fixes" the issue, we now need to
find out whether the guest or the host side got stuck.

I think I asked before, but please also post any relevant lines from
dmesg on the host and from the guest.  Examples would include networking
error messages, kernel backtraces, or out-of-memory errors.

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM Test report, kernel bf640876... qemu 0779caeb...

2013-08-02 Thread Gleb Natapov
On Fri, Aug 02, 2013 at 07:19:28AM +, Ren, Yongjie wrote:
> Hi All,
> 
> This is KVM upstream test result against kvm.git next branch and qemu-kvm.git 
> uq/master branch.
>  kvm.git next branch: bf640876e21fe603f7f52b0c27d66b7716da0384  based 
> on kernel 3.11.0-rc1
>  qemu-kvm.git uq/master branch: 
> 0779caeb1a17f4d3ed14e2925b36ba09b084fb7b
> 
> We found two new bugs and no fixed bug in the past two weeks. 
> As all of the two new bugs are introduced by Arthur, I also CCed him. 
> Arthur, can you have look at the new bugs?
> 
> New issues (2):
> 1. 64bit RHEL6.4 guest crashes and reboots continuously
>   https://bugs.launchpad.net/qemu-kvm/+bug/1207623
Weird. Can you ftrace it?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FAQ on linux-kvm.org has broken link

2013-08-02 Thread folkert
Hi,

> If the result is #2, check firewalls on host and guest.  Also try the
> following inside the guest: disable the network interface, rmmod
> virtio_net, modprobe virtio_net again, and bring the network up.

I pinged, I sniffed, I updated the bug report (it also happens with
other guests now!).

And the bring down interfaces / rmmod / modprobe / ifup works!
So I think something is wrong with virtio_net!

What shall I do now?


Folkert van Heusden

-- 
Nagios user? Check out CoffeeSaint - the versatile Nagios status
viewer! http://www.vanheusden.com/java/CoffeeSaint/
--
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60620] guest loses frequently (multiple times per day!) connectivity to network device

2013-08-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60620

--- Comment #9 from Folkert van Heusden  ---
Good news!
If I

- bring down all interfaces in the guest (ifdown eth0...)
- rmmod virtio_net
- modprobe virtio_net
- bring up the interfaces again
+ it all works again!

So hopefully this helps the bug hunt?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60620] guest loses frequently (multiple times per day!) connectivity to network device

2013-08-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60620

--- Comment #8 from Folkert van Heusden  ---
And now the firewall lost connectivity again.
- I removed all iptables rules and set the default to ACCEPT
- from within the guest I started tcpdump
- from the host I started ping
- from the guest I see the pings coming in but no ping reply goes out! the same
happened on the other guest that had this problem 10 minutes ago
- after a while the host starts doing arp requests and also does not get any
replies from it

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Raghavendra K T

On 08/02/2013 03:24 PM, Gleb Natapov wrote:

On Fri, Aug 02, 2013 at 11:25:39AM +0200, Ingo Molnar wrote:

Ingo,

Do you have any concerns reg this series? please let me know if this
looks good now to you.


I'm inclined to NAK it for excessive quotation - who knows how many people
left the discussion in disgust? Was it done to drive away as many
reviewers as possible?


Ingo, Peter,
Sorry for the confusion caused because of nesting. I should have trimmed 
it as Peter also pointed in other thread.

will ensure that is future mails.



Anyway, see my other reply, the measurement results seem hard to interpret
and inconclusive at the moment.


As Gleb already pointed, patch1-17 as a whole giving good performance 
improvement. It was only the patch 18, Gleb had concerns.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60620] guest loses frequently (multiple times per day!) connectivity to network device

2013-08-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60620

--- Comment #7 from Folkert van Heusden  ---
Moments ago an other(!) guest on this system lost its connectivity.
I did a tcpdump on its network interface from within the host/hypervisor/dom0
and saw that traffic goes in but nothing comes out - e.g. no traffic from the
client comes out of the 'vnet1' interface. While I ran the tcpdump I also ran a
ping from within that guest which indeed showed only timeouts.
As there's _no_ firewall on this guest, we can rule out the possibility that it
is a firewall issue.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 10/15] nEPT: Nested INVEPT

2013-08-02 Thread Gleb Natapov
On Fri, Aug 02, 2013 at 04:06:00PM +0800, Xiao Guangrong wrote:
> On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> 
> > +/* Emulate the INVEPT instruction */
> > +static int handle_invept(struct kvm_vcpu *vcpu)
> > +{
> > +   u32 vmx_instruction_info;
> > +   bool ok;
> > +   unsigned long type;
> > +   gva_t gva;
> > +   struct x86_exception e;
> > +   struct {
> > +   u64 eptp, gpa;
> > +   } operand;
> > +
> > +   if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
> > +   !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
> > +   kvm_queue_exception(vcpu, UD_VECTOR);
> > +   return 1;
> > +   }
> > +
> > +   if (!nested_vmx_check_permission(vcpu))
> > +   return 1;
> > +
> > +   if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
> > +   kvm_queue_exception(vcpu, UD_VECTOR);
> > +   return 1;
> > +   }
> > +
> > +   /* According to the Intel VMX instruction reference, the memory
> > +* operand is read even if it isn't needed (e.g., for type==global)
> > +*/
> > +   vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> > +   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
> > +   vmx_instruction_info, &gva))
> > +   return 1;
> > +   if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
> > +   sizeof(operand), &e)) {
> > +   kvm_inject_page_fault(vcpu, &e);
> > +   return 1;
> > +   }
> > +
> > +   type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
> > +
> > +   switch (type) {
> > +   case VMX_EPT_EXTENT_GLOBAL:
> > +   case VMX_EPT_EXTENT_CONTEXT:
> > +   ok = !!(nested_vmx_ept_caps &
> > +   (1UL << (type + VMX_EPT_EXTENT_SHIFT)));
> > +   break;
> > +   default:
> > +   ok = false;
> > +   }
> > +
> > +   if (ok) {
> > +   kvm_mmu_sync_roots(vcpu);
> > +   kvm_mmu_flush_tlb(vcpu);
> > +   nested_vmx_succeed(vcpu);
> > +   }
> > +   else
> > +   nested_vmx_failValid(vcpu,
> > +   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> > +
> > +   skip_emulated_instruction(vcpu);
> > +   return 1;
> 
> Can this code be changed to:
> 
>   switch (type) {
>   case VMX_EPT_EXTENT_GLOBAL:
>   case VMX_EPT_EXTENT_CONTEXT:
>   if (nested_vmx_ept_caps &
>   (1UL << (type + VMX_EPT_EXTENT_SHIFT) {
>   kvm_mmu_sync_roots(vcpu);
>   kvm_mmu_flush_tlb(vcpu);
>   nested_vmx_succeed(vcpu);
>   break;
>   }
>   default:
>   nested_vmx_failValid(vcpu,
>   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
>   }
> ?
> That's clearer i think.
> 
> Also, we can sync the shadow page table only if the current eptp is the 
> required
> eptp, that means:
> 
> if (type == GLOBAL || operand.eptp == nested_ept_get_cr3(vcpu)) {
>   kvm_mmu_sync_roots(vcpu);
>   ..
> }
Good point. Will do.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Gleb Natapov
On Fri, Aug 02, 2013 at 11:25:39AM +0200, Ingo Molnar wrote:
> > Ingo,
> > 
> > Do you have any concerns reg this series? please let me know if this 
> > looks good now to you.
> 
> I'm inclined to NAK it for excessive quotation - who knows how many people 
> left the discussion in disgust? Was it done to drive away as many 
> reviewers as possible?
> 
> Anyway, see my other reply, the measurement results seem hard to interpret 
> and inconclusive at the moment.
> 
That result was only for patch 18 of the series, not pvspinlock in
general.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Raghavendra K T

On 08/02/2013 02:53 PM, Ingo Molnar wrote:


* Raghavendra K T  wrote:


On 08/01/2013 02:34 PM, Raghavendra K T wrote:

On 08/01/2013 01:15 PM, Gleb Natapov wrote:

Shall I consider this as an ack for kvm part?


For everything except 18/18. For that I still want to see numbers. But
18/18 is pretty independent from the reset of the series so it should
not stop the reset from going in.


Yes. agreed.
I am going to evaluate patch 18 separately and come with results for
that. Now we can consider only 1-17 patches.



Gleb,

32 core machine with HT off 32 vcpu guests.
base = 3.11-rc + patch 1 -17 pvspinlock_v11
patched = base + patch 18

+---+---+---++---+
   dbench  (Throughput in MB/sec higher is better)
+---+---+---++---+
   base  stdev   patchedstdev   %improvement
+---+---+---++---+
1x 14584.3800   146.9074   14705.1000   163.1060 0.82773
2x  1713.730032.87501717.320045.5979 0.20948
3x   967.821242.0257 971.885518.8532 0.41994
4x   685.276425.7150 694.5881 8.3907 1.35882
+---+---+---++---+


Please list stddev in percentage as well ...


Sure. will do this from next time.



a blind stab gave me these figures:


   base  stdev   patchedstdev   %improvement
3x   967.82124.3% 971.8855  1.8% 0.4


That makes the improvement an order of magnitude smaller than the noise of
the measurement ... i.e. totally inconclusive.


Okay. agreed.

I always had seen the positive effect of the patch since it uses ple
handler heuristics, and thus avoiding the directed yield to vcpu's in
halt handler. But the current results clearly does not conclude
anything favoring that. :(

So please drop patch 18 for now.



Also please cut the excessive decimal points: with 2-4% noise what point
is there in 5 decimal point results??


Yes.

Ingo, do you think now the patch series (patch 1 to 17) are in good
shape? or please let me know if you have any concerns to be
addressed.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Ingo Molnar

* Raghavendra K T  wrote:

> On 07/31/2013 11:54 AM, Gleb Natapov wrote:
> >On Tue, Jul 30, 2013 at 10:13:12PM +0530, Raghavendra K T wrote:
> >>On 07/25/2013 03:08 PM, Raghavendra K T wrote:
> >>>On 07/25/2013 02:45 PM, Gleb Natapov wrote:
> On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote:
> >On 07/24/2013 06:06 PM, Raghavendra K T wrote:
> >>On 07/24/2013 05:36 PM, Gleb Natapov wrote:
> >>>On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote:
> On 07/24/2013 04:09 PM, Gleb Natapov wrote:
> >On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:
> >>On 07/23/2013 08:37 PM, Gleb Natapov wrote:
> >>>On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:
> +static void kvm_lock_spinning(struct arch_spinlock *lock,
> __ticket_t want)
> >>[...]

[ a few hundred lines of unnecessary quotation snipped. ]

> Ingo,
> 
> Do you have any concerns reg this series? please let me know if this 
> looks good now to you.

I'm inclined to NAK it for excessive quotation - who knows how many people 
left the discussion in disgust? Was it done to drive away as many 
reviewers as possible?

Anyway, see my other reply, the measurement results seem hard to interpret 
and inconclusive at the moment.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 02/15] nEPT: Fix cr3 handling in nested exit and entry

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> From: Nadav Har'El 
> 
> The existing code for handling cr3 and related VMCS fields during nested
> exit and entry wasn't correct in all cases:
> 
> If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
> during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
> we forgot to do so. This patch adds this copy.
> 
> If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
> whoever does control cr3 (L1 or L2) is using PAE, the processor might have
> saved PDPTEs and we should also save them in vmcs12 (and restore later).

Reviewed-by: Xiao Guangrong 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Ingo Molnar

* Raghavendra K T  wrote:

> On 08/01/2013 02:34 PM, Raghavendra K T wrote:
> >On 08/01/2013 01:15 PM, Gleb Natapov wrote:
> >>>Shall I consider this as an ack for kvm part?
> >>>
> >>For everything except 18/18. For that I still want to see numbers. But
> >>18/18 is pretty independent from the reset of the series so it should
> >>not stop the reset from going in.
> >
> >Yes. agreed.
> >I am going to evaluate patch 18 separately and come with results for
> >that. Now we can consider only 1-17 patches.
> >
> 
> Gleb,
> 
> 32 core machine with HT off 32 vcpu guests.
> base = 3.11-rc + patch 1 -17 pvspinlock_v11
> patched = base + patch 18
> 
> +---+---+---++---+
>   dbench  (Throughput in MB/sec higher is better)
> +---+---+---++---+
>   base  stdev   patchedstdev   %improvement
> +---+---+---++---+
> 1x 14584.3800   146.9074   14705.1000   163.1060 0.82773
> 2x  1713.730032.87501717.320045.5979 0.20948
> 3x   967.821242.0257 971.885518.8532 0.41994
> 4x   685.276425.7150 694.5881 8.3907 1.35882
> +---+---+---++---+

Please list stddev in percentage as well ...

a blind stab gave me these figures:

>   base  stdev   patchedstdev   %improvement
> 3x   967.82124.3% 971.8855  1.8% 0.4

That makes the improvement an order of magnitude smaller than the noise of 
the measurement ... i.e. totally inconclusive.

Also please cut the excessive decimal points: with 2-4% noise what point 
is there in 5 decimal point results??

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 13/15] nEPT: Advertise EPT to L1

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:
> From: Nadav Har'El 
> 
> Advertise the support of EPT to the L1 guest, through the appropriate MSR.
> 
> This is the last patch of the basic Nested EPT feature, so as to allow
> bisection through this patch series: The guest will not see EPT support until
> this last patch, and will not attempt to use the half-applied feature.

Reviewed-by: Xiao Guangrong 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 10/15] nEPT: Nested INVEPT

2013-08-02 Thread Xiao Guangrong
On 08/01/2013 10:08 PM, Gleb Natapov wrote:

> +/* Emulate the INVEPT instruction */
> +static int handle_invept(struct kvm_vcpu *vcpu)
> +{
> + u32 vmx_instruction_info;
> + bool ok;
> + unsigned long type;
> + gva_t gva;
> + struct x86_exception e;
> + struct {
> + u64 eptp, gpa;
> + } operand;
> +
> + if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
> + !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
> + kvm_queue_exception(vcpu, UD_VECTOR);
> + return 1;
> + }
> +
> + if (!nested_vmx_check_permission(vcpu))
> + return 1;
> +
> + if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
> + kvm_queue_exception(vcpu, UD_VECTOR);
> + return 1;
> + }
> +
> + /* According to the Intel VMX instruction reference, the memory
> +  * operand is read even if it isn't needed (e.g., for type==global)
> +  */
> + vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
> + vmx_instruction_info, &gva))
> + return 1;
> + if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
> + sizeof(operand), &e)) {
> + kvm_inject_page_fault(vcpu, &e);
> + return 1;
> + }
> +
> + type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
> +
> + switch (type) {
> + case VMX_EPT_EXTENT_GLOBAL:
> + case VMX_EPT_EXTENT_CONTEXT:
> + ok = !!(nested_vmx_ept_caps &
> + (1UL << (type + VMX_EPT_EXTENT_SHIFT)));
> + break;
> + default:
> + ok = false;
> + }
> +
> + if (ok) {
> + kvm_mmu_sync_roots(vcpu);
> + kvm_mmu_flush_tlb(vcpu);
> + nested_vmx_succeed(vcpu);
> + }
> + else
> + nested_vmx_failValid(vcpu,
> + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> +
> + skip_emulated_instruction(vcpu);
> + return 1;

Can this code be changed to:

switch (type) {
case VMX_EPT_EXTENT_GLOBAL:
case VMX_EPT_EXTENT_CONTEXT:
if (nested_vmx_ept_caps &
(1UL << (type + VMX_EPT_EXTENT_SHIFT) {
kvm_mmu_sync_roots(vcpu);
kvm_mmu_flush_tlb(vcpu);
nested_vmx_succeed(vcpu);
break;
}
default:
nested_vmx_failValid(vcpu,
VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
}
?
That's clearer i think.

Also, we can sync the shadow page table only if the current eptp is the required
eptp, that means:

if (type == GLOBAL || operand.eptp == nested_ept_get_cr3(vcpu)) {
kvm_mmu_sync_roots(vcpu);
..
}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-08-02 Thread Jan Kiszka
On 2013-08-02 09:27, Zhang, Yang Z wrote:
> Jan Kiszka wrote on 2013-08-02:
>> On 2013-08-02 05:04, Zhang, Yang Z wrote:
>>> Gleb Natapov wrote on 2013-08-01:
 From: Nadav Har'El 

 Recent KVM, since
 http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
 switch the EFER MSR when EPT is used and the host and guest have 
 different NX bits. So if we add support for nested EPT (L1 guest 
 using EPT to run L2) and want to be able to run recent KVM as L1, we 
 need to allow L1 to use this EFER switching feature.

 To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if 
 available, and if it isn't, it uses the generic 
 VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the 
 latter is still unsupported).

 Nested entry and exit emulation (prepare_vmcs_02 and 
 load_vmcs12_host_state,
 respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly.
 So all that's left to do in this patch is to properly advertise this 
 feature to L1.

 Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, 
 by using vmx_set_efer (which itself sets one of several vmcs02 
 fields), so we always support this feature, regardless of whether 
 the host supports it.

 Reviewed-by: Orit Wasserman 
 Signed-off-by: Nadav Har'El 
 Signed-off-by: Jun Nakajima 
 Signed-off-by: Xinhao Xu 
 Signed-off-by: Yang Zhang 
 Signed-off-by: Gleb Natapov 
 ---
  arch/x86/kvm/vmx.c |   23 ---
  1 file changed, 16 insertions(+), 7 deletions(-) diff --git 
 a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2198,7 +2198,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
  #else
nested_vmx_exit_ctls_high = 0;
  #endif
 -  nested_vmx_exit_ctls_high |=
 VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 +  nested_vmx_exit_ctls_high |=
 (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 +VM_EXIT_LOAD_IA32_EFER);

/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
 @@ -2207,8 +2208,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_entry_ctls_high &=   VM_ENTRY_LOAD_IA32_PAT |
  VM_ENTRY_IA32E_MODE;
 -  nested_vmx_entry_ctls_high |=
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 -
 +  nested_vmx_entry_ctls_high |=
 (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
 + VM_ENTRY_LOAD_IA32_EFER);
>>> Just saw it, we didn't expose bit22 save VMX-preemption timer in 
>>> vm-exit
>> control but we already allowed guest to set active VMX-preemption 
>> timer in pin based vm-execution conrols. This is wrong.
>>
>> Does the presence of preemption timer support imply that saving its 
>> value is also supported? Then we could demand this combination (ie. do 
>> not expose preemption timer support at all to L1 if value saving is
>> missing) and build our preemption timer emulation on top.
>>
> I don't see we saved the preemption timer value to vmcs12 in 
> prepare_vmcs12(). Will it be saved automatically?

No. As I said, there is more broken with our preemption timer emulation.

Jan

> 
>> There is more broken /wrt VMX preemption timer, patches are welcome.
>> Arthur will also try to develop test cases for it. But that topic is 
>> unrelated to this series.
>>
>> Jan
>>
>>>
/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
nested_vmx_procbased_ctls_low,
 nested_vmx_procbased_ctls_high);
 @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu 
 *vcpu, struct vmcs12 *vmcs12)
vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
vmcs_writel(CR0_GUEST_HOST_MASK,
 ~vcpu->arch.cr0_guest_owned_bits);

 -  /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
 below */
 -  vmcs_write32(VM_EXIT_CONTROLS,
 -  vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
 -  vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
 +  /* L2->L1 exit controls are emulated - the hardware exit is to L0 so
 +   * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
 +   * bits are further modified by vmx_set_efer() below.
 +   */
 +  vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
>>> Should we mentioned that save vmx preemption bit must use host|guest, 
>>> not just host?
>>>
 +
 +  /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
 are
 +   * emulated by vmx_set_efer(), below.
 +   */
 +  vmcs_write32(VM_ENTRY_CONTROLS,
 +  (vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
 +  ~VM_ENTRY_IA32E_MODE) |
(v

RE: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-08-02 Thread Zhang, Yang Z
Jan Kiszka wrote on 2013-08-02:
> On 2013-08-02 05:04, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-08-01:
>>> From: Nadav Har'El 
>>> 
>>> Recent KVM, since
>>> http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
>>> switch the EFER MSR when EPT is used and the host and guest have 
>>> different NX bits. So if we add support for nested EPT (L1 guest 
>>> using EPT to run L2) and want to be able to run recent KVM as L1, we 
>>> need to allow L1 to use this EFER switching feature.
>>> 
>>> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if 
>>> available, and if it isn't, it uses the generic 
>>> VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the 
>>> latter is still unsupported).
>>> 
>>> Nested entry and exit emulation (prepare_vmcs_02 and 
>>> load_vmcs12_host_state,
>>> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly.
>>> So all that's left to do in this patch is to properly advertise this 
>>> feature to L1.
>>> 
>>> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, 
>>> by using vmx_set_efer (which itself sets one of several vmcs02 
>>> fields), so we always support this feature, regardless of whether 
>>> the host supports it.
>>> 
>>> Reviewed-by: Orit Wasserman 
>>> Signed-off-by: Nadav Har'El 
>>> Signed-off-by: Jun Nakajima 
>>> Signed-off-by: Xinhao Xu 
>>> Signed-off-by: Yang Zhang 
>>> Signed-off-by: Gleb Natapov 
>>> ---
>>>  arch/x86/kvm/vmx.c |   23 ---
>>>  1 file changed, 16 insertions(+), 7 deletions(-) diff --git 
>>> a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 
>>> 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -2198,7 +2198,8 @@ static __init void
>>> nested_vmx_setup_ctls_msrs(void)
>>>  #else
>>> nested_vmx_exit_ctls_high = 0;
>>>  #endif
>>> -   nested_vmx_exit_ctls_high |=
>>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
>>> +   nested_vmx_exit_ctls_high |=
>>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>> + VM_EXIT_LOAD_IA32_EFER);
>>> 
>>> /* entry controls */
>>> rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
>>> @@ -2207,8 +2208,8 @@ static __init void
>>> nested_vmx_setup_ctls_msrs(void)
>>> nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>> nested_vmx_entry_ctls_high &=   VM_ENTRY_LOAD_IA32_PAT |
>>>  VM_ENTRY_IA32E_MODE;
>>> -   nested_vmx_entry_ctls_high |=
>>> VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>>> -
>>> +   nested_vmx_entry_ctls_high |=
>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
>>> +  VM_ENTRY_LOAD_IA32_EFER);
>> Just saw it, we didn't expose bit22 save VMX-preemption timer in 
>> vm-exit
> control but we already allowed guest to set active VMX-preemption 
> timer in pin based vm-execution conrols. This is wrong.
> 
> Does the presence of preemption timer support imply that saving its 
> value is also supported? Then we could demand this combination (ie. do 
> not expose preemption timer support at all to L1 if value saving is
> missing) and build our preemption timer emulation on top.
> 
I don't see we saved the preemption timer value to vmcs12 in prepare_vmcs12(). 
Will it be saved automatically?

> There is more broken /wrt VMX preemption timer, patches are welcome.
> Arthur will also try to develop test cases for it. But that topic is 
> unrelated to this series.
> 
> Jan
> 
>> 
>>> /* cpu-based controls */
>>> rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>>> nested_vmx_procbased_ctls_low,
>>> nested_vmx_procbased_ctls_high);
>>> @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu 
>>> *vcpu, struct vmcs12 *vmcs12)
>>> vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
>>> vmcs_writel(CR0_GUEST_HOST_MASK,
>>> ~vcpu->arch.cr0_guest_owned_bits);
>>> 
>>> -   /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
>>> below */
>>> -   vmcs_write32(VM_EXIT_CONTROLS,
>>> -   vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
>>> -   vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
>>> +   /* L2->L1 exit controls are emulated - the hardware exit is to L0 so
>>> +* we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
>>> +* bits are further modified by vmx_set_efer() below.
>>> +*/
>>> +   vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
>> Should we mentioned that save vmx preemption bit must use host|guest, 
>> not just host?
>> 
>>> +
>>> +   /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
>>> are
>>> +* emulated by vmx_set_efer(), below.
>>> +*/
>>> +   vmcs_write32(VM_ENTRY_CONTROLS,
>>> +   (vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
>>> +   ~VM_ENTRY_IA32E_MODE) |
>>> (vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
>>>  
>>> if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
>>> --
>>> 1.7.10.4
>> 
>> Best regards,
>> Yang
>


Best regards,
Yang




KVM Test report, kernel bf640876... qemu 0779caeb...

2013-08-02 Thread Ren, Yongjie
Hi All,

This is KVM upstream test result against kvm.git next branch and qemu-kvm.git 
uq/master branch.
 kvm.git next branch: bf640876e21fe603f7f52b0c27d66b7716da0384  based 
on kernel 3.11.0-rc1
 qemu-kvm.git uq/master branch: 0779caeb1a17f4d3ed14e2925b36ba09b084fb7b

We found two new bugs and no fixed bug in the past two weeks. 
As all of the two new bugs are introduced by Arthur, I also CCed him. 
Arthur, can you have look at the new bugs?

New issues (2):
1. 64bit RHEL6.4 guest crashes and reboots continuously
  https://bugs.launchpad.net/qemu-kvm/+bug/1207623
2. L2 can't boot up when creating L1 with '-cpu host' qemu option
  https://bugzilla.kernel.org/show_bug.cgi?id=60679

Fixed issue (0):

Old issues (9):
--
1. guest panic with parameter "-cpu host" in qemu command line (about vPMU 
issue).
  https://bugs.launchpad.net/qemu/+bug/994378
2. Can't install or boot up 32bit win8 guest.
  https://bugs.launchpad.net/qemu/+bug/1007269
3. vCPU hot-add makes the guest abort. 
  https://bugs.launchpad.net/qemu/+bug/1019179
4. Nested Virt: VMX can't be initialized in L1 Xen ("Xen on KVM")
  https://bugzilla.kernel.org/show_bug.cgi?id=45931
5. Guest has no "xsave" feature with parameter "-cpu qemu64,+xsave" in qemu 
command line.
  https://bugs.launchpad.net/qemu/+bug/1042561
6. Guest hang when doing kernel build and writing data in guest.
  https://bugs.launchpad.net/qemu/+bug/1096814
7. with 'monitor pty', it needs to flush pts device after sending command to it 
  https://bugs.launchpad.net/qemu/+bug/1185228
8. [nested virt] L2 Windows guest can't boot up ('-cpu host' to start L1)
  https://bugzilla.kernel.org/show_bug.cgi?id=58921
9. [nested virt] L2 has NMI error when creating L1 with "-cpu host" parameter
  https://bugzilla.kernel.org/show_bug.cgi?id=58941


Test environment:
==
  Platform   IvyBridge-EP    Sandybridge-EP
  CPU Cores   32    32
  Memory size 64GB  32GB


Regards
   Yongjie Ren  (Jay)