Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code

2008-02-16 Thread Avi Kivity
Andrew Morton wrote:
> How important is this feature to KVM?
>   

Very.  kvm pins pages that are referenced by the guest; a 64-bit guest 
will easily pin its entire memory with the kernel map.  So this is 
critical for guest swapping to actually work.

Other nice features like page migration are also enabled by this patch.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [kvm-ppc-devel] upstream PowerPC qemu breakage

2008-02-16 Thread Avi Kivity
Hollis Blanchard wrote:
> On Wed, 2008-02-13 at 08:58 +0200, Avi Kivity wrote:
>   
>> It'll need to be built against your kernel tree; please provide a URL.
>> 
>
> curl http://penguinppc.org/~hollisb/kvm/kvm-powerpc.mbox | git-am
>
>   

Unfortunately I wasn't able to get an F8 ppc rescue cd ISO to boot with 
qemu 0.9.0.  Can you point me to a working combination?

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code

2008-02-16 Thread Andrew Morton
On Sat, 16 Feb 2008 10:45:50 +0200 Avi Kivity <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > How important is this feature to KVM?
> >   
> 
> Very.  kvm pins pages that are referenced by the guest;

hm.  Why does it do that?

> a 64-bit guest 
> will easily pin its entire memory with the kernel map.

>  So this is 
> critical for guest swapping to actually work.

Curious.  If KVM can release guest pages at the request of this notifier so
that they can be swapped out, why can't it release them by default, and
allow swapping to proceed?

> 
> Other nice features like page migration are also enabled by this patch.
> 

We already have page migration.  Do you mean page-migration-when-using-kvm?

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] enable gfxboot on VMX

2008-02-16 Thread Avi Kivity
Alexander Graf wrote:
>>
>> While enabling gfxboot over vmx is very desirable, I'd like to avoid
>> guest-specific hacks.  IMO the correct fix is to set a  
>> "non_vt_friendly"
>> flag when switching from real mode to protected mode, then continue
>> emulation, re-computing the flag after every instruction.  After a few
>> instruction, the condition disappears and we can enter guest mode  
>> again.
>> 
>
> So when would the flag disappear?

Whenever the register state becomes consistent with VT again.  
vmx_set_segment() looks like the right point for turning it off.

>  Basically an SS read can happen any  
> time after the switch, even after 100 instructions if it has to.

That's okay.  Most (all) code will reload all segments shortly after the 
protected mode switch.

I am concerned about switches to real-mode not reloading fs and gs, so 
we'd stay in vt-unfriendly state and emulate even though the real mode 
code doesn't use fs and gs at all.  I believe we can live with it; newer 
Xen for example emulates 100% of real mode code.

>  While  
> this should fix more problems, the one thing I am concerned about is  
> that I have not encountered any other code that does have this problem.
>
>   

I think some Ubuntus use big real mode, which can use the same fix.

>> The downside is that we have to implement more instructions in the
>> emulator for this, but these instructions will be generally useful,  
>> not
>> just for gfxboot.
>> 
>
> I am not trying to talk you into anything - I would very much prefer a  
> rather clean solution as well. Nevertheless I do not see full  
> protected mode emulation code coming in the very near future and on a  
> user perspective would prefer to have something that works, even if  
> it's ugly.
> So while KVM is able to run most (if not all?) current major Operating  
> Systems unmodified, it fails to install them (at least on the Linux  
> side).
>   

I'd like to keep ugliness out of the kernel side.

I don't think there's much work to get protected mode emulation 
working.  There aren't that many instructions before we get to a 
vt-friendly state (a couple dozen?) and some of them are already 
implemented.

An alternative is to work around it in userspace.  If we recognise the 
exit reason, we can read the instructions around rip and attempt to fix 
things up.

> Even though I would greatly appreciate any effort made to get things  
> cleaned up, the gfxboot issue has been standing for months now without  
> even a hacky workaround (except for disabling gfxboot in all) or any  
> visible progress, so I believe a hack like this is at least worth  
> something to distributions that want to enable their users to work  
> with KVM.

On the other hand, merging the hacks discourages the right fix from 
being developed.  I do agree that the current situation is disgraceful.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code

2008-02-16 Thread Avi Kivity
Andrew Morton wrote:

  

>> Very.  kvm pins pages that are referenced by the guest;
>> 
>
> hm.  Why does it do that?
>
>   

It was deemed best not to allow the guest to write to a page that has 
been swapped out and assigned to an unrelated host process.

One way to view the kvm shadow page tables is as hardware dma 
descriptors. kvm pins pages for the same reason that drivers pin pages 
that are being dma'ed. It's also the reason why mmu notifiers are useful 
for such a wide range of dma capable hardware.

>> a 64-bit guest 
>> will easily pin its entire memory with the kernel map.
>> 
>
>   
>>  So this is 
>> critical for guest swapping to actually work.
>> 
>
> Curious.  If KVM can release guest pages at the request of this notifier so
> that they can be swapped out, why can't it release them by default, and
> allow swapping to proceed?
>
>   

If kvm releases a page, it must also zap any shadow ptes pointing at the 
page and flush the tlb. If you do that for all of memory you can't 
reference any of it.

Releasing a page has costs, both at the time of the release and when the 
guest eventually refers to the page again.

>> Other nice features like page migration are also enabled by this patch.
>>
>> 
>
> We already have page migration.  Do you mean page-migration-when-using-kvm?
>   

Yes, I'm obviously writing from a kvm-centric point of view. This is an 
important feature, as the virtualization future seems to be NUMA hosts 
(2- or 4- way, 4 cores per socket) running moderately sized guests. The 
ability to load-balance guests among the NUMA nodes is important for 
performance.

(btw, I'm also looking forward to memory defragmentation. large pages 
are important for virtualization workloads and mmu notifiers are again 
critical to getting it to work while running kvm).

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-16 Thread Andrea Arcangeli
On Fri, Feb 15, 2008 at 07:37:36PM -0800, Andrew Morton wrote:
> The "|" is obviously deliberate.  But no explanation is provided telling us
> why we still call the callback if ptep_clear_flush_young() said the page
> was recently referenced.  People who read your code will want to understand
> this.

This is to clear the young bit in every pte and spte to such physical
page before backing off because any young bit was on. So if any young
bit will be on in the next scan, we're guaranteed the page has been
touched recently and not ages before (otherwise it would take a worst
case N rounds of the lru before the page can be freed, where N is the
number of pte or sptes pointing to the page).

> I just don't see how ths can be done if the callee has another thread in
> the middle of establishing IO against this region of memory. 
> ->invalidate_page() _has_ to be able to block.  Confused.

invalidate_page marking the spte invalid and flushing the asid/tlb
doesn't need to block the same way ptep_clear_flush doesn't need to
block for the main linux pte. Infact before invalidate_page and
ptep_clear_flush can touch anything at all, they've to take their own
spinlocks (mmu_lock for the former, and PT lock for the latter).

The only sleeping trouble is for networked driven message passing,
where they want to schedule while they wait the message to arrive or
it'd hang the whole cpu to spin for so long.

sptes are cpu-clocked entities like ptes so scheduling there is by far
not necessary because there's zero delay in invalidating them and
flushing their tlbs. GRU is similar. Because we boost the reference
count of the pages for every spte mapping, only implementing
invalidate_range_end is enough, but I need to figure out the
get_user_pages->rmap_add window too and because get_user_pages can
schedule, and if I want to add a critical section around it to avoid
calling get_user_pages twice during the kvm page fault, a mutex would
be the only way (it sure can't be a spinlock). But a mutex can't be
taken by invalidate_page to stop it. So that leaves me with the idea
of adding a get_user_pages variant that returns the page locked. So
instead of calling get_user_pages a second time after rmap_add
returns, I will only need to call unlock_page which should be faster
than a follow_page. And setting the PG_lock before dropping the PT
lock in follow_page, should be fast enough too.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] KVM swapping with MMU Notifiers V7

2008-02-16 Thread Andrew Morton
On Sat, 16 Feb 2008 11:48:27 +0100 Andrea Arcangeli <[EMAIL PROTECTED]> wrote:

> +void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> +struct mm_struct *mm,
> +unsigned long start, unsigned long 
> end,
> +int lock)
> +{
> + for (; start < end; start += PAGE_SIZE)
> + kvm_mmu_notifier_invalidate_page(mn, mm, start);
> +}
> +
> +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> + .invalidate_page= kvm_mmu_notifier_invalidate_page,
> + .age_page   = kvm_mmu_notifier_age_page,
> + .invalidate_range_end   = kvm_mmu_notifier_invalidate_range_end,
> +};

So this doesn't implement ->invalidate_range_start().

By what means does it prevent new mappings from being established in the
range after core mm has tried to call ->invalidate_rande_start()?
mmap_sem, I assume?


> + /* set userspace_addr atomically for kvm_hva_to_rmapp */
> + spin_lock(&kvm->mmu_lock);
> + memslot->userspace_addr = userspace_addr;
> + spin_unlock(&kvm->mmu_lock);

are you sure?  kvm_unmap_hva() and kvm_age_hva() read ->userspace_addr a
single time and it doesn't immediately look like there's a need to take the
lock here?



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] KVM swapping with MMU Notifiers V7

2008-02-16 Thread Andrea Arcangeli
Those below two patches enable KVM to swap the guest physical memory
through Christoph's V7.

There's one last _purely_theoretical_ race condition I figured out and
that I'm wondering how to best fix. The race condition worst case is
that a few guest physical pages could remain pinned by sptes. The race
can materialize if the linux pte is zapped after get_user_pages
returns but before the page is mapped by the spte and tracked by
rmap. The invalidate_ calls can also likely be optimized further but
it's not a fast path so it's not urgent.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 41962e7..e1287ab 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
tristate "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM && EXPERIMENTAL
select PREEMPT_NOTIFIERS
+   select MMU_NOTIFIER
select ANON_INODES
---help---
  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fd39cd1..b56e388 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -533,6 +533,110 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
kvm_flush_remote_tlbs(kvm);
 }
 
+static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
+{
+   struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> 
PAGE_SHIFT);
+   get_page(page);
+   rmap_remove(kvm, spte);
+   set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+   kvm_flush_remote_tlbs(kvm);
+   __free_page(page);
+}
+
+static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte, *curr_spte;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!(*spte & PT_PRESENT_MASK));
+   rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
+   curr_spte = spte;
+   spte = rmap_next(kvm, rmapp, spte);
+   kvm_unmap_spte(kvm, curr_spte);
+   }
+}
+
+void kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+   int i;
+
+   /*
+* If mmap_sem isn't taken, we can look the memslots with only
+* the mmu_lock by skipping over the slots with userspace_addr == 0.
+*/
+   spin_lock(&kvm->mmu_lock);
+   for (i = 0; i < kvm->nmemslots; i++) {
+   struct kvm_memory_slot *memslot = &kvm->memslots[i];
+   unsigned long start = memslot->userspace_addr;
+   unsigned long end;
+
+   /* mmu_lock protects userspace_addr */
+   if (!start)
+   continue;
+
+   end = start + (memslot->npages << PAGE_SHIFT);
+   if (hva >= start && hva < end) {
+   gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+   kvm_unmap_rmapp(kvm, &memslot->rmap[gfn_offset]);
+   }
+   }
+   spin_unlock(&kvm->mmu_lock);
+}
+
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte;
+   int young = 0;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   int _young;
+   u64 _spte = *spte;
+   BUG_ON(!(_spte & PT_PRESENT_MASK));
+   _young = _spte & PT_ACCESSED_MASK;
+   if (_young) {
+   young = !!_young;
+   set_shadow_pte(spte, _spte & ~PT_ACCESSED_MASK);
+   }
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   return young;
+}
+
+int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+{
+   int i;
+   int young = 0;
+
+   /*
+* If mmap_sem isn't taken, we can look the memslots with only
+* the mmu_lock by skipping over the slots with userspace_addr == 0.
+*/
+   spin_lock(&kvm->mmu_lock);
+   for (i = 0; i < kvm->nmemslots; i++) {
+   struct kvm_memory_slot *memslot = &kvm->memslots[i];
+   unsigned long start = memslot->userspace_addr;
+   unsigned long end;
+
+   /* mmu_lock protects userspace_addr */
+   if (!start)
+   continue;
+
+   end = start + (memslot->npages << PAGE_SHIFT);
+   if (hva >= start && hva < end) {
+   gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+   young |= kvm_age_rmapp(kvm, &memslot->rmap[gfn_offset]);
+   }
+   }
+   spin_unlock(&kvm->mmu_lock);
+
+   if (young)
+   kvm_flush_remote_tlbs(kvm);
+
+   return young;
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c910c7..2b2398f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3185,6 +3185,46 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
free_page((unsigned lon

Re: [kvm-devel] [PATCH] KVM swapping with MMU Notifiers V7

2008-02-16 Thread Robin Holt
On Sat, Feb 16, 2008 at 11:48:27AM +0100, Andrea Arcangeli wrote:
> Those below two patches enable KVM to swap the guest physical memory
> through Christoph's V7.
> 
> There's one last _purely_theoretical_ race condition I figured out and
> that I'm wondering how to best fix. The race condition worst case is
> that a few guest physical pages could remain pinned by sptes. The race
> can materialize if the linux pte is zapped after get_user_pages
> returns but before the page is mapped by the spte and tracked by
> rmap. The invalidate_ calls can also likely be optimized further but
> it's not a fast path so it's not urgent.

I am doing this in xpmem with a stack-based structure in the function
calling get_user_pages.  That structure describes the start and
end address of the range we are doing the get_user_pages on.  If an
invalidate_range_begin comes in while we are off to the kernel doing
the get_user_pages, the invalidate_range_begin marks that structure
indicating an invalidate came in.  When the get_user_pages gets the
structures relocked, it checks that flag (really a generation counter)
and if it is set, retries the get_user_pages.  After 3 retries, it
returns -EAGAIN and the fault is started over from the remote side.

Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] enable gfxboot on VMX

2008-02-16 Thread Alexander Graf

On Feb 16, 2008, at 10:06 AM, Avi Kivity wrote:

> Alexander Graf wrote:
>>>
>>> While enabling gfxboot over vmx is very desirable, I'd like to avoid
>>> guest-specific hacks.  IMO the correct fix is to set a   
>>> "non_vt_friendly"
>>> flag when switching from real mode to protected mode, then continue
>>> emulation, re-computing the flag after every instruction.  After a  
>>> few
>>> instruction, the condition disappears and we can enter guest mode   
>>> again.
>>>
>>
>> So when would the flag disappear?
>
> Whenever the register state becomes consistent with VT again.   
> vmx_set_segment() looks like the right point for turning it off.

Sounds good. As basically the only problem we have are the sanity  
checks done on VMENTER, this should work.

>> Basically an SS read can happen any  time after the switch, even  
>> after 100 instructions if it has to.
>
> That's okay.  Most (all) code will reload all segments shortly after  
> the protected mode switch.
>
> I am concerned about switches to real-mode not reloading fs and gs,  
> so we'd stay in vt-unfriendly state and emulate even though the real  
> mode code doesn't use fs and gs at all.  I believe we can live with  
> it; newer Xen for example emulates 100% of real mode code.

Emulating all of the real mode shouldn't be too much of a problem on  
the performance side. I wouldn't be surprised if the vmenter/exits  
take about as much time as the emulation overhead.

>> While  this should fix more problems, the one thing I am concerned  
>> about is  that I have not encountered any other code that does have  
>> this problem.
>
> I think some Ubuntus use big real mode, which can use the same fix.

Do you have any file / pointer to where I could get one? I did try the  
feisty server iso which worked just fine.

>>> The downside is that we have to implement more instructions in the
>>> emulator for this, but these instructions will be generally  
>>> useful,  not
>>> just for gfxboot.
>>
>> I am not trying to talk you into anything - I would very much  
>> prefer a  rather clean solution as well. Nevertheless I do not see  
>> full  protected mode emulation code coming in the very near future  
>> and on a  user perspective would prefer to have something that  
>> works, even if  it's ugly.
>> So while KVM is able to run most (if not all?) current major  
>> Operating  Systems unmodified, it fails to install them (at least  
>> on the Linux  side).
>
> I'd like to keep ugliness out of the kernel side.
>
> I don't think there's much work to get protected mode emulation  
> working.  There aren't that many instructions before we get to a vt- 
> friendly state (a couple dozen?) and some of them are already  
> implemented.

The hardest one being ljmp. You need to do the whole pm transition in  
the emulator then. I believe there is a reason this hasn't been done  
yet?

> An alternative is to work around it in userspace.  If we recognise  
> the exit reason, we can read the instructions around rip and attempt  
> to fix things up.

So just get the CR0 write and UD exception as event to the userspace?  
I'd really love that approach. The "invalid opcode" hack, as I  
implemented it, is actually quite extensible. You could simply put the  
rip and an operation that is supposed to occur in a list and emulate  
whatever comes when the UD occurs. This might be the easiest way to  
fix things.

We could also have something more extensible, say a "generic binary  
patching" framework, so we know that if memory page 0x1234000 contains  
specific content, just patch it and apply a "what happens in case of  
invalid opcodes" script. This could all be in userspace and should  
enable us to circumvent most problems in a generic way.

>> Even though I would greatly appreciate any effort made to get  
>> things  cleaned up, the gfxboot issue has been standing for months  
>> now without  even a hacky workaround (except for disabling gfxboot  
>> in all) or any  visible progress, so I believe a hack like this is  
>> at least worth  something to distributions that want to enable  
>> their users to work  with KVM.
>
> On the other hand, merging the hacks discourages the right fix from  
> being developed.  I do agree that the current situation is  
> disgraceful.

Don't get me wrong on this - I really want to see something "right". I  
just don't see anyone working on it, as there are a lot of places KVM  
improves right now, which are a lot more important than real mode  
fixes. Usually real mode is completely unused as soon as you're done  
with bootstrapping, so why care about it that much?

I'm also perfectly fine with this not being merged. I built this hack  
for me, because I was rather unhappy with the situation as is and  
wanted to see gfxboot working, as I couldn't just "plug in" a current  
iso and install from that. If anyone benefits from it, I'm fine with  
it. If not, that's ok with me too. I just couldn't stand the situation  
that no fix was avail

[kvm-devel] [RFC] Performance monitoring units and KVM

2008-02-16 Thread Balaji Rao
Hi all!

Earlier it was suggested that we go ahead with emulating Perf Mon Events in 
exposing it to the guest. The serious limitation in this approach is that we 
end up exposing only a small number of events to the guest, even though the 
host hardware is capable of much more. The only benefit this approach offers is 
that, it doesn't break live migration.

The other option is to pass through the real PMU to the guest. I believe this 
approach is far better in the sense that,

1. All the available events in the host hardware can be passed on to the guest, 
which can be used by oprofile to profile the guest and trackdown slowdowns 
introduced due to virtualization.

2. Its much cleaner and easier to pass through the PMU.

Yes, this approach breaks live migration. Migration should not be possible 
*only* when the PMU is being used by oprofile. We can mark the guest as 
unmigratable in such situations. Once the PMU is not being used, migration can 
be performed normally.

Note, this requires a small change to oprofile source. Upon migration, oprofile 
should be made to re-identify the CPU and use the perf mon events appropriate 
to that CPU. I think this could be done by having a migrate_notifier, or 
something like that..

Please provide comments on this.

-- 
regards,
balaji rao
NITK

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-16 Thread Christoph Lameter
On Fri, 15 Feb 2008, Andrew Morton wrote:

> On Thu, 14 Feb 2008 22:49:01 -0800 Christoph Lameter <[EMAIL PROTECTED]> 
> wrote:
> 
> > The invalidation of address ranges in a mm_struct needs to be
> > performed when pages are removed or permissions etc change.
> 
> hm.  Do they?  Why?  If I'm in the process of zero-copy writing a hunk of
> memory out to hardware then do I care if someone write-protects the ptes?
> 
> Spose so, but some fleshing-out of the various scenarios here would clarify
> things.

You care f.e. if the VM needs to writeprotect a memory range and a write 
occurs. In that case the VM needs to be proper write processing and write 
through an external pte would cause memory corruption.

> > If invalidate_range_begin() is called with locks held then we
> > pass a flag into invalidate_range() to indicate that no sleeping is
> > possible. Locks are only held for truncate and huge pages.
> 
> This is so bad.

Ok so I can twidlle around with the inode_mmap_lock to drop it while this 
is called?

> > In two cases we use invalidate_range_begin/end to invalidate
> > single pages because the pair allows holding off new references
> > (idea by Robin Holt).
> 
> Assuming that there is a missing "within the range" in this description, I
> assume that all clients will just throw up theior hands in horror and will
> disallow all references to all parts of the mm.

Right. Missing within the range. We only need to disallow creating new 
ptes right? Why disallow references?
 

> > xip_unmap: We are not taking the PageLock so we cannot
> > use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> > stands in.
> 
> What does "stands in" mean?

Use a range begin / end to invalidate a page.

> > +   mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
> > err = populate_range(mm, vma, start, size, pgoff);
> > +   mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
> 
> To avoid off-by-one confusion the changelogs, documentation and comments
> should be very careful to tell the reader whether the range includes the
> byte at start+size.  I don't thik that was done?

No it was not. I assumed that the convention is always start - (end - 1) 
and the byte at end is not affected by the operation.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-16 Thread Christoph Lameter
On Fri, 15 Feb 2008, Andrew Morton wrote:

> > +#define mmu_rmap_notifier(function, args...)   
> > \
> > +   do {\
> > +   struct mmu_rmap_notifier *__mrn;\
> > +   struct hlist_node *__n; \
> > +   \
> > +   rcu_read_lock();\
> > +   hlist_for_each_entry_rcu(__mrn, __n,\
> > +   &mmu_rmap_notifier_list, hlist) \
> > +   if (__mrn->ops->function)   \
> > +   __mrn->ops->function(__mrn, args);  \
> > +   rcu_read_unlock();  \
> > +   } while (0);
> > +
> 
> buggy macro: use locals.

Ok. Same as the non rmap version.

> > +EXPORT_SYMBOL(mmu_rmap_export_page);
> 
> The other patch used EXPORT_SYMBOL_GPL.

Ok will make that consistent.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-16 Thread Christoph Lameter
On Fri, 15 Feb 2008, Andrew Morton wrote:

> > @@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
> > if (vma->vm_flags & VM_LOCKED) {
> > referenced++;
> > *mapcount = 1;  /* break early from loop */
> > -   } else if (ptep_clear_flush_young(vma, address, pte))
> > +   } else if (ptep_clear_flush_young(vma, address, pte) |
> > +  mmu_notifier_age_page(mm, address))
> > referenced++;
> 
> The "|" is obviously deliberate.  But no explanation is provided telling us
> why we still call the callback if ptep_clear_flush_young() said the page
> was recently referenced.  People who read your code will want to understand
> this.

Andrea?

> > flush_cache_page(vma, address, pte_pfn(*pte));
> > entry = ptep_clear_flush(vma, address, pte);
> > +   mmu_notifier(invalidate_page, mm, address);
> 
> I just don't see how ths can be done if the callee has another thread in
> the middle of establishing IO against this region of memory. 
> ->invalidate_page() _has_ to be able to block.  Confused.

The page lock is held and that holds off I/O?


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code

2008-02-16 Thread Christoph Lameter
On Fri, 15 Feb 2008, Andrew Morton wrote:

> What is the status of getting infiniband to use this facility?

Well we are talking about this it seems.
> 
> How important is this feature to KVM?

Andrea can answer this.

> To xpmem?

Without this feature we are stuck with page pinning by increasing 
refcounts which leads to endless lru scanning and other misbehavior. Also 
applications that use XPmem will not be able to swap or be able to use 
things like remap.
 
> Which other potential clients have been identified and how important it it
> to those?

It is likely important to various DMA engines, framebuffers devices etc 
etc. Seems to be a generally useful feature.


> > +The notifier chains provide two callback mechanisms. The
> > +first one is required for any device that establishes external mappings.
> > +The second (rmap) mechanism is required if a device needs to be
> > +able to sleep when invalidating references. Sleeping may be necessary
> > +if we are mapping across a network or to different Linux instances
> > +in the same address space.
> 
> I'd have thought that a major reason for sleeping would be to wait for IO
> to complete.  Worth mentioning here?

Right.

> Why is that "easy"?  I's have thought that it would only be easy if the
> driver happened to be using those same locks for its own purposes. 
> Otherwise it is "awkward"?

Its relatively easy because it is tied directly to a process and can use
external tlb shootdown / external page table clearing directly. The other 
method requires an rmap in the device driver where it can lookup the 
processes that are mapping the page.
 
> > +The invalidation mechanism for a range (*invalidate_range_begin/end*) is
> > +called most of the time without any locks held. It is only called with
> > +locks held for file backed mappings that are truncated. A flag indicates
> > +in which mode we are. A driver can use that mechanism to f.e.
> > +delay the freeing of the pages during truncate until no locks are held.
> 
> That sucks big time.  What do we need to do to make get the callback
> functions called in non-atomic context?

We would have to drop the inode_mmap_lock. Could be done with some minor 
work.

> > +Pages must be marked dirty if dirty bits are found to be set in
> > +the external ptes during unmap.
> 
> That sentence is too vague.  Define "marked dirty"?

Call set_page_dirty().

> > +The *release* method is called when a Linux process exits. It is run before
> 
> We'd conventionally use a notation such as "->release()" here, rather than
> the asterisks.

Ok.

> 
> > +the pages and mappings of a process are torn down and gives the device 
> > driver
> > +a chance to zap all the external mappings in one go.
> 
> I assume what you mean here is that ->release() is called during exit()
> when the final reference to an mm is being dropped.

Right.

> > +An example for a code that can be used to build a notifier mechanism into
> > +a device driver can be found in the file
> > +Documentation/mmu_notifier/skeleton.c
> 
> Should that be in samples/?

Oh. We have that?

> > +The mmu_rmap_notifier adds another invalidate_page() callout that is called
> > +*before* the Linux rmaps are walked. At that point only the page lock is
> > +held. The invalidate_page() function must walk the driver rmaps and evict
> > +all the references to the page.
> 
> What happens if it cannot do so?

The page is not reclaimed if we were called from try_to_unmap(). From 
page_mkclean() we must always evict the page to switch off the write 
protect bit.

> > +There is no process information available before the rmaps are consulted.
> 
> Not sure what that sentence means.  I guess "available to the core VM"?

At that point we only have the page. We do not know which processes map 
the page. In order to find out we need to take a spinlock.


> > +The notifier mechanism can therefore not be attached to an mm_struct. 
> > Instead
> > +it is a global callback list. Having to perform a callback for each and 
> > every
> > +page that is reclaimed would be inefficient. Therefore we add an additional
> > +page flag: PageRmapExternal().
> 
> How many page flags are left?

30 or so. Its only available on 64bit.

> Is this feature important enough to justfy consumption of another one?
> 
> > Only pages that are marked with this bit can
> > +be exported and the rmap callbacks will only be performed for pages marked
> > +that way.
> 
> "exported": new term, unclear what it means.

Something external to the kernel references the page.

> > +The required additional Page flag is only availabe in 64 bit mode and
> > +therefore the mmu_rmap_notifier portion is not available on 32 bit 
> > platforms.
> 
> whoa.  Is that good?  You just made your feature unavailable on the great
> majority of Linux systems.

rmaps are usually used by complex drivers that are typically used in large 
systems.

> > + * Notifier functions for hardware and software that establishes external
> > + * references to pages 

Re: [kvm-devel] [RFC] Performance monitoring units and KVM

2008-02-16 Thread Anthony Liguori
Balaji Rao wrote:
> Hi all!
>
> Earlier it was suggested that we go ahead with emulating Perf Mon Events in 
> exposing it to the guest. The serious limitation in this approach is that we 
> end up exposing only a small number of events to the guest, even though the 
> host hardware is capable of much more. The only benefit this approach offers 
> is 
> that, it doesn't break live migration.
>   

I think performance monitors are no different than anything else in 
KVM.  We should virtualize as much as possible and by default provide 
only the common subset to the guest supported by the majority of hardware.

Then we can use mechanisms like QEMU's CPU support to enable additional 
features that may be available and unique to the underlying hardware.  
It's then up to the management tools to deal with migratability since 
they've explicitly enabled the feature.

Regards,

Anthony Liguori

> The other option is to pass through the real PMU to the guest. I believe this 
> approach is far better in the sense that,
>
> 1. All the available events in the host hardware can be passed on to the 
> guest, 
> which can be used by oprofile to profile the guest and trackdown slowdowns 
> introduced due to virtualization.
>
> 2. Its much cleaner and easier to pass through the PMU.
>
> Yes, this approach breaks live migration. Migration should not be possible 
> *only* when the PMU is being used by oprofile. We can mark the guest as 
> unmigratable in such situations. Once the PMU is not being used, migration 
> can 
> be performed normally.
>
> Note, this requires a small change to oprofile source. Upon migration, 
> oprofile 
> should be made to re-identify the CPU and use the perf mon events appropriate 
> to that CPU. I think this could be done by having a migrate_notifier, or 
> something like that..
>
> Please provide comments on this.
>
>   


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [patch 0/5] KVM paravirt MMU updates and cr3 caching

2008-02-16 Thread Marcelo Tosatti
The following patchset, based on earlier work by Anthony and Ingo, adds
paravirt_ops support for KVM guests enabling hypercall based pte updates,
hypercall batching and cr3 caching.

-- 


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [patch 1/5] KVM: add basic paravirt support

2008-02-16 Thread Marcelo Tosatti
Add basic KVM paravirt support. Avoid vm-exits on IO delays.

Add KVM_GET_PARA_FEATURES ioctl so paravirt features can be reported via
cpuid.

Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Cc: Anthony Liguori <[EMAIL PROTECTED]>

Index: kvm.paravirt/arch/x86/Kconfig
===
--- kvm.paravirt.orig/arch/x86/Kconfig
+++ kvm.paravirt/arch/x86/Kconfig
@@ -372,6 +372,14 @@ config VMI
  at the moment), by linking the kernel to a GPL-ed ROM module
  provided by the hypervisor.
 
+config KVM_GUEST
+   bool "KVM Guest support"
+   select PARAVIRT
+   depends on !(X86_VISWS || X86_VOYAGER)
+   help
+This option enables various optimizations for running under the KVM
+hypervisor.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT
Index: kvm.paravirt/arch/x86/kernel/Makefile
===
--- kvm.paravirt.orig/arch/x86/kernel/Makefile
+++ kvm.paravirt/arch/x86/kernel/Makefile
@@ -69,6 +69,7 @@ obj-$(CONFIG_DEBUG_RODATA_TEST)   += test_
 obj-$(CONFIG_DEBUG_NX_TEST)+= test_nx.o
 
 obj-$(CONFIG_VMI)  += vmi_32.o vmiclock_32.o
+obj-$(CONFIG_KVM_GUEST)+= kvm.o
 obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o
 
 ifdef CONFIG_INPUT_PCSPKR
Index: kvm.paravirt/arch/x86/kernel/kvm.c
===
--- /dev/null
+++ kvm.paravirt/arch/x86/kernel/kvm.c
@@ -0,0 +1,52 @@
+/*
+ * KVM paravirt_ops implementation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright (C) 2007, Red Hat, Inc., Ingo Molnar <[EMAIL PROTECTED]>
+ * Copyright IBM Corporation, 2007
+ *   Authors: Anthony Liguori <[EMAIL PROTECTED]>
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * No need for any "IO delay" on KVM
+ */
+static void kvm_io_delay(void)
+{
+}
+
+static void paravirt_ops_setup(void)
+{
+   pv_info.name = "KVM";
+   pv_info.paravirt_enabled = 1;
+
+   if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
+   pv_cpu_ops.io_delay = kvm_io_delay;
+
+}
+
+void __init kvm_guest_init(void)
+{
+   if (!kvm_para_available())
+   return;
+
+   paravirt_ops_setup();
+}
Index: kvm.paravirt/arch/x86/kernel/setup_32.c
===
--- kvm.paravirt.orig/arch/x86/kernel/setup_32.c
+++ kvm.paravirt/arch/x86/kernel/setup_32.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -779,6 +780,7 @@ void __init setup_arch(char **cmdline_p)
 */
vmi_init();
 #endif
+   kvm_guest_init();
 
/*
 * NOTE: before this point _nobody_ is allowed to allocate
Index: kvm.paravirt/arch/x86/kernel/setup_64.c
===
--- kvm.paravirt.orig/arch/x86/kernel/setup_64.c
+++ kvm.paravirt/arch/x86/kernel/setup_64.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -447,6 +448,8 @@ void __init setup_arch(char **cmdline_p)
init_apic_mappings();
ioapic_init_mappings();
 
+   kvm_guest_init();
+
/*
 * We trust e820 completely. No explicit ROM probing in memory.
 */
Index: kvm.paravirt/arch/x86/kvm/x86.c
===
--- kvm.paravirt.orig/arch/x86/kvm/x86.c
+++ kvm.paravirt/arch/x86/kvm/x86.c
@@ -696,6 +696,7 @@ int kvm_dev_ioctl_check_extension(long e
case KVM_CAP_USER_MEMORY:
case KVM_CAP_SET_TSS_ADDR:
case KVM_CAP_EXT_CPUID:
+   case KVM_CAP_PARA_FEATURES:
r = 1;
break;
case KVM_CAP_VAPIC:
@@ -761,6 +762,15 @@ long kvm_arch_dev_ioctl(struct file *fil
r = 0;
break;
}
+   case KVM_GET_PARA_FEATURES: {
+   __u32 para_features = KVM_PARA_FEATURES;
+
+   r = -EFAULT;
+   if (copy_to_user(argp, ¶_features, sizeof para_features))
+   goto out;
+   r = 0;
+   break;
+   }
default:
r = -EINVAL;
}
Index: kvm.paravirt/include/asm-x86/kvm_para.h
=

[kvm-devel] [patch 2/5] KVM: hypercall based pte updates and TLB flushes

2008-02-16 Thread Marcelo Tosatti
Hypercall based pte updates are faster than faults, and also allow use
of the lazy MMU mode to batch operations.

Don't report the feature if two dimensional paging is enabled.

Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Cc: Anthony Liguori <[EMAIL PROTECTED]>


Index: kvm.paravirt/arch/x86/kernel/kvm.c
===
--- kvm.paravirt.orig/arch/x86/kernel/kvm.c
+++ kvm.paravirt/arch/x86/kernel/kvm.c
@@ -33,6 +33,91 @@ static void kvm_io_delay(void)
 {
 }
 
+static void kvm_mmu_write(void *dest, const void *src, size_t size)
+{
+   const uint8_t *p = src;
+   unsigned long a0 = *(unsigned long *)p;
+   unsigned long a1 = 0;
+
+   size >>= 2;
+#ifdef CONFIG_X86_32
+   if (size == 2)
+   a1 = *(u32 *)&p[4];
+#endif
+   kvm_hypercall4(KVM_HYPERCALL_MMU_WRITE, (unsigned long)dest, size, a0,
+  a1);
+}
+
+/*
+ * We only need to hook operations that are MMU writes.  We hook these so that
+ * we can use lazy MMU mode to batch these operations.  We could probably
+ * improve the performance of the host code if we used some of the information
+ * here to simplify processing of batched writes.
+ */
+static void kvm_set_pte(pte_t *ptep, pte_t pte)
+{
+   kvm_mmu_write(ptep, &pte, sizeof(pte));
+}
+
+static void kvm_set_pte_at(struct mm_struct *mm, unsigned long addr,
+  pte_t *ptep, pte_t pte)
+{
+   kvm_mmu_write(ptep, &pte, sizeof(pte));
+}
+
+static void kvm_set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+   kvm_mmu_write(pmdp, &pmd, sizeof(pmd));
+}
+
+#if PAGETABLE_LEVELS >= 3
+#ifdef CONFIG_X86_PAE
+static void kvm_set_pte_atomic(pte_t *ptep, pte_t pte)
+{
+   kvm_mmu_write(ptep, &pte, sizeof(pte));
+}
+
+static void kvm_set_pte_present(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, pte_t pte)
+{
+   kvm_mmu_write(ptep, &pte, sizeof(pte));
+}
+
+static void kvm_pte_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+   pte_t pte = __pte(0);
+   kvm_mmu_write(ptep, &pte, sizeof(pte));
+}
+
+static void kvm_pmd_clear(pmd_t *pmdp)
+{
+   pmd_t pmd = __pmd(0);
+   kvm_mmu_write(pmdp, &pmd, sizeof(pmd));
+}
+#endif
+
+static void kvm_set_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+   kvm_mmu_write(pgdp, &pgd, sizeof(pgd));
+}
+
+static void kvm_set_pud(pud_t *pudp, pud_t pud)
+{
+   kvm_mmu_write(pudp, &pud, sizeof(pud));
+}
+#endif /* PAGETABLE_LEVELS >= 3 */
+
+static void kvm_flush_tlb(void)
+{
+   kvm_hypercall0(KVM_HYPERCALL_FLUSH_TLB);
+}
+
+static void kvm_release_pt(u32 pfn)
+{
+   kvm_hypercall1(KVM_HYPERCALL_RELEASE_PT, pfn << PAGE_SHIFT);
+}
+
 static void paravirt_ops_setup(void)
 {
pv_info.name = "KVM";
@@ -41,6 +126,24 @@ static void paravirt_ops_setup(void)
if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
pv_cpu_ops.io_delay = kvm_io_delay;
 
+   if (kvm_para_has_feature(KVM_FEATURE_MMU_WRITE)) {
+   pv_mmu_ops.set_pte = kvm_set_pte;
+   pv_mmu_ops.set_pte_at = kvm_set_pte_at;
+   pv_mmu_ops.set_pmd = kvm_set_pmd;
+#if PAGETABLE_LEVELS >= 3
+#ifdef CONFIG_X86_PAE
+   pv_mmu_ops.set_pte_atomic = kvm_set_pte_atomic;
+   pv_mmu_ops.set_pte_present = kvm_set_pte_present;
+   pv_mmu_ops.pte_clear = kvm_pte_clear;
+   pv_mmu_ops.pmd_clear = kvm_pmd_clear;
+#endif
+   pv_mmu_ops.set_pud = kvm_set_pud;
+   pv_mmu_ops.set_pgd = kvm_set_pgd;
+#endif
+   pv_mmu_ops.flush_tlb_user = kvm_flush_tlb;
+   pv_mmu_ops.release_pt = kvm_release_pt;
+   pv_mmu_ops.release_pd = kvm_release_pt;
+   }
 }
 
 void __init kvm_guest_init(void)
Index: kvm.paravirt/arch/x86/kvm/mmu.c
===
--- kvm.paravirt.orig/arch/x86/kvm/mmu.c
+++ kvm.paravirt/arch/x86/kvm/mmu.c
@@ -39,7 +39,7 @@
  * 2. while doing 1. it walks guest-physical to host-physical
  * If the hardware supports that we don't need to do shadow paging.
  */
-static bool tdp_enabled = false;
+bool tdp_enabled = false;
 
 #undef MMU_DEBUG
 
@@ -288,7 +288,7 @@ static void mmu_free_memory_cache_page(s
free_page((unsigned long)mc->objects[--mc->nobjs]);
 }
 
-static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
+int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
 {
int r;
 
@@ -857,7 +857,7 @@ static int kvm_mmu_unprotect_page(struct
return r;
 }
 
-static void mmu_unshadow(struct kvm *kvm, gfn_t gfn)
+void mmu_unshadow(struct kvm *kvm, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
 
Index: kvm.paravirt/arch/x86/kvm/mmu.h
===
--- kvm.paravirt.orig/arch/x86/kvm/mmu.h
+++ kvm.paravirt/arch/x86/kvm/mmu.h
@@ -47,4 +47,7 @@ static inline int is_paging(struct kvm_v
return vcpu->arch.cr0

[kvm-devel] [patch 3/5] KVM: hypercall batching

2008-02-16 Thread Marcelo Tosatti
Batch pte updates and tlb flushes in lazy MMU mode.

Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Cc: Anthony Liguori <[EMAIL PROTECTED]>

Index: kvm.paravirt/arch/x86/kernel/kvm.c
===
--- kvm.paravirt.orig/arch/x86/kernel/kvm.c
+++ kvm.paravirt/arch/x86/kernel/kvm.c
@@ -25,6 +25,74 @@
 #include 
 #include 
 #include 
+#include 
+
+#define MAX_MULTICALL_NR (PAGE_SIZE / sizeof(struct kvm_multicall_entry))
+
+struct kvm_para_state {
+   struct kvm_multicall_entry queue[MAX_MULTICALL_NR];
+   int queue_index;
+   enum paravirt_lazy_mode mode;
+};
+
+static DEFINE_PER_CPU(struct kvm_para_state, para_state);
+
+static int can_defer_hypercall(struct kvm_para_state *state, unsigned int nr)
+{
+   if (state->mode == PARAVIRT_LAZY_MMU) {
+   switch (nr) {
+   case KVM_HYPERCALL_MMU_WRITE:
+   case KVM_HYPERCALL_FLUSH_TLB:
+   return 1;
+   }
+   }
+   return 0;
+}
+
+static void hypercall_queue_flush(struct kvm_para_state *state)
+{
+   if (state->queue_index) {
+   kvm_hypercall2(KVM_HYPERCALL_MULTICALL, __pa(&state->queue),
+ state->queue_index);
+   state->queue_index = 0;
+   }
+}
+
+static void kvm_hypercall_defer(struct kvm_para_state *state,
+   unsigned int nr,
+   unsigned long a0, unsigned long a1,
+   unsigned long a2, unsigned long a3)
+{
+   struct kvm_multicall_entry *entry;
+
+   BUG_ON(preemptible());
+
+   if (state->queue_index == MAX_MULTICALL_NR)
+   hypercall_queue_flush(state);
+
+   entry = &state->queue[state->queue_index++];
+   entry->nr = nr;
+   entry->a0 = a0;
+   entry->a1 = a1;
+   entry->a2 = a2;
+   entry->a3 = a3;
+}
+
+static long kvm_hypercall(unsigned int nr, unsigned long a0,
+ unsigned long a1, unsigned long a2,
+ unsigned long a3)
+{
+   struct kvm_para_state *state = &get_cpu_var(para_state);
+   long ret = 0;
+
+   if (can_defer_hypercall(state, nr))
+   kvm_hypercall_defer(state, nr, a0, a1, a2, a3);
+   else
+   ret = kvm_hypercall4(nr, a0, a1, a2, a3);
+
+   put_cpu_var(para_state);
+   return ret;
+}
 
 /*
  * No need for any "IO delay" on KVM
@@ -44,7 +112,7 @@ static void kvm_mmu_write(void *dest, co
if (size == 2)
a1 = *(u32 *)&p[4];
 #endif
-   kvm_hypercall4(KVM_HYPERCALL_MMU_WRITE, (unsigned long)dest, size, a0,
+   kvm_hypercall(KVM_HYPERCALL_MMU_WRITE, (unsigned long)dest, size, a0,
   a1);
 }
 
@@ -110,12 +178,31 @@ static void kvm_set_pud(pud_t *pudp, pud
 
 static void kvm_flush_tlb(void)
 {
-   kvm_hypercall0(KVM_HYPERCALL_FLUSH_TLB);
+   kvm_hypercall(KVM_HYPERCALL_FLUSH_TLB, 0, 0, 0, 0);
 }
 
 static void kvm_release_pt(u32 pfn)
 {
-   kvm_hypercall1(KVM_HYPERCALL_RELEASE_PT, pfn << PAGE_SHIFT);
+   kvm_hypercall(KVM_HYPERCALL_RELEASE_PT, pfn << PAGE_SHIFT, 0, 0, 0);
+}
+
+static void kvm_enter_lazy_mmu(void)
+{
+   struct kvm_para_state *state
+   = &per_cpu(para_state, smp_processor_id());
+
+   paravirt_enter_lazy_mmu();
+   state->mode = paravirt_get_lazy_mode();
+}
+
+static void kvm_leave_lazy_mmu(void)
+{
+   struct kvm_para_state *state
+   = &per_cpu(para_state, smp_processor_id());
+
+   hypercall_queue_flush(state);
+   paravirt_leave_lazy(paravirt_get_lazy_mode());
+   state->mode = paravirt_get_lazy_mode();
 }
 
 static void paravirt_ops_setup(void)
@@ -144,6 +231,11 @@ static void paravirt_ops_setup(void)
pv_mmu_ops.release_pt = kvm_release_pt;
pv_mmu_ops.release_pd = kvm_release_pt;
}
+
+   if (kvm_para_has_feature(KVM_FEATURE_MULTICALL)) {
+   pv_mmu_ops.lazy_mode.enter = kvm_enter_lazy_mmu;
+   pv_mmu_ops.lazy_mode.leave = kvm_leave_lazy_mmu;
+   }
 }
 
 void __init kvm_guest_init(void)
Index: kvm.paravirt/arch/x86/kvm/x86.c
===
--- kvm.paravirt.orig/arch/x86/kvm/x86.c
+++ kvm.paravirt/arch/x86/kvm/x86.c
@@ -78,6 +78,8 @@ struct kvm_stats_debugfs_item debugfs_en
{ "fpu_reload", VCPU_STAT(fpu_reload) },
{ "insn_emulation", VCPU_STAT(insn_emulation) },
{ "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
+   { "multicall", VCPU_STAT(multicall) },
+   { "multicall_nr", VCPU_STAT(multicall_nr) },
{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
@@ -764,8 +766,10 @@ long kvm_arch_dev_ioctl(struct file *fil
}
case KVM_GET_PARA_FEATURES: {
__u32 para_features = KVM_

[kvm-devel] [patch 4/5] KVM: ignore zapped root pagetables

2008-02-16 Thread Marcelo Tosatti
Mark zapped root pagetables as invalid and ignore such pages during lookup.

This is a problem with the cr3-target feature, where a zapped root table fools
the faulting code into creating a read-only mapping. The result is a lockup
if the instruction can't be emulated.

Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Cc: Anthony Liguori <[EMAIL PROTECTED]>

Index: kvm.paravirt/arch/x86/kvm/mmu.c
===
--- kvm.paravirt.orig/arch/x86/kvm/mmu.c
+++ kvm.paravirt/arch/x86/kvm/mmu.c
@@ -668,7 +668,8 @@ static struct kvm_mmu_page *kvm_mmu_look
index = kvm_page_table_hashfn(gfn);
bucket = &kvm->arch.mmu_page_hash[index];
hlist_for_each_entry(sp, node, bucket, hash_link)
-   if (sp->gfn == gfn && !sp->role.metaphysical) {
+   if (sp->gfn == gfn && !sp->role.metaphysical
+   && !sp->role.invalid) {
pgprintk("%s: found role %x\n",
 __FUNCTION__, sp->role.word);
return sp;
@@ -796,8 +797,10 @@ static void kvm_mmu_zap_page(struct kvm 
if (!sp->root_count) {
hlist_del(&sp->hash_link);
kvm_mmu_free_page(kvm, sp);
-   } else
+   } else {
list_move(&sp->link, &kvm->arch.active_mmu_pages);
+   sp->role.invalid = 1;
+   }
kvm_mmu_reset_last_pte_updated(kvm);
 }
 
Index: kvm.paravirt/include/asm-x86/kvm_host.h
===
--- kvm.paravirt.orig/include/asm-x86/kvm_host.h
+++ kvm.paravirt/include/asm-x86/kvm_host.h
@@ -140,6 +140,7 @@ union kvm_mmu_page_role {
unsigned pad_for_nice_hex_output : 6;
unsigned metaphysical : 1;
unsigned access : 3;
+   unsigned invalid : 1;
};
 };
 

-- 


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [patch 5/5] KVM: VMX cr3 cache support

2008-02-16 Thread Marcelo Tosatti
Add support for the cr3 cache feature on Intel VMX CPU's. This avoids
vmexits on context switch if the cr3 value is cached in one of the 
entries (currently 4 are present).

This is especially important for Xenner, where each guest syscalls 
involves a cr3 switch.

Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Cc: Anthony Liguori <[EMAIL PROTECTED]>


Index: kvm.paravirt/arch/x86/kernel/kvm.c
===
--- kvm.paravirt.orig/arch/x86/kernel/kvm.c
+++ kvm.paravirt/arch/x86/kernel/kvm.c
@@ -26,14 +26,16 @@
 #include 
 #include 
 #include 
+#include 
 
 #define MAX_MULTICALL_NR (PAGE_SIZE / sizeof(struct kvm_multicall_entry))
 
 struct kvm_para_state {
+   struct kvm_cr3_cache cr3_cache;
struct kvm_multicall_entry queue[MAX_MULTICALL_NR];
int queue_index;
enum paravirt_lazy_mode mode;
-};
+} __attribute__ ((aligned(PAGE_SIZE)));
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
 
@@ -101,6 +103,98 @@ static void kvm_io_delay(void)
 {
 }
 
+static void kvm_new_cr3(unsigned long cr3)
+{
+   kvm_hypercall1(KVM_HYPERCALL_SET_CR3, cr3);
+}
+
+/*
+ * Special, register-to-cr3 instruction based hypercall API
+ * variant to the KVM host. This utilizes the cr3 filter capability
+ * of the hardware - if this works out then no VM exit happens,
+ * if a VM exit happens then KVM will get the virtual address too.
+ */
+static void kvm_write_cr3(unsigned long guest_cr3)
+{
+   struct kvm_para_state *para_state = &get_cpu_var(para_state);
+   struct kvm_cr3_cache *cache = ¶_state->cr3_cache;
+   int idx;
+
+   /*
+* Check the cache (maintained by the host) for a matching
+* guest_cr3 => host_cr3 mapping. Use it if found:
+*/
+   for (idx = 0; idx < cache->max_idx; idx++) {
+   if (cache->entry[idx].guest_cr3 == guest_cr3) {
+   /*
+* Cache-hit: we load the cached host-CR3 value.
+* This never causes any VM exit. (if it does then the
+* hypervisor could do nothing with this instruction
+* and the guest OS would be aborted)
+*/
+   native_write_cr3(cache->entry[idx].host_cr3);
+   goto out;
+   }
+   }
+
+   /*
+* Cache-miss. Tell the host the new cr3 via hypercall (to avoid
+* aliasing problems with a cached host_cr3 == guest_cr3).
+*/
+   kvm_new_cr3(guest_cr3);
+out:
+   put_cpu_var(para_state);
+}
+
+/*
+ * Avoid the VM exit upon cr3 load by using the cached
+ * ->active_mm->pgd value:
+ */
+static void kvm_flush_tlb_user(void)
+{
+   kvm_write_cr3(__pa(current->active_mm->pgd));
+}
+
+/*
+ * Disable global pages, do a flush, then enable global pages:
+ */
+static fastcall void kvm_flush_tlb_kernel(void)
+{
+   unsigned long orig_cr4 = read_cr4();
+
+   write_cr4(orig_cr4 & ~X86_CR4_PGE);
+   kvm_flush_tlb_user();
+   write_cr4(orig_cr4);
+}
+
+static void register_cr3_cache(void *cache)
+{
+   struct kvm_para_state *state;
+
+   state = &per_cpu(para_state, raw_smp_processor_id());
+   wrmsrl(KVM_MSR_SET_CR3_CACHE, __pa(&state->cr3_cache));
+}
+
+static unsigned __init kvm_patch(u8 type, u16 clobbers, void *ibuf,
+unsigned long addr, unsigned len)
+{
+   switch (type) {
+   case PARAVIRT_PATCH(pv_mmu_ops.write_cr3):
+   return paravirt_patch_default(type, clobbers, ibuf, addr, len);
+   default:
+   return native_patch(type, clobbers, ibuf, addr, len);
+   }
+}
+
+static void __init setup_guest_cr3_cache(void)
+{
+   on_each_cpu(register_cr3_cache, NULL, 0, 1);
+
+   pv_mmu_ops.write_cr3 = kvm_write_cr3;
+   pv_mmu_ops.flush_tlb_user = kvm_flush_tlb_user;
+   pv_mmu_ops.flush_tlb_kernel = kvm_flush_tlb_kernel;
+}
+
 static void kvm_mmu_write(void *dest, const void *src, size_t size)
 {
const uint8_t *p = src;
@@ -117,6 +211,28 @@ static void kvm_mmu_write(void *dest, co
 }
 
 /*
+ * CR3 cache initialization uses on_each_cpu(), so it can't
+ * happen at kvm_guest_init time.
+ */
+int __init kvm_cr3_cache_init(void)
+{
+   unsigned long flags;
+
+   if (!kvm_para_available())
+   return -ENOSYS;
+
+   if (kvm_para_has_feature(KVM_FEATURE_CR3_CACHE)) {
+   setup_guest_cr3_cache();
+   local_irq_save(flags);
+   apply_paravirt(__parainstructions, __parainstructions_end);
+   local_irq_restore(flags);
+   }
+
+   return 0;
+}
+module_init(kvm_cr3_cache_init);
+
+/*
  * We only need to hook operations that are MMU writes.  We hook these so that
  * we can use lazy MMU mode to batch these operations.  We could probably
  * improve the performance of the host code if we used some of the information
@@ -236,6 +352,9 @@ static void paravirt_op

Re: [kvm-devel] [patch 0/5] KVM paravirt MMU updates and cr3 caching

2008-02-16 Thread Anthony Liguori
Marcelo Tosatti wrote:
> The following patchset, based on earlier work by Anthony and Ingo, adds
> paravirt_ops support for KVM guests enabling hypercall based pte updates,
> hypercall batching and cr3 caching.
>   

Could you post performance results for each optimization?  I'm 
particularly curious if the hypercall batching is very useful.

Regards,

Anthony Liguori



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 0/5] KVM paravirt MMU updates and cr3 caching

2008-02-16 Thread Marcelo Tosatti
On Sat, Feb 16, 2008 at 05:37:00PM -0600, Anthony Liguori wrote:
> Marcelo Tosatti wrote:
> >The following patchset, based on earlier work by Anthony and Ingo, adds
> >paravirt_ops support for KVM guests enabling hypercall based pte updates,
> >hypercall batching and cr3 caching.
> >  
> 
> Could you post performance results for each optimization?  I'm 
> particularly curious if the hypercall batching is very useful.

Batched hypercall pte updates give 8.5% performance improvement on
kernel compile:

http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg12395.html

I can get separate results tomorrow or Monday, but I'm sure batching
plays a significant role. For the kernel compile test, there is an
average of 5 pte updates per batched hypercall.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code

2008-02-16 Thread Andrea Arcangeli
On Sat, Feb 16, 2008 at 11:21:07AM -0800, Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:
> 
> > What is the status of getting infiniband to use this facility?
> 
> Well we are talking about this it seems.

It seems the IB folks think allowing RDMA over virtual memory is not
interesting, their argument seem to be that RDMA is only interesting
on RAM (and they seem not interested in allowing RDMA over a ram+swap
backed _virtual_ memory allocation). They've just to decide if
ram+swap allocation for RDMA is useful or not.

> > How important is this feature to KVM?
> 
> Andrea can answer this.

I think I already did in separate email.

> > That sucks big time.  What do we need to do to make get the callback
> > functions called in non-atomic context?

I sure agree given I also asked to drop the lock param and enforce the
invalidate_range_* to always be called in non atomic context.

> We would have to drop the inode_mmap_lock. Could be done with some minor 
> work.

The invalidate may be deferred after releasing the lock, the lock may
not have to be dropped to cleanup the API (and make xpmem life easier).

> That is one implementation (XPmem does that). The other is to simply stop 
> all references when any invalidate_range is in progress (KVM and GRU do 
> that).

KVM doesn't stop new references. It doesn't need to because it holds a
reference on the page (GRU doesn't). KVM can invalidate the spte and
flush the tlb only after the linux pte has been cleared and after the
page has been released by the VM (because the page doesn't go in the
freelist and it remains pinned for a little while, until the spte is
dropped too inside invalidate_range_end). GRU has to invalidate
_before_ the linux pte is cleared so it has to stop new references
from being established in the invalidate_range_start/end critical
section.

> Andrea put this in to check the reference status of a page. It functions 
> like the accessed bit.

In short each pte can have some spte associated to it. So whenever we
do a ptep_clear_flush protected by the PT lock, we also have to run
invalidate_page that will internally invoke a sort-of
sptep_clear_flush protected by a kvm->mmu_lock (equivalent of
page_table_lock/PT-lock). sptes just like ptes maps virtual addresses
to physical addresses, so you can read/write to RAM either through a
pte or through a spte.

Just like it would be insane to have any requirement that
ptep_clear_flush has to run in not-atomic context (forcing a
conversion of the PT lock to a mutex), it's also weird require the
invalidate_page/age_page to run in atomic context.

All troubles start with the xpmem requirements of having to schedule
in its equivalent of the sptep_clear_flush because it's not a
gigaherz-in-cpu thing but a gigabit thing where the network stack is
involved with its own software linux driven skb memory allocations,
schedules waiting for network I/O, etc... Imagine ptes allocated in a
remote node, no surprise its brings a new set of problems (assuming it
can work reliably during oom given its memory requirements in the
try_to_unmap path, no page can ever be freed until the skbs have been
allocated and sent and allocated again to receive the ack).

Furthermore xpmem doesn't associate any pte to a spte, it associates a
page_t to certain remote references, or it would be in trouble with
invalidate_page that corresponds to ptep_clear_flush on a virtual
address that exists thanks to the anon_vma/i_mmap lock held (and not
thanks to the mmap_sem like in all invalidate_range calls).

Christoph's patch is a mix of two entirely separated features. KVM can
live with V7 just fine, but it's a lot more than what is needed by KVM.

I don't think that invalidate_page/age_page must be allowed to sleep
because invalidate_range also can sleep. You've to just ask yourself
if the VM locks shall remain spinlocks, for the VM own good (not for
the mmu notifiers good). It'd be bad to make the VM underperform with
mutex protecting tiny critical sections to please some mmu notifier
user. But if they're spinlocks, then clearly invalidate_page/age_page
based on virtual addresses can't sleep or the virtual address wouldn't
make sense anymore by the time the spinlock is released.

> > This function looks like it was tossed in at the last minute.  It's
> > mysterious, undocumented, poorly commented, poorly named.  A better name
> > would be one which has some correlation with the return value.
> > 
> > Because anyone who looks at some code which does
> > 
> > if (mmu_notifier_age_page(mm, address))
> > ...
> > 
> > has to go and reverse-engineer the implementation of
> > mmu_notifier_age_page() to work out under which circumstances the "..."
> > will be executed.  But this should be apparent just from reading the callee
> > implementation.
> > 
> > This function *really* does need some documentation.  What does it *mean*
> > when the ->age_page() from some of the notifiers returned "1" and the
> > ->age_pag

Re: [kvm-devel] [RFC] Performance monitoring units and KVM

2008-02-16 Thread Balaji Rao
On Sunday 17 February 2008 03:34:43 am Anthony Liguori wrote:
> Balaji Rao wrote:
> > Hi all!
> >
> > Earlier it was suggested that we go ahead with emulating Perf Mon Events
> > in exposing it to the guest. The serious limitation in this approach is
> > that we end up exposing only a small number of events to the guest, even
> > though the host hardware is capable of much more. The only benefit this
> > approach offers is that, it doesn't break live migration.
>
> I think performance monitors are no different than anything else in
> KVM.  We should virtualize as much as possible and by default provide
> only the common subset to the guest supported by the majority of hardware.
>
> Then we can use mechanisms like QEMU's CPU support to enable additional
> features that may be available and unique to the underlying hardware.
> It's then up to the management tools to deal with migratability since
> they've explicitly enabled the feature.

Sorry, I don't understand how it can done through QEMU, but according to what I 
understand, it makes migration very difficult/impossible. So, why should we go 
for this approach at all ? Its the very reason direct access to PMU was thought 
of as a bad idea.

Do you see any other problem in directly exposing the PMU ?

-- 
regards,

balaji rao
NITK

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code

2008-02-16 Thread Doug Maxey

On Fri, 15 Feb 2008 19:37:19 PST, Andrew Morton wrote:
> Which other potential clients have been identified and how important it it
> to those?

The powerpc ehea utilizes its own mmu.  Not sure about the importance 
to the driver. (But will investigate :)

++doug


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel