[kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-08 Thread Andrea Arcangeli
The difference with #v11 is a different implementation of mm_lock that
guarantees handling signals in O(N). It's also more lowlatency friendly. 

Note that mmu_notifier_unregister may also fail with -EINTR if there are
signal pending or the system runs out of vmalloc space or physical memory,
only exit_mmap guarantees that any kernel module can be unloaded in presence
of an oom condition.

Either #v11 or the first three #v12 1,2,3 patches are suitable for inclusion
in -mm, pick what you prefer looking at the mmu_notifier_register retval and
mm_lock retval difference, I implemented and slighty tested both. GRU and KVM
only needs 1,2,3, XPMEM needs the rest of the patchset too (4, ...) but all
patches from 4 to the end can be deffered to a second merge window.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Register now and save $200. Hurry, offer ends at 11:59 p.m., 
Monday, April 7! Use priority code J8TLD2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-08 Thread Avi Kivity
Andrea Arcangeli wrote:
> Note that mmu_notifier_unregister may also fail with -EINTR if there are
> signal pending or the system runs out of vmalloc space or physical memory,
> only exit_mmap guarantees that any kernel module can be unloaded in presence
> of an oom condition.
>
>   

That's unusual.  What happens to the notifier?  Suppose I destroy a vm 
without exiting the process, what happens if it fires?

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Register now and save $200. Hurry, offer ends at 11:59 p.m., 
Monday, April 7! Use priority code J8TLD2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-08 Thread Andrea Arcangeli
On Wed, Apr 09, 2008 at 12:46:49AM +0300, Avi Kivity wrote:
> That's unusual.  What happens to the notifier?  Suppose I destroy a vm 

Yes it's quite unusual.

> without exiting the process, what happens if it fires?

The mmu notifier ops should stop doing stuff (if there will be no
memslots they will be noops), or the ops can be replaced atomically
with null pointers. The important thing is that the module can't go
away until ->release is invoked or until mmu_notifier_unregister
returned 0.

Previously there was no mmu_notifier_unregister, so adding it can't be
a regression compared to #v11, even if it can fail and you may have to
retry later after returning to userland. Retrying from userland is
always safe in oom kill terms, only looping inside the kernel isn't
safe as do_exit has no chance to run.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Register now and save $200. Hurry, offer ends at 11:59 p.m., 
Monday, April 7! Use priority code J8TLD2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-09 Thread Robin Holt
I applied this patch set with the xpmem version I am working up for
submission and the basic level-1 and level-2 tests passed.  The full mpi
regression test still tends to hang, but that appears to be a common
problem failure affecting either emm or mmu notifiers and therefore, I
am certain is a problem in my code.

Please note this is not an endorsement of one method over the other,
merely that under conditions where we would expect xpmem to pass the
regression tests, it does pass those tests.

Thanks,
Robin

On Tue, Apr 08, 2008 at 05:44:03PM +0200, Andrea Arcangeli wrote:
> The difference with #v11 is a different implementation of mm_lock that
> guarantees handling signals in O(N). It's also more lowlatency friendly. 
> 
> Note that mmu_notifier_unregister may also fail with -EINTR if there are
> signal pending or the system runs out of vmalloc space or physical memory,
> only exit_mmap guarantees that any kernel module can be unloaded in presence
> of an oom condition.
> 
> Either #v11 or the first three #v12 1,2,3 patches are suitable for inclusion
> in -mm, pick what you prefer looking at the mmu_notifier_register retval and
> mm_lock retval difference, I implemented and slighty tested both. GRU and KVM
> only needs 1,2,3, XPMEM needs the rest of the patchset too (4, ...) but all
> patches from 4 to the end can be deffered to a second merge window.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-09 Thread Andrea Arcangeli
On Wed, Apr 09, 2008 at 08:17:09AM -0500, Robin Holt wrote:
> I applied this patch set with the xpmem version I am working up for
> submission and the basic level-1 and level-2 tests passed.  The full mpi
> regression test still tends to hang, but that appears to be a common
> problem failure affecting either emm or mmu notifiers and therefore, I
> am certain is a problem in my code.
> 
> Please note this is not an endorsement of one method over the other,
> merely that under conditions where we would expect xpmem to pass the
> regression tests, it does pass those tests.

Thanks a lot for testing! #v12 works great with KVM too. (I'm now in
the process of chagning the KVM patch to drop the page pinning)

BTW, how did you implement invalidate_page? As this?

invalidate_page() {
invalidate_range_begin()
invalidate_range_end()
}

If yes, I prefer to remind you that normally invalidate_range_begin is
always called before zapping the pte. In the invalidate_page case
instead, invalidate_range_begin is called _after_ the pte has been
zapped already.

Now there's no problem if the pte is established and the spte isn't
established. But it must never happen that the spte is established and
the pte isn't established (with page-pinning that means unswappable
memlock leak, without page-pinning it would mean memory corruption).

So the range_begin must serialize against the secondary mmu page fault
so that it can't establish the spte on a pte that was zapped by the
rmap code after get_user_pages/follow_page returned. I think your
range_begin already does that so you should be ok but I wanted to
remind about this slight difference in implementing invalidate_page as
I suggested above in previous email just to be sure ;).

This is the race you must guard against in invalidate_page:


 CPU0 CPU1
 try_to_unmap on page
  secondary mmu page fault
  get_user_pages()/follow_page found a page
 ptep_clear_flush
 invalidate_page()
  invalidate_range_begin()
  invalidate_range_end()
  return from invalidate_page
  establish spte on page
  return from secodnary mmu page fault

If your range_begin already serializes in a hard way against the
secondary mmu page fault, my previously "trivial" suggested
implementation for invalidate_page should work just fine and this this
saves 1 branch for each try_to_unmap_one if compared to the emm
implementation. The branch check is inlined and it checks against the
mmu_notifier_head that is the hot cacheline, no new cachline is
checked just one branch is saved and so it worth it IMHO even if it
doesn't provide any other advantage if you implement it the way above.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-09 Thread Robin Holt
On Wed, Apr 09, 2008 at 04:44:01PM +0200, Andrea Arcangeli wrote:
> BTW, how did you implement invalidate_page? As this?
> 
>   invalidate_page() {
>   invalidate_range_begin()
>   invalidate_range_end()
>   }

Essentially, I did the work of each step without releasing and
reacquiring locks.

> If yes, I prefer to remind you that normally invalidate_range_begin is
> always called before zapping the pte. In the invalidate_page case
> instead, invalidate_range_begin is called _after_ the pte has been
> zapped already.
> 
> Now there's no problem if the pte is established and the spte isn't
> established. But it must never happen that the spte is established and
> the pte isn't established (with page-pinning that means unswappable
> memlock leak, without page-pinning it would mean memory corruption).

I am not sure I follow what you are saying.  Here is a very terse
breakdown of how PFNs flow through xpmem's structures.

We have a PFN table associated with our structure describing a grant.
We use get_user_pages() to acquire information for that table and we
fill the table in under a mutex.  Remote hosts (on the same numa network
so they have direct access to the users memory) have a PROXY version of
that structure.  It is filled out in a similar fashion to the local
table.  PTEs are created for the other processes while holding the mutex
for this table (either local or remote).  During the process of
faulting, we have a simple linked list of ongoing faults that is
maintained whenever the mutex is going to be released.

Our version of a zap_page_range is called recall_PFNs.  The recall
process grabs the mutex, scans the faulters list for any that cover the
range and mark them as needing a retry.  It then calls zap_page_range
for any processes that have attached the granted memory to clear out
their page tables.  Finally, we release the mutex and proceed.

The locking is more complex than this, but that is the essential idea.


What that means for mmu_notifiers is we have a single reference on the
page for all the remote processes using it.  When the callout to
invalidate_page() is made, we will still have processes with that PTE in
their page tables and potentially TLB entries.  When we return from the
invalidate_page() callout, we will have removed all those page table
entries, we will have no in-progress page table or tlb insertions that
will complete, and we will have released all our references to the page.

Does that meet your expectations?

Thanks,
Robin
> 
> So the range_begin must serialize against the secondary mmu page fault
> so that it can't establish the spte on a pte that was zapped by the
> rmap code after get_user_pages/follow_page returned. I think your
> range_begin already does that so you should be ok but I wanted to
> remind about this slight difference in implementing invalidate_page as
> I suggested above in previous email just to be sure ;).
> 
> This is the race you must guard against in invalidate_page:
> 
> 
>CPU0 CPU1
>try_to_unmap on page
> secondary mmu page fault
> get_user_pages()/follow_page found a page
>  ptep_clear_flush
>invalidate_page()
> invalidate_range_begin()
>   invalidate_range_end()
>   return from invalidate_page
> establish spte on page
> return from secodnary mmu page fault
> 
> If your range_begin already serializes in a hard way against the
> secondary mmu page fault, my previously "trivial" suggested
> implementation for invalidate_page should work just fine and this this
> saves 1 branch for each try_to_unmap_one if compared to the emm
> implementation. The branch check is inlined and it checks against the
> mmu_notifier_head that is the hot cacheline, no new cachline is
> checked just one branch is saved and so it worth it IMHO even if it
> doesn't provide any other advantage if you implement it the way above.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-14 Thread Christoph Lameter
On Tue, 8 Apr 2008, Andrea Arcangeli wrote:

> The difference with #v11 is a different implementation of mm_lock that
> guarantees handling signals in O(N). It's also more lowlatency friendly. 

Ok. So the rest of the issues remains unaddressed? I am glad that we 
finally settled on the locking. But now I will have to clean this up, 
address the remaining issues, sequence the patches right, provide docs, 
handle the merging issue etc etc? I have seen no detailed review of my 
patches that you include here.

We are going down the same road as we had to go with the OOM patches where 
David Rientjes and me had to deal with the issues you raised?

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
This is a followup of the locking of the mmu-notifier methods against
the secondary-mmu page fault, each driver can implement differently
but this is to show an example of what I planned for KVM, others may
follow closely if they find this useful. I post this as pseudocode to
hide 99% of kvm internal complexities and to focus only on the
locking. The KVM locking scheme should be something on these lines:

invalidate_range_start {
spin_lock(&kvm->mmu_lock);

kvm->invalidate_range_count++;
rmap-invalidate of sptes in range

spin_unlock(&kvm->mmu_lock)
}

invalidate_range_end {
spin_lock(&kvm->mmu_lock);

kvm->invalidate_range_count--;

spin_unlock(&kvm->mmu_lock)
}

   invalidate_page {
spin_lock(&kvm->mmu_lock);

write_seqlock()
rmap-invalidate of sptes of page
write_sequnlock()

spin_unlock(&kvm->mmu_lock)
   }

   kvm_page_fault {
  seq = read_seqlock()
  get_user_pages() (aka gfn_to_pfn() in kvm terms)

  spin_lock(&kvm->mmu_lock)
  if (seq_trylock(seq) || kvm->invalidate_range_count)
 goto out; /* reply page fault */
  map sptes and build rmap
 out:
  spin_unlock(&kvm->mmu_lock)
   }

This will allow to remove the page pinning from KVM. I'd appreciate if
you Robin and Christoph can have a second look and pinpoint any
potential issue in my plan.

invalidate_page as you can notice, allows to decrease the fixed cost
overhead from all VM code that works with a single page and where
freeing the page _after_ calling invalidate_page is zero runtime/tlb
cost. We need invalidate_range_begin/end because when we work on
multiple pages, we can reduce cpu utilization and avoid many tlb
flushes by holding off the kvm page fault when we work on the range.

invalidate_page also allows to decrease the window where the kvm page
fault could possibly need to be replied (the ptep_clear_flush <->
invalidate_page window is shorter than a
invalidate_range_begin(PAGE_SIZE) <->
invalidate_range_end(PAGE_SIZE)).

So even if only as a microoptimization it worth it to decrease the
impact on the common VM code. The cost of having both a seqlock and a
range_count is irrlevant in kvm terms as they'll be in the same
cacheline and checked at the same time by the page fault and it won't
require any additional blocking (or writing) lock.

Note that the kvm page fault can't happen unless the cpu switches to
guest mode, and it can't switch to guest mode if we're in the
begin/end critical section, so in theory I could loop inside the page
fault too without risking deadlocking, but replying it by restarting
guest mode sounds nicer in sigkill/scheduling terms.

Soon I'll release a new mmu notifier patchset with patch 1 being the
mmu-notifier-core self-included and ready to go in -mm and mainline in
time for 2.6.26. Then I'll be glad to help merging any further patch
in the patchset to allow methods to sleep so XPMEM can run on mainline
2.6.27 the same way GRU/KVM/Quadrics will run fine on 2.6.26, in a
fully backwards compatible way with 2.6.26 (and of course it doesn't
really need to be backwards compatible because this is a kernel
internal API only, ask Greg etc... ;). But that will likely require a
new config option to avoid hurting AIM performance in fork because the
anon_vma critical sections are so short in the fast path.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
> invalidate_range_start {
>   spin_lock(&kvm->mmu_lock);
> 
>   kvm->invalidate_range_count++;
>   rmap-invalidate of sptes in range
> 

write_seqlock; write_sequnlock;

>   spin_unlock(&kvm->mmu_lock)
> }
> 
> invalidate_range_end {
>   spin_lock(&kvm->mmu_lock);
> 
>   kvm->invalidate_range_count--;


write_seqlock; write_sequnlock;

> 
>   spin_unlock(&kvm->mmu_lock)
> }

Robin correctly pointed out by PM there should be a seqlock in
range_begin/end too like corrected above.

I guess it's better to use an explicit sequence counter so we avoid an
useless spinlock of the write_seqlock (mmu_lock is enough already in
all places) and so we can increase it with a single op with +=2 in the
range_begin/end. The above is a lower-perf version of the final
locking but simpler for reading purposes.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Robin Holt
On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
> > invalidate_range_start {
> > spin_lock(&kvm->mmu_lock);
> > 
> > kvm->invalidate_range_count++;
> > rmap-invalidate of sptes in range
> > 
> 
>   write_seqlock; write_sequnlock;

I don't think you need it here since invalidate_range_count is already
elevated which will accomplish the same effect.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 08:01:20AM -0500, Robin Holt wrote:
> On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote:
> > On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
> > > invalidate_range_start {
> > >   spin_lock(&kvm->mmu_lock);
> > > 
> > >   kvm->invalidate_range_count++;
> > >   rmap-invalidate of sptes in range
> > > 
> > 
> > write_seqlock; write_sequnlock;
> 
> I don't think you need it here since invalidate_range_count is already
> elevated which will accomplish the same effect.

Agreed, seqlock only in range_end should be enough. BTW, the fact
seqlock is needed regardless of invalidate_page existing or not,
really makes invalidate_page a no brainer not just from the core VM
point of view, but from the driver point of view too. The
kvm_page_fault logic would be the same even if I remove
invalidate_page from the mmu notifier patch but it'd run slower both
when armed and disarmed.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Robin Holt
On Tue, Apr 22, 2008 at 03:21:43PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 08:01:20AM -0500, Robin Holt wrote:
> > On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote:
> > > On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
> > > > invalidate_range_start {
> > > > spin_lock(&kvm->mmu_lock);
> > > > 
> > > > kvm->invalidate_range_count++;
> > > > rmap-invalidate of sptes in range
> > > > 
> > > 
> > >   write_seqlock; write_sequnlock;
> > 
> > I don't think you need it here since invalidate_range_count is already
> > elevated which will accomplish the same effect.
> 
> Agreed, seqlock only in range_end should be enough. BTW, the fact

I am a little confused about the value of the seq_lock versus a simple
atomic, but I assumed there is a reason and left it at that.

> seqlock is needed regardless of invalidate_page existing or not,
> really makes invalidate_page a no brainer not just from the core VM
> point of view, but from the driver point of view too. The
> kvm_page_fault logic would be the same even if I remove
> invalidate_page from the mmu notifier patch but it'd run slower both
> when armed and disarmed.

I don't know what you mean by "it'd" run slower and what you mean by
"armed and disarmed".

For the sake of this discussion, I will assume "it'd" means the kernel in
general and not KVM.  With the two call sites for range_begin/range_end,
I would agree we have more call sites, but the second is extremely likely
to be cache hot.

By disarmed, I will assume you mean no notifiers registered for a
particular mm.  In that case, the cache will make the second call
effectively free.  So, for the disarmed case, I see no measurable
difference.

For the case where there is a notifier registered, I certainly can see
a difference.  I am not certain how to quantify the difference as it
depends on the callee.  In the case of xpmem, our callout is always very
expensive for the _start case.  Our _end case is very light, but it is
essentially the exact same steps we would perform for the _page callout.

When I was discussing this difference with Jack, he reminded me that
the GRU, due to its hardware, does not have any race issues with the
invalidate_page callout simply doing the tlb shootdown and not modifying
any of its internal structures.  He then put a caveat on the discussion
that _either_ method was acceptable as far as he was concerned.  The real
issue is getting a patch in that satisfies all needs and not whether
there is a seperate invalidate_page callout.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote:
> I am a little confused about the value of the seq_lock versus a simple
> atomic, but I assumed there is a reason and left it at that.

There's no value for anything but get_user_pages (get_user_pages takes
its own lock internally though). I preferred to explain it as a
seqlock because it was simpler for reading, but I totally agree in the
final implementation it shouldn't be a seqlock. My code was meant to
be pseudo-code only. It doesn't even need to be atomic ;).

> I don't know what you mean by "it'd" run slower and what you mean by
> "armed and disarmed".

1) when armed the time-window where the kvm-page-fault would be
blocked would be a bit larger without invalidate_page for no good
reason

2) if you were to remove invalidate_page when disarmed the VM could
would need two branches instead of one in various places

I don't want to waste cycles if not wasting them improves performance
both when armed and disarmed.

> For the sake of this discussion, I will assume "it'd" means the kernel in
> general and not KVM.  With the two call sites for range_begin/range_end,

I actually meant for both.

> By disarmed, I will assume you mean no notifiers registered for a
> particular mm.  In that case, the cache will make the second call
> effectively free.  So, for the disarmed case, I see no measurable
> difference.

For rmap is sure effective free, for do_wp_page it costs one branch
for no good reason.

> For the case where there is a notifier registered, I certainly can see
> a difference.  I am not certain how to quantify the difference as it

Agreed.

> When I was discussing this difference with Jack, he reminded me that
> the GRU, due to its hardware, does not have any race issues with the
> invalidate_page callout simply doing the tlb shootdown and not modifying
> any of its internal structures.  He then put a caveat on the discussion
> that _either_ method was acceptable as far as he was concerned.  The real
> issue is getting a patch in that satisfies all needs and not whether
> there is a seperate invalidate_page callout.

Sure, we have that patch now, I'll send it out in a minute, I was just
trying to explain why it makes sense to have an invalidate_page too
(which remains the only difference by now), removing it would be a
regression on all sides, even if a minor one.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Robin Holt
Andrew, Could we get direction/guidance from you as regards
the invalidate_page() callout of Andrea's patch set versus the
invalidate_range_start/invalidate_range_end callout pairs of Christoph's
patchset?  This is only in the context of the __xip_unmap, do_wp_page,
page_mkclean_one, and try_to_unmap_one call sites.

On Tue, Apr 22, 2008 at 03:48:47PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote:
> > I am a little confused about the value of the seq_lock versus a simple
> > atomic, but I assumed there is a reason and left it at that.
> 
> There's no value for anything but get_user_pages (get_user_pages takes
> its own lock internally though). I preferred to explain it as a
> seqlock because it was simpler for reading, but I totally agree in the
> final implementation it shouldn't be a seqlock. My code was meant to
> be pseudo-code only. It doesn't even need to be atomic ;).

Unless there is additional locking in your fault path, I think it does
need to be atomic.

> > I don't know what you mean by "it'd" run slower and what you mean by
> > "armed and disarmed".
> 
> 1) when armed the time-window where the kvm-page-fault would be
> blocked would be a bit larger without invalidate_page for no good
> reason

But that is a distinction without a difference.  In the _start/_end
case, kvm's fault handler will not have any _DIRECT_ blocking, but
get_user_pages() had certainly better block waiting for some other lock
to prevent the process's pages being refaulted.

I am no VM expert, but that seems like it is critical to having a
consistent virtual address space.  Effectively, you have a delay on the
kvm fault handler beginning when either invalidate_page() is entered
or invalidate_range_start() is entered until when the _CALLER_ of the
invalidate* method has unlocked.  That time will remain essentailly
identical for either case.  I would argue you would be hard pressed to
even measure the difference.

> 2) if you were to remove invalidate_page when disarmed the VM could
> would need two branches instead of one in various places

Those branches are conditional upon there being list entries.  That check
should be extremely cheap.  The vast majority of cases will have no
registered notifiers.  The second check for the _end callout will be
from cpu cache.

> I don't want to waste cycles if not wasting them improves performance
> both when armed and disarmed.

In summary, I think we have narrowed down the case of no registered
notifiers to being infinitesimal.  The case of registered notifiers
being a distinction without a difference.

> > When I was discussing this difference with Jack, he reminded me that
> > the GRU, due to its hardware, does not have any race issues with the
> > invalidate_page callout simply doing the tlb shootdown and not modifying
> > any of its internal structures.  He then put a caveat on the discussion
> > that _either_ method was acceptable as far as he was concerned.  The real
> > issue is getting a patch in that satisfies all needs and not whether
> > there is a seperate invalidate_page callout.
> 
> Sure, we have that patch now, I'll send it out in a minute, I was just
> trying to explain why it makes sense to have an invalidate_page too
> (which remains the only difference by now), removing it would be a
> regression on all sides, even if a minor one.

I think GRU is the only compelling case I have heard for having the
invalidate_page seperate.  In the case of the GRU, the hardware enforces a
lifetime of the invalidate which covers all in-progress faults including
ones where the hardware is informed after the flush of a PTE.  in all
cases, once the GRU invalidate instruction is issued, all active requests
are invalidated.  Future faults will be blocked in get_user_pages().
Without that special feature of the hardware, I don't think any code
simplification exists.  I, of course, reserve the right to be wrong.

I believe the argument against a seperate invalidate_page() callout was
Christoph's interpretation of Andrew's comments.  I am not certain Andrew
was aware of this special aspects of the GRU hardware and whether that
had been factored into the discussion at that point in time.


Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel