Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-03-04 Thread Christoph Lameter
On Tue, 4 Mar 2008, Nick Piggin wrote:

  Then put it into the arch code for TLB invalidation. Paravirt ops gives
  good examples on how to do that.
 
 Put what into arch code?

The mmu notifier code.

   What about a completely different approach... XPmem runs over NUMAlink,
   right? Why not provide some non-sleeping way to basically IPI remote
   nodes over the NUMAlink where they can process the invalidation? If you
   intra-node cache coherency has to run over this link anyway, then
   presumably it is capable.
 
  There is another Linux instance at the remote end that first has to
  remove its own ptes.
 
 Yeah, what's the problem?

The remote end has to invalidate the page which involves locking etc.

  Also would not work for Inifiniband and other 
  solutions.
 
 infiniband doesn't want it. Other solutions is just handwaving,
 because if we don't know what the other soloutions are, then we can't
 make any sort of informed choices.

We need a solution in general to avoid the pinning problems. Infiniband 
has those too.

  All the approaches that require evictions in an atomic context 
  are limiting the approach and do not allow the generic functionality that
  we want in order to not add alternate APIs for this.
 
 The only generic way to do this that I have seen (and the only proposed
 way that doesn't add alternate APIs for that matter) is turning VM locks
 into sleeping locks. In which case, Andrea's notifiers will work just
 fine (except for relatively minor details like rcu list scanning).

No they wont. As you pointed out the callback need RCU locking.

  The good enough solution right now is to pin pages by elevating
  refcounts.
 
 Which kind of leads to the question of why do you need any further
 kernel patches if that is good enough?

Well its good enough with severe problems during reclaim, livelocks etc. 
One could improve on that scheme through Rik's work trying to add a new 
page flag that mark pinned pages and then keep them off the LRUs and 
limiting their number. Having pinned page would limit the ability to 
reclaim by the VM and make page migration, memory unplug etc impossible. 
It is better to have notifier scheme that allows to tell a device driver 
to free up the memory it has mapped.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-03-04 Thread Nick Piggin
On Wednesday 05 March 2008 05:58, Christoph Lameter wrote:
 On Tue, 4 Mar 2008, Nick Piggin wrote:
   Then put it into the arch code for TLB invalidation. Paravirt ops gives
   good examples on how to do that.
 
  Put what into arch code?

 The mmu notifier code.

It isn't arch specific.


What about a completely different approach... XPmem runs over
NUMAlink, right? Why not provide some non-sleeping way to basically
IPI remote nodes over the NUMAlink where they can process the
invalidation? If you intra-node cache coherency has to run over this
link anyway, then presumably it is capable.
  
   There is another Linux instance at the remote end that first has to
   remove its own ptes.
 
  Yeah, what's the problem?

 The remote end has to invalidate the page which involves locking etc.

I don't see what the problem is.


   Also would not work for Inifiniband and other
   solutions.
 
  infiniband doesn't want it. Other solutions is just handwaving,
  because if we don't know what the other soloutions are, then we can't
  make any sort of informed choices.

 We need a solution in general to avoid the pinning problems. Infiniband
 has those too.

   All the approaches that require evictions in an atomic context
   are limiting the approach and do not allow the generic functionality
   that we want in order to not add alternate APIs for this.
 
  The only generic way to do this that I have seen (and the only proposed
  way that doesn't add alternate APIs for that matter) is turning VM locks
  into sleeping locks. In which case, Andrea's notifiers will work just
  fine (except for relatively minor details like rcu list scanning).

 No they wont. As you pointed out the callback need RCU locking.

That can be fixed easily.


   The good enough solution right now is to pin pages by elevating
   refcounts.
 
  Which kind of leads to the question of why do you need any further
  kernel patches if that is good enough?

 Well its good enough with severe problems during reclaim, livelocks etc.
 One could improve on that scheme through Rik's work trying to add a new
 page flag that mark pinned pages and then keep them off the LRUs and
 limiting their number. Having pinned page would limit the ability to
 reclaim by the VM and make page migration, memory unplug etc impossible.

Well not impossible. You could have a callback to invalidate the remote
TLB and drop the pin on a given page.


 It is better to have notifier scheme that allows to tell a device driver
 to free up the memory it has mapped.

Yeah, it would be nice for those people with clusters of Altixes. Doesn't
mean it has to go upstream, though.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-03-03 Thread Christoph Lameter
On Mon, 3 Mar 2008, Nick Piggin wrote:

 Your skeleton is just registering notifiers and saying
 
 /* you fill the hard part in */
 
 If somebody needs a skeleton in order just to register the notifiers,
 then almost by definition they are unqualified to write the hard
 part ;)

Its also providing a locking scheme.

 OK, there are ways to solve it or hack around it. But this is exactly
 why I think the implementations should be kept seperate. Andrea's
 notifiers are coherent, work on all types of mappings, and will
 hopefully match closely the regular TLB invalidation sequence in the
 Linux VM (at the moment it is quite close, but I hope to make it a
 bit closer) so that it requires almost no changes to the mm.

Then put it into the arch code for TLB invalidation. Paravirt ops gives 
good examples on how to do that.

 What about a completely different approach... XPmem runs over NUMAlink,
 right? Why not provide some non-sleeping way to basically IPI remote
 nodes over the NUMAlink where they can process the invalidation? If you
 intra-node cache coherency has to run over this link anyway, then
 presumably it is capable.

There is another Linux instance at the remote end that first has to 
remove its own ptes. Also would not work for Inifiniband and other 
solutions. All the approaches that require evictions in an atomic context 
are limiting the approach and do not allow the generic functionality that 
we want in order to not add alternate APIs for this.

 Or another idea, why don't you LD_PRELOAD in the MPT library to also
 intercept munmap, mprotect, mremap etc as well as just fork()? That
 would give you similarly good enough coherency as the mmu notifier
 patches except that you can't swap (which Robin said was not a big
 problem).

The good enough solution right now is to pin pages by elevating 
refcounts.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-03-02 Thread Nick Piggin
On Thursday 28 February 2008 09:35, Christoph Lameter wrote:
 On Wed, 20 Feb 2008, Nick Piggin wrote:
  On Friday 15 February 2008 17:49, Christoph Lameter wrote:

  Also, what we are going to need here are not skeleton drivers
  that just do all the *easy* bits (of registering their callbacks),
  but actual fully working examples that do everything that any
  real driver will need to do. If not for the sanity of the driver
  writer, then for the sanity of the VM developers (I don't want
  to have to understand xpmem or infiniband in order to understand
  how the VM works).

 There are 3 different drivers that can already use it but the code is
 complex and not easy to review. Skeletons are easy to allow people to get
 started with it.

Your skeleton is just registering notifiers and saying

/* you fill the hard part in */

If somebody needs a skeleton in order just to register the notifiers,
then almost by definition they are unqualified to write the hard
part ;)


 lru_add_drain();
 tlb = tlb_gather_mmu(mm, 0);
 update_hiwater_rss(mm);
   + mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
 end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
 if (tlb)
 tlb_finish_mmu(tlb, address, end);
   + mmu_notifier(invalidate_range_end, mm, address, end, atomic);
 return end;
}
 
  Where do you invalidate for munmap()?

 zap_page_range() called from unmap_vmas().

But it is not allowed to sleep. Where do you call the sleepable one
from?


  Also, how to you resolve the case where you are not allowed to sleep?
  I would have thought either you have to handle it, in which case nobody
  needs to sleep; or you can't handle it, in which case the code is
  broken.

 That can be done in a variety of ways:

 1. Change VM locking

 2. Not handle file backed mappings (XPmem could work mostly in such a
 config)

 3. Keep the refcount elevated until pages are freed in another execution
 context.

OK, there are ways to solve it or hack around it. But this is exactly
why I think the implementations should be kept seperate. Andrea's
notifiers are coherent, work on all types of mappings, and will
hopefully match closely the regular TLB invalidation sequence in the
Linux VM (at the moment it is quite close, but I hope to make it a
bit closer) so that it requires almost no changes to the mm.

All the other things to try to make it sleep are either hacking holes
in it (eg by removing coherency). So I don't think it is reasonable to
require that any patch handle all cases. I actually think Andrea's
patch is quite nice and simple itself, wheras I am against the patches
that you posted.

What about a completely different approach... XPmem runs over NUMAlink,
right? Why not provide some non-sleeping way to basically IPI remote
nodes over the NUMAlink where they can process the invalidation? If you
intra-node cache coherency has to run over this link anyway, then
presumably it is capable.

Or another idea, why don't you LD_PRELOAD in the MPT library to also
intercept munmap, mprotect, mremap etc as well as just fork()? That
would give you similarly good enough coherency as the mmu notifier
patches except that you can't swap (which Robin said was not a big
problem).


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-29 Thread Andrea Arcangeli
On Thu, Feb 28, 2008 at 04:59:59PM -0800, Christoph Lameter wrote:
 And thus the device driver may stop receiving data on a UP system? It will 
 never get the ack.

Not sure to follow, sorry.

My idea was:

   post the invalidate in the mmio region of the device
   smp_call_function()
   while (mmio device wait-bitflag is on);

Instead of the current:

   smp_call_function()
   post the invalidate in the mmio region of the device
   while (mmio device wait-bitflag is on);

To decrease the wait loop time.

 invalidate_page_before/end could be realized as an 
 invalidate_range_begin/end on a page sized range?

If we go this route, once you add support to xpmem, you'll have to
make the anon_vma lock a mutex too, that would be fine with me
though. The main reason invalidate_page exists, is to allow you to
leave it as non-sleep-capable even after you make invalidate_range
sleep capable, and to implement the mmu_rmap_notifiers sleep capable
in all the paths that invalidate_page would be called. That was the
strategy you had in your patch. I'll try to drop invalidate_page. I
wonder if then you won't need the mmu_rmap_notifiers anymore.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-29 Thread Christoph Lameter
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:

 On Thu, Feb 28, 2008 at 04:59:59PM -0800, Christoph Lameter wrote:
  And thus the device driver may stop receiving data on a UP system? It will 
  never get the ack.
 
 Not sure to follow, sorry.
 
 My idea was:
 
post the invalidate in the mmio region of the device
smp_call_function()
while (mmio device wait-bitflag is on);

So the device driver on UP can only operate through interrupts? If you are 
hogging the only cpu then driver operations may not be possible.

  invalidate_page_before/end could be realized as an 
  invalidate_range_begin/end on a page sized range?
 
 If we go this route, once you add support to xpmem, you'll have to
 make the anon_vma lock a mutex too, that would be fine with me
 though. The main reason invalidate_page exists, is to allow you to
 leave it as non-sleep-capable even after you make invalidate_range
 sleep capable, and to implement the mmu_rmap_notifiers sleep capable
 in all the paths that invalidate_page would be called. That was the
 strategy you had in your patch. I'll try to drop invalidate_page. I
 wonder if then you won't need the mmu_rmap_notifiers anymore.

I am mainly concerned with making the mmu notifier a generally useful 
feature for multiple users. Xpmem is one example of a different user. It 
should be considered as one example of a different type of callback user. 
It is not the gold standard that you make it to be. RDMA is another and 
there are likely scores of others (DMA engines etc) once it becomes clear 
that such a feature is available. In general the mmu notifier will allows 
us to fix the problems caused by memory pinning and mlock by various 
devices and other mechanisms that need to directly access memory. 

And yes I would like to get rid of the mmu_rmap_notifiers altogether. It 
would be much cleaner with just one mmu_notifier that can sleep in all 
functions.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-29 Thread Andrea Arcangeli
On Fri, Feb 29, 2008 at 11:55:17AM -0800, Christoph Lameter wrote:
 post the invalidate in the mmio region of the device
 smp_call_function()
 while (mmio device wait-bitflag is on);
 
 So the device driver on UP can only operate through interrupts? If you are 
 hogging the only cpu then driver operations may not be possible.

There was no irq involved in the above pseudocode, the irq if
something would run in the remote system. Still irqs can run fine
during the while loop like they run fine on top of
smp_call_function. The send-irq and the following spin-on-a-bitflag
works exactly as smp_call_function except this isn't a numa-CPU to
invalidate.

 And yes I would like to get rid of the mmu_rmap_notifiers altogether. It 
 would be much cleaner with just one mmu_notifier that can sleep in all 
 functions.

Agreed. I just thought xpmem needed an invalidate-by-page, but
I'm glad if xpmem can go in sync with the KVM/GRU/DRI model in this
regard.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-29 Thread Christoph Lameter
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:

 Agreed. I just thought xpmem needed an invalidate-by-page, but
 I'm glad if xpmem can go in sync with the KVM/GRU/DRI model in this
 regard.

That means we need both the anon_vma locks and the i_mmap_lock to become 
semaphores. I think semaphores are better than mutexes. Rik and Lee saw 
some performance improvements because list can be traversed in parallel 
when the anon_vma lock is switched to be a rw lock.

Sounds like we get to a conceptually clean version here?


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-29 Thread Christoph Lameter
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:

 On Fri, Feb 29, 2008 at 01:03:16PM -0800, Christoph Lameter wrote:
  That means we need both the anon_vma locks and the i_mmap_lock to become 
  semaphores. I think semaphores are better than mutexes. Rik and Lee saw 
  some performance improvements because list can be traversed in parallel 
  when the anon_vma lock is switched to be a rw lock.
 
 The improvement was with a rw spinlock IIRC, so I don't see how it's
 related to this.

AFAICT The rw semaphore fastpath is similar in performance to a rw 
spinlock. 
 
 Perhaps the rwlock spinlock can be changed to a rw semaphore without
 measurable overscheduling in the fast path. However theoretically

Overscheduling? You mean overhead?

 speaking the rw_lock spinlock is more efficient than a rw semaphore in
 case of a little contention during the page fault fast path because
 the critical section is just a list_add so it'd be overkill to
 schedule while waiting. That's why currently it's a spinlock (or rw
 spinlock).

On the other hand a semaphore puts the process to sleep and may actually 
improve performance because there is less time spend in a busy loop. 
Other processes may do something useful and we stay off the contended 
cacheline reducing traffic on the interconnect.
 
 preempt-rt runs quite a bit slower, or we could rip spinlocks out of
 the kernel in the first place ;)

The question is why that is the case and it seesm that there are issues 
with interrupt on/off that are important here and particularly significant 
with the SLAB allocator (significant hacks there to deal with that issue). 
The fastpath that we have in the works for SLUB may address a large 
part of that issue because it no longer relies on disabling interrupts.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-29 Thread Andrea Arcangeli
On Fri, Feb 29, 2008 at 01:34:34PM -0800, Christoph Lameter wrote:
 On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
 
  On Fri, Feb 29, 2008 at 01:03:16PM -0800, Christoph Lameter wrote:
   That means we need both the anon_vma locks and the i_mmap_lock to become 
   semaphores. I think semaphores are better than mutexes. Rik and Lee saw 
   some performance improvements because list can be traversed in parallel 
   when the anon_vma lock is switched to be a rw lock.
  
  The improvement was with a rw spinlock IIRC, so I don't see how it's
  related to this.
 
 AFAICT The rw semaphore fastpath is similar in performance to a rw 
 spinlock. 

read side is taken in the slow path.

write side is taken in the fast path.

pagefault is fast path, VM during swapping is slow path.

  Perhaps the rwlock spinlock can be changed to a rw semaphore without
  measurable overscheduling in the fast path. However theoretically
 
 Overscheduling? You mean overhead?

The only possible overhead that a rw semaphore could ever generate vs
a rw lock is overscheduling.

  speaking the rw_lock spinlock is more efficient than a rw semaphore in
  case of a little contention during the page fault fast path because
  the critical section is just a list_add so it'd be overkill to
  schedule while waiting. That's why currently it's a spinlock (or rw
  spinlock).
 
 On the other hand a semaphore puts the process to sleep and may actually 
 improve performance because there is less time spend in a busy loop. 
 Other processes may do something useful and we stay off the contended 
 cacheline reducing traffic on the interconnect.

Yes, that's the positive side, the negative side is that you'll put
the task in uninterruptible sleep and call schedule() and require a
wakeup, because a list_add taking 1usec is running in the
other cpu. No other downside. But that's the only reason it's a
spinlock right now, infact there can't be any other reason.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-29 Thread Andrea Arcangeli
On Fri, Feb 29, 2008 at 01:03:16PM -0800, Christoph Lameter wrote:
 That means we need both the anon_vma locks and the i_mmap_lock to become 
 semaphores. I think semaphores are better than mutexes. Rik and Lee saw 
 some performance improvements because list can be traversed in parallel 
 when the anon_vma lock is switched to be a rw lock.

The improvement was with a rw spinlock IIRC, so I don't see how it's
related to this.

Perhaps the rwlock spinlock can be changed to a rw semaphore without
measurable overscheduling in the fast path. However theoretically
speaking the rw_lock spinlock is more efficient than a rw semaphore in
case of a little contention during the page fault fast path because
the critical section is just a list_add so it'd be overkill to
schedule while waiting. That's why currently it's a spinlock (or rw
spinlock).

 Sounds like we get to a conceptually clean version here?

I don't have a strong opinion if it should become a semaphore
unconditionally or only with a CONFIG_XPMEM=y. But keep in mind
preempt-rt runs quite a bit slower, or we could rip spinlocks out of
the kernel in the first place ;)

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-29 Thread Andrea Arcangeli
On Fri, Feb 29, 2008 at 02:12:57PM -0800, Christoph Lameter wrote:
 On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
 
   AFAICT The rw semaphore fastpath is similar in performance to a rw 
   spinlock. 
  
  read side is taken in the slow path.
 
 Slowpath meaning VM slowpath or lock slow path? Its seems that the rwsem 

With slow path I meant the VM. Sorry if that was confusing given locks
also have fast paths (no contention) and slow paths (contention).

 read side path is pretty efficient:

Yes. The assembly doesn't worry me at all.

  pagefault is fast path, VM during swapping is slow path.
 
 Not sure what you are saying here. A pagefault should be considered as a 
 fast path and swapping is not performance critical?

Yes, swapping is I/O bound and it rarely becomes CPU hog in the common
case.

There are corner case workloads (including OOM) where swapping can
become cpu bound (that's also where rwlock helps). But certainly the
speed of fork() and a page fault, is critical for _everyone_, not just
a few workloads and setups.

 Ok too many calls to schedule() because the slow path (of the semaphore) 
 is taken?

Yes, that's the only possible worry when converting a spinlock to
mutex.

 But that is only happening for the contended case. Certainly a spinlock is 
 better for 2p system but the more processors content for the lock (and 
 the longer the hold off is, typical for the processors with 4p or 8p or 
 more) the better a semaphore will work.

Sure. That's also why the PT lock switches for 4way compiles. Config
option helps to keep the VM optimal for everyone. Here it is possible
it won't be necessary but I can't be sure given both i_mmap_lock and
anon-vma lock are used in some many places. Some TPC comparison would
be nice before making a default switch IMHO.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-28 Thread Robin Holt
On Thu, Feb 28, 2008 at 01:52:50AM +0100, Andrea Arcangeli wrote:
 On Wed, Feb 27, 2008 at 04:14:08PM -0800, Christoph Lameter wrote:
  Erm. This would also be needed by RDMA etc.
 
 The only RDMA I know is Quadrics, and Quadrics apparently doesn't need
 to schedule inside the invalidate methods AFIK, so I doubt the above
 is true. It'd be interesting to know if IB is like Quadrics and it
 also doesn't require blocking to invalidate certain remote mappings.

We got an answer from the IB guys already.  They do not track which of
their handles are being used by remote processes so neither approach
will work for their purposes with the exception of straight unmaps.  In
that case, they could use the callout to remove TLB information and rely
on the lack of page table information to kill the users process.
Without changes to their library spec, I don't believe anything further
is possible.  If they did change their library spec, I believe they
could get things to work the same way that XPMEM has gotten things to
work, where a message is sent to the remote side for TLB clearing and
that will require sleeping.

Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-28 Thread Christoph Lameter
On Thu, 28 Feb 2008, Andrea Arcangeli wrote:

 On Wed, Feb 27, 2008 at 05:03:21PM -0800, Christoph Lameter wrote:
  RDMA works across a network and I would assume that it needs confirmation 
  that a connection has been torn down before pages can be unmapped.
 
 Depends on the latency of the network, for example with page pinning
 it can even try to reduce the wait time, by tearing down the mapping
 in range_begin and spin waiting the ack only later in range_end.

What about invalidate_page()?
 

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-28 Thread Andrea Arcangeli
On Thu, Feb 28, 2008 at 10:43:54AM -0800, Christoph Lameter wrote:
 What about invalidate_page()?

That would just spin waiting an ack (just like the smp-tlb-flushing
invalidates in numa already does).

Thinking more about this, we could also parallelize it with an
invalidate_page_before/end. If it takes 1usec to flush remotely,
scheduling would be overkill, but spending 1usec in a while loop isn't
nice if we can parallelize that 1usec with the ipi-tlb-flush. Not sure
if it makes sense... it certainly would be quick to add it (especially
thanks to _notify ;).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-28 Thread Christoph Lameter
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:

 On Thu, Feb 28, 2008 at 10:43:54AM -0800, Christoph Lameter wrote:
  What about invalidate_page()?
 
 That would just spin waiting an ack (just like the smp-tlb-flushing
 invalidates in numa already does).

And thus the device driver may stop receiving data on a UP system? It will 
never get the ack.
 
 Thinking more about this, we could also parallelize it with an
 invalidate_page_before/end. If it takes 1usec to flush remotely,
 scheduling would be overkill, but spending 1usec in a while loop isn't
 nice if we can parallelize that 1usec with the ipi-tlb-flush. Not sure
 if it makes sense... it certainly would be quick to add it (especially
 thanks to _notify ;).

invalidate_page_before/end could be realized as an 
invalidate_range_begin/end on a page sized range?

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Christoph Lameter
On Tue, 19 Feb 2008, Andrea Arcangeli wrote:

 Yes, that's why I kept maintaining my patch and I posted the last
 revision to Andrew. I use pte/tlb locking of the core VM, it's
 unintrusive and obviously safe. Furthermore it can be extended with
 Christoph's stuff in a 100% backwards compatible fashion later if needed.

How would that work? You rely on the pte locking. Thus calls are all in an 
atomic context. I think we need a general scheme that allows sleeping when 
references are invalidates. Even the GRU has performance issues when using 
the KVM patch.


 

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Christoph Lameter
On Wed, 20 Feb 2008, Nick Piggin wrote:

 On Friday 15 February 2008 17:49, Christoph Lameter wrote:
  The invalidation of address ranges in a mm_struct needs to be
  performed when pages are removed or permissions etc change.
 
  If invalidate_range_begin() is called with locks held then we
  pass a flag into invalidate_range() to indicate that no sleeping is
  possible. Locks are only held for truncate and huge pages.
 
 You can't sleep inside rcu_read_lock()!

Could you be specific? This refers to page migration? Hmmm... Guess we 
would need to inc the refcount there instead?

 I must say that for a patch that is up to v8 or whatever and is
 posted twice a week to such a big cc list, it is kind of slack to
 not even test it and expect other people to review it.

It was tested with the GRU and XPmem. Andrea also reported success.
 
 Also, what we are going to need here are not skeleton drivers
 that just do all the *easy* bits (of registering their callbacks),
 but actual fully working examples that do everything that any
 real driver will need to do. If not for the sanity of the driver
 writer, then for the sanity of the VM developers (I don't want
 to have to understand xpmem or infiniband in order to understand
 how the VM works).

There are 3 different drivers that can already use it but the code is 
complex and not easy to review. Skeletons are easy to allow people to get 
started with it.

  lru_add_drain();
  tlb = tlb_gather_mmu(mm, 0);
  update_hiwater_rss(mm);
  +   mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
  end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
  if (tlb)
  tlb_finish_mmu(tlb, address, end);
  +   mmu_notifier(invalidate_range_end, mm, address, end, atomic);
  return end;
   }
 
 
 Where do you invalidate for munmap()?

zap_page_range() called from unmap_vmas().

 Also, how to you resolve the case where you are not allowed to sleep?
 I would have thought either you have to handle it, in which case nobody
 needs to sleep; or you can't handle it, in which case the code is
 broken.

That can be done in a variety of ways:

1. Change VM locking

2. Not handle file backed mappings (XPmem could work mostly in such a 
config)

3. Keep the refcount elevated until pages are freed in another execution 
context.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Christoph Lameter
On Wed, 20 Feb 2008, Andrea Arcangeli wrote:

 Well, xpmem requirements are complex. As as side effect of the
 simplicity of my approach, my patch is 100% safe since #v1. Now it
 also works for GRU and it cluster invalidates.

The patch has to satisfy RDMA, XPMEM, GRU and KVM. I keep hearing that we 
have a KVM only solution that works 100% (which makes me just switch 
ignore the rest of the argument because 100% solutions usually do not 
exist).


 rcu_read_lock), no atomic parameters, and it doesn't open a window
 where sptes have a view on older pages and linux pte has view on newer
 pages (this can happen with remap_file_pages with my KVM swapping
 patch to use V8 Christoph's patch).

Ok so you are now getting away from keeping the refcount elevated? That 
was your design decision


  Also, how to you resolve the case where you are not allowed to sleep?
  I would have thought either you have to handle it, in which case nobody
  needs to sleep; or you can't handle it, in which case the code is
  broken.
 
 I also asked exactly this, glad you reasked this too.

It would have helped if you would have repeated my answers that you had 
already gotten before. You knew I was on vacation


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Jack Steiner
  
  Also, what we are going to need here are not skeleton drivers
  that just do all the *easy* bits (of registering their callbacks),
  but actual fully working examples that do everything that any
  real driver will need to do. If not for the sanity of the driver
  writer, then for the sanity of the VM developers (I don't want
  to have to understand xpmem or infiniband in order to understand
  how the VM works).
 
 There are 3 different drivers that can already use it but the code is 
 complex and not easy to review. Skeletons are easy to allow people to get 
 started with it.


I posted the full GRU driver late last week. It is a lot of
code  somewhat difficult to understand w/o access to full chip
specs (sorry). The code is fairly well commented   the
parts related to TLB management should be understandable.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Andrea Arcangeli
On Wed, Feb 27, 2008 at 02:23:29PM -0800, Christoph Lameter wrote:
 How would that work? You rely on the pte locking. Thus calls are all in an 

I don't rely on the pte locking in #v7, exactly to satisfy GRU
(so far purely theoretical) performance complains.

 atomic context. I think we need a general scheme that allows sleeping when 

Calls are still in atomic context until we change the i_mmap_lock to a
mutex under a CONFIG_XPMEM, or unless we boost mm_users, drop the lock
and restart the loop at every different mm. In any case those changes
should be under CONFIG_XPMEM IMHO given desktop users definitely don't
need this (regular non-blocking mmu notifiers in my patch are all what
a desktop user need as far as I can tell).

 references are invalidates. Even the GRU has performance issues when using 
 the KVM patch.

GRU will perform the same with #v7 or V8.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Christoph Lameter
On Wed, 27 Feb 2008, Christoph Lameter wrote:

 Could you be specific? This refers to page migration? Hmmm... Guess we 
 would need to inc the refcount there instead?

Argh. No its the callback list scanning. Yuck. No one noticed.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Andrea Arcangeli
On Wed, Feb 27, 2008 at 02:35:59PM -0800, Christoph Lameter wrote:
 Could you be specific? This refers to page migration? Hmmm... Guess we 

If the reader schedule, the synchronize_rcu will return in the other
cpu and the objects in the list will be freed and overwritten, and
when the task is scheduled back in, it'll follow dangling pointers...
You can't use RCU if you want any of your invalidate methods to
schedule. Otherwise it's like having zero locking.

 2. Not handle file backed mappings (XPmem could work mostly in such a 
 config)

IMHO that fits under your definition of hacking something in now and
then having to modify it later.

 3. Keep the refcount elevated until pages are freed in another execution 
 context.

Page refcount is not enough (the mmu_notifier_release will run in
another cpu the moment after i_mmap_lock is unlocked) but mm_users may
prevent us to change the i_mmap_lock to a mutex, but it'll slowdown
truncate as it'll have to drop the lock and restart the radix tree
walk every time so a change like this better fits as a separate
CONFIG_XPMEM IMHO.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Christoph Lameter
On Thu, 28 Feb 2008, Andrea Arcangeli wrote:

  3. Keep the refcount elevated until pages are freed in another execution 
  context.
 
 Page refcount is not enough (the mmu_notifier_release will run in
 another cpu the moment after i_mmap_lock is unlocked) but mm_users may
 prevent us to change the i_mmap_lock to a mutex, but it'll slowdown
 truncate as it'll have to drop the lock and restart the radix tree
 walk every time so a change like this better fits as a separate
 CONFIG_XPMEM IMHO.

Erm. This would also be needed by RDMA etc.

 

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Andrea Arcangeli
On Wed, Feb 27, 2008 at 02:39:46PM -0800, Christoph Lameter wrote:
 On Wed, 20 Feb 2008, Andrea Arcangeli wrote:
 
  Well, xpmem requirements are complex. As as side effect of the
  simplicity of my approach, my patch is 100% safe since #v1. Now it
  also works for GRU and it cluster invalidates.
 
 The patch has to satisfy RDMA, XPMEM, GRU and KVM. I keep hearing that we 
 have a KVM only solution that works 100% (which makes me just switch 
 ignore the rest of the argument because 100% solutions usually do not 
 exist).

I only said 100% safe, I didn't imply anything other than it won't
crash the kernel ;).

#v6 and #v7 only leaves XPMEM out AFIK, and that can be supported
later with a CONFIG_XPMEM that purely changes some VM locking. #v7
also provides maximum performance to GRU.

  rcu_read_lock), no atomic parameters, and it doesn't open a window
  where sptes have a view on older pages and linux pte has view on newer
  pages (this can happen with remap_file_pages with my KVM swapping
  patch to use V8 Christoph's patch).
 
 Ok so you are now getting away from keeping the refcount elevated? That 
 was your design decision

No, I'm not getting away from it. If I would get away from it, I would
be forced to implement invalidate_range_begin. However even if I don't
get away from it, the fact I only implement invalidate_range_end, and
that's called after the PT lock is dropped, opens a little window with
lost coherency (which may not be detectable by userland anyway). But this
little window is fine for KVM and it doesn't impose any security
risk. But clearly proving the locking safe becomes a bit more complex
in #v7 than in #v6.

 It would have helped if you would have repeated my answers that you had 
 already gotten before. You knew I was on vacation

I didn't remember the BUG_ON crystal clear sorry, but not sure why you
think it was your call, this was a lowlevel XPMEM question and Robin
promptly answered/reminded about it infact.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Andrea Arcangeli
On Wed, Feb 27, 2008 at 04:14:08PM -0800, Christoph Lameter wrote:
 Erm. This would also be needed by RDMA etc.

The only RDMA I know is Quadrics, and Quadrics apparently doesn't need
to schedule inside the invalidate methods AFIK, so I doubt the above
is true. It'd be interesting to know if IB is like Quadrics and it
also doesn't require blocking to invalidate certain remote mappings.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Christoph Lameter
On Thu, 28 Feb 2008, Andrea Arcangeli wrote:

 On Wed, Feb 27, 2008 at 04:14:08PM -0800, Christoph Lameter wrote:
  Erm. This would also be needed by RDMA etc.
 
 The only RDMA I know is Quadrics, and Quadrics apparently doesn't need
 to schedule inside the invalidate methods AFIK, so I doubt the above
 is true. It'd be interesting to know if IB is like Quadrics and it
 also doesn't require blocking to invalidate certain remote mappings.

RDMA works across a network and I would assume that it needs confirmation 
that a connection has been torn down before pages can be unmapped.
 

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Andrea Arcangeli
On Wed, Feb 27, 2008 at 05:03:21PM -0800, Christoph Lameter wrote:
 RDMA works across a network and I would assume that it needs confirmation 
 that a connection has been torn down before pages can be unmapped.

Depends on the latency of the network, for example with page pinning
it can even try to reduce the wait time, by tearing down the mapping
in range_begin and spin waiting the ack only later in range_end.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin
On Friday 15 February 2008 17:49, Christoph Lameter wrote:
 The invalidation of address ranges in a mm_struct needs to be
 performed when pages are removed or permissions etc change.

 If invalidate_range_begin() is called with locks held then we
 pass a flag into invalidate_range() to indicate that no sleeping is
 possible. Locks are only held for truncate and huge pages.

 In two cases we use invalidate_range_begin/end to invalidate
 single pages because the pair allows holding off new references
 (idea by Robin Holt).

 do_wp_page(): We hold off new references while we update the pte.

 xip_unmap: We are not taking the PageLock so we cannot
 use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
 stands in.

This whole thing would be much better if you didn't rely on the page
lock at all, but either a) used the same locking as Linux does for its
ptes/tlbs, or b) have some locking that is private to the mmu notifier
code. Then there is not all this new stuff that has to be understood in
the core VM.

Also, why do you have to invalidate ranges when switching to a
_more_ permissive state? This stuff should basically be the same as
(a subset of) the TLB flushing API AFAIKS. Anything more is a pretty
big burden to put in the core VM.

See my alternative patch I posted -- I can't see why it won't work
just like a TLB.

As far as sleeping inside callbacks goes... I think there are big
problems with the patch (the sleeping patch and the external rmap
patch). I don't think it is workable in its current state. Either
we have to make some big changes to the core VM, or we have to turn
some locks into sleeping locks to do it properly AFAIKS. Neither
one is good.

But anyway, I don't really think the two approaches (Andrea's
notifiers vs sleeping/xrmap) should be tangled up too much. I
think Andrea's can possibly be quite unintrusive and useful very
soon.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Andrea Arcangeli
On Tue, Feb 19, 2008 at 07:54:14PM +1100, Nick Piggin wrote:
 As far as sleeping inside callbacks goes... I think there are big
 problems with the patch (the sleeping patch and the external rmap
 patch). I don't think it is workable in its current state. Either
 we have to make some big changes to the core VM, or we have to turn
 some locks into sleeping locks to do it properly AFAIKS. Neither
 one is good.

Agreed.

The thing is quite simple, the moment we support xpmem the complexity
in the mmu notifier patch start and there are hacks, duplicated
functionality through the same xpmem callbacks etc... GRU can already
be 100% supported (infact simpler and safer) with my patch.

 But anyway, I don't really think the two approaches (Andrea's
 notifiers vs sleeping/xrmap) should be tangled up too much. I
 think Andrea's can possibly be quite unintrusive and useful very
 soon.

Yes, that's why I kept maintaining my patch and I posted the last
revision to Andrew. I use pte/tlb locking of the core VM, it's
unintrusive and obviously safe. Furthermore it can be extended with
Christoph's stuff in a 100% backwards compatible fashion later if needed.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin
On Friday 15 February 2008 17:49, Christoph Lameter wrote:
 The invalidation of address ranges in a mm_struct needs to be
 performed when pages are removed or permissions etc change.

 If invalidate_range_begin() is called with locks held then we
 pass a flag into invalidate_range() to indicate that no sleeping is
 possible. Locks are only held for truncate and huge pages.

You can't sleep inside rcu_read_lock()!

I must say that for a patch that is up to v8 or whatever and is
posted twice a week to such a big cc list, it is kind of slack to
not even test it and expect other people to review it.

Also, what we are going to need here are not skeleton drivers
that just do all the *easy* bits (of registering their callbacks),
but actual fully working examples that do everything that any
real driver will need to do. If not for the sanity of the driver
writer, then for the sanity of the VM developers (I don't want
to have to understand xpmem or infiniband in order to understand
how the VM works).



 In two cases we use invalidate_range_begin/end to invalidate
 single pages because the pair allows holding off new references
 (idea by Robin Holt).

 do_wp_page(): We hold off new references while we update the pte.

 xip_unmap: We are not taking the PageLock so we cannot
 use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
 stands in.

 Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
 Signed-off-by: Robin Holt [EMAIL PROTECTED]
 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

 ---
  mm/filemap_xip.c |5 +
  mm/fremap.c  |3 +++
  mm/hugetlb.c |3 +++
  mm/memory.c  |   35 +--
  mm/mmap.c|2 ++
  mm/mprotect.c|3 +++
  mm/mremap.c  |7 ++-
  7 files changed, 51 insertions(+), 7 deletions(-)

 Index: linux-2.6/mm/fremap.c
 ===
 --- linux-2.6.orig/mm/fremap.c2008-02-14 18:43:31.0 -0800
 +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.0 -0800
 @@ -15,6 +15,7 @@
  #include linux/rmap.h
  #include linux/module.h
  #include linux/syscalls.h
 +#include linux/mmu_notifier.h

  #include asm/mmu_context.h
  #include asm/cacheflush.h
 @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
   spin_unlock(mapping-i_mmap_lock);
   }

 + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
   err = populate_range(mm, vma, start, size, pgoff);
 + mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
   if (!err  !(flags  MAP_NONBLOCK)) {
   if (unlikely(has_write_lock)) {
   downgrade_write(mm-mmap_sem);
 Index: linux-2.6/mm/memory.c
 ===
 --- linux-2.6.orig/mm/memory.c2008-02-14 18:43:31.0 -0800
 +++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.0 -0800
 @@ -51,6 +51,7 @@
  #include linux/init.h
  #include linux/writeback.h
  #include linux/memcontrol.h
 +#include linux/mmu_notifier.h

  #include asm/pgalloc.h
  #include asm/uaccess.h
 @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
   if (is_vm_hugetlb_page(vma))
   return copy_hugetlb_page_range(dst_mm, src_mm, vma);

 + if (is_cow_mapping(vma-vm_flags))
 + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
 +
   dst_pgd = pgd_offset(dst_mm, addr);
   src_pgd = pgd_offset(src_mm, addr);
   do {
 @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
   vma, addr, next))
   return -ENOMEM;
   } while (dst_pgd++, src_pgd++, addr = next, addr != end);
 +
 + if (is_cow_mapping(vma-vm_flags))
 + mmu_notifier(invalidate_range_end, src_mm,
 + vma-vm_start, end, 0);
 +
   return 0;
  }

 @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
   struct mmu_gather *tlb;
   unsigned long end = address + size;
   unsigned long nr_accounted = 0;
 + int atomic = details ? (details-i_mmap_lock != 0) : 0;

   lru_add_drain();
   tlb = tlb_gather_mmu(mm, 0);
   update_hiwater_rss(mm);
 + mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
   end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
   if (tlb)
   tlb_finish_mmu(tlb, address, end);
 + mmu_notifier(invalidate_range_end, mm, address, end, atomic);
   return end;
  }


Where do you invalidate for munmap()?

Also, how to you resolve the case where you are not allowed to sleep?
I would have thought either you have to handle it, in which case nobody
needs to sleep; or you can't handle it, in which case the code is
broken.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. 

Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Andrea Arcangeli
On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:
 You can't sleep inside rcu_read_lock()!
 
 I must say that for a patch that is up to v8 or whatever and is
 posted twice a week to such a big cc list, it is kind of slack to
 not even test it and expect other people to review it.

Well, xpmem requirements are complex. As as side effect of the
simplicity of my approach, my patch is 100% safe since #v1. Now it
also works for GRU and it cluster invalidates.

 Also, what we are going to need here are not skeleton drivers
 that just do all the *easy* bits (of registering their callbacks),
 but actual fully working examples that do everything that any
 real driver will need to do. If not for the sanity of the driver

I've a fully working scenario for my patch, infact I didn't post the
mmu notifier patch until I got KVM to swap 100% reliably to be sure I
would post something that works well. mmu notifiers are already used
in KVM for:

1) 100% reliable and efficient swapping of guest physical memory
2) copy-on-writes of writeprotect faults after ksm page sharing of guest
   physical memory
3) ballooning using madvise to give the guest memory back to the host

My implementation is the most handy because it requires zero changes
to the ksm code too (no explicit mmu notifier calls after
ptep_clear_flush) and it's also 100% safe (no mess with schedules over
rcu_read_lock), no atomic parameters, and it doesn't open a window
where sptes have a view on older pages and linux pte has view on newer
pages (this can happen with remap_file_pages with my KVM swapping
patch to use V8 Christoph's patch).

 Also, how to you resolve the case where you are not allowed to sleep?
 I would have thought either you have to handle it, in which case nobody
 needs to sleep; or you can't handle it, in which case the code is
 broken.

I also asked exactly this, glad you reasked this too.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Robin Holt
On Wed, Feb 20, 2008 at 02:11:41PM +1100, Nick Piggin wrote:
 On Wednesday 20 February 2008 14:00, Robin Holt wrote:
  On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote:
   On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:
 
Also, how to you resolve the case where you are not allowed to sleep?
I would have thought either you have to handle it, in which case nobody
needs to sleep; or you can't handle it, in which case the code is
broken.
  
   I also asked exactly this, glad you reasked this too.
 
  Currently, we BUG_ON having a PFN in our tables and not being able
  to sleep.  These are mappings which MPT has never supported in the past
  and XPMEM was already not allowing page faults for VMAs which are not
  anonymous so it should never happen.  If the file-backed operations can
  ever get changed to allow for sleeping and a customer has a need for it,
  we would need to change XPMEM to allow those types of faults to succeed.
 
 Do you really want to be able to swap, or are you just interested
 in keeping track of unmaps / prot changes?

I would rather not swap, but we do have one customer that would like
swapout to work for certain circumstances.  Additionally, we have
many customers that would rather that their system not die under I/O
termination.

Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-16 Thread Christoph Lameter
On Fri, 15 Feb 2008, Andrew Morton wrote:

 On Thu, 14 Feb 2008 22:49:01 -0800 Christoph Lameter [EMAIL PROTECTED] 
 wrote:
 
  The invalidation of address ranges in a mm_struct needs to be
  performed when pages are removed or permissions etc change.
 
 hm.  Do they?  Why?  If I'm in the process of zero-copy writing a hunk of
 memory out to hardware then do I care if someone write-protects the ptes?
 
 Spose so, but some fleshing-out of the various scenarios here would clarify
 things.

You care f.e. if the VM needs to writeprotect a memory range and a write 
occurs. In that case the VM needs to be proper write processing and write 
through an external pte would cause memory corruption.

  If invalidate_range_begin() is called with locks held then we
  pass a flag into invalidate_range() to indicate that no sleeping is
  possible. Locks are only held for truncate and huge pages.
 
 This is so bad.

Ok so I can twidlle around with the inode_mmap_lock to drop it while this 
is called?

  In two cases we use invalidate_range_begin/end to invalidate
  single pages because the pair allows holding off new references
  (idea by Robin Holt).
 
 Assuming that there is a missing within the range in this description, I
 assume that all clients will just throw up theior hands in horror and will
 disallow all references to all parts of the mm.

Right. Missing within the range. We only need to disallow creating new 
ptes right? Why disallow references?
 

  xip_unmap: We are not taking the PageLock so we cannot
  use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
  stands in.
 
 What does stands in mean?

Use a range begin / end to invalidate a page.

  +   mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
  err = populate_range(mm, vma, start, size, pgoff);
  +   mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
 
 To avoid off-by-one confusion the changelogs, documentation and comments
 should be very careful to tell the reader whether the range includes the
 byte at start+size.  I don't thik that was done?

No it was not. I assumed that the convention is always start - (end - 1) 
and the byte at end is not affected by the operation.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-15 Thread Andrew Morton
On Thu, 14 Feb 2008 22:49:01 -0800 Christoph Lameter [EMAIL PROTECTED] wrote:

 The invalidation of address ranges in a mm_struct needs to be
 performed when pages are removed or permissions etc change.

hm.  Do they?  Why?  If I'm in the process of zero-copy writing a hunk of
memory out to hardware then do I care if someone write-protects the ptes?

Spose so, but some fleshing-out of the various scenarios here would clarify
things.

 If invalidate_range_begin() is called with locks held then we
 pass a flag into invalidate_range() to indicate that no sleeping is
 possible. Locks are only held for truncate and huge pages.

This is so bad.

I supposed in the restricted couple of cases which you're focussed on it
works OK.  But is it generally suitable?  What if IO is in progress?  What
if other cluster nodes need to be talked to?  Does it suit RDMA?

 In two cases we use invalidate_range_begin/end to invalidate
 single pages because the pair allows holding off new references
 (idea by Robin Holt).

Assuming that there is a missing within the range in this description, I
assume that all clients will just throw up theior hands in horror and will
disallow all references to all parts of the mm.

Of course, to do that they will need to take a sleeping lock to prevent
other threads from establishing new references.  whoops.

 do_wp_page(): We hold off new references while we update the pte.
 
 xip_unmap: We are not taking the PageLock so we cannot
 use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
 stands in.

What does stands in mean?

 Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
 Signed-off-by: Robin Holt [EMAIL PROTECTED]
 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
 
 ---
  mm/filemap_xip.c |5 +
  mm/fremap.c  |3 +++
  mm/hugetlb.c |3 +++
  mm/memory.c  |   35 +--
  mm/mmap.c|2 ++
  mm/mprotect.c|3 +++
  mm/mremap.c  |7 ++-
  7 files changed, 51 insertions(+), 7 deletions(-)
 
 Index: linux-2.6/mm/fremap.c
 ===
 --- linux-2.6.orig/mm/fremap.c2008-02-14 18:43:31.0 -0800
 +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.0 -0800
 @@ -15,6 +15,7 @@
  #include linux/rmap.h
  #include linux/module.h
  #include linux/syscalls.h
 +#include linux/mmu_notifier.h
  
  #include asm/mmu_context.h
  #include asm/cacheflush.h
 @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
   spin_unlock(mapping-i_mmap_lock);
   }
  
 + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
   err = populate_range(mm, vma, start, size, pgoff);
 + mmu_notifier(invalidate_range_end, mm, start, start + size, 0);

To avoid off-by-one confusion the changelogs, documentation and comments
should be very careful to tell the reader whether the range includes the
byte at start+size.  I don't thik that was done?



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-14 Thread Christoph Lameter
The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.

If invalidate_range_begin() is called with locks held then we
pass a flag into invalidate_range() to indicate that no sleeping is
possible. Locks are only held for truncate and huge pages.

In two cases we use invalidate_range_begin/end to invalidate
single pages because the pair allows holding off new references
(idea by Robin Holt).

do_wp_page(): We hold off new references while we update the pte.

xip_unmap: We are not taking the PageLock so we cannot
use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
stands in.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Robin Holt [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/filemap_xip.c |5 +
 mm/fremap.c  |3 +++
 mm/hugetlb.c |3 +++
 mm/memory.c  |   35 +--
 mm/mmap.c|2 ++
 mm/mprotect.c|3 +++
 mm/mremap.c  |7 ++-
 7 files changed, 51 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/fremap.c
===
--- linux-2.6.orig/mm/fremap.c  2008-02-14 18:43:31.0 -0800
+++ linux-2.6/mm/fremap.c   2008-02-14 18:45:07.0 -0800
@@ -15,6 +15,7 @@
 #include linux/rmap.h
 #include linux/module.h
 #include linux/syscalls.h
+#include linux/mmu_notifier.h
 
 #include asm/mmu_context.h
 #include asm/cacheflush.h
@@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(mapping-i_mmap_lock);
}
 
+   mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
+   mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
if (!err  !(flags  MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(mm-mmap_sem);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-02-14 18:43:31.0 -0800
+++ linux-2.6/mm/memory.c   2008-02-14 18:45:07.0 -0800
@@ -51,6 +51,7 @@
 #include linux/init.h
 #include linux/writeback.h
 #include linux/memcontrol.h
+#include linux/mmu_notifier.h
 
 #include asm/pgalloc.h
 #include asm/uaccess.h
@@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+   if (is_cow_mapping(vma-vm_flags))
+   mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
+
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
@@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
vma, addr, next))
return -ENOMEM;
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+   if (is_cow_mapping(vma-vm_flags))
+   mmu_notifier(invalidate_range_end, src_mm,
+   vma-vm_start, end, 0);
+
return 0;
 }
 
@@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;
+   int atomic = details ? (details-i_mmap_lock != 0) : 0;
 
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
+   mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+   mmu_notifier(invalidate_range_end, mm, address, end, atomic);
return end;
 }
 
@@ -1339,7 +1351,7 @@ int remap_pfn_range(struct vm_area_struc
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + PAGE_ALIGN(size);
+   unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma-vm_mm;
int err;
 
@@ -1373,6 +1385,7 @@ int remap_pfn_range(struct vm_area_struc
pfn -= addr  PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
+   mmu_notifier(invalidate_range_begin, mm, start, end, 0);
do {
next = pgd_addr_end(addr, end);
err = remap_pud_range(mm, pgd, addr, next,
@@ -1380,6 +1393,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range_end, mm, start, end, 0);
return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1463,10 +1477,11 @@ int apply_to_page_range(struct mm_struct
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + size;
+   unsigned long 

[kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-08 Thread Christoph Lameter
The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.

If invalidate_range_begin() is called with locks held then we
pass a flag into invalidate_range() to indicate that no sleeping is
possible. Locks are only held for truncate and huge pages.

In two cases we use invalidate_range_begin/end to invalidate
single pages because the pair allows holding off new references
(idea by Robin Holt).

do_wp_page(): We hold off new references while we update the pte.

xip_unmap: We are not taking the PageLock so we cannot
use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
stands in.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Robin Holt [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/filemap_xip.c |5 +
 mm/fremap.c  |3 +++
 mm/hugetlb.c |3 +++
 mm/memory.c  |   35 +--
 mm/mmap.c|2 ++
 mm/mprotect.c|3 +++
 mm/mremap.c  |7 ++-
 7 files changed, 51 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/fremap.c
===
--- linux-2.6.orig/mm/fremap.c  2008-02-08 13:18:58.0 -0800
+++ linux-2.6/mm/fremap.c   2008-02-08 13:25:22.0 -0800
@@ -15,6 +15,7 @@
 #include linux/rmap.h
 #include linux/module.h
 #include linux/syscalls.h
+#include linux/mmu_notifier.h
 
 #include asm/mmu_context.h
 #include asm/cacheflush.h
@@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(mapping-i_mmap_lock);
}
 
+   mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
+   mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
if (!err  !(flags  MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(mm-mmap_sem);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-02-08 13:22:14.0 -0800
+++ linux-2.6/mm/memory.c   2008-02-08 13:25:22.0 -0800
@@ -51,6 +51,7 @@
 #include linux/init.h
 #include linux/writeback.h
 #include linux/memcontrol.h
+#include linux/mmu_notifier.h
 
 #include asm/pgalloc.h
 #include asm/uaccess.h
@@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+   if (is_cow_mapping(vma-vm_flags))
+   mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
+
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
@@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
vma, addr, next))
return -ENOMEM;
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+   if (is_cow_mapping(vma-vm_flags))
+   mmu_notifier(invalidate_range_end, src_mm,
+   vma-vm_start, end, 0);
+
return 0;
 }
 
@@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;
+   int atomic = details ? (details-i_mmap_lock != 0) : 0;
 
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
+   mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+   mmu_notifier(invalidate_range_end, mm, address, end, atomic);
return end;
 }
 
@@ -1337,7 +1349,7 @@ int remap_pfn_range(struct vm_area_struc
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + PAGE_ALIGN(size);
+   unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma-vm_mm;
int err;
 
@@ -1371,6 +1383,7 @@ int remap_pfn_range(struct vm_area_struc
pfn -= addr  PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
+   mmu_notifier(invalidate_range_begin, mm, start, end, 0);
do {
next = pgd_addr_end(addr, end);
err = remap_pud_range(mm, pgd, addr, next,
@@ -1378,6 +1391,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range_end, mm, start, end, 0);
return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1461,10 +1475,11 @@ int apply_to_page_range(struct mm_struct
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + size;
+   unsigned long 

Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-31 Thread Andrea Arcangeli
On Wed, Jan 30, 2008 at 05:46:21PM -0800, Christoph Lameter wrote:
 Well the GRU uses follow_page() instead of get_user_pages. Performance is 
 a major issue for the GRU. 

GRU is a external TLB, we have to allocate RAM instead but we do it
through the regular userland paging mechanism. Performance is a major
issue for kvm too, but the result of get_user_pages is used to fill a
spte, so then the cpu will use the spte in hardware to fill its
tlb, we won't have to keep calling follow_page in software to fill the
tlb like GRU has to do, so you can imagine the difference in cpu
utilization spent in those paths (plus our requirement to allocate
memory).

 H.. Could we go to a scheme where we do not have to increase the page 
 count? Modifications of the page struct require dirtying a cache line and 

I doubt the atomic_inc is measurable given the rest of overhead like
building the rmap for each new spte.

There's no technical reason for not wanting proper reference counting
other than microoptimization. What will work for GRU will work for KVM
too regardless of whatever reference counting. Each mmu-notifier user
should be free to do what it think it's better/safer or more
convenient (and for anybody calling get_user_pages having the
refcounting on external references is natural and zero additional
cost).

 it seems that we do not need an increased page count if we have an
 invalidate_range_start() that clears all the external references 
 and stops the establishment of new ones and invalidate_range_end() that 
 reenables new external references?
 
 Then we do not need the frequent invalidate_page() calls.

The increased page count is _mandatory_ to safely use range_start/end
called outside the locks with _end called after releasing the old
page. sptes will build themself the whole time until the pte_clear is
called on the main linux pte. We don't want to clutter the VM fast
paths with additional locks to stop the kvm pagefault while the VM is
in the _range_start/end critical section like xpmem has to do be
safe. So you're contradicting yourself by suggesting not to use
invalidate_page and not to use a increased page count at the same
time. And I need invalidate_page anyway for rmap.c which can't be
provided as an invalidate_range and it can't sleep either.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Andrea Arcangeli
On Tue, Jan 29, 2008 at 06:28:05PM -0600, Jack Steiner wrote:
 On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
  On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
  
invalidate_range after populate allows access to memory for which ptes 
were zapped and the refcount was released.
   
   The last refcount is released by the invalidate_range itself.
  
  That is true for your implementation and to address Robin's issues. Jack: 
  Is that true for the GRU?
 
 I'm not sure I understand the question. The GRU never (currently) takes
 a reference on a page. It has no mechanism for tracking pages that
 were exported to the external TLBs.

If you don't have a pin, then things like invalidate_range in
remap_file_pages can't be safe as writes through the external TLBs can
keep going on pages in the freelist. For you to be safe w/o a
page-pin, you need to return in the direction of invalidate_page
inside ptep_clear_flush (or anyway before
page_cache_release/__free_page/put_page...). You're generally not safe
with any invalidate_range that may run after the page pointed by the
pte has been freed (or can be freed by the VM anytime because of being
unpinned cache).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Robin Holt
 Robin, if you don't mind, could you please post or upload somewhere
 your GPLv2 code that registers itself in Christoph's V2 notifiers? Or
 is it top secret? I wouldn't mind to have a look so I can better
 understand what's the exact reason you're sleeping besides attempting
 GFP_KERNEL allocations. Thanks!

Dean is still actively working on updating the xpmem patch posted
here a few months ago reworked for the mmu_notifiers.  I am sure
we can give you a early look, but it is in a really rough state.

http://marc.info/?l=linux-mmw=2r=1s=xpmemq=t

The need to sleep comes from the fact that these PFNs are sent to other
hosts on the same NUMA fabric which have direct access to the pages
and then placed into remote process's page tables and then filled into
their TLBs.  Our only means of communicating the recall is async.

I think I need to straighten this discussion out in my head a little bit.
Am I correct in assuming Andrea's original patch set did not have any SMP
race conditions for KVM?  If so, then we need to start looking at how to
implement Christoph's and my changes in a safe fashion.  Andrea, I agree
complete that our introduction of the range callouts have introduced
SMP races.

The three issues we need to simultaneously solve is revoking the remote
page table/tlb information while still in a sleepable context and not
having the remote faulters become out of sync with the granting process.
Currently, I don't see a way to do that cleanly with a single callout.

Could we consider doing a range-based recall and lock callout before
clearing the processes page tables/TLBs, then use the _page or _range
callouts from Andrea's patch to clear the mappings,  finally make a
range-based unlock callout.  The mmu_notifier user would usually use ops
for either the recall+lock/unlock family of callouts or the _page/_range
family of callouts.

Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Andrea Arcangeli
On Wed, Jan 30, 2008 at 10:11:24AM -0600, Robin Holt wrote:
  Robin, if you don't mind, could you please post or upload somewhere
  your GPLv2 code that registers itself in Christoph's V2 notifiers? Or
  is it top secret? I wouldn't mind to have a look so I can better
  understand what's the exact reason you're sleeping besides attempting
  GFP_KERNEL allocations. Thanks!
 
 Dean is still actively working on updating the xpmem patch posted
 here a few months ago reworked for the mmu_notifiers.  I am sure
 we can give you a early look, but it is in a really rough state.
 
 http://marc.info/?l=linux-mmw=2r=1s=xpmemq=t
 
 The need to sleep comes from the fact that these PFNs are sent to other
 hosts on the same NUMA fabric which have direct access to the pages
 and then placed into remote process's page tables and then filled into
 their TLBs.  Our only means of communicating the recall is async.
 
 I think I need to straighten this discussion out in my head a little bit.
 Am I correct in assuming Andrea's original patch set did not have any SMP
 race conditions for KVM?  If so, then we need to start looking at how to

Yes my last patch was SMP safe, stable and feature complete for KVM. I
tested it for 1 week on my smp workstation with real desktop load and
everything loaded, with 3G non-linux guest running on 2G of ram.

Now for whatever reason I adapted the KVM side to Christoph's V2/V3
and it hangs the moment it hits swap. However in the meantime I
changed test hardware, upgraded host to 2.6.24-hg, and upgraded kvm
kernel and userland. all patches applied cleanly (with a minor nit in
a .h include in V2 on top of current git). Swapping of regular tasks
on the test system is 100% solid or I wouldn't even wasting time
mentioning this. By code inspection I didn't expect a stability
regression or I wouldn't have chanced all variables at the same time
(taking the opportunity to move everything to bleeding edge while
moving to V2 turned out to be a bad idea). I already audited the mmu
notifiers a few times, infact I already went back to call
invalidate_page and age_page inside ptep_clear_flush/young in case the
page-pin wasn't enough to prevent the page to change under the sptes,
as I thought yesterday.

Christoph's V3 notably still misses the needed range flushes in mremap
for example, but that's not my problem.  (Jack instead will certainly
kernel crash due to the missing invalidate_page after ptep_clear_flush
in mremap, such an invalidate_page wasn't missing with my last patch)

I'm now going to run the same binaries that still are stable on my
workstation on the test system too, to rule out timings and hardware
differences.

 implement Christoph's and my changes in a safe fashion.  Andrea, I agree
 complete that our introduction of the range callouts have introduced
 SMP races.

I think for KVM basic swapping both V2 and V3 should be safe. V2 had
race conditions that would later break KSM yes, I fixed it and V3
should be already ok and I'm not testing KSM. This is all thanks to the
pin of the page in get_user_page that KVM does for every page mapped
in any spte.

 The three issues we need to simultaneously solve is revoking the remote
 page table/tlb information while still in a sleepable context and not
 having the remote faulters become out of sync with the granting process.
 Currently, I don't see a way to do that cleanly with a single callout.

Agreed.

 Could we consider doing a range-based recall and lock callout before
 clearing the processes page tables/TLBs, then use the _page or _range
 callouts from Andrea's patch to clear the mappings,  finally make a
 range-based unlock callout.  The mmu_notifier user would usually use ops
 for either the recall+lock/unlock family of callouts or the _page/_range
 family of callouts.

invalidate_page/age_page can return inside ptep_clear_flush/young and
Jack will need that too. Infact Jack will need an invalidate_page also
inside ptep_get_and_clear. And the range callout will be done always
in a sleeping context and it'll relay on the page-pin to be safe (when
details-i_mmap_lock != NULL invalidate_range it shouldn't be called
inside zap_page_range but before returning from
unmap_mapping_range_vma before cond_resched). This will make
everything a bit simpler and less prone to breakage IMHO, plus it'll
have a chance to work for Jack w/o page-pin without additional
cluttering of mm/*.c.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Robin Holt
On Wed, Jan 30, 2008 at 06:04:52PM +0100, Andrea Arcangeli wrote:
 On Wed, Jan 30, 2008 at 10:11:24AM -0600, Robin Holt wrote:
...
  The three issues we need to simultaneously solve is revoking the remote
  page table/tlb information while still in a sleepable context and not
  having the remote faulters become out of sync with the granting process.
...
  Could we consider doing a range-based recall and lock callout before
  clearing the processes page tables/TLBs, then use the _page or _range
  callouts from Andrea's patch to clear the mappings,  finally make a
  range-based unlock callout.  The mmu_notifier user would usually use ops
  for either the recall+lock/unlock family of callouts or the _page/_range
  family of callouts.
 
 invalidate_page/age_page can return inside ptep_clear_flush/young and
 Jack will need that too. Infact Jack will need an invalidate_page also
 inside ptep_get_and_clear. And the range callout will be done always
 in a sleeping context and it'll relay on the page-pin to be safe (when
 details-i_mmap_lock != NULL invalidate_range it shouldn't be called
 inside zap_page_range but before returning from
 unmap_mapping_range_vma before cond_resched). This will make
 everything a bit simpler and less prone to breakage IMHO, plus it'll
 have a chance to work for Jack w/o page-pin without additional
 cluttering of mm/*.c.

I don't think I saw the answer to my original question.  I assume your
original patch, extended in a way similar to what Christoph has done,
can be made to work to cover both the KVM and GRU (Jack's) case.

XPMEM, however, does not look to be solvable due to the three simultaneous
issues above.  To address that, I think I am coming to the conclusion
that we need an accompanying but seperate pair of callouts.  The first
will ensure the remote page tables and TLBs are cleared and all page
information is returned back to the process that is granting access to
its address space.  That will include an implicit block on the address
range so no further faults will be satisfied by the remote accessor
(forgot the KVM name for this, sorry).  Any faults will be held off
and only the processes page tables/TLBs are in play.  Once the normal
processing of the kernel is complete, an unlock callout would be made
for the range and then faulting may occur on behalf of the process again.

Currently, this is the only direct solution that I can see as a
possibility.  My question is two fold.  Does this seem like a reasonable
means to solve the three simultaneous issues above and if so, does it
seem like the most reasonable means?

Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter
On Wed, 30 Jan 2008, Robin Holt wrote:

 I think I need to straighten this discussion out in my head a little bit.
 Am I correct in assuming Andrea's original patch set did not have any SMP
 race conditions for KVM?  If so, then we need to start looking at how to
 implement Christoph's and my changes in a safe fashion.  Andrea, I agree
 complete that our introduction of the range callouts have introduced
 SMP races.

The original patch drew the clearing of the sptes into ptep_clear_flush(). 
So the invalidate_page was called for each page regardless if we had been 
doing an invalidate range before or not. It seems that the the 
invalidate_range() was just there for optimization.
 
 The three issues we need to simultaneously solve is revoking the remote
 page table/tlb information while still in a sleepable context and not
 having the remote faulters become out of sync with the granting process.
 Currently, I don't see a way to do that cleanly with a single callout.

You could use the invalidate_page callouts to set a flag that no 
additional rmap entries may be added until the invalidate_range has 
occurred? We could add back all the original invalidate_pages() and pass
a flag that specifies that an invalidate range will follow. The notifier 
can then decide what to do with that information. If its okay to defer 
then do nothing and wait for the range_invalidate. XPmem could stop 
allowing external references to be established until the invalidate_range 
was successful.

Jack had a concern that multiple callouts for the same pte could cause 
problems.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Jack Steiner
On Wed, Jan 30, 2008 at 11:41:29AM -0800, Christoph Lameter wrote:
 On Wed, 30 Jan 2008, Jack Steiner wrote:
 
  I see what you mean. I need to review to mail to see why this changed
  but in the original discussions with Christoph, the invalidate_range
  callouts were suppose to be made BEFORE the pages were put on the freelist.
 
 Seems that we cannot rely on the invalidate_ranges for correctness at all?
 We need to have invalidate_page() always. invalidate_range() is only an 
 optimization.
 

I don't understand your point an optimization. How would invalidate_range
as currently defined be correctly used?

It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on
the page  drops it when invalidate_range is called. That may work (not sure)
for xpmem but not for the GRU.

--- jack


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter
On Wed, 30 Jan 2008, Jack Steiner wrote:

  Seems that we cannot rely on the invalidate_ranges for correctness at all?
  We need to have invalidate_page() always. invalidate_range() is only an 
  optimization.
  
 
 I don't understand your point an optimization. How would invalidate_range
 as currently defined be correctly used?

We are changing definitions. The original patch by Andrea calls 
invalidate_page for each pte that is cleared. So strictly you would not 
need an invalidate_range.

 It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on
 the page  drops it when invalidate_range is called. That may work (not sure)
 for xpmem but not for the GRU.

The refcount is not necessary if we adopt Andrea's approach of a callback 
on the clearing of each pte. At that point the page is still guaranteed to 
exist. If we do the range_invalidate later (as in V3) then the page may 
have been released (see sys_remap_file_pages() f.e.) before we zap the GRU 
ptes. So there will be a time when the GRU may write to a page that has 
been freed and used for another purpose.

Taking a refcount on the page defers the free until the range_invalidate 
runs.

I would prefer a solution that does not require taking refcounts (pins) 
for establishing an external pte and for release (like what the GRU does).

If we could effectively determine that there are no external ptes in a 
range then the invalidate_page() call may return immediately. Maybe it is 
then effective to do these gazillions of invalidate_page() calls when a 
process terminates or an remap is performed.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Robin Holt
On Wed, Jan 30, 2008 at 11:50:26AM -0800, Christoph Lameter wrote:
 On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
 
  XPMEM requires with invalidate_range (sleepy) +
  before_invalidate_range (sleepy). invalidate_all should also be called
  before_release (both sleepy).
  
  It sounds we need full overlap of information provided by
  invalidate_page and invalidate_range to fit all three models (the
  opposite of the zero objective that current V3 is taking). And the
  swap will be handled only by invalidate_page either through linux rmap
  or external rmap (with the latter that can sleep so it's ok for you,
  the former not). GRU can safely use the either the linux rmap notifier
  or the external rmap notifier equally well, because when try_to_unmap
  is called the page is locked and obviously pinned by the VM itself.
 
 So put the invalidate_page() callbacks in everywhere.

The way I am envisioning it, we essentially drop back to Andrea's original
patch.  We then introduce a invalidate_range_begin (I was really thinking
of it as invalidate_and_lock_range()) and an invalidate_range_end (again
I was thinking of unlock_range).

Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Andrea Arcangeli
On Wed, Jan 30, 2008 at 11:50:26AM -0800, Christoph Lameter wrote:
 Then we have 
 
 invalidate_range_start(mm)
 
 and
 
 invalidate_range_finish(mm, start, end)
 
 in addition to the invalidate rmap_notifier?
 
 ---
  include/linux/mmu_notifier.h |7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)
 
 Index: linux-2.6/include/linux/mmu_notifier.h
 ===
 --- linux-2.6.orig/include/linux/mmu_notifier.h   2008-01-30 
 11:49:02.0 -0800
 +++ linux-2.6/include/linux/mmu_notifier.h2008-01-30 11:49:57.0 
 -0800
 @@ -69,10 +69,13 @@ struct mmu_notifier_ops {
   /*
* lock indicates that the function is called under spinlock.
*/
 - void (*invalidate_range)(struct mmu_notifier *mn,
 + void (*invalidate_range_begin)(struct mmu_notifier *mn,
struct mm_struct *mm,
 -  unsigned long start, unsigned long end,
int lock);
 +
 + void (*invalidate_range_end)(struct mmu_notifier *mn,
 +  struct mm_struct *mm,
 +  unsigned long start, unsigned long end);
  };

start/finish/begin/end/before/after? ;)

I'd drop the 'int lock', you should skip the before/after if
i_mmap_lock isn't null and offload it to the caller before taking the
lock. At least for the after call that looks a few liner change,
didn't figure out the before yet.

Given the amount of changes that are going on in design terms to cover
both XPMEM and GRE, can we split the minimal invalidate_page that
provides an obviously safe and feature complete mmu notifier code for
KVM, and merge that first patch that will cover KVM 100%, it will
cover GRE 90%, and then we add invalidate_range_before/after in a
separate patch and we close the remaining 10% for GRE covering
ptep_get_and_clear or whatever else ptep_*?  The mmu notifiers are
made so that are extendible in backwards compatible way. I think
invalidate_page inside ptep_clear_flush is the first fundamental block
of the mmu notifiers. Then once the fundamental is in and obviously
safe and feature complete for KVM, the rest can be added very easily
with incremental patches as far as I can tell. That would be my
preferred route ;)

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter
On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

  -   void (*invalidate_range)(struct mmu_notifier *mn,
  +   void (*invalidate_range_begin)(struct mmu_notifier *mn,
   struct mm_struct *mm,
  -unsigned long start, unsigned long end,
   int lock);
  +
  +   void (*invalidate_range_end)(struct mmu_notifier *mn,
  +struct mm_struct *mm,
  +unsigned long start, unsigned long end);
   };
 
 start/finish/begin/end/before/after? ;)

Well lets pick one and then stick to it.

 I'd drop the 'int lock', you should skip the before/after if
 i_mmap_lock isn't null and offload it to the caller before taking the
 lock. At least for the after call that looks a few liner change,
 didn't figure out the before yet.

How we offload that? Before the scan of the rmaps we do not have the 
mmstruct. So we'd need another notifier_rmap_callback.

 Given the amount of changes that are going on in design terms to cover
 both XPMEM and GRE, can we split the minimal invalidate_page that
 provides an obviously safe and feature complete mmu notifier code for
 KVM, and merge that first patch that will cover KVM 100%, it will

The obvious solution does not scale. You will have a callback for every 
page and there may be a million of those if you have a 4GB process.

 made so that are extendible in backwards compatible way. I think
 invalidate_page inside ptep_clear_flush is the first fundamental block
 of the mmu notifiers. Then once the fundamental is in and obviously
 safe and feature complete for KVM, the rest can be added very easily
 with incremental patches as far as I can tell. That would be my
 preferred route ;)

We need to have a coherent notifier solution that works for multiple 
scenarios. I think a working invalidate_range would also be required for 
KVM. KVM and GRUB are very similar so they should be able to use the same 
mechanisms and we need to properly document how that mechanism is safe. 
Either both take a page refcount or none.




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter
On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

 On Wed, Jan 30, 2008 at 04:01:31PM -0800, Christoph Lameter wrote:
  How we offload that? Before the scan of the rmaps we do not have the 
  mmstruct. So we'd need another notifier_rmap_callback.
 
 My assumption is that that int lock exists just because
 unmap_mapping_range_vma exists. If I'm right then my suggestion was to
 move the invalidate_range after dropping the i_mmap_lock and not to
 invoke it inside zap_page_range.

There is still no pointer to the mm_struct available there because pages 
of a mapping may belong to multiple processes. So we need to add another 
rmap method?

The same issue is also occurring for unmap_hugepages().
 
 There's no reason why KVM should take any risk of corrupting memory
 due to a single missing mmu notifier, with not taking the
 refcount. get_user_pages will take it for us, so we have to pay the
 atomic-op anyway. It sure worth doing the atomic_dec inside the mmu
 notifier, and not immediately like this:

Well the GRU uses follow_page() instead of get_user_pages. Performance is 
a major issue for the GRU. 


 get_user_pages(pages)
 __free_page(pages[0])
 
 The idea is that what works for GRU, works for KVM too. So we do a
 single invalidate_page and clustered invalidate_pages, we add that,
 and then we make sure all places are covered so GRU will not
 kernel-crash, and KVM won't risk to run oom or to generate _userland_
 corruption.

H.. Could we go to a scheme where we do not have to increase the page 
count? Modifications of the page struct require dirtying a cache line and 
it seems that we do not need an increased page count if we have an
invalidate_range_start() that clears all the external references 
and stops the establishment of new ones and invalidate_range_end() that 
reenables new external references?

Then we do not need the frequent invalidate_page() calls.

The typical case would be anyways that invalidate_all() is called 
before anything else on exit. Invalidate_all() would remove all pages 
and disable creation of new references to the memory in the mm_struct.




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter

Patch to


1. Remove sync on notifier_release. Must be called when only a 
   single process remain.

2. Add invalidate_range_start/end. This should allow safe removal
   of ranges of external ptes without having to resort to a callback
   for every individual page.

This must be able to nest so the driver needs to keep a refcount of range 
invalidates and wait if the refcount != 0.


---
 include/linux/mmu_notifier.h |   11 +--
 mm/fremap.c  |3 ++-
 mm/hugetlb.c |3 ++-
 mm/memory.c  |   16 ++--
 mm/mmu_notifier.c|9 -
 5 files changed, 27 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/mmu_notifier.c
===
--- linux-2.6.orig/mm/mmu_notifier.c2008-01-30 17:58:48.0 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-30 18:00:26.0 -0800
@@ -13,23 +13,22 @@
 #include linux/mm.h
 #include linux/mmu_notifier.h
 
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
 void mmu_notifier_release(struct mm_struct *mm)
 {
struct mmu_notifier *mn;
struct hlist_node *n, *t;
 
if (unlikely(!hlist_empty(mm-mmu_notifier.head))) {
-   down_write(mm-mmap_sem);
-   rcu_read_lock();
hlist_for_each_entry_safe_rcu(mn, n, t,
  mm-mmu_notifier.head, hlist) {
hlist_del_rcu(mn-hlist);
if (mn-ops-release)
mn-ops-release(mn, mm);
}
-   rcu_read_unlock();
-   up_write(mm-mmap_sem);
-   synchronize_rcu();
}
 }
 
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 17:58:48.0 
-0800
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-30 18:00:26.0 
-0800
@@ -67,15 +67,22 @@ struct mmu_notifier_ops {
int dummy);
 
/*
+* invalidate_range_begin() and invalidate_range_end() are paired.
+*
+* invalidate_range_begin must clear all references in the range
+* and stop the establishment of new references.
+*
+* invalidate_range_end() reenables the establishment of references.
+*
 * lock indicates that the function is called under spinlock.
 */
void (*invalidate_range_begin)(struct mmu_notifier *mn,
 struct mm_struct *mm,
+unsigned long start, unsigned long end,
 int lock);
 
void (*invalidate_range_end)(struct mmu_notifier *mn,
-struct mm_struct *mm,
-unsigned long start, unsigned long end);
+struct mm_struct *mm);
 };
 
 struct mmu_rmap_notifier_ops;
Index: linux-2.6/mm/fremap.c
===
--- linux-2.6.orig/mm/fremap.c  2008-01-30 17:58:48.0 -0800
+++ linux-2.6/mm/fremap.c   2008-01-30 18:00:26.0 -0800
@@ -212,8 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(mapping-i_mmap_lock);
}
 
+   mmu_notifier(invalidate_range_start, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
-   mmu_notifier(invalidate_range, mm, start, start + size, 0);
+   mmu_notifier(invalidate_range_end, mm);
if (!err  !(flags  MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(mm-mmap_sem);
Index: linux-2.6/mm/hugetlb.c
===
--- linux-2.6.orig/mm/hugetlb.c 2008-01-30 17:58:48.0 -0800
+++ linux-2.6/mm/hugetlb.c  2008-01-30 18:00:26.0 -0800
@@ -744,6 +744,7 @@ void __unmap_hugepage_range(struct vm_ar
BUG_ON(start  ~HPAGE_MASK);
BUG_ON(end  ~HPAGE_MASK);
 
+   mmu_notifier(invalidate_range_start, mm, start, end, 1);
spin_lock(mm-page_table_lock);
for (address = start; address  end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
@@ -764,7 +765,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(mm-page_table_lock);
flush_tlb_range(vma, start, end);
-   mmu_notifier(invalidate_range, mm, start, end, 1);
+   mmu_notifier(invalidate_range_end, mm);
list_for_each_entry_safe(page, tmp, page_list, lru) {
list_del(page-lru);
put_page(page);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-01-30 

Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter
On Wed, 30 Jan 2008, Robin Holt wrote:

  Well the GRU uses follow_page() instead of get_user_pages. Performance is 
  a major issue for the GRU. 
 
 Worse, the GRU takes its TLB faults from within an interrupt so we
 use follow_page to prevent going to sleep.  That said, I think we
 could probably use follow_page() with FOLL_GET set to accomplish the
 requirements of mmu_notifier invalidate_range call.  Doesn't look too
 promising for hugetlb pages.

There may be no need to with the range_start/end scheme. The driver can 
have its own lock to make follow page secure. The lock needs to serialize 
the follow_page handler and the range_start/end calls as well as the 
invalidate_page callouts. I think that avoids the need for 
get_user_pages().



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Andrea Arcangeli
On Wed, Jan 30, 2008 at 06:08:14PM -0800, Christoph Lameter wrote:
   hlist_for_each_entry_safe_rcu(mn, n, t,
 

 mm-mmu_notifier.head, hlist) {
   hlist_del_rcu(mn-hlist);
 

_rcu can go away from both, if hlist_del_rcu can be called w/o locks.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Christoph Lameter
On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

 On Wed, Jan 30, 2008 at 06:08:14PM -0800, Christoph Lameter wrote:
  hlist_for_each_entry_safe_rcu(mn, n, t,

 
mm-mmu_notifier.head, hlist) {
  hlist_del_rcu(mn-hlist);

 
 _rcu can go away from both, if hlist_del_rcu can be called w/o locks.

True. hlist_del_init ok? That would allow to check the driver that the 
mmu_notifier is already linked in using !hlist_unhashed(). Driver then 
needs to properly initialize the mmu_notifier list with INIT_HLIST_NODE().



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Robin Holt
 Well the GRU uses follow_page() instead of get_user_pages. Performance is 
 a major issue for the GRU. 

Worse, the GRU takes its TLB faults from within an interrupt so we
use follow_page to prevent going to sleep.  That said, I think we
could probably use follow_page() with FOLL_GET set to accomplish the
requirements of mmu_notifier invalidate_range call.  Doesn't look too
promising for hugetlb pages.

Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Andrea Arcangeli
On Mon, Jan 28, 2008 at 12:28:42PM -0800, Christoph Lameter wrote:
 Index: linux-2.6/mm/fremap.c
 ===
 --- linux-2.6.orig/mm/fremap.c2008-01-25 19:31:05.0 -0800
 +++ linux-2.6/mm/fremap.c 2008-01-25 19:32:49.0 -0800
 @@ -15,6 +15,7 @@
  #include linux/rmap.h
  #include linux/module.h
  #include linux/syscalls.h
 +#include linux/mmu_notifier.h
  
  #include asm/mmu_context.h
  #include asm/cacheflush.h
 @@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns
   spin_unlock(mapping-i_mmap_lock);
   }
  
 + mmu_notifier(invalidate_range, mm, start, start + size, 0);
   err = populate_range(mm, vma, start, size, pgoff);

How can it be right to invalidate_range _before_ ptep_clear_flush?

 @@ -1634,6 +1639,8 @@ gotten:
   /*
* Re-check the pte - we dropped the lock
*/
 + mmu_notifier(invalidate_range, mm, address,
 + address + PAGE_SIZE - 1, 0);
   page_table = pte_offset_map_lock(mm, pmd, address, ptl);
   if (likely(pte_same(*page_table, orig_pte))) {
   if (old_page) {

What's the point of invalidate_range when the size is PAGE_SIZE? And
how can it be right to invalidate_range _before_ ptep_clear_flush?

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Andrea Arcangeli
Christoph, the below patch should fix the current leak of the pinned
pages. I hope the page-pin that should be dropped by the
invalidate_range op, is enough to prevent the physical page mapped
on that mm+address to change before invalidate_range returns. If
that would ever happen, there would be a coherency loss between the
guest VM writes and the writes coming from userland on the same
mm+address from a different thread (qemu, whatever). invalidate_page
before PT lock was obviously safe. Now we entirely relay on the pin to
prevent the page to change before invalidate_range returns. If the pte
is unmapped and the page is mapped back in with a minor fault that's
ok, as long as the physical page remains the same for that mm+address,
until all sptes are gone.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(mapping-i_mmap_lock);
}
 
+   err = populate_range(mm, vma, start, size, pgoff);
mmu_notifier(invalidate_range, mm, start, start + size, 0);
-   err = populate_range(mm, vma, start, size, pgoff);
if (!err  !(flags  MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(mm-mmap_sem);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1639,8 +1639,6 @@ gotten:
/*
 * Re-check the pte - we dropped the lock
 */
-   mmu_notifier(invalidate_range, mm, address,
-   address + PAGE_SIZE - 1, 0);
page_table = pte_offset_map_lock(mm, pmd, address, ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
@@ -1676,6 +1674,8 @@ gotten:
page_cache_release(old_page);
 unlock:
pte_unmap_unlock(page_table, ptl);
+   mmu_notifier(invalidate_range, mm, address,
+   address + PAGE_SIZE - 1, 0);
if (dirty_page) {
if (vma-vm_file)
file_update_time(vma-vm_file);

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Andrea Arcangeli
On Tue, Jan 29, 2008 at 11:55:10AM -0800, Christoph Lameter wrote:
 I am not sure. AFAICT you wrote that code.

Actually I didn't need to change a single line in do_wp_page because
ptep_clear_flush was already doing everything transparently for
me. This was the memory.c part of my last patch I posted, it only
touches zap_page_range, remap_pfn_range and apply_to_page_range.

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -889,6 +889,7 @@ unsigned long zap_page_range(struct vm_a
end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+   mmu_notifier(invalidate_range, mm, address, end);
return end;
 }
 
@@ -1317,7 +1318,7 @@ int remap_pfn_range(struct vm_area_struc
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + PAGE_ALIGN(size);
+   unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma-vm_mm;
int err;
 
@@ -1358,6 +1359,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range, mm, start, end);
return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1441,7 +1443,7 @@ int apply_to_page_range(struct mm_struct
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + size;
+   unsigned long start = addr, end = addr + size;
int err;
 
BUG_ON(addr = end);
@@ -1452,6 +1454,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range, mm, start, end);
return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);

 It seems to be okay to invalidate range if you hold mmap_sem writably. In 
 that case no additional faults can happen that would create new ptes.

In that place the mmap_sem is taken but in readonly mode. I never rely
on the mmap_sem in the mmu notifier methods. Not invoking the notifier
before releasing the PT lock adds quite some uncertainty on the smp
safety of the spte invalidates, because the pte may be unmapped and
remapped by a minor fault before invalidate_range is invoked, but I
didn't figure out a kernel crashing race yet thanks to the pin we take
through get_user_pages (and only thanks to it). The requirement is
that invalidate_range is invoked after the last ptep_clear_flush or it
leaks pins that's why I had to move it at the end.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Andrea Arcangeli
On Tue, Jan 29, 2008 at 12:30:06PM -0800, Christoph Lameter wrote:
 On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
 
  diff --git a/mm/fremap.c b/mm/fremap.c
  --- a/mm/fremap.c
  +++ b/mm/fremap.c
  @@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns
  spin_unlock(mapping-i_mmap_lock);
  }
   
  +   err = populate_range(mm, vma, start, size, pgoff);
  mmu_notifier(invalidate_range, mm, start, start + size, 0);
  -   err = populate_range(mm, vma, start, size, pgoff);
  if (!err  !(flags  MAP_NONBLOCK)) {
  if (unlikely(has_write_lock)) {
  downgrade_write(mm-mmap_sem);
 
 We invalidate the range *after* populating it? Isnt it okay to establish 
 references while populate_range() runs?

It's not ok because that function can very well overwrite existing and
present ptes (it's actually the nonlinear common case fast path for
db). With your code the sptes created between invalidate_range and
populate_range, will keep pointing forever to the old physical page
instead of the newly populated one.

I'm also asking myself if it's a smp race not to call
mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
in install_file_pte. Probably not because the guest VM running in a
different thread would need to serialize outside the install_file_pte
code with the task running install_file_pte, if it wants to be sure to
write either all its data to the old or the new page. Certainly doing
the invalidate_page inside the PT lock was obviously safe but I hope
this is safe and this can accommodate your needs too.

  diff --git a/mm/memory.c b/mm/memory.c
  --- a/mm/memory.c
  +++ b/mm/memory.c
  @@ -1639,8 +1639,6 @@ gotten:
  /*
   * Re-check the pte - we dropped the lock
   */
  -   mmu_notifier(invalidate_range, mm, address,
  -   address + PAGE_SIZE - 1, 0);
  page_table = pte_offset_map_lock(mm, pmd, address, ptl);
  if (likely(pte_same(*page_table, orig_pte))) {
  if (old_page) {
 
 What we did is to invalidate the page (?!) before taking the pte lock. In 
 the lock we replace the pte to point to another page. This means that we 
 need to clear stale information. So we zap it before. If another reference 
 is established after taking the spinlock then the pte contents have 
 changed at the cirtical section fails.
 
 Before the critical section starts we have gotten an extra refcount on the 
 original page so the page cannot vanish from under us.

The problem is the missing invalidate_page/range _after_
ptep_clear_flush. If a spte is built between invalidate_range and
pte_offset_map_lock, it will remain pointing to the old page
forever. Nothing will be called to invalidate that stale spte built
between invalidate_page/range and ptep_clear_flush. This is why for
the last few days I kept saying the mmu notifiers have to be invoked
_after_ ptep_clear_flush and never before (remember the export
notifier?). No idea how you can deal with this in your code, certainly
for KVM sptes that's backwards and unworkable ordering of operation
(exactly as backwards are doing the tlb flush before pte_clear in
ptep_clear_flush, think spte as a tlb, you can't flush the tlb before
clearing/updating the pte or it's smp unsafe).

  @@ -1676,6 +1674,8 @@ gotten:
  page_cache_release(old_page);
   unlock:
  pte_unmap_unlock(page_table, ptl);
  +   mmu_notifier(invalidate_range, mm, address,
  +   address + PAGE_SIZE - 1, 0);
  if (dirty_page) {
  if (vma-vm_file)
  file_update_time(vma-vm_file);
 
 Now we invalidate the page after the transaction is complete. This means 
 external pte can persist while we change the pte? Possibly even dirty the 
 page?

Yes, and the only reason this can be safe is for the reason explained
at the top of the email, if the other cpu wants to serialize to be
sure to write in the new page, it has to serialize with the
page-fault but to serialize it has to wait the page fault to return
(example: we're not going to call futex code until the page fault
returns).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

 diff --git a/mm/fremap.c b/mm/fremap.c
 --- a/mm/fremap.c
 +++ b/mm/fremap.c
 @@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns
   spin_unlock(mapping-i_mmap_lock);
   }
  
 + err = populate_range(mm, vma, start, size, pgoff);
   mmu_notifier(invalidate_range, mm, start, start + size, 0);
 - err = populate_range(mm, vma, start, size, pgoff);
   if (!err  !(flags  MAP_NONBLOCK)) {
   if (unlikely(has_write_lock)) {
   downgrade_write(mm-mmap_sem);

We invalidate the range *after* populating it? Isnt it okay to establish 
references while populate_range() runs?

 diff --git a/mm/memory.c b/mm/memory.c
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -1639,8 +1639,6 @@ gotten:
   /*
* Re-check the pte - we dropped the lock
*/
 - mmu_notifier(invalidate_range, mm, address,
 - address + PAGE_SIZE - 1, 0);
   page_table = pte_offset_map_lock(mm, pmd, address, ptl);
   if (likely(pte_same(*page_table, orig_pte))) {
   if (old_page) {

What we did is to invalidate the page (?!) before taking the pte lock. In 
the lock we replace the pte to point to another page. This means that we 
need to clear stale information. So we zap it before. If another reference 
is established after taking the spinlock then the pte contents have 
changed at the cirtical section fails.

Before the critical section starts we have gotten an extra refcount on the 
original page so the page cannot vanish from under us.

 @@ -1676,6 +1674,8 @@ gotten:
   page_cache_release(old_page);
  unlock:
   pte_unmap_unlock(page_table, ptl);
 + mmu_notifier(invalidate_range, mm, address,
 + address + PAGE_SIZE - 1, 0);
   if (dirty_page) {
   if (vma-vm_file)
   file_update_time(vma-vm_file);

Now we invalidate the page after the transaction is complete. This means 
external pte can persist while we change the pte? Possibly even dirty the 
page?




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

  +   mmu_notifier(invalidate_range, mm, address,
  +   address + PAGE_SIZE - 1, 0);
  page_table = pte_offset_map_lock(mm, pmd, address, ptl);
  if (likely(pte_same(*page_table, orig_pte))) {
  if (old_page) {
 
 What's the point of invalidate_range when the size is PAGE_SIZE? And
 how can it be right to invalidate_range _before_ ptep_clear_flush?

I am not sure. AFAICT you wrote that code.

It seems to be okay to invalidate range if you hold mmap_sem writably. In 
that case no additional faults can happen that would create new ptes.




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

  It seems to be okay to invalidate range if you hold mmap_sem writably. In 
  that case no additional faults can happen that would create new ptes.
 
 In that place the mmap_sem is taken but in readonly mode. I never rely
 on the mmap_sem in the mmu notifier methods. Not invoking the notifier

Well it seems that we have to rely on mmap_sem otherwise concurrent faults 
can occur. The mmap_sem seems to be acquired for write there.

  if (!has_write_lock) {
up_read(mm-mmap_sem);
down_write(mm-mmap_sem);
has_write_lock = 1;
goto retry;
}


 before releasing the PT lock adds quite some uncertainty on the smp
 safety of the spte invalidates, because the pte may be unmapped and
 remapped by a minor fault before invalidate_range is invoked, but I
 didn't figure out a kernel crashing race yet thanks to the pin we take
 through get_user_pages (and only thanks to it). The requirement is
 that invalidate_range is invoked after the last ptep_clear_flush or it
 leaks pins that's why I had to move it at the end.
 
So pins means a reference count right? I still do not get why you 
have refcount problems. You take a refcount when you export the page 
through KVM and then drop the refcount in invalidate page right?

So you walk through the KVM ptes and drop the refcount for each spte you 
encounter?



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Andrea Arcangeli
On Tue, Jan 29, 2008 at 01:53:05PM -0800, Christoph Lameter wrote:
 On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
 
   We invalidate the range *after* populating it? Isnt it okay to establish 
   references while populate_range() runs?
  
  It's not ok because that function can very well overwrite existing and
  present ptes (it's actually the nonlinear common case fast path for
  db). With your code the sptes created between invalidate_range and
  populate_range, will keep pointing forever to the old physical page
  instead of the newly populated one.
 
 Seems though that the mmap_sem is taken for regular vmas writably and will 
 hold off new mappings.

It's taken writable due to the code being inefficient the first time,
all later times remap_populate_range overwrites ptes with the mmap_sem
in readonly mode (finally rightfully so). The first remap_file_pages I
guess it's irrelevant to optimize, the whole point of nonlinear is to
call remap_file_pages zillon of times on the same vma, overwriting
present ptes the whole time, so if the first time the mutex is not
readonly it probably doesn't make a difference.

get_user_pages invoked by the kvm spte-fault, can happen between
invalidate_range and populate_range. If it can't happen, for sure
nobody pointed out a good reason why it can't happen. The kvm page
faults as well rightfully only takes the mmap_sem in readonly mode, so
get_user_pages is only called internally to gfn_to_page with the
readonly semaphore.

With my approach ptep_clear_flush was not only invalidating sptes
after ptep_clear_flush, but it was also invalidating them inside the
PT lock, so it was totally obvious there could be no race vs
get_user_pages.

  I'm also asking myself if it's a smp race not to call
  mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
  in install_file_pte. Probably not because the guest VM running in a
  different thread would need to serialize outside the install_file_pte
  code with the task running install_file_pte, if it wants to be sure to
  write either all its data to the old or the new page. Certainly doing
  the invalidate_page inside the PT lock was obviously safe but I hope
  this is safe and this can accommodate your needs too.
 
 But that would be doing two invalidates on one pte. One range and one page 
 invalidate.

Yes, but it would have been micro-optimized later if you really cared,
by simply changing ptep_clear_flush to __ptep_clear_flush, no big
deal. Definitely all methods must be robust about them being called
multiple times, even if the rmap finds no spte mapping such host
virtual address.

 Hmmm... So we could only do an invalidate_page here? Drop the strange 
 invalidate_range()?

That's a question you should answer.

@@ -1676,6 +1674,8 @@ gotten:
page_cache_release(old_page);
 unlock:
pte_unmap_unlock(page_table, ptl);
+   mmu_notifier(invalidate_range, mm, address,
+   address + PAGE_SIZE - 1, 0);
if (dirty_page) {
if (vma-vm_file)
file_update_time(vma-vm_file);
   
   Now we invalidate the page after the transaction is complete. This means 
   external pte can persist while we change the pte? Possibly even dirty the 
   page?
  
  Yes, and the only reason this can be safe is for the reason explained
  at the top of the email, if the other cpu wants to serialize to be
  sure to write in the new page, it has to serialize with the
  page-fault but to serialize it has to wait the page fault to return
  (example: we're not going to call futex code until the page fault
  returns).
 
 Serialize how? mmap_sem?

No, that's a different angle.

But now I think there may be an issue with a third thread that may
show unsafe the removal of invalidate_page from ptep_clear_flush.

A third thread writing to a page through the linux-pte and the guest
VM writing to the same page through the sptes, will be writing on the
same physical page concurrently and using an userspace spinlock w/o
ever entering the kernel. With your patch that invalidate_range after
dropping the PT lock, the third thread may start writing on the new
page, when the guest is still writing to the old page through the
sptes. While this couldn't happen with my patch.

So really at the light of the third thread, it seems your approach is
smp racey and ptep_clear_flush should invalidate_page as last thing
before returning. My patch was enforcing that ptep_clear_flush would
stop the third thread in a linux page fault, and to drop the spte,
before the new mapping could be instantiated in both the linux pte and
in the sptes. The PT lock provided the needed serialization. This
ensured the third thread and the guest VM would always write on the
same physical page even if the first thread runs a flood of
remap_file_pages on that same page moving it around the pagecache. So
it seems I found a unfixable smp race in pretending to invalidate in 

Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

  We invalidate the range *after* populating it? Isnt it okay to establish 
  references while populate_range() runs?
 
 It's not ok because that function can very well overwrite existing and
 present ptes (it's actually the nonlinear common case fast path for
 db). With your code the sptes created between invalidate_range and
 populate_range, will keep pointing forever to the old physical page
 instead of the newly populated one.

Seems though that the mmap_sem is taken for regular vmas writably and will 
hold off new mappings.

 I'm also asking myself if it's a smp race not to call
 mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
 in install_file_pte. Probably not because the guest VM running in a
 different thread would need to serialize outside the install_file_pte
 code with the task running install_file_pte, if it wants to be sure to
 write either all its data to the old or the new page. Certainly doing
 the invalidate_page inside the PT lock was obviously safe but I hope
 this is safe and this can accommodate your needs too.

But that would be doing two invalidates on one pte. One range and one page 
invalidate.

   diff --git a/mm/memory.c b/mm/memory.c
   --- a/mm/memory.c
   +++ b/mm/memory.c
   @@ -1639,8 +1639,6 @@ gotten:
 /*
  * Re-check the pte - we dropped the lock
  */
   - mmu_notifier(invalidate_range, mm, address,
   - address + PAGE_SIZE - 1, 0);
 page_table = pte_offset_map_lock(mm, pmd, address, ptl);
 if (likely(pte_same(*page_table, orig_pte))) {
 if (old_page) {
  
  What we did is to invalidate the page (?!) before taking the pte lock. In 
  the lock we replace the pte to point to another page. This means that we 
  need to clear stale information. So we zap it before. If another reference 
  is established after taking the spinlock then the pte contents have 
  changed at the cirtical section fails.
  
  Before the critical section starts we have gotten an extra refcount on the 
  original page so the page cannot vanish from under us.
 
 The problem is the missing invalidate_page/range _after_
 ptep_clear_flush. If a spte is built between invalidate_range and
 pte_offset_map_lock, it will remain pointing to the old page
 forever. Nothing will be called to invalidate that stale spte built
 between invalidate_page/range and ptep_clear_flush. This is why for
 the last few days I kept saying the mmu notifiers have to be invoked
 _after_ ptep_clear_flush and never before (remember the export
 notifier?). No idea how you can deal with this in your code, certainly
 for KVM sptes that's backwards and unworkable ordering of operation
 (exactly as backwards are doing the tlb flush before pte_clear in
 ptep_clear_flush, think spte as a tlb, you can't flush the tlb before
 clearing/updating the pte or it's smp unsafe).

Hmmm... So we could only do an invalidate_page here? Drop the strange 
invalidate_range()?

 
   @@ -1676,6 +1674,8 @@ gotten:
 page_cache_release(old_page);
unlock:
 pte_unmap_unlock(page_table, ptl);
   + mmu_notifier(invalidate_range, mm, address,
   + address + PAGE_SIZE - 1, 0);
 if (dirty_page) {
 if (vma-vm_file)
 file_update_time(vma-vm_file);
  
  Now we invalidate the page after the transaction is complete. This means 
  external pte can persist while we change the pte? Possibly even dirty the 
  page?
 
 Yes, and the only reason this can be safe is for the reason explained
 at the top of the email, if the other cpu wants to serialize to be
 sure to write in the new page, it has to serialize with the
 page-fault but to serialize it has to wait the page fault to return
 (example: we're not going to call futex code until the page fault
 returns).

Serialize how? mmap_sem?
 

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
n Tue, 29 Jan 2008, Andrea Arcangeli wrote:

 hmm, there where? When I said it was taken in readonly mode I meant
 for the quoted code (it would be at the top if it wasn't cut), so I
 quote below again:
 
   +   mmu_notifier(invalidate_range, mm, address,
   +   address + PAGE_SIZE - 1, 0);
   page_table = pte_offset_map_lock(mm, pmd, address, ptl);
   if (likely(pte_same(*page_table, orig_pte))) {
   if (old_page) {
 
 The there for me was do_wp_page.

Maybe we better focus on one call at a time?

 Even for the code you quoted in freemap.c, the has_write_lock is set
 to 1 _only_ for the very first time you call sys_remap_file_pages on a
 VMA. Only the transition of the VMA between linear to nonlinear
 requires the mmap in write mode. So you can be sure all freemap code
 99% of the time is populating (overwriting) already present ptes with
 only the mmap_sem in readonly mode like do_wp_page. It would be
 unnecessary to populate the nonlinear range with the mmap in write
 mode. Only the vma mangling requires the mmap_sem in write mode, the
 pte modifications only requires the PT_lock + mmap_sem in read mode.
 
 Effectively the first invocation of populate_range runs with the
 mmap_sem in write mode, I wonder why, there seem to be no good reason
 for that. I guess it's a bit that should be optimized, by calling
 downgrade_write before calling populate_range even for the first time
 the vma switches from linear to nonlinear (after the vma has been
 fully updated to the new status). But for sure all later invocations
 runs populate_range with the semaphore readonly like the rest of the
 VM does when instantiating ptes in the page faults.

If it does not run in write mode then concurrent faults are permissible 
while we remap pages. Weird. Maybe we better handle this like individual
page operations? Put the invalidate_page back into zap_pte. But then there 
would be no callback w/o lock as required by Robin. Doing the 
invalidate_range after populate allows access to memory for which ptes 
were zapped and the refcount was released.

 All pins are gone by the time invalidate_page/range returns. But there
 is no critical section between invalidate_page and the _later_
 ptep_clear_flush. So get_user_pages is free to run and take the PT
 lock before the ptep_clear_flush, find the linux pte still
 instantiated, and to create a new spte, before ptep_clear_flush runs.

Hmmm... Right. Did not consider get_user_pages. A write to the page that 
is not marked dirty would typically require a fault that will serialize.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

 But now I think there may be an issue with a third thread that may
 show unsafe the removal of invalidate_page from ptep_clear_flush.
 
 A third thread writing to a page through the linux-pte and the guest
 VM writing to the same page through the sptes, will be writing on the
 same physical page concurrently and using an userspace spinlock w/o
 ever entering the kernel. With your patch that invalidate_range after
 dropping the PT lock, the third thread may start writing on the new
 page, when the guest is still writing to the old page through the
 sptes. While this couldn't happen with my patch.

A user space spinlock plays into this??? That is irrelevant to the kernel. 
And we are discussing your placement of the invalidate_range not mine.

This is the scenario that I described before. You just need two threads.
One thread is in do_wp_page and the other is writing through the spte. 
We are in do_wp_page. Meaning the page is not writable. The writer will 
have to take fault which will properly serialize access. It a bug if the 
spte would allow write.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Andrea Arcangeli
On Tue, Jan 29, 2008 at 02:55:56PM -0800, Christoph Lameter wrote:
 On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
 
  But now I think there may be an issue with a third thread that may
  show unsafe the removal of invalidate_page from ptep_clear_flush.
  
  A third thread writing to a page through the linux-pte and the guest
  VM writing to the same page through the sptes, will be writing on the
  same physical page concurrently and using an userspace spinlock w/o
  ever entering the kernel. With your patch that invalidate_range after
  dropping the PT lock, the third thread may start writing on the new
  page, when the guest is still writing to the old page through the
  sptes. While this couldn't happen with my patch.
 
 A user space spinlock plays into this??? That is irrelevant to the kernel. 
 And we are discussing your placement of the invalidate_range not mine.

With my code, invalidate_range wasn't placed there at all, my
modification to ptep_clear_flush already covered it in a automatic
way, grep from the word fremap in my latest patch you won't find it,
like you won't find any change to do_wp_page. Not sure why you keep
thinking I added those invalidate_range when infact you did.

The user space spinlock plays also in declaring rdtscp unworkable to
provide a monotone vgettimeofday w/o kernel locking.

My patch by calling invalidate_page inside ptep_clear_flush guaranteed
that both the thread writing through sptes and the thread writing
through linux ptes, couldn't possibly simultaneously write to two
different physical pages.

Your patch allows the thread writing through linux-pte to write to a
new populated page while the old thread writing through sptes still
writes to the old page. Is that safe? I don't know for sure. The fact
the physical page backing the virtual address could change back and
forth, perhaps invalidates the theory that somebody could possibly do
some useful locking out of it relaying on all threads seeing the same
physical page at the same time.

Anyway as long as invalidate_page/range happens after ptep_clear_flush
things are mostly ok.

 This is the scenario that I described before. You just need two threads.
 One thread is in do_wp_page and the other is writing through the spte. 
 We are in do_wp_page. Meaning the page is not writable. The writer will 

Actually above I was describing remap_file_pages not do_wp_page.

 have to take fault which will properly serialize access. It a bug if the 
 spte would allow write.

In that scenario because write is forbidden (unlike remap_file_pages)
like you said things should be ok. The spte reader will eventually see
the updates happening in the new page, as long as the spte invalidate
happens after ptep_clear_flush (i.e. with my incremental fix applied
to your code, or with my latest patch).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Andrea Arcangeli
On Tue, Jan 29, 2008 at 02:39:00PM -0800, Christoph Lameter wrote:
 If it does not run in write mode then concurrent faults are permissible 
 while we remap pages. Weird. Maybe we better handle this like individual
 page operations? Put the invalidate_page back into zap_pte. But then there 
 would be no callback w/o lock as required by Robin. Doing the 

The Robin requirements and the need to schedule are the source of the
complications indeed.

I posted all the KVM patches using mmu notifiers, today I reposted the
ones to work with your V2 (which crashes my host unlike my last
simpler mmu notifier patch but I also changed a few other variable
besides your mmu notifier changes, so I can't yet be sure it's a bug
in your V2, and the SMP regressions I fixed so far sure can't explain
the crashes because my KVM setup could never run in do_wp_page nor
remap_file_pages so it's something else I need to find ASAP).

Robin, if you don't mind, could you please post or upload somewhere
your GPLv2 code that registers itself in Christoph's V2 notifiers? Or
is it top secret? I wouldn't mind to have a look so I can better
understand what's the exact reason you're sleeping besides attempting
GFP_KERNEL allocations. Thanks!

 invalidate_range after populate allows access to memory for which ptes 
 were zapped and the refcount was released.

The last refcount is released by the invalidate_range itself.
 
  All pins are gone by the time invalidate_page/range returns. But there
  is no critical section between invalidate_page and the _later_
  ptep_clear_flush. So get_user_pages is free to run and take the PT
  lock before the ptep_clear_flush, find the linux pte still
  instantiated, and to create a new spte, before ptep_clear_flush runs.
 
 Hmmm... Right. Did not consider get_user_pages. A write to the page that 
 is not marked dirty would typically require a fault that will serialize.

The pte is already marked dirty (and this is the case only for
get_user_pages, regular linux writes don't fault unless it's
explicitly writeprotect, which is mandatory in a few archs, x86 not).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

 On Wed, Jan 30, 2008 at 01:00:39AM +0100, Andrea Arcangeli wrote:
  get_user_pages, regular linux writes don't fault unless it's
  explicitly writeprotect, which is mandatory in a few archs, x86 not).
 
 actually get_user_pages doesn't fault either but it calls into
 set_page_dirty, however get_user_pages (unlike a userland-write) at
 least requires mmap_sem in read mode and the PT lock as serialization,
 userland writes don't, they just go ahead and mark the pte in hardware
 w/o faults. Anyway anonymous memory these days always mapped with
 dirty bit set regardless, even for read-faults, after Nick finally
 rightfully cleaned up the zero-page trick.

That is only partially true. pte are created wronly in order to track 
dirty state these days. The first write will lead to a fault that switches 
the pte to writable. When the page undergoes writeback the page again 
becomes write protected. Thus our need to effectively deal with 
page_mkclean.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Jack Steiner
On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
 On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
 
   invalidate_range after populate allows access to memory for which ptes 
   were zapped and the refcount was released.
  
  The last refcount is released by the invalidate_range itself.
 
 That is true for your implementation and to address Robin's issues. Jack: 
 Is that true for the GRU?

I'm not sure I understand the question. The GRU never (currently) takes
a reference on a page. It has no mechanism for tracking pages that
were exported to the external TLBs.

--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Andrea Arcangeli
On Wed, Jan 30, 2008 at 01:00:39AM +0100, Andrea Arcangeli wrote:
 get_user_pages, regular linux writes don't fault unless it's
 explicitly writeprotect, which is mandatory in a few archs, x86 not).

actually get_user_pages doesn't fault either but it calls into
set_page_dirty, however get_user_pages (unlike a userland-write) at
least requires mmap_sem in read mode and the PT lock as serialization,
userland writes don't, they just go ahead and mark the pte in hardware
w/o faults. Anyway anonymous memory these days always mapped with
dirty bit set regardless, even for read-faults, after Nick finally
rightfully cleaned up the zero-page trick.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

  A user space spinlock plays into this??? That is irrelevant to the kernel. 
  And we are discussing your placement of the invalidate_range not mine.
 
 With my code, invalidate_range wasn't placed there at all, my
 modification to ptep_clear_flush already covered it in a automatic
 way, grep from the word fremap in my latest patch you won't find it,
 like you won't find any change to do_wp_page. Not sure why you keep
 thinking I added those invalidate_range when infact you did.

Well you moved the code at minimum. Hmmm... according 
http://marc.info/?l=linux-kernelm=120114755620891w=2 it was Robin.

 The user space spinlock plays also in declaring rdtscp unworkable to
 provide a monotone vgettimeofday w/o kernel locking.

No idea what you are talking about.

 My patch by calling invalidate_page inside ptep_clear_flush guaranteed
 that both the thread writing through sptes and the thread writing
 through linux ptes, couldn't possibly simultaneously write to two
 different physical pages.

But then the ptep_clear_flush will issue invalidate_page() for ranges 
that were already covered by invalidate_range(). There are multiple calls 
to clear the same spte.

 Your patch allows the thread writing through linux-pte to write to a
 new populated page while the old thread writing through sptes still
 writes to the old page. Is that safe? I don't know for sure. The fact
 the physical page backing the virtual address could change back and
 forth, perhaps invalidates the theory that somebody could possibly do
 some useful locking out of it relaying on all threads seeing the same
 physical page at the same time.

This is referrring to the remap issue not do_wp_page right?

 Actually above I was describing remap_file_pages not do_wp_page.

Ok.

The serialization of remap_file_pages does not seem that critical since we 
only take a read lock on mmap_sem here. There may already be concurrent 
access to pages from other processors while the ptes are remapped. So 
there is already some overlap.

We could take mmap_sem there writably and keep it writably for the case 
that we have an mmu notifier in the mm.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
On Tue, 29 Jan 2008, Jack Steiner wrote:

  That is true for your implementation and to address Robin's issues. Jack: 
  Is that true for the GRU?
 
 I'm not sure I understand the question. The GRU never (currently) takes
 a reference on a page. It has no mechanism for tracking pages that
 were exported to the external TLBs.

Thats what I was looking for. Thanks. KVM takes a refcount and so does 
XPmem.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

  invalidate_range after populate allows access to memory for which ptes 
  were zapped and the refcount was released.
 
 The last refcount is released by the invalidate_range itself.

That is true for your implementation and to address Robin's issues. Jack: 
Is that true for the GRU?


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter
The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.
Most of the VM address space changes can use the range invalidate
callback.

invalidate_range() is generally called with mmap_sem held but
no spinlocks are active. If invalidate_range() is called with
locks held then we pass a flag into invalidate_range()

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Robin Holt [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/fremap.c  |2 ++
 mm/hugetlb.c |2 ++
 mm/memory.c  |   11 +--
 mm/mmap.c|1 +
 4 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/fremap.c
===
--- linux-2.6.orig/mm/fremap.c  2008-01-29 16:56:33.0 -0800
+++ linux-2.6/mm/fremap.c   2008-01-29 16:59:24.0 -0800
@@ -15,6 +15,7 @@
 #include linux/rmap.h
 #include linux/module.h
 #include linux/syscalls.h
+#include linux/mmu_notifier.h
 
 #include asm/mmu_context.h
 #include asm/cacheflush.h
@@ -212,6 +213,7 @@ asmlinkage long sys_remap_file_pages(uns
}
 
err = populate_range(mm, vma, start, size, pgoff);
+   mmu_notifier(invalidate_range, mm, start, start + size, 0);
if (!err  !(flags  MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(mm-mmap_sem);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-01-29 16:56:33.0 -0800
+++ linux-2.6/mm/memory.c   2008-01-29 16:59:24.0 -0800
@@ -50,6 +50,7 @@
 #include linux/delayacct.h
 #include linux/init.h
 #include linux/writeback.h
+#include linux/mmu_notifier.h
 
 #include asm/pgalloc.h
 #include asm/uaccess.h
@@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a
end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+   mmu_notifier(invalidate_range, mm, address, end,
+   (details ? (details-i_mmap_lock != NULL)  : 0));
return end;
 }
 
@@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + PAGE_ALIGN(size);
+   unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma-vm_mm;
int err;
 
@@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range, mm, start, end, 0);
return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + size;
+   unsigned long start = addr, end = addr + size;
int err;
 
BUG_ON(addr = end);
@@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range, mm, start, end, 0);
return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1669,6 +1674,8 @@ gotten:
page_cache_release(old_page);
 unlock:
pte_unmap_unlock(page_table, ptl);
+   mmu_notifier(invalidate_range, mm, address,
+   address + PAGE_SIZE - 1, 0);
if (dirty_page) {
if (vma-vm_file)
file_update_time(vma-vm_file);
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c2008-01-29 16:56:36.0 -0800
+++ linux-2.6/mm/mmap.c 2008-01-29 16:58:15.0 -0800
@@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc
free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
 next? next-vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+   mmu_notifier(invalidate_range, mm, start, end, 0);
 }
 
 /*
Index: linux-2.6/mm/hugetlb.c
===
--- linux-2.6.orig/mm/hugetlb.c 2008-01-29 16:56:33.0 -0800
+++ linux-2.6/mm/hugetlb.c  2008-01-29 16:58:15.0 -0800
@@ -14,6 +14,7 @@
 #include linux/mempolicy.h
 #include linux/cpuset.h
 #include linux/mutex.h
+#include linux/mmu_notifier.h
 
 #include asm/page.h
 #include asm/pgtable.h
@@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(mm-page_table_lock);
flush_tlb_range(vma, start, end);
+   mmu_notifier(invalidate_range, mm, start, end, 1);
list_for_each_entry_safe(page, tmp, page_list, 

[kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-28 Thread Christoph Lameter
The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.
Most of the VM address space changes can use the range invalidate
callback.

invalidate_range() is generally called with mmap_sem held but
no spinlocks are active. If invalidate_range() is called with
locks held then we pass a flag into invalidate_range()

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Robin Holt [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/fremap.c  |2 ++
 mm/hugetlb.c |2 ++
 mm/memory.c  |   11 +--
 mm/mmap.c|1 +
 4 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/fremap.c
===
--- linux-2.6.orig/mm/fremap.c  2008-01-25 19:31:05.0 -0800
+++ linux-2.6/mm/fremap.c   2008-01-25 19:32:49.0 -0800
@@ -15,6 +15,7 @@
 #include linux/rmap.h
 #include linux/module.h
 #include linux/syscalls.h
+#include linux/mmu_notifier.h
 
 #include asm/mmu_context.h
 #include asm/cacheflush.h
@@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(mapping-i_mmap_lock);
}
 
+   mmu_notifier(invalidate_range, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
if (!err  !(flags  MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-01-25 19:31:05.0 -0800
+++ linux-2.6/mm/memory.c   2008-01-25 19:32:49.0 -0800
@@ -50,6 +50,7 @@
 #include linux/delayacct.h
 #include linux/init.h
 #include linux/writeback.h
+#include linux/mmu_notifier.h
 
 #include asm/pgalloc.h
 #include asm/uaccess.h
@@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a
end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+   mmu_notifier(invalidate_range, mm, address, end,
+   (details ? (details-i_mmap_lock != NULL)  : 0));
return end;
 }
 
@@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + PAGE_ALIGN(size);
+   unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma-vm_mm;
int err;
 
@@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range, mm, start, end, 0);
return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + size;
+   unsigned long start = addr, end = addr + size;
int err;
 
BUG_ON(addr = end);
@@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range, mm, start, end, 0);
return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1634,6 +1639,8 @@ gotten:
/*
 * Re-check the pte - we dropped the lock
 */
+   mmu_notifier(invalidate_range, mm, address,
+   address + PAGE_SIZE - 1, 0);
page_table = pte_offset_map_lock(mm, pmd, address, ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c2008-01-25 19:31:05.0 -0800
+++ linux-2.6/mm/mmap.c 2008-01-25 19:32:49.0 -0800
@@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc
free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
 next? next-vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+   mmu_notifier(invalidate_range, mm, start, end, 0);
 }
 
 /*
Index: linux-2.6/mm/hugetlb.c
===
--- linux-2.6.orig/mm/hugetlb.c 2008-01-25 19:33:58.0 -0800
+++ linux-2.6/mm/hugetlb.c  2008-01-25 19:34:13.0 -0800
@@ -14,6 +14,7 @@
 #include linux/mempolicy.h
 #include linux/cpuset.h
 #include linux/mutex.h
+#include linux/mmu_notifier.h
 
 #include asm/page.h
 #include asm/pgtable.h
@@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(mm-page_table_lock);
flush_tlb_range(vma, start, end);
+   mmu_notifier(invalidate_range, mm, start, end, 1);
list_for_each_entry_safe(page, tmp,