[kvm-devel] [PATCH] mmu notifiers #v8

2008-03-02 Thread Andrea Arcangeli
Difference between #v7 and #v8:

1) s/age_page/clear_flush_young/ (Nick's suggestion)
2) macro fix (Andrew)
3) move release before final unmap_vmas (for GRU, Jack/Christoph)
4) microoptimize mmu_notifier_unregister (Christoph)
5) use mmap_sem for registration serialization (Christoph)

The (void)xxx in macros doesn't work with "args". Christoph's solution
look best in avoiding warnings, even if it forces to make the mmu
notifier operation structure visible even if MMU_NOTIFIER=n (that's
the only downside).

I didn't drop invalidate_page, because invalidate_range_begin/end
would be slower for usages like KVM/GRU (we don't need a begin/end
there because where invalidate_page is called, the VM holds a
reference on the page). do_wp_page should also use invalidate_page
since it can free the page after dropping the PT lock without losing
any performance (that's not true for the places where invalidate_range
is called).

It'd be nice if everyone involved can agree to converge on this API
for .25. KVM/GRU (and perhaps Quadrics) and similar usages will be
fully covered in .25. This is a kernel internal API so there's no
problem if all the methods will become sleep capable only starting
only in .26. The brainer part of the VM work to do to make it sleep
capable is pretty much orthogonal with this patch.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -228,6 +229,8 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
struct mem_cgroup *mem_cgroup;
 #endif
+
+   struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,161 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include 
+#include 
+
+struct mmu_notifier;
+
+struct mmu_notifier_ops {
+   /*
+* Called when nobody can register any more notifier in the mm
+* and after the "mn" notifier has been disarmed already.
+*/
+   void (*release)(struct mmu_notifier *mn,
+   struct mm_struct *mm);
+
+   /*
+* clear_flush_young is called after the VM is
+* test-and-clearing the young/accessed bitflag in the
+* pte. This way the VM will provide proper aging to the
+* accesses to the page through the secondary MMUs and not
+* only to the ones through the Linux pte.
+*/
+   int (*clear_flush_young)(struct mmu_notifier *mn,
+struct mm_struct *mm,
+unsigned long address);
+
+   /*
+* Before this is invoked any secondary MMU is still ok to
+* read/write to the page previously pointed by the Linux pte
+* because the old page hasn't been freed yet.  If required
+* set_page_dirty has to be called internally to this method.
+*/
+   void (*invalidate_page)(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long address);
+
+   /*
+* invalidate_range_begin() and invalidate_range_end() must be
+* paired. Multiple invalidate_range_begin/ends may be nested
+* or called concurrently.
+*/
+   void (*invalidate_range_begin)(struct mmu_notifier *mn,
+  struct mm_struct *mm,
+  unsigned long start, unsigned long end); 
  
+   void (*invalidate_range_end)(struct mmu_notifier *mn,
+struct mm_struct *mm,
+unsigned long start, unsigned long end);
+};
+
+struct mmu_notifier {
+   struct hlist_node hlist;
+   const struct mmu_notifier_ops *ops;
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_head {
+   struct hlist_head head;
+};
+
+#include 
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads.
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the "struct mmu_notifier" can be freed. Alternatively it
+ * can be synchronously freed inside ->release when the list can't
+ * change anymore and nobody could possibly walk it.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+   struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm

Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-02 Thread Nick Piggin
On Sun, Mar 02, 2008 at 04:54:57PM +0100, Andrea Arcangeli wrote:
> Difference between #v7 and #v8:
> 
> 1) s/age_page/clear_flush_young/ (Nick's suggestion)
> 2) macro fix (Andrew)
> 3) move release before final unmap_vmas (for GRU, Jack/Christoph)
> 4) microoptimize mmu_notifier_unregister (Christoph)
> 5) use mmap_sem for registration serialization (Christoph)
> 
> The (void)xxx in macros doesn't work with "args". Christoph's solution
> look best in avoiding warnings, even if it forces to make the mmu
> notifier operation structure visible even if MMU_NOTIFIER=n (that's
> the only downside).

I have a couple of "cleanup" patches that change the structure of this
to something I prefer. Others may not, but I'll post them for debate
anyway.

 
> I didn't drop invalidate_page, because invalidate_range_begin/end
> would be slower for usages like KVM/GRU (we don't need a begin/end
> there because where invalidate_page is called, the VM holds a
> reference on the page). do_wp_page should also use invalidate_page
> since it can free the page after dropping the PT lock without losing
> any performance (that's not true for the places where invalidate_range
> is called).

I'm still not completely happy with this. I had a very quick look
at the GRU driver, but I don't see why it can't be implemented
more like the regular TLB model, and have TLB insertions depend on
the linux pte, and do invalidates _after_ restricting permissions
to the pte.

Ie. I'd still like to get rid of invalidate_range_begin, and get
rid of invalidate calls from places where permissions are relaxed.


> It'd be nice if everyone involved can agree to converge on this API
> for .25. KVM/GRU (and perhaps Quadrics) and similar usages will be
> fully covered in .25.

If we can agree on the API, then I don't see any reason why it can't
go into 2.6.25, unless someome wants more time to review it (but
2.6.25 release should be quite far away still so there should be quite
a bit of time).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-02 Thread Nick Piggin
On Sun, Mar 02, 2008 at 04:54:57PM +0100, Andrea Arcangeli wrote:
> Difference between #v7 and #v8:

[patch] mmu-v8: demacro


Remove the macros from mmu_notifier.h, in favour of functions.

This requires untangling the include order circular dependencies as well,
so just remove struct mmu_notifier_head in favour of just using the hlist
in mm_struct.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -55,12 +55,13 @@ struct mmu_notifier {
 
 #ifdef CONFIG_MMU_NOTIFIER
 
-struct mmu_notifier_head {
-   struct hlist_head head;
-};
-
 #include 
 
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+   return unlikely(!hlist_empty(&mm->mmu_notifier_list));
+}
+
 /*
  * Must hold the mmap_sem for write.
  *
@@ -79,33 +80,59 @@ extern void mmu_notifier_register(struct
  */
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
struct mm_struct *mm);
-extern void mmu_notifier_release(struct mm_struct *mm);
-extern int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+ unsigned long address);
+extern void __mmu_notifier_invalidate_range_begin(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long address)
+{
+   if (mm_has_notifiers(mm))
+   return __mmu_notifier_clear_flush_young(mm, address);
+   return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+ unsigned long address)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_invalidate_page(mm, address);
+}
+
+static inline void mmu_notifier_invalidate_range_begin(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_invalidate_range_begin(mm, start, end);
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_invalidate_range_end(mm, start, end);
+}
 
-static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
-   INIT_HLIST_HEAD(&mnh->head);
+   INIT_HLIST_HEAD(&mm->mmu_notifier_list);
 }
 
-#define mmu_notifier(function, mm, args...)\
-   do {\
-   struct mmu_notifier *__mn;  \
-   struct hlist_node *__n; \
-   struct mm_struct * __mm = mm;   \
-   \
-   if (unlikely(!hlist_empty(&__mm->mmu_notifier.head))) { \
-   rcu_read_lock();\
-   hlist_for_each_entry_rcu(__mn, __n, \
-&__mm->mmu_notifier.head, \
-hlist) \
-   if (__mn->ops->function)\
-   __mn->ops->function(__mn,   \
-   __mm,   \
-   args);  \
-   rcu_read_unlock();  \
-   }   \
-   } while (0)
+
 
 #define ptep_clear_flush_notify(__vma, __address, __ptep)  \
 ({ \
@@ -113,7 +140,7 @@ static inline void mmu_notifier_head_ini
struct vm_area_struct * ___vma = __vma; \
unsigned long ___address = __address;   \
__pte = ptep_clear_flush(___vma, ___address, __ptep);   \

Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-02 Thread Nick Piggin
On Sun, Mar 02, 2008 at 04:54:57PM +0100, Andrea Arcangeli wrote:
> Difference between #v7 and #v8:

This one on top of the previous patch

[patch] mmu-v8: typesafe

Move definition of struct mmu_notifier and struct mmu_notifier_ops under
CONFIG_MMU_NOTIFIER to ensure they doesn't get dereferenced when they
don't make sense.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -3,8 +3,12 @@
 
 #include 
 #include 
+#include 
 
 struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
 
 struct mmu_notifier_ops {
/*
@@ -53,10 +57,6 @@ struct mmu_notifier {
const struct mmu_notifier_ops *ops;
 };
 
-#ifdef CONFIG_MMU_NOTIFIER
-
-#include 
-
 static inline int mm_has_notifiers(struct mm_struct *mm)
 {
return unlikely(!hlist_empty(&mm->mmu_notifier_list));

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-02 Thread Nick Piggin
On Sun, Mar 02, 2008 at 04:54:57PM +0100, Andrea Arcangeli wrote:
> Difference between #v7 and #v8:

Here is just a couple of checkpatch fixes on top of the last patches.

Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -46,7 +46,7 @@ struct mmu_notifier_ops {
 */
void (*invalidate_range_begin)(struct mmu_notifier *mn,
   struct mm_struct *mm,
-  unsigned long start, unsigned long end); 
  
+  unsigned long start, unsigned long end);
void (*invalidate_range_end)(struct mmu_notifier *mn,
 struct mm_struct *mm,
 unsigned long start, unsigned long end);
@@ -137,7 +137,7 @@ static inline void mmu_notifier_mm_init(
 #define ptep_clear_flush_notify(__vma, __address, __ptep)  \
 ({ \
pte_t __pte;\
-   struct vm_area_struct * ___vma = __vma; \
+   struct vm_area_struct *___vma = __vma;  \
unsigned long ___address = __address;   \
__pte = ptep_clear_flush(___vma, ___address, __ptep);   \
mmu_notifier_invalidate_page(___vma->vm_mm, ___address);\
@@ -147,7 +147,7 @@ static inline void mmu_notifier_mm_init(
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
 ({ \
int __young;\
-   struct vm_area_struct * ___vma = __vma; \
+   struct vm_area_struct *___vma = __vma;  \
unsigned long ___address = __address;   \
__young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,\

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Andrea Arcangeli
On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote:
> to something I prefer. Others may not, but I'll post them for debate
> anyway.

Sure, thanks!

> > I didn't drop invalidate_page, because invalidate_range_begin/end
> > would be slower for usages like KVM/GRU (we don't need a begin/end
> > there because where invalidate_page is called, the VM holds a
> > reference on the page). do_wp_page should also use invalidate_page
> > since it can free the page after dropping the PT lock without losing
> > any performance (that's not true for the places where invalidate_range
> > is called).
> 
> I'm still not completely happy with this. I had a very quick look
> at the GRU driver, but I don't see why it can't be implemented
> more like the regular TLB model, and have TLB insertions depend on
> the linux pte, and do invalidates _after_ restricting permissions
> to the pte.
> 
> Ie. I'd still like to get rid of invalidate_range_begin, and get
> rid of invalidate calls from places where permissions are relaxed.

_begin exists because by the time _end is called, the VM already
dropped the reference on the page. This way we can do a single
invalidate no matter how large the range is. I don't see ways to
remove _begin while still invoking _end a single time for the whole
range.

> If we can agree on the API, then I don't see any reason why it can't
> go into 2.6.25, unless someome wants more time to review it (but
> 2.6.25 release should be quite far away still so there should be quite
> a bit of time).

Cool! ;)

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Nick Piggin
On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote:
> On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote:
> > to something I prefer. Others may not, but I'll post them for debate
> > anyway.
> 
> Sure, thanks!
> 
> > > I didn't drop invalidate_page, because invalidate_range_begin/end
> > > would be slower for usages like KVM/GRU (we don't need a begin/end
> > > there because where invalidate_page is called, the VM holds a
> > > reference on the page). do_wp_page should also use invalidate_page
> > > since it can free the page after dropping the PT lock without losing
> > > any performance (that's not true for the places where invalidate_range
> > > is called).
> > 
> > I'm still not completely happy with this. I had a very quick look
> > at the GRU driver, but I don't see why it can't be implemented
> > more like the regular TLB model, and have TLB insertions depend on
> > the linux pte, and do invalidates _after_ restricting permissions
> > to the pte.
> > 
> > Ie. I'd still like to get rid of invalidate_range_begin, and get
> > rid of invalidate calls from places where permissions are relaxed.
> 
> _begin exists because by the time _end is called, the VM already
> dropped the reference on the page. This way we can do a single
> invalidate no matter how large the range is. I don't see ways to
> remove _begin while still invoking _end a single time for the whole
> range.

Is this just a GRU problem? Can't we just require them to take a ref
on the page (IIRC Jack said GRU could be changed to more like a TLB
model).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Andrea Arcangeli
On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote:
> Is this just a GRU problem? Can't we just require them to take a ref
> on the page (IIRC Jack said GRU could be changed to more like a TLB
> model).

Yes, it's just a GRU problem, it tries to optimize performance by
calling follow_page only in the fast path, and fallbacks to
get_user_pages; put_page in the slow path. xpmem could also send the
message in _begin and wait the message in _end, to reduce the wait
time. But if you forge GRU to call get_user_pages only (like KVM
does), the _begin can be removed. In theory we could also optimize KVM
to use follow_page only if the pte is already established. I'm not
sure how much that is a worthwhile optimization though.

However note that Quadrics also had a callback before and one after,
so they may be using the callback before for similar
optimizations. But functionality-wise _end is the only required bit if
everyone takes refcounts like KVM and XPMEM do.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Jack Steiner
On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote:
> On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote:
> > On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote:
> > > to something I prefer. Others may not, but I'll post them for debate
> > > anyway.
> > 
> > Sure, thanks!
> > 
> > > > I didn't drop invalidate_page, because invalidate_range_begin/end
> > > > would be slower for usages like KVM/GRU (we don't need a begin/end
> > > > there because where invalidate_page is called, the VM holds a
> > > > reference on the page). do_wp_page should also use invalidate_page
> > > > since it can free the page after dropping the PT lock without losing
> > > > any performance (that's not true for the places where invalidate_range
> > > > is called).
> > > 
> > > I'm still not completely happy with this. I had a very quick look
> > > at the GRU driver, but I don't see why it can't be implemented
> > > more like the regular TLB model, and have TLB insertions depend on
> > > the linux pte, and do invalidates _after_ restricting permissions
> > > to the pte.
> > > 
> > > Ie. I'd still like to get rid of invalidate_range_begin, and get
> > > rid of invalidate calls from places where permissions are relaxed.
> > 
> > _begin exists because by the time _end is called, the VM already
> > dropped the reference on the page. This way we can do a single
> > invalidate no matter how large the range is. I don't see ways to
> > remove _begin while still invoking _end a single time for the whole
> > range.
> 
> Is this just a GRU problem? Can't we just require them to take a ref
> on the page (IIRC Jack said GRU could be changed to more like a TLB
> model).

Maintaining a long-term reference on a page is a problem. The GRU does not
currently maintain tables to track the pages for which dropins have been done.

The GRU has a large internal TLB and is designed to reference up to 8PB of
memory. The size of the tables to track this many referenced pages would be
a problem (at best).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Nick Piggin
On Mon, Mar 03, 2008 at 09:18:59AM -0600, Jack Steiner wrote:
> On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote:
> > On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote:
> > > On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote:
> > > > to something I prefer. Others may not, but I'll post them for debate
> > > > anyway.
> > > 
> > > Sure, thanks!
> > > 
> > > > > I didn't drop invalidate_page, because invalidate_range_begin/end
> > > > > would be slower for usages like KVM/GRU (we don't need a begin/end
> > > > > there because where invalidate_page is called, the VM holds a
> > > > > reference on the page). do_wp_page should also use invalidate_page
> > > > > since it can free the page after dropping the PT lock without losing
> > > > > any performance (that's not true for the places where invalidate_range
> > > > > is called).
> > > > 
> > > > I'm still not completely happy with this. I had a very quick look
> > > > at the GRU driver, but I don't see why it can't be implemented
> > > > more like the regular TLB model, and have TLB insertions depend on
> > > > the linux pte, and do invalidates _after_ restricting permissions
> > > > to the pte.
> > > > 
> > > > Ie. I'd still like to get rid of invalidate_range_begin, and get
> > > > rid of invalidate calls from places where permissions are relaxed.
> > > 
> > > _begin exists because by the time _end is called, the VM already
> > > dropped the reference on the page. This way we can do a single
> > > invalidate no matter how large the range is. I don't see ways to
> > > remove _begin while still invoking _end a single time for the whole
> > > range.
> > 
> > Is this just a GRU problem? Can't we just require them to take a ref
> > on the page (IIRC Jack said GRU could be changed to more like a TLB
> > model).
> 
> Maintaining a long-term reference on a page is a problem. The GRU does not
> currently maintain tables to track the pages for which dropins have been done.
> 
> The GRU has a large internal TLB and is designed to reference up to 8PB of
> memory. The size of the tables to track this many referenced pages would be
> a problem (at best).

Is it any worse a problem than the pagetables of the processes which have
their virtual memory exported to GRU? AFAIKS, no; it is on the same
magnitude of difficulty. So you could do it without introducing any
fundamental problem (memory usage might be increased by some constant
factor, but I think we can cope with that in order to make the core patch
really nice and simple).

It is going to be really easy to add more weird and wonderful notifiers
later that deviate from our standard TLB model. It would be much harder to
remove them. So I really want to see everyone conform to this model first.
Numbers and comparisons can be brought out afterwards if people want to
attempt to make such changes.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Jack Steiner
On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote:
> On Mon, Mar 03, 2008 at 09:18:59AM -0600, Jack Steiner wrote:
> > On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote:
> > > On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote:
> > > > On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote:
> > > > > to something I prefer. Others may not, but I'll post them for debate
> > > > > anyway.
> > > > 
> > > > Sure, thanks!
> > > > 
> > > > > > I didn't drop invalidate_page, because invalidate_range_begin/end
> > > > > > would be slower for usages like KVM/GRU (we don't need a begin/end
> > > > > > there because where invalidate_page is called, the VM holds a
> > > > > > reference on the page). do_wp_page should also use invalidate_page
> > > > > > since it can free the page after dropping the PT lock without losing
> > > > > > any performance (that's not true for the places where 
> > > > > > invalidate_range
> > > > > > is called).
> > > > > 
> > > > > I'm still not completely happy with this. I had a very quick look
> > > > > at the GRU driver, but I don't see why it can't be implemented
> > > > > more like the regular TLB model, and have TLB insertions depend on
> > > > > the linux pte, and do invalidates _after_ restricting permissions
> > > > > to the pte.
> > > > > 
> > > > > Ie. I'd still like to get rid of invalidate_range_begin, and get
> > > > > rid of invalidate calls from places where permissions are relaxed.
> > > > 
> > > > _begin exists because by the time _end is called, the VM already
> > > > dropped the reference on the page. This way we can do a single
> > > > invalidate no matter how large the range is. I don't see ways to
> > > > remove _begin while still invoking _end a single time for the whole
> > > > range.

The range invalidates have a performance advantage for the GRU. TLB invalidates
on the GRU are relatively slow (usec) and interfere somewhat with the 
performance
of other active GRU instructions. Invalidating a large chunk of addresses with
a single GRU TLBINVAL operation is must faster than issuing a stream of single
page TLBINVALs.

I expect this performance advantage will also apply to other users of mmuops.

> > > 
> > > Is this just a GRU problem? Can't we just require them to take a ref
> > > on the page (IIRC Jack said GRU could be changed to more like a TLB
> > > model).
> > 
> > Maintaining a long-term reference on a page is a problem. The GRU does not
> > currently maintain tables to track the pages for which dropins have been 
> > done.
> > 
> > The GRU has a large internal TLB and is designed to reference up to 8PB of
> > memory. The size of the tables to track this many referenced pages would be
> > a problem (at best).
> 
> Is it any worse a problem than the pagetables of the processes which have
> their virtual memory exported to GRU? AFAIKS, no; it is on the same
> magnitude of difficulty. So you could do it without introducing any
> fundamental problem (memory usage might be increased by some constant
> factor, but I think we can cope with that in order to make the core patch
> really nice and simple).

Functionally, the GRU is very close to what I would consider to be the
"standard TLB" model. Dropins and flushs map closely to processor dropins
and flushes for cpus.  The internal structure of the GRU TLB is identical to
the TLB of existing cpus.  Requiring the GRU driver to track dropins with
long term page references seems to me a deviation from having the basic
mmuops support a "standard TLB" model. AFAIK, no other processor requires
this.

Tracking TLB dropins (and long term page references) could be done but it
adds significant complexity and scaling issues. The size of the tables to
track many TB (to PB) of memory can get large. If the memory is being
referenced by highly threaded applications, then the problem becomes even
more complex. Either tables must be replicated per-thread (and require even
more memory), or the table structure becomes even more complex to deal with
node locality, cacheline bouncing, etc.

Try to avoid a requirement to track dropins with long term page references.


> It is going to be really easy to add more weird and wonderful notifiers
> later that deviate from our standard TLB model. It would be much harder to
> remove them. So I really want to see everyone conform to this model first.

Agree.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Avi Kivity
Jack Steiner wrote:
> The range invalidates have a performance advantage for the GRU. TLB 
> invalidates
> on the GRU are relatively slow (usec) and interfere somewhat with the 
> performance
> of other active GRU instructions. Invalidating a large chunk of addresses with
> a single GRU TLBINVAL operation is must faster than issuing a stream of single
> page TLBINVALs.
>
> I expect this performance advantage will also apply to other users of mmuops.
>   

In theory this would apply to kvm as well (coalesce tlb flush IPIs, 
lookup shadow page table once), but is it really a fast path?  What 
triggers range operations for your use cases?

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Jack Steiner
On Mon, Mar 03, 2008 at 08:09:49PM +0200, Avi Kivity wrote:
> Jack Steiner wrote:
> >The range invalidates have a performance advantage for the GRU. TLB 
> >invalidates
> >on the GRU are relatively slow (usec) and interfere somewhat with the 
> >performance
> >of other active GRU instructions. Invalidating a large chunk of addresses 
> >with
> >a single GRU TLBINVAL operation is must faster than issuing a stream of 
> >single
> >page TLBINVALs.
> >
> >I expect this performance advantage will also apply to other users of 
> >mmuops.
> >  
> 
> In theory this would apply to kvm as well (coalesce tlb flush IPIs, 
> lookup shadow page table once), but is it really a fast path?  What 
> triggers range operations for your use cases?
 

Although not frequent, an unmap of a multiple TB object could be quite painful
if each page was invalidated individually instead of 1 invalidate for the 
entire range.
This is even worse if the application is threaded and the object has been 
reference by
many GRUs (there are 16 GRU ports per node - each potentially has to be 
invalidated).

Forks (again, not frequent) would be another case.




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Nick Piggin
On Mon, Mar 03, 2008 at 12:06:05PM -0600, Jack Steiner wrote:
> On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote:
> > > Maintaining a long-term reference on a page is a problem. The GRU does not
> > > currently maintain tables to track the pages for which dropins have been 
> > > done.
> > > 
> > > The GRU has a large internal TLB and is designed to reference up to 8PB of
> > > memory. The size of the tables to track this many referenced pages would 
> > > be
> > > a problem (at best).
> > 
> > Is it any worse a problem than the pagetables of the processes which have
> > their virtual memory exported to GRU? AFAIKS, no; it is on the same
> > magnitude of difficulty. So you could do it without introducing any
> > fundamental problem (memory usage might be increased by some constant
> > factor, but I think we can cope with that in order to make the core patch
> > really nice and simple).
> 
> Functionally, the GRU is very close to what I would consider to be the
> "standard TLB" model. Dropins and flushs map closely to processor dropins
> and flushes for cpus.  The internal structure of the GRU TLB is identical to
> the TLB of existing cpus.  Requiring the GRU driver to track dropins with
> long term page references seems to me a deviation from having the basic
> mmuops support a "standard TLB" model. AFAIK, no other processor requires
> this.

That is because the CPU TLBs have the mmu_gather batching APIs which
avoid the problem. It would be possible to do something similar for
GRU which would involve taking a reference for each page-to-be-invalidated
in invalidate_page, and release them when you invalidate_range. Or else
do some other scheme which makes mmu notifiers work similarly to the
mmu gather API. But not just go an invent something completely different
in the form of this invalidate_begin,clear linux pte,invalidate_end API.


> Tracking TLB dropins (and long term page references) could be done but it
> adds significant complexity and scaling issues. The size of the tables to
> track many TB (to PB) of memory can get large. If the memory is being
> referenced by highly threaded applications, then the problem becomes even
> more complex. Either tables must be replicated per-thread (and require even
> more memory), or the table structure becomes even more complex to deal with
> node locality, cacheline bouncing, etc.

I don't think it would be that significant in terms of complexity or
scaling.

For a quick solution, you could stick a radix tree in each of your mmu
notifiers registered (ie. one per mm), which is indexed on virtual address
>> PAGE_SHIFT, and returns the struct page *. Size is no different than
page tables, and locking is pretty scalable.

After that, I would really like to see whether the numbers justify
larger changes.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Christoph Lameter
On Mon, 3 Mar 2008, Nick Piggin wrote:

> I'm still not completely happy with this. I had a very quick look
> at the GRU driver, but I don't see why it can't be implemented
> more like the regular TLB model, and have TLB insertions depend on
> the linux pte, and do invalidates _after_ restricting permissions
> to the pte.
> 
> Ie. I'd still like to get rid of invalidate_range_begin, and get
> rid of invalidate calls from places where permissions are relaxed.

Isnt this more a job for paravirt ops if it is so tightly bound to page 
tables? Are we not adding another similar API?

> If we can agree on the API, then I don't see any reason why it can't
> go into 2.6.25, unless someome wants more time to review it (but
> 2.6.25 release should be quite far away still so there should be quite
> a bit of time).

API still has rcu issues and the example given for making things sleepable 
is only working for the aging callback. The most important callback is for 
try_to_unmao and page_mkclean. This means the API is still not generic 
enough and likely not extendable as needed in its present form.





-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Christoph Lameter
On Mon, 3 Mar 2008, Nick Piggin wrote:

> It is going to be really easy to add more weird and wonderful notifiers
> later that deviate from our standard TLB model. It would be much harder to
> remove them. So I really want to see everyone conform to this model first.
> Numbers and comparisons can be brought out afterwards if people want to
> attempt to make such changes.

Still do not see how that could be done. The model here is tightly bound 
to ptes. AFAICT this could be implemented in arch code like the paravirt 
ops.

 

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Christoph Lameter
I like the patch.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Christoph Lameter
On Mon, 3 Mar 2008, Nick Piggin wrote:

> Move definition of struct mmu_notifier and struct mmu_notifier_ops under
> CONFIG_MMU_NOTIFIER to ensure they doesn't get dereferenced when they
> don't make sense.

The callbacks take a mmu_notifier parameter. So how does this compile for 
!MMU_NOTIFIER?


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Jack Steiner
On Mon, Mar 03, 2008 at 07:45:17PM +0100, Nick Piggin wrote:
> On Mon, Mar 03, 2008 at 12:06:05PM -0600, Jack Steiner wrote:
> > On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote:
> > > > Maintaining a long-term reference on a page is a problem. The GRU does 
> > > > not
> > > > currently maintain tables to track the pages for which dropins have 
> > > > been done.
> > > > 
> > > > The GRU has a large internal TLB and is designed to reference up to 8PB 
> > > > of
> > > > memory. The size of the tables to track this many referenced pages 
> > > > would be
> > > > a problem (at best).
> > > 
> > > Is it any worse a problem than the pagetables of the processes which have
> > > their virtual memory exported to GRU? AFAIKS, no; it is on the same
> > > magnitude of difficulty. So you could do it without introducing any
> > > fundamental problem (memory usage might be increased by some constant
> > > factor, but I think we can cope with that in order to make the core patch
> > > really nice and simple).
> > 
> > Functionally, the GRU is very close to what I would consider to be the
> > "standard TLB" model. Dropins and flushs map closely to processor dropins
> > and flushes for cpus.  The internal structure of the GRU TLB is identical to
> > the TLB of existing cpus.  Requiring the GRU driver to track dropins with
> > long term page references seems to me a deviation from having the basic
> > mmuops support a "standard TLB" model. AFAIK, no other processor requires
> > this.
> 
> That is because the CPU TLBs have the mmu_gather batching APIs which
> avoid the problem. It would be possible to do something similar for
> GRU which would involve taking a reference for each page-to-be-invalidated
> in invalidate_page, and release them when you invalidate_range. Or else
> do some other scheme which makes mmu notifiers work similarly to the
> mmu gather API. But not just go an invent something completely different
> in the form of this invalidate_begin,clear linux pte,invalidate_end API.

Correct. If the mmu_gather were passed on the mmuops callout and the callout 
were
done at the same point as the tlb_finish_mmu(), the GRU could
efficiently work w/o the range invalidates. A range invalidate might still
be slightly more efficient but not measureable so. The net difference is
not worth the extra complexity of range callouts.


> 
> 
> > Tracking TLB dropins (and long term page references) could be done but it
> > adds significant complexity and scaling issues. The size of the tables to
> > track many TB (to PB) of memory can get large. If the memory is being
> > referenced by highly threaded applications, then the problem becomes even
> > more complex. Either tables must be replicated per-thread (and require even
> > more memory), or the table structure becomes even more complex to deal with
> > node locality, cacheline bouncing, etc.
> 
> I don't think it would be that significant in terms of complexity or
> scaling.
> 
> For a quick solution, you could stick a radix tree in each of your mmu
> notifiers registered (ie. one per mm), which is indexed on virtual address
> >> PAGE_SHIFT, and returns the struct page *. Size is no different than
> page tables, and locking is pretty scalable.
> 
> After that, I would really like to see whether the numbers justify
> larger changes.

I'm still concerned about performance. Each dropin would first have to access
an additional data structure that would most likely be non-node-local and
non-cache-resident. The net effect would be measurable but not a killer.

I haven't thought about locking requirements for the radix tree. Most accesses
would be read-only & updates infrequent. Any chance of an RCU-based radix
implementation?  Otherwise, don't we add the potential for hot locks/cachelines
for threaded applications ???

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Andrea Arcangeli
On Mon, Mar 03, 2008 at 11:01:22AM -0800, Christoph Lameter wrote:
> API still has rcu issues and the example given for making things sleepable 
> is only working for the aging callback. The most important callback is for 
> try_to_unmao and page_mkclean. This means the API is still not generic 
> enough and likely not extendable as needed in its present form.

I converted only one of those _notify as an example of how it should
be done, because I assumed you volunteer to convert the other ones
yourself during .26. It's useless to convert all of them right now,
because the i_mmap_lock and anon_vma locks are still going to be
spinlocks in .25.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-04 Thread Jack Steiner
On Tue, Mar 04, 2008 at 11:35:32AM +0100, Peter Zijlstra wrote:
> 
> On Mon, 2008-03-03 at 13:15 -0600, Jack Steiner wrote:
> 
> > I haven't thought about locking requirements for the radix tree. Most 
> > accesses
> > would be read-only & updates infrequent. Any chance of an RCU-based radix
> > implementation?  Otherwise, don't we add the potential for hot 
> > locks/cachelines
> > for threaded applications ???
> 
> The current radix tree implementation in the kernel is RCU capable. We
> just don't have many RCU users yet.

Ahhh. You are right. I thought I looked but obviously missed it.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-04 Thread Nick Piggin
On Mon, Mar 03, 2008 at 11:01:22AM -0800, Christoph Lameter wrote:
> On Mon, 3 Mar 2008, Nick Piggin wrote:
> 
> > I'm still not completely happy with this. I had a very quick look
> > at the GRU driver, but I don't see why it can't be implemented
> > more like the regular TLB model, and have TLB insertions depend on
> > the linux pte, and do invalidates _after_ restricting permissions
> > to the pte.
> > 
> > Ie. I'd still like to get rid of invalidate_range_begin, and get
> > rid of invalidate calls from places where permissions are relaxed.
> 
> Isnt this more a job for paravirt ops if it is so tightly bound to page 
> tables? Are we not adding another similar API?

Um, it's bound to the *Linux page tables*, yes. And I have no idea why
you would use the paravirt ops for this.

 

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-05 Thread Christoph Lameter
On Wed, 5 Mar 2008, Nick Piggin wrote:

> Um, it's bound to the *Linux page tables*, yes. And I have no idea why
> you would use the paravirt ops for this.

paravirt ops allows interception of page table operations?


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-05 Thread Nick Piggin
On Wed, Mar 05, 2008 at 10:48:24AM -0800, Christoph Lameter wrote:
> On Wed, 5 Mar 2008, Nick Piggin wrote:
> 
> > Um, it's bound to the *Linux page tables*, yes. And I have no idea why
> > you would use the paravirt ops for this.
> 
> paravirt ops allows interception of page table operations?

Maybe possible but it's totally the wrong API for it.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8 + xpmem

2008-03-02 Thread Andrea Arcangeli
Here an example of the futher orthogonal work to do on top of #v8
during .26-rc to make the whole mmu notifier API sleep capable.

1) Every single ptep_clear_flush_young_notify and
ptep_clear_flush_notify must be converted like the below. The below is
the conversion of a single one. do_wp_page has been converted by
Christoph already but with invalidate_range (should be changed to
invalidate_page by releasing the refcount on the page after calling
invalidate_page). Hope it's clear why I'd rather not depend on these
changes to be merged in .25 in order to have the mmu notifier included
in .25.

2) Then after all this conversion work is finished, it's trivial to
delete ptep_clear_flush_young_notify and ptep_clear_flush_notify from
mmu_notifier.h (they will be unused macros once the conversion is
complete).

3) After that the VM has to be changed to convert anon_vma lock and
i_mmap_lock spinlocks to mutex/rwsemaphore.

4) Then finally the mmu_notifier_unregister must be dropped to make the
mmu notifier sleep capable with RCU in the mmu_notifier() fast path.

It's unclear at this point if 3/4 should be switchable and happening
under a CONFIG_XPMEM or similar or if everyone will benefit from those
spinlock becoming mutex (the only one that is certain to appreciate
such a change is preempt-rt, the rest of the userbase I don't know for
sure and I'd be more confortable with a TPC number comparison before
doing such a chance by default, but I leave the commentary on such a
change to linux-mm in a separate thread).

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -274,7 +274,7 @@ static int page_referenced_one(struct pa
unsigned long address;
pte_t *pte;
spinlock_t *ptl;
-   int referenced = 0;
+   int referenced = 0, clear_flush_young = 0;
 
address = vma_address(page, vma);
if (address == -EFAULT)
@@ -287,8 +287,11 @@ static int page_referenced_one(struct pa
if (vma->vm_flags & VM_LOCKED) {
referenced++;
*mapcount = 1;  /* break early from loop */
-   } else if (ptep_clear_flush_young_notify(vma, address, pte))
-   referenced++;
+   } else {
+   clear_flush_young = 1;
+   if (ptep_clear_flush_young(vma, address, pte))
+   referenced++;
+   }
 
/* Pretend the page is referenced if the task has the
   swap token and is in the middle of a page fault. */
@@ -298,6 +301,11 @@ static int page_referenced_one(struct pa
 
(*mapcount)--;
pte_unmap_unlock(pte, ptl);
+
+   if (clear_flush_young)
+   referenced += mmu_notifier_clear_flush_young(vma->vm_mm,
+address);
+
 out:
return referenced;
 }


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel