Re: [patch 1/4] mmu_notifier: Core code

2008-02-01 Thread Christoph Lameter
On Fri, 1 Feb 2008, Robin Holt wrote:

> OK.  Now that release has been moved, I think I agree with you that the
> down_write(mmap_sem) can be used as our lock again and still work for
> Jack.  I would like a ruling from Jack as well.

Talked to Jack last night and he said its okay.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-02-01 Thread Robin Holt
On Fri, Feb 01, 2008 at 04:55:16AM -0600, Robin Holt wrote:
> OK.  Now that release has been moved, I think I agree with you that the
> down_write(mmap_sem) can be used as our lock again and still work for
> Jack.  I would like a ruling from Jack as well.

Ignore this, I was in the wrong work area.  I am sorry for adding to the
confusion.  This version has no locking requirement outside the driver
itself.

Sorry,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-02-01 Thread Robin Holt
OK.  Now that release has been moved, I think I agree with you that the
down_write(mmap_sem) can be used as our lock again and still work for
Jack.  I would like a ruling from Jack as well.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-02-01 Thread Robin Holt
OK.  Now that release has been moved, I think I agree with you that the
down_write(mmap_sem) can be used as our lock again and still work for
Jack.  I would like a ruling from Jack as well.

Thanks,
Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-02-01 Thread Robin Holt
On Fri, Feb 01, 2008 at 04:55:16AM -0600, Robin Holt wrote:
 OK.  Now that release has been moved, I think I agree with you that the
 down_write(mmap_sem) can be used as our lock again and still work for
 Jack.  I would like a ruling from Jack as well.

Ignore this, I was in the wrong work area.  I am sorry for adding to the
confusion.  This version has no locking requirement outside the driver
itself.

Sorry,
Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-02-01 Thread Christoph Lameter
On Fri, 1 Feb 2008, Robin Holt wrote:

 OK.  Now that release has been moved, I think I agree with you that the
 down_write(mmap_sem) can be used as our lock again and still work for
 Jack.  I would like a ruling from Jack as well.

Talked to Jack last night and he said its okay.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/4] mmu_notifier: Core code

2008-01-31 Thread Christoph Lameter
Notifier functions for hardware and software that establishes external
references to pages of a Linux system. The notifier calls ensure that
external mappings are removed when the Linux VM removes memory ranges
or individual pages from a process.

This first portion is fitting for external mmu's that do not have their
own rmap or need the ability to sleep before removing individual pages.

Two categories of external mmus are possible:

1. KVM style external mmus that have their own page table.
   These are capable of tracking pages in their page tables and
   can therefore increase the refcount on pages. An increased
   refcount guarantees page existence regardless of the vms unmapping
   actions until the logic in the notifier call decides to drop a page.

2. GRU style external mmus that rely on the Linux page table for TLB lookups.
   These cannot track pages that are externally references.
   TLB entries can only be evicted as necessary.


Callbacks are registered with an mm_struct from a device drivers using
mmu_notifier_register. When the VM removes pages (or restricts
permissions on pages) then callbacks are triggered

The VM holds spinlocks in order to walk reverse maps in rmap.c. The single
page callback invalidate_page() is therefore always run with
spinlocks held (which limits what can be done in the callbacks).

The invalidate_range_start/end callbacks can be run in atomic as well as
sleepable contexts. A flag is passed to indicate an atomic context.
The notifier may decide to defer actions if the context is atomic.

Pages must be marked dirty if dirty bits are found to be set in
the external ptes.

Requirements on synchronization within the driver:

 Multiple invalidate_range_begin/ends may be nested or called
 concurrently. That is legit. However, no new external references
 may be established as long as any invalidate_xxx is running or as long
 as any invalidate_range_begin() and has not been completed through a
 corresponding call to invalidate_range_end().

 Locking within the notifier callbacks needs to serialize events
 correspondingly. One simple implementation would be the use of a spinlock
 that needs to be acquired for access to the page table or tlb managed by
 the driver. A rw lock could be used to allow multiplel concurrent 
invalidates
 to run but then the driver needs to have additional internal 
synchronization
 for access to hardware resources.

 If all invalidate_xx notifier calls take the driver lock then it is 
possible
 to run follow_page() under the same lock. The lock can then guarantee
 that no page is removed and provides an additional existence guarantee
 of the page independent of the page count.

 invalidate_range_begin() must clear all references in the range
 and stop the establishment of new references.

 invalidate_range_end() reenables the establishment of references.
 The atomic paramater passed to invalidatge_range_xx indicates that the 
function
 is called in an atomic context. We can sleep if atomic == 0.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

---
 include/linux/mm_types.h |8 +
 include/linux/mmu_notifier.h |  179 +++
 kernel/fork.c|2 
 mm/Kconfig   |4 
 mm/Makefile  |1 
 mm/mmap.c|2 
 mm/mmu_notifier.c|   76 ++
 7 files changed, 272 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-31 19:55:46.0 
-0800
+++ linux-2.6/include/linux/mm_types.h  2008-01-31 19:59:51.0 -0800
@@ -153,6 +153,12 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head head;
+#endif
+};
+
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +225,8 @@ struct mm_struct {
/* aio bits */
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
+
+   struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-31 20:56:03.0 
-0800
@@ -0,0 +1,179 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a 

[patch 1/4] mmu_notifier: Core code

2008-01-31 Thread Christoph Lameter
Notifier functions for hardware and software that establishes external
references to pages of a Linux system. The notifier calls ensure that
external mappings are removed when the Linux VM removes memory ranges
or individual pages from a process.

This first portion is fitting for external mmu's that do not have their
own rmap or need the ability to sleep before removing individual pages.

Two categories of external mmus are possible:

1. KVM style external mmus that have their own page table.
   These are capable of tracking pages in their page tables and
   can therefore increase the refcount on pages. An increased
   refcount guarantees page existence regardless of the vms unmapping
   actions until the logic in the notifier call decides to drop a page.

2. GRU style external mmus that rely on the Linux page table for TLB lookups.
   These cannot track pages that are externally references.
   TLB entries can only be evicted as necessary.


Callbacks are registered with an mm_struct from a device drivers using
mmu_notifier_register. When the VM removes pages (or restricts
permissions on pages) then callbacks are triggered

The VM holds spinlocks in order to walk reverse maps in rmap.c. The single
page callback invalidate_page() is therefore always run with
spinlocks held (which limits what can be done in the callbacks).

The invalidate_range_start/end callbacks can be run in atomic as well as
sleepable contexts. A flag is passed to indicate an atomic context.
The notifier may decide to defer actions if the context is atomic.

Pages must be marked dirty if dirty bits are found to be set in
the external ptes.

Requirements on synchronization within the driver:

 Multiple invalidate_range_begin/ends may be nested or called
 concurrently. That is legit. However, no new external references
 may be established as long as any invalidate_xxx is running or as long
 as any invalidate_range_begin() and has not been completed through a
 corresponding call to invalidate_range_end().

 Locking within the notifier callbacks needs to serialize events
 correspondingly. One simple implementation would be the use of a spinlock
 that needs to be acquired for access to the page table or tlb managed by
 the driver. A rw lock could be used to allow multiplel concurrent 
invalidates
 to run but then the driver needs to have additional internal 
synchronization
 for access to hardware resources.

 If all invalidate_xx notifier calls take the driver lock then it is 
possible
 to run follow_page() under the same lock. The lock can then guarantee
 that no page is removed and provides an additional existence guarantee
 of the page independent of the page count.

 invalidate_range_begin() must clear all references in the range
 and stop the establishment of new references.

 invalidate_range_end() reenables the establishment of references.
 The atomic paramater passed to invalidatge_range_xx indicates that the 
function
 is called in an atomic context. We can sleep if atomic == 0.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

---
 include/linux/mm_types.h |8 +
 include/linux/mmu_notifier.h |  179 +++
 kernel/fork.c|2 
 mm/Kconfig   |4 
 mm/Makefile  |1 
 mm/mmap.c|2 
 mm/mmu_notifier.c|   76 ++
 7 files changed, 272 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-31 19:55:46.0 
-0800
+++ linux-2.6/include/linux/mm_types.h  2008-01-31 19:59:51.0 -0800
@@ -153,6 +153,12 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head head;
+#endif
+};
+
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +225,8 @@ struct mm_struct {
/* aio bits */
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
+
+   struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-31 20:56:03.0 
-0800
@@ -0,0 +1,179 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ 

Re: [patch 1/4] mmu_notifier: Core code

2008-01-28 Thread Christoph Lameter
On Sat, 26 Jan 2008, Robin Holt wrote:

> > No you cannot do that because there are still callbacks that come later. 
> > The invalidate_all may lead to invalidate_range() doing nothing for this 
> > mm. The ops notifier and the freeing of the structure has to wait until 
> > release().
> 
> Could you be a little more clear here?  If you are saying that the other
> callbacks will need to do work?  I can assure you we will clean up those
> pages and raise memory protections.  It will also be done in a much more
> efficient fashion than the individual callouts.

No the other callbacks need to work in the sense that they can be called. 
You could have them do nothing after an invalidate_all().
But you cannot release the allocated structs needed for list traversal 
etc.

> If, on the other hand, you are saying we can not because of the way
> we traverse the list, can we return a result indicating to the caller
> we would like to be unregistered and then the mmu_notifier code do the
> remove followed by a call to the release notifier?

You would need to release the resources when the release notifier is 
called.

> > That does not sync with the current scheme of the invalidate_range() 
> > hooks. We would have to do a global invalidate early and then place the 
> > other invalidate_range hooks in such a way that none is called in later in 
> > process exit handling.
> 
> But if the notifier is removed from the list following the invalidate_all
> callout, there would be no additional callouts.

Hmmm Okay did not think about that. Then you would need to do a 
synchronize_rcu() in invalidate_all()?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-28 Thread Christoph Lameter
On Sat, 26 Jan 2008, Robin Holt wrote:

> But what if the caller is already holding the mmap_sem?  Why force the
> acquire into this function?  Since we are dealing with a semaphore/mutex,

Then you need to call __register_mmu_notifier.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-28 Thread Christoph Lameter
On Sat, 26 Jan 2008, Robin Holt wrote:

 But what if the caller is already holding the mmap_sem?  Why force the
 acquire into this function?  Since we are dealing with a semaphore/mutex,

Then you need to call __register_mmu_notifier.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-28 Thread Christoph Lameter
On Sat, 26 Jan 2008, Robin Holt wrote:

  No you cannot do that because there are still callbacks that come later. 
  The invalidate_all may lead to invalidate_range() doing nothing for this 
  mm. The ops notifier and the freeing of the structure has to wait until 
  release().
 
 Could you be a little more clear here?  If you are saying that the other
 callbacks will need to do work?  I can assure you we will clean up those
 pages and raise memory protections.  It will also be done in a much more
 efficient fashion than the individual callouts.

No the other callbacks need to work in the sense that they can be called. 
You could have them do nothing after an invalidate_all().
But you cannot release the allocated structs needed for list traversal 
etc.

 If, on the other hand, you are saying we can not because of the way
 we traverse the list, can we return a result indicating to the caller
 we would like to be unregistered and then the mmu_notifier code do the
 remove followed by a call to the release notifier?

You would need to release the resources when the release notifier is 
called.

  That does not sync with the current scheme of the invalidate_range() 
  hooks. We would have to do a global invalidate early and then place the 
  other invalidate_range hooks in such a way that none is called in later in 
  process exit handling.
 
 But if the notifier is removed from the list following the invalidate_all
 callout, there would be no additional callouts.

Hmmm Okay did not think about that. Then you would need to do a 
synchronize_rcu() in invalidate_all()?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-26 Thread Robin Holt
>  void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
> - spin_lock(_notifier_list_lock);
> - hlist_add_head(>hlist, >mmu_notifier.head);
> - spin_unlock(_notifier_list_lock);
> + down_write(>mmap_sem);
> + __mmu_notifier_register(mn, mm);
> + up_write(>mmap_sem);
>  }
>  EXPORT_SYMBOL_GPL(mmu_notifier_register);

But what if the caller is already holding the mmap_sem?  Why force the
acquire into this function?  Since we are dealing with a semaphore/mutex,
it is reasonable that other structures are protected by this, more work
will be done, and therefore put the weight of acquiring the sema in the
control of the caller where they can decide if more needs to be completed.

That was why I originally suggested creating a new rwsem_is_write_locked()
function and basing a BUG_ON upon that.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-26 Thread Robin Holt
> > > 1. invalidate_all()
> > 
> > That will be fine as long as we can unregister the ops notifier and free
> > the structure.  Otherwise, we end up being called needlessly.
> 
> No you cannot do that because there are still callbacks that come later. 
> The invalidate_all may lead to invalidate_range() doing nothing for this 
> mm. The ops notifier and the freeing of the structure has to wait until 
> release().

Could you be a little more clear here?  If you are saying that the other
callbacks will need to do work?  I can assure you we will clean up those
pages and raise memory protections.  It will also be done in a much more
efficient fashion than the individual callouts.

If, on the other hand, you are saying we can not because of the way
we traverse the list, can we return a result indicating to the caller
we would like to be unregistered and then the mmu_notifier code do the
remove followed by a call to the release notifier?

> 
> > > 2. invalidate_range() for each vma
> > > 
> > > 3. release()
> > > 
> > > We cannot simply move the call up because there will be future range 
> > > callbacks on vma invalidation.
> > 
> > I am not sure what this means.  Right now, if you were to notify XPMEM
> > the process is exiting, we would take care of all the recalling of pages
> > exported by this process, clearing those pages cache lines from cache,
> > and raising memory protections.  I would assume that moving the callout
> > earlier would expect the same of every driver.
> 
> That does not sync with the current scheme of the invalidate_range() 
> hooks. We would have to do a global invalidate early and then place the 
> other invalidate_range hooks in such a way that none is called in later in 
> process exit handling.

But if the notifier is removed from the list following the invalidate_all
callout, there would be no additional callouts.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-26 Thread Robin Holt
   1. invalidate_all()
  
  That will be fine as long as we can unregister the ops notifier and free
  the structure.  Otherwise, we end up being called needlessly.
 
 No you cannot do that because there are still callbacks that come later. 
 The invalidate_all may lead to invalidate_range() doing nothing for this 
 mm. The ops notifier and the freeing of the structure has to wait until 
 release().

Could you be a little more clear here?  If you are saying that the other
callbacks will need to do work?  I can assure you we will clean up those
pages and raise memory protections.  It will also be done in a much more
efficient fashion than the individual callouts.

If, on the other hand, you are saying we can not because of the way
we traverse the list, can we return a result indicating to the caller
we would like to be unregistered and then the mmu_notifier code do the
remove followed by a call to the release notifier?

 
   2. invalidate_range() for each vma
   
   3. release()
   
   We cannot simply move the call up because there will be future range 
   callbacks on vma invalidation.
  
  I am not sure what this means.  Right now, if you were to notify XPMEM
  the process is exiting, we would take care of all the recalling of pages
  exported by this process, clearing those pages cache lines from cache,
  and raising memory protections.  I would assume that moving the callout
  earlier would expect the same of every driver.
 
 That does not sync with the current scheme of the invalidate_range() 
 hooks. We would have to do a global invalidate early and then place the 
 other invalidate_range hooks in such a way that none is called in later in 
 process exit handling.

But if the notifier is removed from the list following the invalidate_all
callout, there would be no additional callouts.

Thanks,
Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-26 Thread Robin Holt
  void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
  {
 - spin_lock(mmu_notifier_list_lock);
 - hlist_add_head(mn-hlist, mm-mmu_notifier.head);
 - spin_unlock(mmu_notifier_list_lock);
 + down_write(mm-mmap_sem);
 + __mmu_notifier_register(mn, mm);
 + up_write(mm-mmap_sem);
  }
  EXPORT_SYMBOL_GPL(mmu_notifier_register);

But what if the caller is already holding the mmap_sem?  Why force the
acquire into this function?  Since we are dealing with a semaphore/mutex,
it is reasonable that other structures are protected by this, more work
will be done, and therefore put the weight of acquiring the sema in the
control of the caller where they can decide if more needs to be completed.

That was why I originally suggested creating a new rwsem_is_write_locked()
function and basing a BUG_ON upon that.

Thanks,
Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Christoph Lameter
Diff so far against V1

- Improve RCU support. (There is now a sychronize_rcu in mmu_release which 
is bad.)

- Clean compile for !MMU_NOTIFIER

- Use mmap_sem for serializing additions the mmu_notifier list in the 
mm_struct (but still global spinlock for mmu_rmap_notifier. The 
registration function is called only a couple of times))

- 

---
 include/linux/list.h |   14 ++
 include/linux/mm_types.h |2 --
 include/linux/mmu_notifier.h |   39 ---
 mm/mmu_notifier.c|   28 +++-
 4 files changed, 69 insertions(+), 14 deletions(-)

Index: linux-2.6/mm/mmu_notifier.c
===
--- linux-2.6.orig/mm/mmu_notifier.c2008-01-25 12:14:49.0 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-25 12:14:49.0 -0800
@@ -15,17 +15,18 @@
 void mmu_notifier_release(struct mm_struct *mm)
 {
struct mmu_notifier *mn;
-   struct hlist_node *n;
+   struct hlist_node *n, *t;
 
if (unlikely(!hlist_empty(>mmu_notifier.head))) {
rcu_read_lock();
-   hlist_for_each_entry_rcu(mn, n,
+   hlist_for_each_entry_safe_rcu(mn, n, t,
  >mmu_notifier.head, hlist) {
if (mn->ops->release)
mn->ops->release(mn, mm);
hlist_del(>hlist);
}
rcu_read_unlock();
+   synchronize_rcu();
}
 }
 
@@ -53,24 +54,33 @@ int mmu_notifier_age_page(struct mm_stru
return young;
 }
 
-static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+   hlist_add_head_rcu(>hlist, >mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
 
 void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-   spin_lock(_notifier_list_lock);
-   hlist_add_head(>hlist, >mmu_notifier.head);
-   spin_unlock(_notifier_list_lock);
+   down_write(>mmap_sem);
+   __mmu_notifier_register(mn, mm);
+   up_write(>mmap_sem);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_register);
 
 void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-   spin_lock(_notifier_list_lock);
-   hlist_del(>hlist);
-   spin_unlock(_notifier_list_lock);
+   down_write(>mmap_sem);
+   hlist_del_rcu(>hlist);
+   up_write(>mmap_sem);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
 
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
 HLIST_HEAD(mmu_rmap_notifier_list);
 
 void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
Index: linux-2.6/include/linux/list.h
===
--- linux-2.6.orig/include/linux/list.h 2008-01-25 12:14:47.0 -0800
+++ linux-2.6/include/linux/list.h  2008-01-25 12:14:49.0 -0800
@@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 pos = pos->next)
 
+/**
+ * hlist_for_each_entry_safe_rcu   - iterate over list of given type
+ * @tpos:  the type * to use as a loop cursor.
+ * @pos:   the  hlist_node to use as a loop cursor.
+ * @n: temporary pointer
+ * @head:  the head for your list.
+ * @member:the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member)   \
+   for (pos = (head)->first;\
+rcu_dereference(pos) && ({ n = pos->next; 1;}) &&   \
+   ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+pos = n)
+
 #else
 #warning "don't include kernel headers in userspace"
 #endif /* __KERNEL__ */
Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-25 12:14:49.0 
-0800
+++ linux-2.6/include/linux/mm_types.h  2008-01-25 12:14:49.0 -0800
@@ -224,9 +224,7 @@ struct mm_struct {
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
 
-#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
-#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-25 12:14:49.0 
-0800
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-25 13:07:54.0 
-0800
@@ -46,6 +46,10 @@ struct mmu_notifier {
 };
 
 struct mmu_notifier_ops {
+   /*
+* Note the mmu_notifier 

Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Christoph Lameter
On Fri, 25 Jan 2008, Robin Holt wrote:

> Keep in mind that on a 2048p SSI MPI job starting up, we have 2048 ranks
> doing this at the same time 6 times withing their address range.  That
> seems like a lock which could get hot fairly quickly.  It may be for a
> short period during startup and shutdown, but it is there.

Ok. I guess we need to have a __register_mmu_notifier that expects the 
mmap_sem to be held then?

> > 1. invalidate_all()
> 
> That will be fine as long as we can unregister the ops notifier and free
> the structure.  Otherwise, we end up being called needlessly.

No you cannot do that because there are still callbacks that come later. 
The invalidate_all may lead to invalidate_range() doing nothing for this 
mm. The ops notifier and the freeing of the structure has to wait until 
release().

> > 2. invalidate_range() for each vma
> > 
> > 3. release()
> > 
> > We cannot simply move the call up because there will be future range 
> > callbacks on vma invalidation.
> 
> I am not sure what this means.  Right now, if you were to notify XPMEM
> the process is exiting, we would take care of all the recalling of pages
> exported by this process, clearing those pages cache lines from cache,
> and raising memory protections.  I would assume that moving the callout
> earlier would expect the same of every driver.

That does not sync with the current scheme of the invalidate_range() 
hooks. We would have to do a global invalidate early and then place the 
other invalidate_range hooks in such a way that none is called in later in 
process exit handling.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Robin Holt
On Fri, Jan 25, 2008 at 11:03:07AM -0800, Christoph Lameter wrote:
> > > > Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
> > > 
> > > Ok. We could switch this to mmap_sem protection for the mm_struct but the 
> > > rmap notifier is not associated with an mm_struct. So we would need to 
> > > keep it there. Since we already have a spinlock: Just use it for both to 
> > > avoid further complications.
> > 
> > But now you are putting a global lock in where it is inappropriate.
> 
> The lock is only used during register and unregister. Very low level 
> usage.

Seems to me that is the same argument used for lock_kernel.  I am saying
we have a perfectly reasonable way to seperate the protections down to
their smallest.  For the things hanging off the mm, mmap_sem, for the
other list, a list specific lock.

Keep in mind that on a 2048p SSI MPI job starting up, we have 2048 ranks
doing this at the same time 6 times withing their address range.  That
seems like a lock which could get hot fairly quickly.  It may be for a
short period during startup and shutdown, but it is there.


> 
> > > > XPMEM, would also benefit from a call early.  We could make all the
> > > > segments as being torn down and start the recalls.  We already have
> > > > this code in and working (have since it was first written 6 years ago).
> > > > In this case, all segments are torn down with a single message to each
> > > > of the importing partitions.  In contrast, the teardown code which would
> > > > happen now would be one set of messages for each vma.
> > > 
> > > So we need an additional global teardown call? Then we'd need to switch 
> > > off the vma based invalidate_range()?
> > 
> > No, EXACTLY what I originally was asking for, either move this call site
> > up, introduce an additional mmu_notifier op, or place this one in two
> > locations with a flag indicating which call is being made.
> 
> Add a new invalidate_all() call? Then on exit we do
> 
> 1. invalidate_all()

That will be fine as long as we can unregister the ops notifier and free
the structure.  Otherwise, we end up being called needlessly.

> 
> 2. invalidate_range() for each vma
> 
> 3. release()
> 
> We cannot simply move the call up because there will be future range 
> callbacks on vma invalidation.

I am not sure what this means.  Right now, if you were to notify XPMEM
the process is exiting, we would take care of all the recalling of pages
exported by this process, clearing those pages cache lines from cache,
and raising memory protections.  I would assume that moving the callout
earlier would expect the same of every driver.


Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Christoph Lameter
On Fri, 25 Jan 2008, Robin Holt wrote:

> > > > +void mmu_notifier_release(struct mm_struct *mm)
> > > > +{
> > > > +   struct mmu_notifier *mn;
> > > > +   struct hlist_node *n;
> > > > +
> > > > +   if (unlikely(!hlist_empty(>mmu_notifier.head))) {
> > > > +   rcu_read_lock();
> > > > +   hlist_for_each_entry_rcu(mn, n,
> > > > + >mmu_notifier.head, 
> > > > hlist) {
> > > > +   if (mn->ops->release)
> > > > +   mn->ops->release(mn, mm);
> > > > +   hlist_del(>hlist);
> > > 
> > > I think the hlist_del needs to be before the function callout so we can 
> > > free
> > > the structure without a use-after-free issue.
> > 
> > The list head is in the mm_struct. This will be freed later.
> > 
> 
> I meant the structure pointed to by   I assume it is intended that
> structure be kmalloc'd as part of a larger structure.  The driver is the
> entity which created that structure and should be the one to free it.

mn will be pointing to the listhead in the mm_struct one after the other. 
You mean the ops structure?

> > > > +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct 
> > > > *mm)
> > > > +{
> > > > +   spin_lock(_notifier_list_lock);
> > > 
> > > Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
> > 
> > Ok. We could switch this to mmap_sem protection for the mm_struct but the 
> > rmap notifier is not associated with an mm_struct. So we would need to 
> > keep it there. Since we already have a spinlock: Just use it for both to 
> > avoid further complications.
> 
> But now you are putting a global lock in where it is inappropriate.

The lock is only used during register and unregister. Very low level 
usage.

> > > XPMEM, would also benefit from a call early.  We could make all the
> > > segments as being torn down and start the recalls.  We already have
> > > this code in and working (have since it was first written 6 years ago).
> > > In this case, all segments are torn down with a single message to each
> > > of the importing partitions.  In contrast, the teardown code which would
> > > happen now would be one set of messages for each vma.
> > 
> > So we need an additional global teardown call? Then we'd need to switch 
> > off the vma based invalidate_range()?
> 
> No, EXACTLY what I originally was asking for, either move this call site
> up, introduce an additional mmu_notifier op, or place this one in two
> locations with a flag indicating which call is being made.

Add a new invalidate_all() call? Then on exit we do

1. invalidate_all()

2. invalidate_range() for each vma

3. release()

We cannot simply move the call up because there will be future range 
callbacks on vma invalidation.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Robin Holt
On Fri, Jan 25, 2008 at 10:47:04AM -0800, Christoph Lameter wrote:
> On Fri, 25 Jan 2008, Robin Holt wrote:
> 
> > I realize it is a minor nit, but since we put the continuation in column
> > 81 in the next define, can we do the same here and make this more
> > readable?
> 
> We need to fix the next define to not use column 81.
> Found a couple of more 80 column infractions. Will be fixed in next 
> release.
> 
> > > +void mmu_notifier_release(struct mm_struct *mm)
> > > +{
> > > + struct mmu_notifier *mn;
> > > + struct hlist_node *n;
> > > +
> > > + if (unlikely(!hlist_empty(>mmu_notifier.head))) {
> > > + rcu_read_lock();
> > > + hlist_for_each_entry_rcu(mn, n,
> > > +   >mmu_notifier.head, hlist) {
> > > + if (mn->ops->release)
> > > + mn->ops->release(mn, mm);
> > > + hlist_del(>hlist);
> > 
> > I think the hlist_del needs to be before the function callout so we can free
> > the structure without a use-after-free issue.
> 
> The list head is in the mm_struct. This will be freed later.
> 

I meant the structure pointed to by   I assume it is intended that
structure be kmalloc'd as part of a larger structure.  The driver is the
entity which created that structure and should be the one to free it.

> > > +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> > > +{
> > > + spin_lock(_notifier_list_lock);
> > 
> > Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
> 
> Ok. We could switch this to mmap_sem protection for the mm_struct but the 
> rmap notifier is not associated with an mm_struct. So we would need to 
> keep it there. Since we already have a spinlock: Just use it for both to 
> avoid further complications.

But now you are putting a global lock in where it is inappropriate.

> 
> > > + spin_lock(_notifier_list_lock);
> > > + hlist_del(>hlist);
> > 
> > hlist_del_rcu?  Ditto on the lock.
> 
> Peter already mentioned that and I have posted patches that address this 
> issue.
> 
> > > @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
> > >   vm_unacct_memory(nr_accounted);
> > >   free_pgtables(, vma, FIRST_USER_ADDRESS, 0);
> > >   tlb_finish_mmu(tlb, 0, end);
> > > + mmu_notifier_release(mm);
> > 
> > Can we consider moving this notifier or introducing an additional notifier
> > in the release or a flag to this one indicating early/late.
> 
> There is only one call right now?
> 
> > The GRU that Jack is concerned with would benefit from the early in
> > that it could just invalidate the GRU context and immediately all GRU
> > TLB entries are invalid.  I believe Jack would like to also be able to
> > remove his entry from the mmu_notifier list in an effort to avoid the
> > page and range callouts.
> 
> The TLB entries are removed by earlier invalidate_range calls. I would 
> think that no TLBs are left at this point. Its simply a matter of 
> releasing any still allocated resources through this callback.

What I was asking for is a way to avoid those numerous callouts for
drivers that can do early cleanup.

>  
> > XPMEM, would also benefit from a call early.  We could make all the
> > segments as being torn down and start the recalls.  We already have
> > this code in and working (have since it was first written 6 years ago).
> > In this case, all segments are torn down with a single message to each
> > of the importing partitions.  In contrast, the teardown code which would
> > happen now would be one set of messages for each vma.
> 
> So we need an additional global teardown call? Then we'd need to switch 
> off the vma based invalidate_range()?

No, EXACTLY what I originally was asking for, either move this call site
up, introduce an additional mmu_notifier op, or place this one in two
locations with a flag indicating which call is being made.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Christoph Lameter
On Fri, 25 Jan 2008, Robin Holt wrote:

> I realize it is a minor nit, but since we put the continuation in column
> 81 in the next define, can we do the same here and make this more
> readable?

We need to fix the next define to not use column 81.
Found a couple of more 80 column infractions. Will be fixed in next 
release.

> > +void mmu_notifier_release(struct mm_struct *mm)
> > +{
> > +   struct mmu_notifier *mn;
> > +   struct hlist_node *n;
> > +
> > +   if (unlikely(!hlist_empty(>mmu_notifier.head))) {
> > +   rcu_read_lock();
> > +   hlist_for_each_entry_rcu(mn, n,
> > + >mmu_notifier.head, hlist) {
> > +   if (mn->ops->release)
> > +   mn->ops->release(mn, mm);
> > +   hlist_del(>hlist);
> 
> I think the hlist_del needs to be before the function callout so we can free
> the structure without a use-after-free issue.

The list head is in the mm_struct. This will be freed later.

> > +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> > +{
> > +   spin_lock(_notifier_list_lock);
> 
> Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:

Ok. We could switch this to mmap_sem protection for the mm_struct but the 
rmap notifier is not associated with an mm_struct. So we would need to 
keep it there. Since we already have a spinlock: Just use it for both to 
avoid further complications.

> > +   spin_lock(_notifier_list_lock);
> > +   hlist_del(>hlist);
> 
> hlist_del_rcu?  Ditto on the lock.

Peter already mentioned that and I have posted patches that address this 
issue.

> > @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
> > vm_unacct_memory(nr_accounted);
> > free_pgtables(, vma, FIRST_USER_ADDRESS, 0);
> > tlb_finish_mmu(tlb, 0, end);
> > +   mmu_notifier_release(mm);
> 
> Can we consider moving this notifier or introducing an additional notifier
> in the release or a flag to this one indicating early/late.

There is only one call right now?

> The GRU that Jack is concerned with would benefit from the early in
> that it could just invalidate the GRU context and immediately all GRU
> TLB entries are invalid.  I believe Jack would like to also be able to
> remove his entry from the mmu_notifier list in an effort to avoid the
> page and range callouts.

The TLB entries are removed by earlier invalidate_range calls. I would 
think that no TLBs are left at this point. Its simply a matter of 
releasing any still allocated resources through this callback.
 
> XPMEM, would also benefit from a call early.  We could make all the
> segments as being torn down and start the recalls.  We already have
> this code in and working (have since it was first written 6 years ago).
> In this case, all segments are torn down with a single message to each
> of the importing partitions.  In contrast, the teardown code which would
> happen now would be one set of messages for each vma.

So we need an additional global teardown call? Then we'd need to switch 
off the vma based invalidate_range()?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Robin Holt
> +#define mmu_notifier(function, mm, args...)  \
...
> + if (__mn->ops->function)\
> + __mn->ops->function(__mn,   \
> + mm, \
> + args);  \

__mn->ops->function(__mn, mm, args);
\
I realize it is a minor nit, but since we put the continuation in column
81 in the next define, can we do the same here and make this more
readable?

> + rcu_read_unlock();  \
...
> +#define mmu_rmap_notifier(function, args...) 
> \
> + do {
> \
> + struct mmu_rmap_notifier *__mrn;
> \
> + struct hlist_node *__n; 
> \
> + 
> \



> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n;
> +
> + if (unlikely(!hlist_empty(>mmu_notifier.head))) {
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(mn, n,
> +   >mmu_notifier.head, hlist) {
> + if (mn->ops->release)
> + mn->ops->release(mn, mm);
> + hlist_del(>hlist);

I think the hlist_del needs to be before the function callout so we can free
the structure without a use-after-free issue.

hlist_for_each_entry_rcu(mn, n,
  >mmu_notifier.head, hlist) {
hlist_del_rcu(>hlist);
if (mn->ops->release)
mn->ops->release(mn, mm);



> +static DEFINE_SPINLOCK(mmu_notifier_list_lock);

Remove

> +
> +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + spin_lock(_notifier_list_lock);

Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
BUG_ON(!rwsem_is_write_locked(>mmap_sem));

> + hlist_add_head(>hlist, >mmu_notifier.head);
hlist_add_head_rcu(>hlist, >mmu_notifier.head);

> + spin_unlock(_notifier_list_lock);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);
> +
> +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + spin_lock(_notifier_list_lock);
> + hlist_del(>hlist);

hlist_del_rcu?  Ditto on the lock.

> + spin_unlock(_notifier_list_lock);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
> +
> +HLIST_HEAD(mmu_rmap_notifier_list);

static DEFINE_SPINLOCK(mmu_rmap_notifier_list_lock);

> +
> +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
> +{
> + spin_lock(_notifier_list_lock);
> + hlist_add_head_rcu(>hlist, _rmap_notifier_list);
> + spin_unlock(_notifier_list_lock);

spin_lock(_rmap_notifier_list_lock);
hlist_add_head_rcu(>hlist, _rmap_notifier_list);
spin_unlock(_rmap_notifier_list_lock);

> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_register);
> +
> +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
> +{
> + spin_lock(_notifier_list_lock);
> + hlist_del_rcu(>hlist);
> + spin_unlock(_notifier_list_lock);

spin_lock(_rmap_notifier_list_lock);
hlist_del_rcu(>hlist);
spin_unlock(_rmap_notifier_list_lock);


> @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
>   vm_unacct_memory(nr_accounted);
>   free_pgtables(, vma, FIRST_USER_ADDRESS, 0);
>   tlb_finish_mmu(tlb, 0, end);
> + mmu_notifier_release(mm);

Can we consider moving this notifier or introducing an additional notifier
in the release or a flag to this one indicating early/late.

The GRU that Jack is concerned with would benefit from the early in
that it could just invalidate the GRU context and immediately all GRU
TLB entries are invalid.  I believe Jack would like to also be able to
remove his entry from the mmu_notifier list in an effort to avoid the
page and range callouts.

XPMEM, would also benefit from a call early.  We could make all the
segments as being torn down and start the recalls.  We already have
this code in and working (have since it was first written 6 years ago).
In this case, all segments are torn down with a single message to each
of the importing partitions.  In contrast, the teardown code which would
happen now would be one set of messages for each vma.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Robin Holt
On Fri, Jan 25, 2008 at 11:03:07AM -0800, Christoph Lameter wrote:
Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
   
   Ok. We could switch this to mmap_sem protection for the mm_struct but the 
   rmap notifier is not associated with an mm_struct. So we would need to 
   keep it there. Since we already have a spinlock: Just use it for both to 
   avoid further complications.
  
  But now you are putting a global lock in where it is inappropriate.
 
 The lock is only used during register and unregister. Very low level 
 usage.

Seems to me that is the same argument used for lock_kernel.  I am saying
we have a perfectly reasonable way to seperate the protections down to
their smallest.  For the things hanging off the mm, mmap_sem, for the
other list, a list specific lock.

Keep in mind that on a 2048p SSI MPI job starting up, we have 2048 ranks
doing this at the same time 6 times withing their address range.  That
seems like a lock which could get hot fairly quickly.  It may be for a
short period during startup and shutdown, but it is there.


 
XPMEM, would also benefit from a call early.  We could make all the
segments as being torn down and start the recalls.  We already have
this code in and working (have since it was first written 6 years ago).
In this case, all segments are torn down with a single message to each
of the importing partitions.  In contrast, the teardown code which would
happen now would be one set of messages for each vma.
   
   So we need an additional global teardown call? Then we'd need to switch 
   off the vma based invalidate_range()?
  
  No, EXACTLY what I originally was asking for, either move this call site
  up, introduce an additional mmu_notifier op, or place this one in two
  locations with a flag indicating which call is being made.
 
 Add a new invalidate_all() call? Then on exit we do
 
 1. invalidate_all()

That will be fine as long as we can unregister the ops notifier and free
the structure.  Otherwise, we end up being called needlessly.

 
 2. invalidate_range() for each vma
 
 3. release()
 
 We cannot simply move the call up because there will be future range 
 callbacks on vma invalidation.

I am not sure what this means.  Right now, if you were to notify XPMEM
the process is exiting, we would take care of all the recalling of pages
exported by this process, clearing those pages cache lines from cache,
and raising memory protections.  I would assume that moving the callout
earlier would expect the same of every driver.


Thanks,
Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Christoph Lameter
On Fri, 25 Jan 2008, Robin Holt wrote:

+void mmu_notifier_release(struct mm_struct *mm)
+{
+   struct mmu_notifier *mn;
+   struct hlist_node *n;
+
+   if (unlikely(!hlist_empty(mm-mmu_notifier.head))) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(mn, n,
+ mm-mmu_notifier.head, 
hlist) {
+   if (mn-ops-release)
+   mn-ops-release(mn, mm);
+   hlist_del(mn-hlist);
   
   I think the hlist_del needs to be before the function callout so we can 
   free
   the structure without a use-after-free issue.
  
  The list head is in the mm_struct. This will be freed later.
  
 
 I meant the structure pointed to by mn.  I assume it is intended that
 structure be kmalloc'd as part of a larger structure.  The driver is the
 entity which created that structure and should be the one to free it.

mn will be pointing to the listhead in the mm_struct one after the other. 
You mean the ops structure?

+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct 
*mm)
+{
+   spin_lock(mmu_notifier_list_lock);
   
   Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
  
  Ok. We could switch this to mmap_sem protection for the mm_struct but the 
  rmap notifier is not associated with an mm_struct. So we would need to 
  keep it there. Since we already have a spinlock: Just use it for both to 
  avoid further complications.
 
 But now you are putting a global lock in where it is inappropriate.

The lock is only used during register and unregister. Very low level 
usage.

   XPMEM, would also benefit from a call early.  We could make all the
   segments as being torn down and start the recalls.  We already have
   this code in and working (have since it was first written 6 years ago).
   In this case, all segments are torn down with a single message to each
   of the importing partitions.  In contrast, the teardown code which would
   happen now would be one set of messages for each vma.
  
  So we need an additional global teardown call? Then we'd need to switch 
  off the vma based invalidate_range()?
 
 No, EXACTLY what I originally was asking for, either move this call site
 up, introduce an additional mmu_notifier op, or place this one in two
 locations with a flag indicating which call is being made.

Add a new invalidate_all() call? Then on exit we do

1. invalidate_all()

2. invalidate_range() for each vma

3. release()

We cannot simply move the call up because there will be future range 
callbacks on vma invalidation.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Robin Holt
 +#define mmu_notifier(function, mm, args...)  \
...
 + if (__mn-ops-function)\
 + __mn-ops-function(__mn,   \
 + mm, \
 + args);  \

__mn-ops-function(__mn, mm, args);
\
I realize it is a minor nit, but since we put the continuation in column
81 in the next define, can we do the same here and make this more
readable?

 + rcu_read_unlock();  \
...
 +#define mmu_rmap_notifier(function, args...) 
 \
 + do {
 \
 + struct mmu_rmap_notifier *__mrn;
 \
 + struct hlist_node *__n; 
 \
 + 
 \



 +void mmu_notifier_release(struct mm_struct *mm)
 +{
 + struct mmu_notifier *mn;
 + struct hlist_node *n;
 +
 + if (unlikely(!hlist_empty(mm-mmu_notifier.head))) {
 + rcu_read_lock();
 + hlist_for_each_entry_rcu(mn, n,
 +   mm-mmu_notifier.head, hlist) {
 + if (mn-ops-release)
 + mn-ops-release(mn, mm);
 + hlist_del(mn-hlist);

I think the hlist_del needs to be before the function callout so we can free
the structure without a use-after-free issue.

hlist_for_each_entry_rcu(mn, n,
  mm-mmu_notifier.head, hlist) {
hlist_del_rcu(mn-hlist);
if (mn-ops-release)
mn-ops-release(mn, mm);



 +static DEFINE_SPINLOCK(mmu_notifier_list_lock);

Remove

 +
 +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 +{
 + spin_lock(mmu_notifier_list_lock);

Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
BUG_ON(!rwsem_is_write_locked(mm-mmap_sem));

 + hlist_add_head(mn-hlist, mm-mmu_notifier.head);
hlist_add_head_rcu(mn-hlist, mm-mmu_notifier.head);

 + spin_unlock(mmu_notifier_list_lock);
 +}
 +EXPORT_SYMBOL_GPL(mmu_notifier_register);
 +
 +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
 +{
 + spin_lock(mmu_notifier_list_lock);
 + hlist_del(mn-hlist);

hlist_del_rcu?  Ditto on the lock.

 + spin_unlock(mmu_notifier_list_lock);
 +}
 +EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
 +
 +HLIST_HEAD(mmu_rmap_notifier_list);

static DEFINE_SPINLOCK(mmu_rmap_notifier_list_lock);

 +
 +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
 +{
 + spin_lock(mmu_notifier_list_lock);
 + hlist_add_head_rcu(mrn-hlist, mmu_rmap_notifier_list);
 + spin_unlock(mmu_notifier_list_lock);

spin_lock(mmu_rmap_notifier_list_lock);
hlist_add_head_rcu(mrn-hlist, mmu_rmap_notifier_list);
spin_unlock(mmu_rmap_notifier_list_lock);

 +}
 +EXPORT_SYMBOL(mmu_rmap_notifier_register);
 +
 +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
 +{
 + spin_lock(mmu_notifier_list_lock);
 + hlist_del_rcu(mrn-hlist);
 + spin_unlock(mmu_notifier_list_lock);

spin_lock(mmu_rmap_notifier_list_lock);
hlist_del_rcu(mrn-hlist);
spin_unlock(mmu_rmap_notifier_list_lock);


 @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
   vm_unacct_memory(nr_accounted);
   free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
   tlb_finish_mmu(tlb, 0, end);
 + mmu_notifier_release(mm);

Can we consider moving this notifier or introducing an additional notifier
in the release or a flag to this one indicating early/late.

The GRU that Jack is concerned with would benefit from the early in
that it could just invalidate the GRU context and immediately all GRU
TLB entries are invalid.  I believe Jack would like to also be able to
remove his entry from the mmu_notifier list in an effort to avoid the
page and range callouts.

XPMEM, would also benefit from a call early.  We could make all the
segments as being torn down and start the recalls.  We already have
this code in and working (have since it was first written 6 years ago).
In this case, all segments are torn down with a single message to each
of the importing partitions.  In contrast, the teardown code which would
happen now would be one set of messages for each vma.

Thanks,
Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Christoph Lameter
On Fri, 25 Jan 2008, Robin Holt wrote:

 I realize it is a minor nit, but since we put the continuation in column
 81 in the next define, can we do the same here and make this more
 readable?

We need to fix the next define to not use column 81.
Found a couple of more 80 column infractions. Will be fixed in next 
release.

  +void mmu_notifier_release(struct mm_struct *mm)
  +{
  +   struct mmu_notifier *mn;
  +   struct hlist_node *n;
  +
  +   if (unlikely(!hlist_empty(mm-mmu_notifier.head))) {
  +   rcu_read_lock();
  +   hlist_for_each_entry_rcu(mn, n,
  + mm-mmu_notifier.head, hlist) {
  +   if (mn-ops-release)
  +   mn-ops-release(mn, mm);
  +   hlist_del(mn-hlist);
 
 I think the hlist_del needs to be before the function callout so we can free
 the structure without a use-after-free issue.

The list head is in the mm_struct. This will be freed later.

  +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
  +{
  +   spin_lock(mmu_notifier_list_lock);
 
 Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:

Ok. We could switch this to mmap_sem protection for the mm_struct but the 
rmap notifier is not associated with an mm_struct. So we would need to 
keep it there. Since we already have a spinlock: Just use it for both to 
avoid further complications.

  +   spin_lock(mmu_notifier_list_lock);
  +   hlist_del(mn-hlist);
 
 hlist_del_rcu?  Ditto on the lock.

Peter already mentioned that and I have posted patches that address this 
issue.

  @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
  vm_unacct_memory(nr_accounted);
  free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
  tlb_finish_mmu(tlb, 0, end);
  +   mmu_notifier_release(mm);
 
 Can we consider moving this notifier or introducing an additional notifier
 in the release or a flag to this one indicating early/late.

There is only one call right now?

 The GRU that Jack is concerned with would benefit from the early in
 that it could just invalidate the GRU context and immediately all GRU
 TLB entries are invalid.  I believe Jack would like to also be able to
 remove his entry from the mmu_notifier list in an effort to avoid the
 page and range callouts.

The TLB entries are removed by earlier invalidate_range calls. I would 
think that no TLBs are left at this point. Its simply a matter of 
releasing any still allocated resources through this callback.
 
 XPMEM, would also benefit from a call early.  We could make all the
 segments as being torn down and start the recalls.  We already have
 this code in and working (have since it was first written 6 years ago).
 In this case, all segments are torn down with a single message to each
 of the importing partitions.  In contrast, the teardown code which would
 happen now would be one set of messages for each vma.

So we need an additional global teardown call? Then we'd need to switch 
off the vma based invalidate_range()?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Robin Holt
On Fri, Jan 25, 2008 at 10:47:04AM -0800, Christoph Lameter wrote:
 On Fri, 25 Jan 2008, Robin Holt wrote:
 
  I realize it is a minor nit, but since we put the continuation in column
  81 in the next define, can we do the same here and make this more
  readable?
 
 We need to fix the next define to not use column 81.
 Found a couple of more 80 column infractions. Will be fixed in next 
 release.
 
   +void mmu_notifier_release(struct mm_struct *mm)
   +{
   + struct mmu_notifier *mn;
   + struct hlist_node *n;
   +
   + if (unlikely(!hlist_empty(mm-mmu_notifier.head))) {
   + rcu_read_lock();
   + hlist_for_each_entry_rcu(mn, n,
   +   mm-mmu_notifier.head, hlist) {
   + if (mn-ops-release)
   + mn-ops-release(mn, mm);
   + hlist_del(mn-hlist);
  
  I think the hlist_del needs to be before the function callout so we can free
  the structure without a use-after-free issue.
 
 The list head is in the mm_struct. This will be freed later.
 

I meant the structure pointed to by mn.  I assume it is intended that
structure be kmalloc'd as part of a larger structure.  The driver is the
entity which created that structure and should be the one to free it.

   +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
   +{
   + spin_lock(mmu_notifier_list_lock);
  
  Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
 
 Ok. We could switch this to mmap_sem protection for the mm_struct but the 
 rmap notifier is not associated with an mm_struct. So we would need to 
 keep it there. Since we already have a spinlock: Just use it for both to 
 avoid further complications.

But now you are putting a global lock in where it is inappropriate.

 
   + spin_lock(mmu_notifier_list_lock);
   + hlist_del(mn-hlist);
  
  hlist_del_rcu?  Ditto on the lock.
 
 Peter already mentioned that and I have posted patches that address this 
 issue.
 
   @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
 vm_unacct_memory(nr_accounted);
 free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 tlb_finish_mmu(tlb, 0, end);
   + mmu_notifier_release(mm);
  
  Can we consider moving this notifier or introducing an additional notifier
  in the release or a flag to this one indicating early/late.
 
 There is only one call right now?
 
  The GRU that Jack is concerned with would benefit from the early in
  that it could just invalidate the GRU context and immediately all GRU
  TLB entries are invalid.  I believe Jack would like to also be able to
  remove his entry from the mmu_notifier list in an effort to avoid the
  page and range callouts.
 
 The TLB entries are removed by earlier invalidate_range calls. I would 
 think that no TLBs are left at this point. Its simply a matter of 
 releasing any still allocated resources through this callback.

What I was asking for is a way to avoid those numerous callouts for
drivers that can do early cleanup.

  
  XPMEM, would also benefit from a call early.  We could make all the
  segments as being torn down and start the recalls.  We already have
  this code in and working (have since it was first written 6 years ago).
  In this case, all segments are torn down with a single message to each
  of the importing partitions.  In contrast, the teardown code which would
  happen now would be one set of messages for each vma.
 
 So we need an additional global teardown call? Then we'd need to switch 
 off the vma based invalidate_range()?

No, EXACTLY what I originally was asking for, either move this call site
up, introduce an additional mmu_notifier op, or place this one in two
locations with a flag indicating which call is being made.

Thanks,
Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Christoph Lameter
On Fri, 25 Jan 2008, Robin Holt wrote:

 Keep in mind that on a 2048p SSI MPI job starting up, we have 2048 ranks
 doing this at the same time 6 times withing their address range.  That
 seems like a lock which could get hot fairly quickly.  It may be for a
 short period during startup and shutdown, but it is there.

Ok. I guess we need to have a __register_mmu_notifier that expects the 
mmap_sem to be held then?

  1. invalidate_all()
 
 That will be fine as long as we can unregister the ops notifier and free
 the structure.  Otherwise, we end up being called needlessly.

No you cannot do that because there are still callbacks that come later. 
The invalidate_all may lead to invalidate_range() doing nothing for this 
mm. The ops notifier and the freeing of the structure has to wait until 
release().

  2. invalidate_range() for each vma
  
  3. release()
  
  We cannot simply move the call up because there will be future range 
  callbacks on vma invalidation.
 
 I am not sure what this means.  Right now, if you were to notify XPMEM
 the process is exiting, we would take care of all the recalling of pages
 exported by this process, clearing those pages cache lines from cache,
 and raising memory protections.  I would assume that moving the callout
 earlier would expect the same of every driver.

That does not sync with the current scheme of the invalidate_range() 
hooks. We would have to do a global invalidate early and then place the 
other invalidate_range hooks in such a way that none is called in later in 
process exit handling.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mmu_notifier: Core code

2008-01-25 Thread Christoph Lameter
Diff so far against V1

- Improve RCU support. (There is now a sychronize_rcu in mmu_release which 
is bad.)

- Clean compile for !MMU_NOTIFIER

- Use mmap_sem for serializing additions the mmu_notifier list in the 
mm_struct (but still global spinlock for mmu_rmap_notifier. The 
registration function is called only a couple of times))

- 

---
 include/linux/list.h |   14 ++
 include/linux/mm_types.h |2 --
 include/linux/mmu_notifier.h |   39 ---
 mm/mmu_notifier.c|   28 +++-
 4 files changed, 69 insertions(+), 14 deletions(-)

Index: linux-2.6/mm/mmu_notifier.c
===
--- linux-2.6.orig/mm/mmu_notifier.c2008-01-25 12:14:49.0 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-25 12:14:49.0 -0800
@@ -15,17 +15,18 @@
 void mmu_notifier_release(struct mm_struct *mm)
 {
struct mmu_notifier *mn;
-   struct hlist_node *n;
+   struct hlist_node *n, *t;
 
if (unlikely(!hlist_empty(mm-mmu_notifier.head))) {
rcu_read_lock();
-   hlist_for_each_entry_rcu(mn, n,
+   hlist_for_each_entry_safe_rcu(mn, n, t,
  mm-mmu_notifier.head, hlist) {
if (mn-ops-release)
mn-ops-release(mn, mm);
hlist_del(mn-hlist);
}
rcu_read_unlock();
+   synchronize_rcu();
}
 }
 
@@ -53,24 +54,33 @@ int mmu_notifier_age_page(struct mm_stru
return young;
 }
 
-static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+   hlist_add_head_rcu(mn-hlist, mm-mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
 
 void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-   spin_lock(mmu_notifier_list_lock);
-   hlist_add_head(mn-hlist, mm-mmu_notifier.head);
-   spin_unlock(mmu_notifier_list_lock);
+   down_write(mm-mmap_sem);
+   __mmu_notifier_register(mn, mm);
+   up_write(mm-mmap_sem);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_register);
 
 void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-   spin_lock(mmu_notifier_list_lock);
-   hlist_del(mn-hlist);
-   spin_unlock(mmu_notifier_list_lock);
+   down_write(mm-mmap_sem);
+   hlist_del_rcu(mn-hlist);
+   up_write(mm-mmap_sem);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
 
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
 HLIST_HEAD(mmu_rmap_notifier_list);
 
 void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
Index: linux-2.6/include/linux/list.h
===
--- linux-2.6.orig/include/linux/list.h 2008-01-25 12:14:47.0 -0800
+++ linux-2.6/include/linux/list.h  2008-01-25 12:14:49.0 -0800
@@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 pos = pos-next)
 
+/**
+ * hlist_for_each_entry_safe_rcu   - iterate over list of given type
+ * @tpos:  the type * to use as a loop cursor.
+ * @pos:   the struct hlist_node to use as a loop cursor.
+ * @n: temporary pointer
+ * @head:  the head for your list.
+ * @member:the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member)   \
+   for (pos = (head)-first;\
+rcu_dereference(pos)  ({ n = pos-next; 1;})\
+   ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+pos = n)
+
 #else
 #warning don't include kernel headers in userspace
 #endif /* __KERNEL__ */
Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-25 12:14:49.0 
-0800
+++ linux-2.6/include/linux/mm_types.h  2008-01-25 12:14:49.0 -0800
@@ -224,9 +224,7 @@ struct mm_struct {
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
 
-#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
-#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-25 12:14:49.0 
-0800
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-25 13:07:54.0 
-0800
@@ -46,6 +46,10 @@ struct mmu_notifier {
 };
 
 struct mmu_notifier_ops {
+   /*
+ 

[patch 1/4] mmu_notifier: Core code

2008-01-24 Thread Christoph Lameter
Core code for mmu notifiers.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

---
 include/linux/mm_types.h |8 ++
 include/linux/mmu_notifier.h |  152 +++
 include/linux/page-flags.h   |9 ++
 kernel/fork.c|2 
 mm/Kconfig   |4 +
 mm/Makefile  |1 
 mm/mmap.c|2 
 mm/mmu_notifier.c|   91 +
 8 files changed, 269 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-24 20:59:17.0 
-0800
+++ linux-2.6/include/linux/mm_types.h  2008-01-24 20:59:19.0 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+   struct hlist_head head;
+};
+
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +223,10 @@ struct mm_struct {
/* aio bits */
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
+
+#ifdef CONFIG_MMU_NOTIFIER
+   struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-24 20:59:19.0 
-0800
@@ -0,0 +1,152 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * These are callbacks registered with an mm_struct. If mappings are
+ * removed from an address space then callbacks are performed.
+ * Spinlocks must be held in order to the walk reverse maps and the
+ * notifications are performed while the spinlock is held.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ * Callbacks for subsystems that provide their own rmaps. These
+ * need to walk their own rmaps for a page. The invalidate_page
+ * callback is outside of locks so that we are not in a strictly
+ * atomic context (but we may be in a PF_MEMALLOC context if the
+ * notifier is called from reclaim code) and are able to sleep.
+ * Rmap notifiers need an extra page bit and are only available
+ * on 64 bit platforms. It is up to the subsystem to mark pags
+ * as PageExternalRmap as needed to trigger the callbacks. Pages
+ * must be marked dirty if dirty bits are set in the external
+ * pte.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+   struct hlist_node hlist;
+   const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+   void (*release)(struct mmu_notifier *mn,
+   struct mm_struct *mm);
+   int (*age_page)(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long address);
+   void (*invalidate_page)(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long address);
+   void (*invalidate_range)(struct mmu_notifier *mn,
+struct mm_struct *mm,
+unsigned long start, unsigned long end);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+   struct hlist_node hlist;
+   const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+   /*
+* Called with the page lock held after ptes are modified or removed
+* so that a subsystem with its own rmap's can remove remote ptes
+* mapping a page.
+*/
+   void (*invalidate_page)(struct mmu_rmap_notifier *mrn, struct page 
*page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+   struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+   INIT_HLIST_HEAD(>head);
+}
+
+#define mmu_notifier(function, mm, args...)\
+   do {\
+ 

[patch 1/4] mmu_notifier: Core code

2008-01-24 Thread Christoph Lameter
Core code for mmu notifiers.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

---
 include/linux/mm_types.h |8 ++
 include/linux/mmu_notifier.h |  152 +++
 include/linux/page-flags.h   |9 ++
 kernel/fork.c|2 
 mm/Kconfig   |4 +
 mm/Makefile  |1 
 mm/mmap.c|2 
 mm/mmu_notifier.c|   91 +
 8 files changed, 269 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-24 20:59:17.0 
-0800
+++ linux-2.6/include/linux/mm_types.h  2008-01-24 20:59:19.0 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+   struct hlist_head head;
+};
+
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +223,10 @@ struct mm_struct {
/* aio bits */
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
+
+#ifdef CONFIG_MMU_NOTIFIER
+   struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-24 20:59:19.0 
-0800
@@ -0,0 +1,152 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * These are callbacks registered with an mm_struct. If mappings are
+ * removed from an address space then callbacks are performed.
+ * Spinlocks must be held in order to the walk reverse maps and the
+ * notifications are performed while the spinlock is held.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ * Callbacks for subsystems that provide their own rmaps. These
+ * need to walk their own rmaps for a page. The invalidate_page
+ * callback is outside of locks so that we are not in a strictly
+ * atomic context (but we may be in a PF_MEMALLOC context if the
+ * notifier is called from reclaim code) and are able to sleep.
+ * Rmap notifiers need an extra page bit and are only available
+ * on 64 bit platforms. It is up to the subsystem to mark pags
+ * as PageExternalRmap as needed to trigger the callbacks. Pages
+ * must be marked dirty if dirty bits are set in the external
+ * pte.
+ */
+
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/rcupdate.h
+#include linux/mm_types.h
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+   struct hlist_node hlist;
+   const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+   void (*release)(struct mmu_notifier *mn,
+   struct mm_struct *mm);
+   int (*age_page)(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long address);
+   void (*invalidate_page)(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long address);
+   void (*invalidate_range)(struct mmu_notifier *mn,
+struct mm_struct *mm,
+unsigned long start, unsigned long end);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+   struct hlist_node hlist;
+   const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+   /*
+* Called with the page lock held after ptes are modified or removed
+* so that a subsystem with its own rmap's can remove remote ptes
+* mapping a page.
+*/
+   void (*invalidate_page)(struct mmu_rmap_notifier *mrn, struct page 
*page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+   struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+   INIT_HLIST_HEAD(mnh-head);
+}
+
+#define mmu_notifier(function, mm, args...)\
+   do {