[kvm-devel] [PATCH] mmu notifiers #v8
Difference between #v7 and #v8: 1) s/age_page/clear_flush_young/ (Nick's suggestion) 2) macro fix (Andrew) 3) move release before final unmap_vmas (for GRU, Jack/Christoph) 4) microoptimize mmu_notifier_unregister (Christoph) 5) use mmap_sem for registration serialization (Christoph) The (void)xxx in macros doesn't work with "args". Christoph's solution look best in avoiding warnings, even if it forces to make the mmu notifier operation structure visible even if MMU_NOTIFIER=n (that's the only downside). I didn't drop invalidate_page, because invalidate_range_begin/end would be slower for usages like KVM/GRU (we don't need a begin/end there because where invalidate_page is called, the VM holds a reference on the page). do_wp_page should also use invalidate_page since it can free the page after dropping the PT lock without losing any performance (that's not true for the places where invalidate_range is called). It'd be nice if everyone involved can agree to converge on this API for .25. KVM/GRU (and perhaps Quadrics) and similar usages will be fully covered in .25. This is a kernel internal API so there's no problem if all the methods will become sleep capable only starting only in .26. The brainer part of the VM work to do to make it sleep capable is pretty much orthogonal with this patch. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -228,6 +229,8 @@ struct mm_struct { #ifdef CONFIG_CGROUP_MEM_CONT struct mem_cgroup *mem_cgroup; #endif + + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,161 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include +#include + +struct mmu_notifier; + +struct mmu_notifier_ops { + /* +* Called when nobody can register any more notifier in the mm +* and after the "mn" notifier has been disarmed already. +*/ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* +* clear_flush_young is called after the VM is +* test-and-clearing the young/accessed bitflag in the +* pte. This way the VM will provide proper aging to the +* accesses to the page through the secondary MMUs and not +* only to the ones through the Linux pte. +*/ + int (*clear_flush_young)(struct mmu_notifier *mn, +struct mm_struct *mm, +unsigned long address); + + /* +* Before this is invoked any secondary MMU is still ok to +* read/write to the page previously pointed by the Linux pte +* because the old page hasn't been freed yet. If required +* set_page_dirty has to be called internally to this method. +*/ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* +* invalidate_range_begin() and invalidate_range_end() must be +* paired. Multiple invalidate_range_begin/ends may be nested +* or called concurrently. +*/ + void (*invalidate_range_begin)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, +struct mm_struct *mm, +unsigned long start, unsigned long end); +}; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +#ifdef CONFIG_MMU_NOTIFIER + +struct mmu_notifier_head { + struct hlist_head head; +}; + +#include + +/* + * Must hold the mmap_sem for write. + * + * RCU is used to traverse the list. A quiescent period needs to pass + * before the notifier is guaranteed to be visible to all threads. + */ +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +/* + * Must hold the mmap_sem for write. + * + * RCU is used to traverse the list. A quiescent period needs to pass + * before the "struct mmu_notifier" can be freed. Alternatively it + * can be synchronously freed inside ->release when the list can't + * change anymore and nobody could possibly walk it. + */ +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_release(struct mm
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Sun, Mar 02, 2008 at 04:54:57PM +0100, Andrea Arcangeli wrote: > Difference between #v7 and #v8: > > 1) s/age_page/clear_flush_young/ (Nick's suggestion) > 2) macro fix (Andrew) > 3) move release before final unmap_vmas (for GRU, Jack/Christoph) > 4) microoptimize mmu_notifier_unregister (Christoph) > 5) use mmap_sem for registration serialization (Christoph) > > The (void)xxx in macros doesn't work with "args". Christoph's solution > look best in avoiding warnings, even if it forces to make the mmu > notifier operation structure visible even if MMU_NOTIFIER=n (that's > the only downside). I have a couple of "cleanup" patches that change the structure of this to something I prefer. Others may not, but I'll post them for debate anyway. > I didn't drop invalidate_page, because invalidate_range_begin/end > would be slower for usages like KVM/GRU (we don't need a begin/end > there because where invalidate_page is called, the VM holds a > reference on the page). do_wp_page should also use invalidate_page > since it can free the page after dropping the PT lock without losing > any performance (that's not true for the places where invalidate_range > is called). I'm still not completely happy with this. I had a very quick look at the GRU driver, but I don't see why it can't be implemented more like the regular TLB model, and have TLB insertions depend on the linux pte, and do invalidates _after_ restricting permissions to the pte. Ie. I'd still like to get rid of invalidate_range_begin, and get rid of invalidate calls from places where permissions are relaxed. > It'd be nice if everyone involved can agree to converge on this API > for .25. KVM/GRU (and perhaps Quadrics) and similar usages will be > fully covered in .25. If we can agree on the API, then I don't see any reason why it can't go into 2.6.25, unless someome wants more time to review it (but 2.6.25 release should be quite far away still so there should be quite a bit of time). - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Sun, Mar 02, 2008 at 04:54:57PM +0100, Andrea Arcangeli wrote: > Difference between #v7 and #v8: [patch] mmu-v8: demacro Remove the macros from mmu_notifier.h, in favour of functions. This requires untangling the include order circular dependencies as well, so just remove struct mmu_notifier_head in favour of just using the hlist in mm_struct. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> --- Index: linux-2.6/include/linux/mmu_notifier.h === --- linux-2.6.orig/include/linux/mmu_notifier.h +++ linux-2.6/include/linux/mmu_notifier.h @@ -55,12 +55,13 @@ struct mmu_notifier { #ifdef CONFIG_MMU_NOTIFIER -struct mmu_notifier_head { - struct hlist_head head; -}; - #include +static inline int mm_has_notifiers(struct mm_struct *mm) +{ + return unlikely(!hlist_empty(&mm->mmu_notifier_list)); +} + /* * Must hold the mmap_sem for write. * @@ -79,33 +80,59 @@ extern void mmu_notifier_register(struct */ extern void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm); -extern void mmu_notifier_release(struct mm_struct *mm); -extern int mmu_notifier_clear_flush_young(struct mm_struct *mm, + +extern void __mmu_notifier_release(struct mm_struct *mm); +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long address); +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_range_begin(struct mm_struct *mm, + unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end); + + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_release(mm); +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_clear_flush_young(mm, address); + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_page(mm, address); +} + +static inline void mmu_notifier_invalidate_range_begin(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_begin(mm, start, end); +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end); +} -static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) +static inline void mmu_notifier_mm_init(struct mm_struct *mm) { - INIT_HLIST_HEAD(&mnh->head); + INIT_HLIST_HEAD(&mm->mmu_notifier_list); } -#define mmu_notifier(function, mm, args...)\ - do {\ - struct mmu_notifier *__mn; \ - struct hlist_node *__n; \ - struct mm_struct * __mm = mm; \ - \ - if (unlikely(!hlist_empty(&__mm->mmu_notifier.head))) { \ - rcu_read_lock();\ - hlist_for_each_entry_rcu(__mn, __n, \ -&__mm->mmu_notifier.head, \ -hlist) \ - if (__mn->ops->function)\ - __mn->ops->function(__mn, \ - __mm, \ - args); \ - rcu_read_unlock(); \ - } \ - } while (0) + #define ptep_clear_flush_notify(__vma, __address, __ptep) \ ({ \ @@ -113,7 +140,7 @@ static inline void mmu_notifier_head_ini struct vm_area_struct * ___vma = __vma; \ unsigned long ___address = __address; \ __pte = ptep_clear_flush(___vma, ___address, __ptep); \
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Sun, Mar 02, 2008 at 04:54:57PM +0100, Andrea Arcangeli wrote: > Difference between #v7 and #v8: This one on top of the previous patch [patch] mmu-v8: typesafe Move definition of struct mmu_notifier and struct mmu_notifier_ops under CONFIG_MMU_NOTIFIER to ensure they doesn't get dereferenced when they don't make sense. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> --- Index: linux-2.6/include/linux/mmu_notifier.h === --- linux-2.6.orig/include/linux/mmu_notifier.h +++ linux-2.6/include/linux/mmu_notifier.h @@ -3,8 +3,12 @@ #include #include +#include struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_ops { /* @@ -53,10 +57,6 @@ struct mmu_notifier { const struct mmu_notifier_ops *ops; }; -#ifdef CONFIG_MMU_NOTIFIER - -#include - static inline int mm_has_notifiers(struct mm_struct *mm) { return unlikely(!hlist_empty(&mm->mmu_notifier_list)); - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Sun, Mar 02, 2008 at 04:54:57PM +0100, Andrea Arcangeli wrote: > Difference between #v7 and #v8: Here is just a couple of checkpatch fixes on top of the last patches. Index: linux-2.6/include/linux/mmu_notifier.h === --- linux-2.6.orig/include/linux/mmu_notifier.h +++ linux-2.6/include/linux/mmu_notifier.h @@ -46,7 +46,7 @@ struct mmu_notifier_ops { */ void (*invalidate_range_begin)(struct mmu_notifier *mn, struct mm_struct *mm, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end); void (*invalidate_range_end)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end); @@ -137,7 +137,7 @@ static inline void mmu_notifier_mm_init( #define ptep_clear_flush_notify(__vma, __address, __ptep) \ ({ \ pte_t __pte;\ - struct vm_area_struct * ___vma = __vma; \ + struct vm_area_struct *___vma = __vma; \ unsigned long ___address = __address; \ __pte = ptep_clear_flush(___vma, ___address, __ptep); \ mmu_notifier_invalidate_page(___vma->vm_mm, ___address);\ @@ -147,7 +147,7 @@ static inline void mmu_notifier_mm_init( #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ ({ \ int __young;\ - struct vm_area_struct * ___vma = __vma; \ + struct vm_area_struct *___vma = __vma; \ unsigned long ___address = __address; \ __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,\ - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote: > to something I prefer. Others may not, but I'll post them for debate > anyway. Sure, thanks! > > I didn't drop invalidate_page, because invalidate_range_begin/end > > would be slower for usages like KVM/GRU (we don't need a begin/end > > there because where invalidate_page is called, the VM holds a > > reference on the page). do_wp_page should also use invalidate_page > > since it can free the page after dropping the PT lock without losing > > any performance (that's not true for the places where invalidate_range > > is called). > > I'm still not completely happy with this. I had a very quick look > at the GRU driver, but I don't see why it can't be implemented > more like the regular TLB model, and have TLB insertions depend on > the linux pte, and do invalidates _after_ restricting permissions > to the pte. > > Ie. I'd still like to get rid of invalidate_range_begin, and get > rid of invalidate calls from places where permissions are relaxed. _begin exists because by the time _end is called, the VM already dropped the reference on the page. This way we can do a single invalidate no matter how large the range is. I don't see ways to remove _begin while still invoking _end a single time for the whole range. > If we can agree on the API, then I don't see any reason why it can't > go into 2.6.25, unless someome wants more time to review it (but > 2.6.25 release should be quite far away still so there should be quite > a bit of time). Cool! ;) - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote: > On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote: > > to something I prefer. Others may not, but I'll post them for debate > > anyway. > > Sure, thanks! > > > > I didn't drop invalidate_page, because invalidate_range_begin/end > > > would be slower for usages like KVM/GRU (we don't need a begin/end > > > there because where invalidate_page is called, the VM holds a > > > reference on the page). do_wp_page should also use invalidate_page > > > since it can free the page after dropping the PT lock without losing > > > any performance (that's not true for the places where invalidate_range > > > is called). > > > > I'm still not completely happy with this. I had a very quick look > > at the GRU driver, but I don't see why it can't be implemented > > more like the regular TLB model, and have TLB insertions depend on > > the linux pte, and do invalidates _after_ restricting permissions > > to the pte. > > > > Ie. I'd still like to get rid of invalidate_range_begin, and get > > rid of invalidate calls from places where permissions are relaxed. > > _begin exists because by the time _end is called, the VM already > dropped the reference on the page. This way we can do a single > invalidate no matter how large the range is. I don't see ways to > remove _begin while still invoking _end a single time for the whole > range. Is this just a GRU problem? Can't we just require them to take a ref on the page (IIRC Jack said GRU could be changed to more like a TLB model). - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote: > Is this just a GRU problem? Can't we just require them to take a ref > on the page (IIRC Jack said GRU could be changed to more like a TLB > model). Yes, it's just a GRU problem, it tries to optimize performance by calling follow_page only in the fast path, and fallbacks to get_user_pages; put_page in the slow path. xpmem could also send the message in _begin and wait the message in _end, to reduce the wait time. But if you forge GRU to call get_user_pages only (like KVM does), the _begin can be removed. In theory we could also optimize KVM to use follow_page only if the pte is already established. I'm not sure how much that is a worthwhile optimization though. However note that Quadrics also had a callback before and one after, so they may be using the callback before for similar optimizations. But functionality-wise _end is the only required bit if everyone takes refcounts like KVM and XPMEM do. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote: > On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote: > > On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote: > > > to something I prefer. Others may not, but I'll post them for debate > > > anyway. > > > > Sure, thanks! > > > > > > I didn't drop invalidate_page, because invalidate_range_begin/end > > > > would be slower for usages like KVM/GRU (we don't need a begin/end > > > > there because where invalidate_page is called, the VM holds a > > > > reference on the page). do_wp_page should also use invalidate_page > > > > since it can free the page after dropping the PT lock without losing > > > > any performance (that's not true for the places where invalidate_range > > > > is called). > > > > > > I'm still not completely happy with this. I had a very quick look > > > at the GRU driver, but I don't see why it can't be implemented > > > more like the regular TLB model, and have TLB insertions depend on > > > the linux pte, and do invalidates _after_ restricting permissions > > > to the pte. > > > > > > Ie. I'd still like to get rid of invalidate_range_begin, and get > > > rid of invalidate calls from places where permissions are relaxed. > > > > _begin exists because by the time _end is called, the VM already > > dropped the reference on the page. This way we can do a single > > invalidate no matter how large the range is. I don't see ways to > > remove _begin while still invoking _end a single time for the whole > > range. > > Is this just a GRU problem? Can't we just require them to take a ref > on the page (IIRC Jack said GRU could be changed to more like a TLB > model). Maintaining a long-term reference on a page is a problem. The GRU does not currently maintain tables to track the pages for which dropins have been done. The GRU has a large internal TLB and is designed to reference up to 8PB of memory. The size of the tables to track this many referenced pages would be a problem (at best). - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 09:18:59AM -0600, Jack Steiner wrote: > On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote: > > On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote: > > > On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote: > > > > to something I prefer. Others may not, but I'll post them for debate > > > > anyway. > > > > > > Sure, thanks! > > > > > > > > I didn't drop invalidate_page, because invalidate_range_begin/end > > > > > would be slower for usages like KVM/GRU (we don't need a begin/end > > > > > there because where invalidate_page is called, the VM holds a > > > > > reference on the page). do_wp_page should also use invalidate_page > > > > > since it can free the page after dropping the PT lock without losing > > > > > any performance (that's not true for the places where invalidate_range > > > > > is called). > > > > > > > > I'm still not completely happy with this. I had a very quick look > > > > at the GRU driver, but I don't see why it can't be implemented > > > > more like the regular TLB model, and have TLB insertions depend on > > > > the linux pte, and do invalidates _after_ restricting permissions > > > > to the pte. > > > > > > > > Ie. I'd still like to get rid of invalidate_range_begin, and get > > > > rid of invalidate calls from places where permissions are relaxed. > > > > > > _begin exists because by the time _end is called, the VM already > > > dropped the reference on the page. This way we can do a single > > > invalidate no matter how large the range is. I don't see ways to > > > remove _begin while still invoking _end a single time for the whole > > > range. > > > > Is this just a GRU problem? Can't we just require them to take a ref > > on the page (IIRC Jack said GRU could be changed to more like a TLB > > model). > > Maintaining a long-term reference on a page is a problem. The GRU does not > currently maintain tables to track the pages for which dropins have been done. > > The GRU has a large internal TLB and is designed to reference up to 8PB of > memory. The size of the tables to track this many referenced pages would be > a problem (at best). Is it any worse a problem than the pagetables of the processes which have their virtual memory exported to GRU? AFAIKS, no; it is on the same magnitude of difficulty. So you could do it without introducing any fundamental problem (memory usage might be increased by some constant factor, but I think we can cope with that in order to make the core patch really nice and simple). It is going to be really easy to add more weird and wonderful notifiers later that deviate from our standard TLB model. It would be much harder to remove them. So I really want to see everyone conform to this model first. Numbers and comparisons can be brought out afterwards if people want to attempt to make such changes. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote: > On Mon, Mar 03, 2008 at 09:18:59AM -0600, Jack Steiner wrote: > > On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote: > > > On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote: > > > > On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote: > > > > > to something I prefer. Others may not, but I'll post them for debate > > > > > anyway. > > > > > > > > Sure, thanks! > > > > > > > > > > I didn't drop invalidate_page, because invalidate_range_begin/end > > > > > > would be slower for usages like KVM/GRU (we don't need a begin/end > > > > > > there because where invalidate_page is called, the VM holds a > > > > > > reference on the page). do_wp_page should also use invalidate_page > > > > > > since it can free the page after dropping the PT lock without losing > > > > > > any performance (that's not true for the places where > > > > > > invalidate_range > > > > > > is called). > > > > > > > > > > I'm still not completely happy with this. I had a very quick look > > > > > at the GRU driver, but I don't see why it can't be implemented > > > > > more like the regular TLB model, and have TLB insertions depend on > > > > > the linux pte, and do invalidates _after_ restricting permissions > > > > > to the pte. > > > > > > > > > > Ie. I'd still like to get rid of invalidate_range_begin, and get > > > > > rid of invalidate calls from places where permissions are relaxed. > > > > > > > > _begin exists because by the time _end is called, the VM already > > > > dropped the reference on the page. This way we can do a single > > > > invalidate no matter how large the range is. I don't see ways to > > > > remove _begin while still invoking _end a single time for the whole > > > > range. The range invalidates have a performance advantage for the GRU. TLB invalidates on the GRU are relatively slow (usec) and interfere somewhat with the performance of other active GRU instructions. Invalidating a large chunk of addresses with a single GRU TLBINVAL operation is must faster than issuing a stream of single page TLBINVALs. I expect this performance advantage will also apply to other users of mmuops. > > > > > > Is this just a GRU problem? Can't we just require them to take a ref > > > on the page (IIRC Jack said GRU could be changed to more like a TLB > > > model). > > > > Maintaining a long-term reference on a page is a problem. The GRU does not > > currently maintain tables to track the pages for which dropins have been > > done. > > > > The GRU has a large internal TLB and is designed to reference up to 8PB of > > memory. The size of the tables to track this many referenced pages would be > > a problem (at best). > > Is it any worse a problem than the pagetables of the processes which have > their virtual memory exported to GRU? AFAIKS, no; it is on the same > magnitude of difficulty. So you could do it without introducing any > fundamental problem (memory usage might be increased by some constant > factor, but I think we can cope with that in order to make the core patch > really nice and simple). Functionally, the GRU is very close to what I would consider to be the "standard TLB" model. Dropins and flushs map closely to processor dropins and flushes for cpus. The internal structure of the GRU TLB is identical to the TLB of existing cpus. Requiring the GRU driver to track dropins with long term page references seems to me a deviation from having the basic mmuops support a "standard TLB" model. AFAIK, no other processor requires this. Tracking TLB dropins (and long term page references) could be done but it adds significant complexity and scaling issues. The size of the tables to track many TB (to PB) of memory can get large. If the memory is being referenced by highly threaded applications, then the problem becomes even more complex. Either tables must be replicated per-thread (and require even more memory), or the table structure becomes even more complex to deal with node locality, cacheline bouncing, etc. Try to avoid a requirement to track dropins with long term page references. > It is going to be really easy to add more weird and wonderful notifiers > later that deviate from our standard TLB model. It would be much harder to > remove them. So I really want to see everyone conform to this model first. Agree. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
Jack Steiner wrote: > The range invalidates have a performance advantage for the GRU. TLB > invalidates > on the GRU are relatively slow (usec) and interfere somewhat with the > performance > of other active GRU instructions. Invalidating a large chunk of addresses with > a single GRU TLBINVAL operation is must faster than issuing a stream of single > page TLBINVALs. > > I expect this performance advantage will also apply to other users of mmuops. > In theory this would apply to kvm as well (coalesce tlb flush IPIs, lookup shadow page table once), but is it really a fast path? What triggers range operations for your use cases? -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 08:09:49PM +0200, Avi Kivity wrote: > Jack Steiner wrote: > >The range invalidates have a performance advantage for the GRU. TLB > >invalidates > >on the GRU are relatively slow (usec) and interfere somewhat with the > >performance > >of other active GRU instructions. Invalidating a large chunk of addresses > >with > >a single GRU TLBINVAL operation is must faster than issuing a stream of > >single > >page TLBINVALs. > > > >I expect this performance advantage will also apply to other users of > >mmuops. > > > > In theory this would apply to kvm as well (coalesce tlb flush IPIs, > lookup shadow page table once), but is it really a fast path? What > triggers range operations for your use cases? Although not frequent, an unmap of a multiple TB object could be quite painful if each page was invalidated individually instead of 1 invalidate for the entire range. This is even worse if the application is threaded and the object has been reference by many GRUs (there are 16 GRU ports per node - each potentially has to be invalidated). Forks (again, not frequent) would be another case. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 12:06:05PM -0600, Jack Steiner wrote: > On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote: > > > Maintaining a long-term reference on a page is a problem. The GRU does not > > > currently maintain tables to track the pages for which dropins have been > > > done. > > > > > > The GRU has a large internal TLB and is designed to reference up to 8PB of > > > memory. The size of the tables to track this many referenced pages would > > > be > > > a problem (at best). > > > > Is it any worse a problem than the pagetables of the processes which have > > their virtual memory exported to GRU? AFAIKS, no; it is on the same > > magnitude of difficulty. So you could do it without introducing any > > fundamental problem (memory usage might be increased by some constant > > factor, but I think we can cope with that in order to make the core patch > > really nice and simple). > > Functionally, the GRU is very close to what I would consider to be the > "standard TLB" model. Dropins and flushs map closely to processor dropins > and flushes for cpus. The internal structure of the GRU TLB is identical to > the TLB of existing cpus. Requiring the GRU driver to track dropins with > long term page references seems to me a deviation from having the basic > mmuops support a "standard TLB" model. AFAIK, no other processor requires > this. That is because the CPU TLBs have the mmu_gather batching APIs which avoid the problem. It would be possible to do something similar for GRU which would involve taking a reference for each page-to-be-invalidated in invalidate_page, and release them when you invalidate_range. Or else do some other scheme which makes mmu notifiers work similarly to the mmu gather API. But not just go an invent something completely different in the form of this invalidate_begin,clear linux pte,invalidate_end API. > Tracking TLB dropins (and long term page references) could be done but it > adds significant complexity and scaling issues. The size of the tables to > track many TB (to PB) of memory can get large. If the memory is being > referenced by highly threaded applications, then the problem becomes even > more complex. Either tables must be replicated per-thread (and require even > more memory), or the table structure becomes even more complex to deal with > node locality, cacheline bouncing, etc. I don't think it would be that significant in terms of complexity or scaling. For a quick solution, you could stick a radix tree in each of your mmu notifiers registered (ie. one per mm), which is indexed on virtual address >> PAGE_SHIFT, and returns the struct page *. Size is no different than page tables, and locking is pretty scalable. After that, I would really like to see whether the numbers justify larger changes. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, 3 Mar 2008, Nick Piggin wrote: > I'm still not completely happy with this. I had a very quick look > at the GRU driver, but I don't see why it can't be implemented > more like the regular TLB model, and have TLB insertions depend on > the linux pte, and do invalidates _after_ restricting permissions > to the pte. > > Ie. I'd still like to get rid of invalidate_range_begin, and get > rid of invalidate calls from places where permissions are relaxed. Isnt this more a job for paravirt ops if it is so tightly bound to page tables? Are we not adding another similar API? > If we can agree on the API, then I don't see any reason why it can't > go into 2.6.25, unless someome wants more time to review it (but > 2.6.25 release should be quite far away still so there should be quite > a bit of time). API still has rcu issues and the example given for making things sleepable is only working for the aging callback. The most important callback is for try_to_unmao and page_mkclean. This means the API is still not generic enough and likely not extendable as needed in its present form. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, 3 Mar 2008, Nick Piggin wrote: > It is going to be really easy to add more weird and wonderful notifiers > later that deviate from our standard TLB model. It would be much harder to > remove them. So I really want to see everyone conform to this model first. > Numbers and comparisons can be brought out afterwards if people want to > attempt to make such changes. Still do not see how that could be done. The model here is tightly bound to ptes. AFAICT this could be implemented in arch code like the paravirt ops. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
I like the patch. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, 3 Mar 2008, Nick Piggin wrote: > Move definition of struct mmu_notifier and struct mmu_notifier_ops under > CONFIG_MMU_NOTIFIER to ensure they doesn't get dereferenced when they > don't make sense. The callbacks take a mmu_notifier parameter. So how does this compile for !MMU_NOTIFIER? - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 07:45:17PM +0100, Nick Piggin wrote: > On Mon, Mar 03, 2008 at 12:06:05PM -0600, Jack Steiner wrote: > > On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote: > > > > Maintaining a long-term reference on a page is a problem. The GRU does > > > > not > > > > currently maintain tables to track the pages for which dropins have > > > > been done. > > > > > > > > The GRU has a large internal TLB and is designed to reference up to 8PB > > > > of > > > > memory. The size of the tables to track this many referenced pages > > > > would be > > > > a problem (at best). > > > > > > Is it any worse a problem than the pagetables of the processes which have > > > their virtual memory exported to GRU? AFAIKS, no; it is on the same > > > magnitude of difficulty. So you could do it without introducing any > > > fundamental problem (memory usage might be increased by some constant > > > factor, but I think we can cope with that in order to make the core patch > > > really nice and simple). > > > > Functionally, the GRU is very close to what I would consider to be the > > "standard TLB" model. Dropins and flushs map closely to processor dropins > > and flushes for cpus. The internal structure of the GRU TLB is identical to > > the TLB of existing cpus. Requiring the GRU driver to track dropins with > > long term page references seems to me a deviation from having the basic > > mmuops support a "standard TLB" model. AFAIK, no other processor requires > > this. > > That is because the CPU TLBs have the mmu_gather batching APIs which > avoid the problem. It would be possible to do something similar for > GRU which would involve taking a reference for each page-to-be-invalidated > in invalidate_page, and release them when you invalidate_range. Or else > do some other scheme which makes mmu notifiers work similarly to the > mmu gather API. But not just go an invent something completely different > in the form of this invalidate_begin,clear linux pte,invalidate_end API. Correct. If the mmu_gather were passed on the mmuops callout and the callout were done at the same point as the tlb_finish_mmu(), the GRU could efficiently work w/o the range invalidates. A range invalidate might still be slightly more efficient but not measureable so. The net difference is not worth the extra complexity of range callouts. > > > > Tracking TLB dropins (and long term page references) could be done but it > > adds significant complexity and scaling issues. The size of the tables to > > track many TB (to PB) of memory can get large. If the memory is being > > referenced by highly threaded applications, then the problem becomes even > > more complex. Either tables must be replicated per-thread (and require even > > more memory), or the table structure becomes even more complex to deal with > > node locality, cacheline bouncing, etc. > > I don't think it would be that significant in terms of complexity or > scaling. > > For a quick solution, you could stick a radix tree in each of your mmu > notifiers registered (ie. one per mm), which is indexed on virtual address > >> PAGE_SHIFT, and returns the struct page *. Size is no different than > page tables, and locking is pretty scalable. > > After that, I would really like to see whether the numbers justify > larger changes. I'm still concerned about performance. Each dropin would first have to access an additional data structure that would most likely be non-node-local and non-cache-resident. The net effect would be measurable but not a killer. I haven't thought about locking requirements for the radix tree. Most accesses would be read-only & updates infrequent. Any chance of an RCU-based radix implementation? Otherwise, don't we add the potential for hot locks/cachelines for threaded applications ??? - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 11:01:22AM -0800, Christoph Lameter wrote: > API still has rcu issues and the example given for making things sleepable > is only working for the aging callback. The most important callback is for > try_to_unmao and page_mkclean. This means the API is still not generic > enough and likely not extendable as needed in its present form. I converted only one of those _notify as an example of how it should be done, because I assumed you volunteer to convert the other ones yourself during .26. It's useless to convert all of them right now, because the i_mmap_lock and anon_vma locks are still going to be spinlocks in .25. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Tue, Mar 04, 2008 at 11:35:32AM +0100, Peter Zijlstra wrote: > > On Mon, 2008-03-03 at 13:15 -0600, Jack Steiner wrote: > > > I haven't thought about locking requirements for the radix tree. Most > > accesses > > would be read-only & updates infrequent. Any chance of an RCU-based radix > > implementation? Otherwise, don't we add the potential for hot > > locks/cachelines > > for threaded applications ??? > > The current radix tree implementation in the kernel is RCU capable. We > just don't have many RCU users yet. Ahhh. You are right. I thought I looked but obviously missed it. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Mon, Mar 03, 2008 at 11:01:22AM -0800, Christoph Lameter wrote: > On Mon, 3 Mar 2008, Nick Piggin wrote: > > > I'm still not completely happy with this. I had a very quick look > > at the GRU driver, but I don't see why it can't be implemented > > more like the regular TLB model, and have TLB insertions depend on > > the linux pte, and do invalidates _after_ restricting permissions > > to the pte. > > > > Ie. I'd still like to get rid of invalidate_range_begin, and get > > rid of invalidate calls from places where permissions are relaxed. > > Isnt this more a job for paravirt ops if it is so tightly bound to page > tables? Are we not adding another similar API? Um, it's bound to the *Linux page tables*, yes. And I have no idea why you would use the paravirt ops for this. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Wed, 5 Mar 2008, Nick Piggin wrote: > Um, it's bound to the *Linux page tables*, yes. And I have no idea why > you would use the paravirt ops for this. paravirt ops allows interception of page table operations? - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8
On Wed, Mar 05, 2008 at 10:48:24AM -0800, Christoph Lameter wrote: > On Wed, 5 Mar 2008, Nick Piggin wrote: > > > Um, it's bound to the *Linux page tables*, yes. And I have no idea why > > you would use the paravirt ops for this. > > paravirt ops allows interception of page table operations? Maybe possible but it's totally the wrong API for it. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] mmu notifiers #v8 + xpmem
Here an example of the futher orthogonal work to do on top of #v8 during .26-rc to make the whole mmu notifier API sleep capable. 1) Every single ptep_clear_flush_young_notify and ptep_clear_flush_notify must be converted like the below. The below is the conversion of a single one. do_wp_page has been converted by Christoph already but with invalidate_range (should be changed to invalidate_page by releasing the refcount on the page after calling invalidate_page). Hope it's clear why I'd rather not depend on these changes to be merged in .25 in order to have the mmu notifier included in .25. 2) Then after all this conversion work is finished, it's trivial to delete ptep_clear_flush_young_notify and ptep_clear_flush_notify from mmu_notifier.h (they will be unused macros once the conversion is complete). 3) After that the VM has to be changed to convert anon_vma lock and i_mmap_lock spinlocks to mutex/rwsemaphore. 4) Then finally the mmu_notifier_unregister must be dropped to make the mmu notifier sleep capable with RCU in the mmu_notifier() fast path. It's unclear at this point if 3/4 should be switchable and happening under a CONFIG_XPMEM or similar or if everyone will benefit from those spinlock becoming mutex (the only one that is certain to appreciate such a change is preempt-rt, the rest of the userbase I don't know for sure and I'd be more confortable with a TPC number comparison before doing such a chance by default, but I leave the commentary on such a change to linux-mm in a separate thread). Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -274,7 +274,7 @@ static int page_referenced_one(struct pa unsigned long address; pte_t *pte; spinlock_t *ptl; - int referenced = 0; + int referenced = 0, clear_flush_young = 0; address = vma_address(page, vma); if (address == -EFAULT) @@ -287,8 +287,11 @@ static int page_referenced_one(struct pa if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young_notify(vma, address, pte)) - referenced++; + } else { + clear_flush_young = 1; + if (ptep_clear_flush_young(vma, address, pte)) + referenced++; + } /* Pretend the page is referenced if the task has the swap token and is in the middle of a page fault. */ @@ -298,6 +301,11 @@ static int page_referenced_one(struct pa (*mapcount)--; pte_unmap_unlock(pte, ptl); + + if (clear_flush_young) + referenced += mmu_notifier_clear_flush_young(vma->vm_mm, +address); + out: return referenced; } - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel