Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Thu, May 22, 2014 at 10:47:22AM -0700, Andrew Morton wrote: > On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman wrote: > > > > > If I'm still on track here, what happens if we switch to wake-all so we > > > > can avoid the dangling flag? I doubt if there are many collisions on > > > > that hash table? > > > > > > Wake-all will be ugly and loose a herd of waiters, all racing to > > > acquire, all but one of whoem will loose the race. It also looses the > > > fairness, its currently a FIFO queue. Wake-all will allow starvation. > > > > > > > And the cost of the thundering herd of waiters may offset any benefit of > > reducing the number of calls to page_waitqueue and waker functions. > > Well, none of this has been demonstrated. > True, but it's also the type of thing that would deserve a patch of its own with some separation in case bisection fingerpoints to a patch that is doing too much on its own. > As I speculated earlier, hash chain collisions will probably be rare, They are meant to be (well, they're documented to be). It's the primary reason why I'm not concerned about "dangling waiters" being that common a case. > except for the case where a bunch of processes are waiting on the same > page. And in this case, perhaps wake-all is the desired behavior. > > Take a look at do_read_cache_page(). It does lock_page(), but it > doesn't actually *need* to. It checks ->mapping and PG_uptodate and > then... unlocks the page! We could have used wait_on_page_locked() > there and permitted concurrent threads to run concurrently. > It does that later when it calls wait_on_page_read but the flow is weird. It looks like the first lock_page was to serialise against any IO and double check it was not racing against a parallel reclaim although the elevated reference count should have prevented that. Historical artifact maybe? It looks like there could be some improvement there but also would deserve a patch on its own. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman wrote: > > > If I'm still on track here, what happens if we switch to wake-all so we > > > can avoid the dangling flag? I doubt if there are many collisions on > > > that hash table? > > > > Wake-all will be ugly and loose a herd of waiters, all racing to > > acquire, all but one of whoem will loose the race. It also looses the > > fairness, its currently a FIFO queue. Wake-all will allow starvation. > > > > And the cost of the thundering herd of waiters may offset any benefit of > reducing the number of calls to page_waitqueue and waker functions. Well, none of this has been demonstrated. As I speculated earlier, hash chain collisions will probably be rare, except for the case where a bunch of processes are waiting on the same page. And in this case, perhaps wake-all is the desired behavior. Take a look at do_read_cache_page(). It does lock_page(), but it doesn't actually *need* to. It checks ->mapping and PG_uptodate and then... unlocks the page! We could have used wait_on_page_locked() there and permitted concurrent threads to run concurrently. btw, I'm struggling a bit to understand why we bother checking ->mapping there as we're about to unlock the page anyway... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Thu, May 22, 2014 at 08:45:29AM +0200, Peter Zijlstra wrote: > On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote: > > On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra > > wrote: > > > Alternative solution is not to merge the patch ;) > > There is always that.. :-) > > > > Yeah, so we only clear that bit when at 'unlock' we find there are no > > > more pending waiters, so if the last unlock still had a waiter, we'll > > > leave the bit set. > > > > Confused. If the last unlock had a waiter, that waiter will get woken > > up so there are no waiters any more, so the last unlock clears the flag. > > > > um, how do we determine that there are no more waiters? By looking at > > the waitqueue. But that waitqueue is hashed, so it may contain waiters > > for other pages so we're screwed? But we could just go and wake up the > > other-page waiters anyway and still clear PG_waiters? > > > > um2, we're using exclusive waitqueues so we can't (or don't) wake all > > waiters, so we're screwed again? > > Ah, so leave it set. Then when we do an uncontended wakeup, that is a > wakeup where there are _no_ waiters left, we'll iterate the entire > hashed queue, looking for a matching page. > > We'll find none, and only then clear the bit. > Yes, sorry that was not clear. > > > (This process is proving to be a hard way of writing Mel's changelog btw). > > Agreed :/ > I've lost sight of what is obvious and what is not. The introduction now reads This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are *potentially* processes waiting on PG_lock or PG_writeback. If there are no possible waiters then we avoid barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page and end_page_writeback paths. There is no guarantee that waiters exist if PG_waiters is set as multiple pages can hash to the same waitqueue and we cannot accurately detect if a waking process is the last waiter without a reference count. When this happens, the bit is left set and the next unlock or writeback completion will lookup the waitqueue and clear the bit when there are no collisions. This adds a few branches to the fast path but avoids bouncing a dirty cache line between CPUs. 32-bit machines always take the slow path but the primary motivation for this patch is large machines so I do not think that is a concern. > > If I'm still on track here, what happens if we switch to wake-all so we > > can avoid the dangling flag? I doubt if there are many collisions on > > that hash table? > > Wake-all will be ugly and loose a herd of waiters, all racing to > acquire, all but one of whoem will loose the race. It also looses the > fairness, its currently a FIFO queue. Wake-all will allow starvation. > And the cost of the thundering herd of waiters may offset any benefit of reducing the number of calls to page_waitqueue and waker functions. > > If there *are* a lot of collisions, I bet it's because a great pile of > > threads are all waiting on the same page. If they're trying to lock > > that page then wake-all is bad. But if they're just waiting for IO > > completion (probable) then it's OK. > > Yeah, I'm not entirely sure on the rationale for adding PG_waiters to > writeback completion, and yes PG_writeback is a wake-all. tmpfs was the most obvious one. We were doing a useless lookup almost every time writeback completed for async streaming writers. I suspected it would also apply to normal filesystems if backed by fast storage. There is not much to gain by continuing to use __wake_up_bit in the writeback pathso when PG_waiters is available. Only the first waiters incurs the SetPageWaiters penalty. In the uncontended case, neither take locks (one approach checks waitqueue_active outside the lock, the other checks PageWaiters). Both approaches end up taking q->lock either in __wake_up_bit->__wake_up or __wake_up_page_bit but in some cases __wake_up_page_bit. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Thu, May 22, 2014 at 01:07:15AM +0100, Mel Gorman wrote: > +PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters) > + TESTCLEARFLAG(Waiters, waiters) > +#define __PG_WAITERS (1 << PG_waiters) > +#else > +/* Always fallback to slow path on 32-bit */ > +static inline bool PageWaiters(struct page *page) > +{ > + return true; > +} > +static inline void __ClearPageWaiters(struct page *page) {} > +static inline void ClearPageWaiters(struct page *page) {} > +static inline void SetPageWaiters(struct page *page) {} > +#define __PG_WAITERS 0 > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void > *word, int bit) > +{ > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); > + unsigned long flags; > + > + /* > + * Unlike __wake_up_bit it is necessary to check waitqueue_active to be > + * checked under the wqh->lock to avoid races with parallel additions > + * to the waitqueue. Otherwise races could result in lost wakeups > + */ Well, you could do something like: if (!__PG_WAITERS && !waitqueue_active(wqh)) return; Which at least for 32bit restores some of the performance loss of this patch (did you have 32bit numbers in that massive changelog?, I totally tl;dr it). > + spin_lock_irqsave(>lock, flags); > + if (waitqueue_active(wqh)) > + __wake_up_common(wqh, TASK_NORMAL, 1, 0, ); > + else > + ClearPageWaiters(page); > + spin_unlock_irqrestore(>lock, flags); > +} pgpOeEVJL8s9M.pgp Description: PGP signature
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote: > On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra > wrote: > Alternative solution is not to merge the patch ;) There is always that.. :-) > > Yeah, so we only clear that bit when at 'unlock' we find there are no > > more pending waiters, so if the last unlock still had a waiter, we'll > > leave the bit set. > > Confused. If the last unlock had a waiter, that waiter will get woken > up so there are no waiters any more, so the last unlock clears the flag. > > um, how do we determine that there are no more waiters? By looking at > the waitqueue. But that waitqueue is hashed, so it may contain waiters > for other pages so we're screwed? But we could just go and wake up the > other-page waiters anyway and still clear PG_waiters? > > um2, we're using exclusive waitqueues so we can't (or don't) wake all > waiters, so we're screwed again? Ah, so leave it set. Then when we do an uncontended wakeup, that is a wakeup where there are _no_ waiters left, we'll iterate the entire hashed queue, looking for a matching page. We'll find none, and only then clear the bit. > (This process is proving to be a hard way of writing Mel's changelog btw). Agreed :/ > If I'm still on track here, what happens if we switch to wake-all so we > can avoid the dangling flag? I doubt if there are many collisions on > that hash table? Wake-all will be ugly and loose a herd of waiters, all racing to acquire, all but one of whoem will loose the race. It also looses the fairness, its currently a FIFO queue. Wake-all will allow starvation. > If there *are* a lot of collisions, I bet it's because a great pile of > threads are all waiting on the same page. If they're trying to lock > that page then wake-all is bad. But if they're just waiting for IO > completion (probable) then it's OK. Yeah, I'm not entirely sure on the rationale for adding PG_waiters to writeback completion, and yes PG_writeback is a wake-all. pgp0z0FmMKIdX.pgp Description: PGP signature
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote: On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra pet...@infradead.org wrote: Alternative solution is not to merge the patch ;) There is always that.. :-) Yeah, so we only clear that bit when at 'unlock' we find there are no more pending waiters, so if the last unlock still had a waiter, we'll leave the bit set. Confused. If the last unlock had a waiter, that waiter will get woken up so there are no waiters any more, so the last unlock clears the flag. um, how do we determine that there are no more waiters? By looking at the waitqueue. But that waitqueue is hashed, so it may contain waiters for other pages so we're screwed? But we could just go and wake up the other-page waiters anyway and still clear PG_waiters? um2, we're using exclusive waitqueues so we can't (or don't) wake all waiters, so we're screwed again? Ah, so leave it set. Then when we do an uncontended wakeup, that is a wakeup where there are _no_ waiters left, we'll iterate the entire hashed queue, looking for a matching page. We'll find none, and only then clear the bit. (This process is proving to be a hard way of writing Mel's changelog btw). Agreed :/ If I'm still on track here, what happens if we switch to wake-all so we can avoid the dangling flag? I doubt if there are many collisions on that hash table? Wake-all will be ugly and loose a herd of waiters, all racing to acquire, all but one of whoem will loose the race. It also looses the fairness, its currently a FIFO queue. Wake-all will allow starvation. If there *are* a lot of collisions, I bet it's because a great pile of threads are all waiting on the same page. If they're trying to lock that page then wake-all is bad. But if they're just waiting for IO completion (probable) then it's OK. Yeah, I'm not entirely sure on the rationale for adding PG_waiters to writeback completion, and yes PG_writeback is a wake-all. pgp0z0FmMKIdX.pgp Description: PGP signature
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Thu, May 22, 2014 at 01:07:15AM +0100, Mel Gorman wrote: +PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters) + TESTCLEARFLAG(Waiters, waiters) +#define __PG_WAITERS (1 PG_waiters) +#else +/* Always fallback to slow path on 32-bit */ +static inline bool PageWaiters(struct page *page) +{ + return true; +} +static inline void __ClearPageWaiters(struct page *page) {} +static inline void ClearPageWaiters(struct page *page) {} +static inline void SetPageWaiters(struct page *page) {} +#define __PG_WAITERS 0 +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit) +{ + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); + unsigned long flags; + + /* + * Unlike __wake_up_bit it is necessary to check waitqueue_active to be + * checked under the wqh-lock to avoid races with parallel additions + * to the waitqueue. Otherwise races could result in lost wakeups + */ Well, you could do something like: if (!__PG_WAITERS !waitqueue_active(wqh)) return; Which at least for 32bit restores some of the performance loss of this patch (did you have 32bit numbers in that massive changelog?, I totally tl;dr it). + spin_lock_irqsave(wqh-lock, flags); + if (waitqueue_active(wqh)) + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key); + else + ClearPageWaiters(page); + spin_unlock_irqrestore(wqh-lock, flags); +} pgpOeEVJL8s9M.pgp Description: PGP signature
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Thu, May 22, 2014 at 08:45:29AM +0200, Peter Zijlstra wrote: On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote: On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra pet...@infradead.org wrote: Alternative solution is not to merge the patch ;) There is always that.. :-) Yeah, so we only clear that bit when at 'unlock' we find there are no more pending waiters, so if the last unlock still had a waiter, we'll leave the bit set. Confused. If the last unlock had a waiter, that waiter will get woken up so there are no waiters any more, so the last unlock clears the flag. um, how do we determine that there are no more waiters? By looking at the waitqueue. But that waitqueue is hashed, so it may contain waiters for other pages so we're screwed? But we could just go and wake up the other-page waiters anyway and still clear PG_waiters? um2, we're using exclusive waitqueues so we can't (or don't) wake all waiters, so we're screwed again? Ah, so leave it set. Then when we do an uncontended wakeup, that is a wakeup where there are _no_ waiters left, we'll iterate the entire hashed queue, looking for a matching page. We'll find none, and only then clear the bit. Yes, sorry that was not clear. (This process is proving to be a hard way of writing Mel's changelog btw). Agreed :/ I've lost sight of what is obvious and what is not. The introduction now reads This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are *potentially* processes waiting on PG_lock or PG_writeback. If there are no possible waiters then we avoid barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page and end_page_writeback paths. There is no guarantee that waiters exist if PG_waiters is set as multiple pages can hash to the same waitqueue and we cannot accurately detect if a waking process is the last waiter without a reference count. When this happens, the bit is left set and the next unlock or writeback completion will lookup the waitqueue and clear the bit when there are no collisions. This adds a few branches to the fast path but avoids bouncing a dirty cache line between CPUs. 32-bit machines always take the slow path but the primary motivation for this patch is large machines so I do not think that is a concern. If I'm still on track here, what happens if we switch to wake-all so we can avoid the dangling flag? I doubt if there are many collisions on that hash table? Wake-all will be ugly and loose a herd of waiters, all racing to acquire, all but one of whoem will loose the race. It also looses the fairness, its currently a FIFO queue. Wake-all will allow starvation. And the cost of the thundering herd of waiters may offset any benefit of reducing the number of calls to page_waitqueue and waker functions. If there *are* a lot of collisions, I bet it's because a great pile of threads are all waiting on the same page. If they're trying to lock that page then wake-all is bad. But if they're just waiting for IO completion (probable) then it's OK. Yeah, I'm not entirely sure on the rationale for adding PG_waiters to writeback completion, and yes PG_writeback is a wake-all. tmpfs was the most obvious one. We were doing a useless lookup almost every time writeback completed for async streaming writers. I suspected it would also apply to normal filesystems if backed by fast storage. There is not much to gain by continuing to use __wake_up_bit in the writeback pathso when PG_waiters is available. Only the first waiters incurs the SetPageWaiters penalty. In the uncontended case, neither take locks (one approach checks waitqueue_active outside the lock, the other checks PageWaiters). Both approaches end up taking q-lock either in __wake_up_bit-__wake_up or __wake_up_page_bit but in some cases __wake_up_page_bit. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman mgor...@suse.de wrote: If I'm still on track here, what happens if we switch to wake-all so we can avoid the dangling flag? I doubt if there are many collisions on that hash table? Wake-all will be ugly and loose a herd of waiters, all racing to acquire, all but one of whoem will loose the race. It also looses the fairness, its currently a FIFO queue. Wake-all will allow starvation. And the cost of the thundering herd of waiters may offset any benefit of reducing the number of calls to page_waitqueue and waker functions. Well, none of this has been demonstrated. As I speculated earlier, hash chain collisions will probably be rare, except for the case where a bunch of processes are waiting on the same page. And in this case, perhaps wake-all is the desired behavior. Take a look at do_read_cache_page(). It does lock_page(), but it doesn't actually *need* to. It checks -mapping and PG_uptodate and then... unlocks the page! We could have used wait_on_page_locked() there and permitted concurrent threads to run concurrently. btw, I'm struggling a bit to understand why we bother checking -mapping there as we're about to unlock the page anyway... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Thu, May 22, 2014 at 10:47:22AM -0700, Andrew Morton wrote: On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman mgor...@suse.de wrote: If I'm still on track here, what happens if we switch to wake-all so we can avoid the dangling flag? I doubt if there are many collisions on that hash table? Wake-all will be ugly and loose a herd of waiters, all racing to acquire, all but one of whoem will loose the race. It also looses the fairness, its currently a FIFO queue. Wake-all will allow starvation. And the cost of the thundering herd of waiters may offset any benefit of reducing the number of calls to page_waitqueue and waker functions. Well, none of this has been demonstrated. True, but it's also the type of thing that would deserve a patch of its own with some separation in case bisection fingerpoints to a patch that is doing too much on its own. As I speculated earlier, hash chain collisions will probably be rare, They are meant to be (well, they're documented to be). It's the primary reason why I'm not concerned about dangling waiters being that common a case. except for the case where a bunch of processes are waiting on the same page. And in this case, perhaps wake-all is the desired behavior. Take a look at do_read_cache_page(). It does lock_page(), but it doesn't actually *need* to. It checks -mapping and PG_uptodate and then... unlocks the page! We could have used wait_on_page_locked() there and permitted concurrent threads to run concurrently. It does that later when it calls wait_on_page_read but the flow is weird. It looks like the first lock_page was to serialise against any IO and double check it was not racing against a parallel reclaim although the elevated reference count should have prevented that. Historical artifact maybe? It looks like there could be some improvement there but also would deserve a patch on its own. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote: > On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra > wrote: > > > On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote: > > > > +static inline void > > > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, > > > > + struct page *page, int state, bool exclusive) > > > > > > Putting MM stuff into core waitqueue code is rather bad. I really > > > don't know how I'm going to explain this to my family. > > > > Right, so we could avoid all that and make the functions in mm/filemap.c > > rather large and opencode a bunch of wait.c stuff. > > > > The world won't end if we do it Mel's way and it's probably the most > efficient. But ugh. This stuff does raise the "it had better be a > useful patch" bar. > > > Which is pretty much what I initially pseudo proposed. > > Alternative solution is not to merge the patch ;) > While true, the overhead of the page_waitqueue lookups and unnecessary wakeups sucks even on small machines. Not only does it hit us during simple operations like dd to a file but we would hit it during page reclaim as well which is trylock_page/unlock_page intensive > > > > + __ClearPageWaiters(page); > > > > > > We're freeing the page - if someone is still waiting on it then we have > > > a huge bug? It's the mysterious collision thing again I hope? > > > > Yeah, so we only clear that bit when at 'unlock' we find there are no > > more pending waiters, so if the last unlock still had a waiter, we'll > > leave the bit set. > > Confused. If the last unlock had a waiter, that waiter will get woken > up so there are no waiters any more, so the last unlock clears the flag. > > um, how do we determine that there are no more waiters? By looking at > the waitqueue. But that waitqueue is hashed, so it may contain waiters > for other pages so we're screwed? But we could just go and wake up the > other-page waiters anyway and still clear PG_waiters? > > um2, we're using exclusive waitqueues so we can't (or don't) wake all > waiters, so we're screwed again? > > (This process is proving to be a hard way of writing Mel's changelog btw). > > If I'm still on track here, what happens if we switch to wake-all so we > can avoid the dangling flag? I doubt if there are many collisions on > that hash table? > > If there *are* a lot of collisions, I bet it's because a great pile of > threads are all waiting on the same page. If they're trying to lock > that page then wake-all is bad. But if they're just waiting for IO > completion (probable) then it's OK. > > I'll stop now. Rather than putting details in the changelog, here is an updated version that hopefully improves the commentary to the point where it's actually clear. ---8<--- From: Nick Piggin Subject: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v6 Changelog since v5 o __always_inline where appropriate (peterz) o Documentation (akpm) Changelog since v4 o Remove dependency on io_schedule_timeout o Push waiting logic down into waitqueue This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are processes waiting on PG_lock or PG_writeback and uses it to avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. This adds a few branches to the fast path but avoids bouncing a dirty cache line between CPUs. 32-bit machines always take the slow path but the primary motivation for this patch is large machines so I do not think that is a concern. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. The kernels being compared are "accessed-v2" which is the patch series up to this patch where as lockpage-v2 includes this patch. async dd 3.15.0-rc53.15.0-rc5 mmotm
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote: > On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman wrote: > > > Andrew had suggested dropping v4 of the patch entirely as the numbers were > > marginal and the complexity was high. However, even on a relatively small > > machine running simple workloads the overhead of page_waitqueue and wakeup > > functions is around 5% of system CPU time. That's quite high for basic > > operations so I felt it was worth another shot. The performance figures > > are better with this version than they were for v4 and overall the patch > > should be more comprehensible. > > > > Changelog since v4 > > o Remove dependency on io_schedule_timeout > > o Push waiting logic down into waitqueue > > > > This patch introduces a new page flag for 64-bit capable machines, > > PG_waiters, to signal there are processes waiting on PG_lock and uses it to > > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. > > > > This adds a few branches to the fast path but avoids bouncing a dirty > > cache line between CPUs. 32-bit machines always take the slow path but the > > primary motivation for this patch is large machines so I do not think that > > is a concern. > > > > The test case used to evaulate this is a simple dd of a large file done > > multiple times with the file deleted on each iterations. The size of > > the file is 1/10th physical memory to avoid dirty page balancing. In the > > async case it will be possible that the workload completes without even > > hitting the disk and will have variable results but highlight the impact > > of mark_page_accessed for async IO. The sync results are expected to be > > more stable. The exception is tmpfs where the normal case is for the "IO" > > to not hit the disk. > > > > The test machine was single socket and UMA to avoid any scheduling or > > NUMA artifacts. Throughput and wall times are presented for sync IO, only > > wall times are shown for async as the granularity reported by dd and the > > variability is unsuitable for comparison. As async results were variable > > do to writback timings, I'm only reporting the maximum figures. The sync > > results were stable enough to make the mean and stddev uninteresting. > > > > The performance results are reported based on a run with no profiling. > > Profile data is based on a separate run with oprofile running. The > > kernels being compared are "accessed-v2" which is the patch series up > > to this patch where as lockpage-v2 includes this patch. > > > > ... > > > > --- a/include/linux/wait.h > > +++ b/include/linux/wait.h > > @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned > > int mode, int nr, void *k > > void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr); > > void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr); > > void __wake_up_bit(wait_queue_head_t *, void *, int); > > +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, > > int); > > You're going to need to forward-declare struct page in wait.h. The > good thing about this is that less people will notice that we've gone > and mentioned struct page in wait.h :( > Will add the forward-declare. > > int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int > > (*)(void *), unsigned); > > +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *, > > > > ... > > > > --- a/kernel/sched/wait.c > > +++ b/kernel/sched/wait.c > > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);/* For internal > > use only */ > > * stops them from bleeding out - it would still allow subsequent > > * loads to move into the critical region). > > */ > > -void > > -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) > > +static inline void > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, > > + struct page *page, int state, bool exclusive) > > Putting MM stuff into core waitqueue code is rather bad. I really > don't know how I'm going to explain this to my family. > The alternative is updating wait.h and wait.c was open-coding the waitqueue modifications in filemap.c but that is just as ugly. The wait queue stuff is complex and there was motivation to keep it in one place even if we are special casing struct page handling. FWIW, I cannot explain anything I do in work to my family. It gets blank looks no matter what. > > { > > unsigned long flags; > > > > - wait->flags &= ~WQ_FLAG_EXCLUSIVE; > > spin_lock_irqsave(>lock, flags); > > - if (list_empty(>task_list)) > > - __add_wait_queue(q, wait); > > + if (page && !PageWaiters(page)) > > + SetPageWaiters(page); > > And this isn't racy because we're assuming that all users of `page' are > using the same waitqueue. ie, assuming all callers use > page_waitqueue()? Subtle, unobvious, worth documenting. > All users of the page will get the same
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra wrote: > On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote: > > > +static inline void > > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, > > > + struct page *page, int state, bool exclusive) > > > > Putting MM stuff into core waitqueue code is rather bad. I really > > don't know how I'm going to explain this to my family. > > Right, so we could avoid all that and make the functions in mm/filemap.c > rather large and opencode a bunch of wait.c stuff. > The world won't end if we do it Mel's way and it's probably the most efficient. But ugh. This stuff does raise the "it had better be a useful patch" bar. > Which is pretty much what I initially pseudo proposed. Alternative solution is not to merge the patch ;) > > > + __ClearPageWaiters(page); > > > > We're freeing the page - if someone is still waiting on it then we have > > a huge bug? It's the mysterious collision thing again I hope? > > Yeah, so we only clear that bit when at 'unlock' we find there are no > more pending waiters, so if the last unlock still had a waiter, we'll > leave the bit set. Confused. If the last unlock had a waiter, that waiter will get woken up so there are no waiters any more, so the last unlock clears the flag. um, how do we determine that there are no more waiters? By looking at the waitqueue. But that waitqueue is hashed, so it may contain waiters for other pages so we're screwed? But we could just go and wake up the other-page waiters anyway and still clear PG_waiters? um2, we're using exclusive waitqueues so we can't (or don't) wake all waiters, so we're screwed again? (This process is proving to be a hard way of writing Mel's changelog btw). If I'm still on track here, what happens if we switch to wake-all so we can avoid the dangling flag? I doubt if there are many collisions on that hash table? If there *are* a lot of collisions, I bet it's because a great pile of threads are all waiting on the same page. If they're trying to lock that page then wake-all is bad. But if they're just waiting for IO completion (probable) then it's OK. I'll stop now. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote: > > +static inline void > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, > > + struct page *page, int state, bool exclusive) > > Putting MM stuff into core waitqueue code is rather bad. I really > don't know how I'm going to explain this to my family. Right, so we could avoid all that and make the functions in mm/filemap.c rather large and opencode a bunch of wait.c stuff. Which is pretty much what I initially pseudo proposed. > > + __ClearPageWaiters(page); > > We're freeing the page - if someone is still waiting on it then we have > a huge bug? It's the mysterious collision thing again I hope? Yeah, so we only clear that bit when at 'unlock' we find there are no more pending waiters, so if the last unlock still had a waiter, we'll leave the bit set. So its entirely reasonable to still have it set when we free a page etc.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman wrote: > Andrew had suggested dropping v4 of the patch entirely as the numbers were > marginal and the complexity was high. However, even on a relatively small > machine running simple workloads the overhead of page_waitqueue and wakeup > functions is around 5% of system CPU time. That's quite high for basic > operations so I felt it was worth another shot. The performance figures > are better with this version than they were for v4 and overall the patch > should be more comprehensible. > > Changelog since v4 > o Remove dependency on io_schedule_timeout > o Push waiting logic down into waitqueue > > This patch introduces a new page flag for 64-bit capable machines, > PG_waiters, to signal there are processes waiting on PG_lock and uses it to > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. > > This adds a few branches to the fast path but avoids bouncing a dirty > cache line between CPUs. 32-bit machines always take the slow path but the > primary motivation for this patch is large machines so I do not think that > is a concern. > > The test case used to evaulate this is a simple dd of a large file done > multiple times with the file deleted on each iterations. The size of > the file is 1/10th physical memory to avoid dirty page balancing. In the > async case it will be possible that the workload completes without even > hitting the disk and will have variable results but highlight the impact > of mark_page_accessed for async IO. The sync results are expected to be > more stable. The exception is tmpfs where the normal case is for the "IO" > to not hit the disk. > > The test machine was single socket and UMA to avoid any scheduling or > NUMA artifacts. Throughput and wall times are presented for sync IO, only > wall times are shown for async as the granularity reported by dd and the > variability is unsuitable for comparison. As async results were variable > do to writback timings, I'm only reporting the maximum figures. The sync > results were stable enough to make the mean and stddev uninteresting. > > The performance results are reported based on a run with no profiling. > Profile data is based on a separate run with oprofile running. The > kernels being compared are "accessed-v2" which is the patch series up > to this patch where as lockpage-v2 includes this patch. > > ... > > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned > int mode, int nr, void *k > void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr); > void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr); > void __wake_up_bit(wait_queue_head_t *, void *, int); > +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int); You're going to need to forward-declare struct page in wait.h. The good thing about this is that less people will notice that we've gone and mentioned struct page in wait.h :( > int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void > *), unsigned); > +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *, > > ... > > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal > use only */ > * stops them from bleeding out - it would still allow subsequent > * loads to move into the critical region). > */ > -void > -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) > +static inline void > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, > + struct page *page, int state, bool exclusive) Putting MM stuff into core waitqueue code is rather bad. I really don't know how I'm going to explain this to my family. > { > unsigned long flags; > > - wait->flags &= ~WQ_FLAG_EXCLUSIVE; > spin_lock_irqsave(>lock, flags); > - if (list_empty(>task_list)) > - __add_wait_queue(q, wait); > + if (page && !PageWaiters(page)) > + SetPageWaiters(page); And this isn't racy because we're assuming that all users of `page' are using the same waitqueue. ie, assuming all callers use page_waitqueue()? Subtle, unobvious, worth documenting. > + if (list_empty(>task_list)) { > + if (exclusive) { > + wait->flags |= WQ_FLAG_EXCLUSIVE; > + __add_wait_queue_tail(q, wait); > + } else { > + wait->flags &= ~WQ_FLAG_EXCLUSIVE; > + __add_wait_queue(q, wait); > + } > + } > set_current_state(state); > spin_unlock_irqrestore(>lock, flags); > } > > ... > > @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event); > * the wait descriptor from the given waitqueue if still > * queued. > */ > -void finish_wait(wait_queue_head_t *q, wait_queue_t *wait) > +static inline void
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 04:33:57PM +0100, Mel Gorman wrote: > > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, > > > + struct page *page, int state, bool exclusive) > > > { > > > unsigned long flags; > > > > > > + if (page && !PageWaiters(page)) > > > + SetPageWaiters(page); > > > + if (list_empty(>task_list)) { > > > + if (exclusive) { > > > + wait->flags |= WQ_FLAG_EXCLUSIVE; > > > + __add_wait_queue_tail(q, wait); > > > + } else { > > > > I'm fairly sure we've just initialized the wait thing to 0, so clearing > > the bit would be superfluous. > > > > I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be > superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT() > then it's redundant. If it's a wait_queue_t that is being reused and > sometimes used for exclusive waits and other times for non-exclusive > waits then it's required. The API allows this to happen so I see no harm > is clearing the flag like the old code did. Am I missing your point? Yeah, I'm not aware of any other users except the on-stack kind, but you're right. Maybe we should stick an object_is_on_stack() test in there to see if anything falls out, something for a rainy afternoon perhaps.. > > > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void > > > *word, int bit) > > > +{ > > > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); > > > + unsigned long flags; > > > + > > > + spin_lock_irqsave(>lock, flags); > > > + if (waitqueue_active(wqh)) > > > + __wake_up_common(wqh, TASK_NORMAL, 1, 0, ); > > > + else > > > + ClearPageWaiters(page); > > > + spin_unlock_irqrestore(>lock, flags); > > > +} > > > > Seeing how word is always going to be >flags, might it make sense > > to remove that argument? > > > > The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses > wake_bit_function() as a wakeup function and that thing consumes both the > page->flags and the bit number it's interested in. This is used for both > PG_writeback and PG_locked so assumptions cannot really be made about > the value. Well, both PG_flags come from the same >flags word, right? But yeah, if we ever decide to grow the page frame with another flags word we'd be in trouble :-) In any case I don't feel too strongly about either of these points. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 03:02:23PM +0200, Peter Zijlstra wrote: > On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote: > > Andrew had suggested dropping v4 of the patch entirely as the numbers were > > marginal and the complexity was high. However, even on a relatively small > > machine running simple workloads the overhead of page_waitqueue and wakeup > > functions is around 5% of system CPU time. That's quite high for basic > > operations so I felt it was worth another shot. The performance figures > > are better with this version than they were for v4 and overall the patch > > should be more comprehensible. > > Simpler patch and better performance, yay! > > > This patch introduces a new page flag for 64-bit capable machines, > > PG_waiters, to signal there are processes waiting on PG_lock and uses it to > > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. > > The patch seems to also explicitly use it for PG_writeback, yet no > mention of that here. > I'll add a note. > > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > > index 0ffa20a..f829e73 100644 > > --- a/kernel/sched/wait.c > > +++ b/kernel/sched/wait.c > > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);/* For internal > > use only */ > > * stops them from bleeding out - it would still allow subsequent > > * loads to move into the critical region). > > */ > > +static inline void > > Make that __always_inline, that way we're guaranteed to optimize the > build time constant .page=NULL cases. > Done. > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, > > + struct page *page, int state, bool exclusive) > > { > > unsigned long flags; > > > > + if (page && !PageWaiters(page)) > > + SetPageWaiters(page); > > + if (list_empty(>task_list)) { > > + if (exclusive) { > > + wait->flags |= WQ_FLAG_EXCLUSIVE; > > + __add_wait_queue_tail(q, wait); > > + } else { > > I'm fairly sure we've just initialized the wait thing to 0, so clearing > the bit would be superfluous. > I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT() then it's redundant. If it's a wait_queue_t that is being reused and sometimes used for exclusive waits and other times for non-exclusive waits then it's required. The API allows this to happen so I see no harm is clearing the flag like the old code did. Am I missing your point? > > + wait->flags &= ~WQ_FLAG_EXCLUSIVE; > > + __add_wait_queue(q, wait); > > + } > > + } > > set_current_state(state); > > spin_unlock_irqrestore(>lock, flags); > > } > > + > > +void > > +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) > > +{ > > + return __prepare_to_wait(q, wait, NULL, state, false); > > +} > > EXPORT_SYMBOL(prepare_to_wait); > > > > void > > prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int > > state) > > { > > + return __prepare_to_wait(q, wait, NULL, state, true); > > } > > EXPORT_SYMBOL(prepare_to_wait_exclusive); > > > > @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event); > > * the wait descriptor from the given waitqueue if still > > * queued. > > */ > > +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait, > > + struct page *page) > > { > > Same thing, make that __always_inline. > Done. > > unsigned long flags; > > > > @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t > > *wait) > > if (!list_empty_careful(>task_list)) { > > spin_lock_irqsave(>lock, flags); > > list_del_init(>task_list); > > + if (page && !waitqueue_active(q)) > > + ClearPageWaiters(page); > > spin_unlock_irqrestore(>lock, flags); > > } > > } > > + > > +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait) > > +{ > > + return __finish_wait(q, wait, NULL); > > +} > > EXPORT_SYMBOL(finish_wait); > > > > /** > > > @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, > > int bit, > > } > > EXPORT_SYMBOL(out_of_line_wait_on_bit_lock); > > > > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void > > *word, int bit) > > +{ > > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); > > + unsigned long flags; > > + > > + spin_lock_irqsave(>lock, flags); > > + if (waitqueue_active(wqh)) > > + __wake_up_common(wqh, TASK_NORMAL, 1, 0, ); > > + else > > + ClearPageWaiters(page); > > + spin_unlock_irqrestore(>lock, flags); > > +} > > Seeing how word is always going to be >flags, might it make sense > to remove that argument? > The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses wake_bit_function() as a wakeup function and
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote: > Andrew had suggested dropping v4 of the patch entirely as the numbers were > marginal and the complexity was high. However, even on a relatively small > machine running simple workloads the overhead of page_waitqueue and wakeup > functions is around 5% of system CPU time. That's quite high for basic > operations so I felt it was worth another shot. The performance figures > are better with this version than they were for v4 and overall the patch > should be more comprehensible. Simpler patch and better performance, yay! > This patch introduces a new page flag for 64-bit capable machines, > PG_waiters, to signal there are processes waiting on PG_lock and uses it to > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. The patch seems to also explicitly use it for PG_writeback, yet no mention of that here. > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index 0ffa20a..f829e73 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal > use only */ > * stops them from bleeding out - it would still allow subsequent > * loads to move into the critical region). > */ > +static inline void Make that __always_inline, that way we're guaranteed to optimize the build time constant .page=NULL cases. > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, > + struct page *page, int state, bool exclusive) > { > unsigned long flags; > > + if (page && !PageWaiters(page)) > + SetPageWaiters(page); > + if (list_empty(>task_list)) { > + if (exclusive) { > + wait->flags |= WQ_FLAG_EXCLUSIVE; > + __add_wait_queue_tail(q, wait); > + } else { I'm fairly sure we've just initialized the wait thing to 0, so clearing the bit would be superfluous. > + wait->flags &= ~WQ_FLAG_EXCLUSIVE; > + __add_wait_queue(q, wait); > + } > + } > set_current_state(state); > spin_unlock_irqrestore(>lock, flags); > } > + > +void > +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) > +{ > + return __prepare_to_wait(q, wait, NULL, state, false); > +} > EXPORT_SYMBOL(prepare_to_wait); > > void > prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int > state) > { > + return __prepare_to_wait(q, wait, NULL, state, true); > } > EXPORT_SYMBOL(prepare_to_wait_exclusive); > > @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event); > * the wait descriptor from the given waitqueue if still > * queued. > */ > +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait, > + struct page *page) > { Same thing, make that __always_inline. > unsigned long flags; > > @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t > *wait) > if (!list_empty_careful(>task_list)) { > spin_lock_irqsave(>lock, flags); > list_del_init(>task_list); > + if (page && !waitqueue_active(q)) > + ClearPageWaiters(page); > spin_unlock_irqrestore(>lock, flags); > } > } > + > +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait) > +{ > + return __finish_wait(q, wait, NULL); > +} > EXPORT_SYMBOL(finish_wait); > > /** > @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int > bit, > } > EXPORT_SYMBOL(out_of_line_wait_on_bit_lock); > > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void > *word, int bit) > +{ > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); > + unsigned long flags; > + > + spin_lock_irqsave(>lock, flags); > + if (waitqueue_active(wqh)) > + __wake_up_common(wqh, TASK_NORMAL, 1, 0, ); > + else > + ClearPageWaiters(page); > + spin_unlock_irqrestore(>lock, flags); > +} Seeing how word is always going to be >flags, might it make sense to remove that argument? Anyway, looks good in principle. Oleg? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote: Andrew had suggested dropping v4 of the patch entirely as the numbers were marginal and the complexity was high. However, even on a relatively small machine running simple workloads the overhead of page_waitqueue and wakeup functions is around 5% of system CPU time. That's quite high for basic operations so I felt it was worth another shot. The performance figures are better with this version than they were for v4 and overall the patch should be more comprehensible. Simpler patch and better performance, yay! This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are processes waiting on PG_lock and uses it to avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. The patch seems to also explicitly use it for PG_writeback, yet no mention of that here. diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 0ffa20a..f829e73 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */ * stops them from bleeding out - it would still allow subsequent * loads to move into the critical region). */ +static inline void Make that __always_inline, that way we're guaranteed to optimize the build time constant .page=NULL cases. +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page, int state, bool exclusive) { unsigned long flags; + if (page !PageWaiters(page)) + SetPageWaiters(page); + if (list_empty(wait-task_list)) { + if (exclusive) { + wait-flags |= WQ_FLAG_EXCLUSIVE; + __add_wait_queue_tail(q, wait); + } else { I'm fairly sure we've just initialized the wait thing to 0, so clearing the bit would be superfluous. + wait-flags = ~WQ_FLAG_EXCLUSIVE; + __add_wait_queue(q, wait); + } + } set_current_state(state); spin_unlock_irqrestore(q-lock, flags); } + +void +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) +{ + return __prepare_to_wait(q, wait, NULL, state, false); +} EXPORT_SYMBOL(prepare_to_wait); void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state) { + return __prepare_to_wait(q, wait, NULL, state, true); } EXPORT_SYMBOL(prepare_to_wait_exclusive); @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event); * the wait descriptor from the given waitqueue if still * queued. */ +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page) { Same thing, make that __always_inline. unsigned long flags; @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait) if (!list_empty_careful(wait-task_list)) { spin_lock_irqsave(q-lock, flags); list_del_init(wait-task_list); + if (page !waitqueue_active(q)) + ClearPageWaiters(page); spin_unlock_irqrestore(q-lock, flags); } } + +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait) +{ + return __finish_wait(q, wait, NULL); +} EXPORT_SYMBOL(finish_wait); /** @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int bit, } EXPORT_SYMBOL(out_of_line_wait_on_bit_lock); +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit) +{ + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); + unsigned long flags; + + spin_lock_irqsave(wqh-lock, flags); + if (waitqueue_active(wqh)) + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key); + else + ClearPageWaiters(page); + spin_unlock_irqrestore(wqh-lock, flags); +} Seeing how word is always going to be page-flags, might it make sense to remove that argument? Anyway, looks good in principle. Oleg? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 03:02:23PM +0200, Peter Zijlstra wrote: On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote: Andrew had suggested dropping v4 of the patch entirely as the numbers were marginal and the complexity was high. However, even on a relatively small machine running simple workloads the overhead of page_waitqueue and wakeup functions is around 5% of system CPU time. That's quite high for basic operations so I felt it was worth another shot. The performance figures are better with this version than they were for v4 and overall the patch should be more comprehensible. Simpler patch and better performance, yay! This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are processes waiting on PG_lock and uses it to avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. The patch seems to also explicitly use it for PG_writeback, yet no mention of that here. I'll add a note. diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 0ffa20a..f829e73 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);/* For internal use only */ * stops them from bleeding out - it would still allow subsequent * loads to move into the critical region). */ +static inline void Make that __always_inline, that way we're guaranteed to optimize the build time constant .page=NULL cases. Done. +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page, int state, bool exclusive) { unsigned long flags; + if (page !PageWaiters(page)) + SetPageWaiters(page); + if (list_empty(wait-task_list)) { + if (exclusive) { + wait-flags |= WQ_FLAG_EXCLUSIVE; + __add_wait_queue_tail(q, wait); + } else { I'm fairly sure we've just initialized the wait thing to 0, so clearing the bit would be superfluous. I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT() then it's redundant. If it's a wait_queue_t that is being reused and sometimes used for exclusive waits and other times for non-exclusive waits then it's required. The API allows this to happen so I see no harm is clearing the flag like the old code did. Am I missing your point? + wait-flags = ~WQ_FLAG_EXCLUSIVE; + __add_wait_queue(q, wait); + } + } set_current_state(state); spin_unlock_irqrestore(q-lock, flags); } + +void +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) +{ + return __prepare_to_wait(q, wait, NULL, state, false); +} EXPORT_SYMBOL(prepare_to_wait); void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state) { + return __prepare_to_wait(q, wait, NULL, state, true); } EXPORT_SYMBOL(prepare_to_wait_exclusive); @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event); * the wait descriptor from the given waitqueue if still * queued. */ +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page) { Same thing, make that __always_inline. Done. unsigned long flags; @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait) if (!list_empty_careful(wait-task_list)) { spin_lock_irqsave(q-lock, flags); list_del_init(wait-task_list); + if (page !waitqueue_active(q)) + ClearPageWaiters(page); spin_unlock_irqrestore(q-lock, flags); } } + +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait) +{ + return __finish_wait(q, wait, NULL); +} EXPORT_SYMBOL(finish_wait); /** @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int bit, } EXPORT_SYMBOL(out_of_line_wait_on_bit_lock); +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit) +{ + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); + unsigned long flags; + + spin_lock_irqsave(wqh-lock, flags); + if (waitqueue_active(wqh)) + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key); + else + ClearPageWaiters(page); + spin_unlock_irqrestore(wqh-lock, flags); +} Seeing how word is always going to be page-flags, might it make sense to remove that argument? The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses wake_bit_function() as a wakeup function and that thing consumes both the page-flags and the bit number it's interested in. This is used for both PG_writeback and PG_locked so assumptions cannot really be made about the value. -- Mel Gorman SUSE
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 04:33:57PM +0100, Mel Gorman wrote: +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page, int state, bool exclusive) { unsigned long flags; + if (page !PageWaiters(page)) + SetPageWaiters(page); + if (list_empty(wait-task_list)) { + if (exclusive) { + wait-flags |= WQ_FLAG_EXCLUSIVE; + __add_wait_queue_tail(q, wait); + } else { I'm fairly sure we've just initialized the wait thing to 0, so clearing the bit would be superfluous. I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT() then it's redundant. If it's a wait_queue_t that is being reused and sometimes used for exclusive waits and other times for non-exclusive waits then it's required. The API allows this to happen so I see no harm is clearing the flag like the old code did. Am I missing your point? Yeah, I'm not aware of any other users except the on-stack kind, but you're right. Maybe we should stick an object_is_on_stack() test in there to see if anything falls out, something for a rainy afternoon perhaps.. +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit) +{ + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); + unsigned long flags; + + spin_lock_irqsave(wqh-lock, flags); + if (waitqueue_active(wqh)) + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key); + else + ClearPageWaiters(page); + spin_unlock_irqrestore(wqh-lock, flags); +} Seeing how word is always going to be page-flags, might it make sense to remove that argument? The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses wake_bit_function() as a wakeup function and that thing consumes both the page-flags and the bit number it's interested in. This is used for both PG_writeback and PG_locked so assumptions cannot really be made about the value. Well, both PG_flags come from the same page-flags word, right? But yeah, if we ever decide to grow the page frame with another flags word we'd be in trouble :-) In any case I don't feel too strongly about either of these points. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman mgor...@suse.de wrote: Andrew had suggested dropping v4 of the patch entirely as the numbers were marginal and the complexity was high. However, even on a relatively small machine running simple workloads the overhead of page_waitqueue and wakeup functions is around 5% of system CPU time. That's quite high for basic operations so I felt it was worth another shot. The performance figures are better with this version than they were for v4 and overall the patch should be more comprehensible. Changelog since v4 o Remove dependency on io_schedule_timeout o Push waiting logic down into waitqueue This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are processes waiting on PG_lock and uses it to avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. This adds a few branches to the fast path but avoids bouncing a dirty cache line between CPUs. 32-bit machines always take the slow path but the primary motivation for this patch is large machines so I do not think that is a concern. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the IO to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. The kernels being compared are accessed-v2 which is the patch series up to this patch where as lockpage-v2 includes this patch. ... --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *k void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr); void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr); void __wake_up_bit(wait_queue_head_t *, void *, int); +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int); You're going to need to forward-declare struct page in wait.h. The good thing about this is that less people will notice that we've gone and mentioned struct page in wait.h :( int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned); +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *, ... --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */ * stops them from bleeding out - it would still allow subsequent * loads to move into the critical region). */ -void -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) +static inline void +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page, int state, bool exclusive) Putting MM stuff into core waitqueue code is rather bad. I really don't know how I'm going to explain this to my family. { unsigned long flags; - wait-flags = ~WQ_FLAG_EXCLUSIVE; spin_lock_irqsave(q-lock, flags); - if (list_empty(wait-task_list)) - __add_wait_queue(q, wait); + if (page !PageWaiters(page)) + SetPageWaiters(page); And this isn't racy because we're assuming that all users of `page' are using the same waitqueue. ie, assuming all callers use page_waitqueue()? Subtle, unobvious, worth documenting. + if (list_empty(wait-task_list)) { + if (exclusive) { + wait-flags |= WQ_FLAG_EXCLUSIVE; + __add_wait_queue_tail(q, wait); + } else { + wait-flags = ~WQ_FLAG_EXCLUSIVE; + __add_wait_queue(q, wait); + } + } set_current_state(state); spin_unlock_irqrestore(q-lock, flags); } ... @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event); * the wait descriptor from the given waitqueue if still * queued. */ -void finish_wait(wait_queue_head_t *q, wait_queue_t *wait) +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page)
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote: +static inline void +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page, int state, bool exclusive) Putting MM stuff into core waitqueue code is rather bad. I really don't know how I'm going to explain this to my family. Right, so we could avoid all that and make the functions in mm/filemap.c rather large and opencode a bunch of wait.c stuff. Which is pretty much what I initially pseudo proposed. + __ClearPageWaiters(page); We're freeing the page - if someone is still waiting on it then we have a huge bug? It's the mysterious collision thing again I hope? Yeah, so we only clear that bit when at 'unlock' we find there are no more pending waiters, so if the last unlock still had a waiter, we'll leave the bit set. So its entirely reasonable to still have it set when we free a page etc.. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra pet...@infradead.org wrote: On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote: +static inline void +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page, int state, bool exclusive) Putting MM stuff into core waitqueue code is rather bad. I really don't know how I'm going to explain this to my family. Right, so we could avoid all that and make the functions in mm/filemap.c rather large and opencode a bunch of wait.c stuff. The world won't end if we do it Mel's way and it's probably the most efficient. But ugh. This stuff does raise the it had better be a useful patch bar. Which is pretty much what I initially pseudo proposed. Alternative solution is not to merge the patch ;) + __ClearPageWaiters(page); We're freeing the page - if someone is still waiting on it then we have a huge bug? It's the mysterious collision thing again I hope? Yeah, so we only clear that bit when at 'unlock' we find there are no more pending waiters, so if the last unlock still had a waiter, we'll leave the bit set. Confused. If the last unlock had a waiter, that waiter will get woken up so there are no waiters any more, so the last unlock clears the flag. um, how do we determine that there are no more waiters? By looking at the waitqueue. But that waitqueue is hashed, so it may contain waiters for other pages so we're screwed? But we could just go and wake up the other-page waiters anyway and still clear PG_waiters? um2, we're using exclusive waitqueues so we can't (or don't) wake all waiters, so we're screwed again? (This process is proving to be a hard way of writing Mel's changelog btw). If I'm still on track here, what happens if we switch to wake-all so we can avoid the dangling flag? I doubt if there are many collisions on that hash table? If there *are* a lot of collisions, I bet it's because a great pile of threads are all waiting on the same page. If they're trying to lock that page then wake-all is bad. But if they're just waiting for IO completion (probable) then it's OK. I'll stop now. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote: On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman mgor...@suse.de wrote: Andrew had suggested dropping v4 of the patch entirely as the numbers were marginal and the complexity was high. However, even on a relatively small machine running simple workloads the overhead of page_waitqueue and wakeup functions is around 5% of system CPU time. That's quite high for basic operations so I felt it was worth another shot. The performance figures are better with this version than they were for v4 and overall the patch should be more comprehensible. Changelog since v4 o Remove dependency on io_schedule_timeout o Push waiting logic down into waitqueue This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are processes waiting on PG_lock and uses it to avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. This adds a few branches to the fast path but avoids bouncing a dirty cache line between CPUs. 32-bit machines always take the slow path but the primary motivation for this patch is large machines so I do not think that is a concern. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the IO to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. The kernels being compared are accessed-v2 which is the patch series up to this patch where as lockpage-v2 includes this patch. ... --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *k void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr); void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr); void __wake_up_bit(wait_queue_head_t *, void *, int); +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int); You're going to need to forward-declare struct page in wait.h. The good thing about this is that less people will notice that we've gone and mentioned struct page in wait.h :( Will add the forward-declare. int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned); +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *, ... --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);/* For internal use only */ * stops them from bleeding out - it would still allow subsequent * loads to move into the critical region). */ -void -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) +static inline void +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page, int state, bool exclusive) Putting MM stuff into core waitqueue code is rather bad. I really don't know how I'm going to explain this to my family. The alternative is updating wait.h and wait.c was open-coding the waitqueue modifications in filemap.c but that is just as ugly. The wait queue stuff is complex and there was motivation to keep it in one place even if we are special casing struct page handling. FWIW, I cannot explain anything I do in work to my family. It gets blank looks no matter what. { unsigned long flags; - wait-flags = ~WQ_FLAG_EXCLUSIVE; spin_lock_irqsave(q-lock, flags); - if (list_empty(wait-task_list)) - __add_wait_queue(q, wait); + if (page !PageWaiters(page)) + SetPageWaiters(page); And this isn't racy because we're assuming that all users of `page' are using the same waitqueue. ie, assuming all callers use page_waitqueue()? Subtle, unobvious, worth documenting. All users of the page will get the same waitqueue. page_waitqueue is hashed on the pointer value. + if (list_empty(wait-task_list)) { + if (exclusive) { + wait-flags |=
Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote: On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra pet...@infradead.org wrote: On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote: +static inline void +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, + struct page *page, int state, bool exclusive) Putting MM stuff into core waitqueue code is rather bad. I really don't know how I'm going to explain this to my family. Right, so we could avoid all that and make the functions in mm/filemap.c rather large and opencode a bunch of wait.c stuff. The world won't end if we do it Mel's way and it's probably the most efficient. But ugh. This stuff does raise the it had better be a useful patch bar. Which is pretty much what I initially pseudo proposed. Alternative solution is not to merge the patch ;) While true, the overhead of the page_waitqueue lookups and unnecessary wakeups sucks even on small machines. Not only does it hit us during simple operations like dd to a file but we would hit it during page reclaim as well which is trylock_page/unlock_page intensive + __ClearPageWaiters(page); We're freeing the page - if someone is still waiting on it then we have a huge bug? It's the mysterious collision thing again I hope? Yeah, so we only clear that bit when at 'unlock' we find there are no more pending waiters, so if the last unlock still had a waiter, we'll leave the bit set. Confused. If the last unlock had a waiter, that waiter will get woken up so there are no waiters any more, so the last unlock clears the flag. um, how do we determine that there are no more waiters? By looking at the waitqueue. But that waitqueue is hashed, so it may contain waiters for other pages so we're screwed? But we could just go and wake up the other-page waiters anyway and still clear PG_waiters? um2, we're using exclusive waitqueues so we can't (or don't) wake all waiters, so we're screwed again? (This process is proving to be a hard way of writing Mel's changelog btw). If I'm still on track here, what happens if we switch to wake-all so we can avoid the dangling flag? I doubt if there are many collisions on that hash table? If there *are* a lot of collisions, I bet it's because a great pile of threads are all waiting on the same page. If they're trying to lock that page then wake-all is bad. But if they're just waiting for IO completion (probable) then it's OK. I'll stop now. Rather than putting details in the changelog, here is an updated version that hopefully improves the commentary to the point where it's actually clear. ---8--- From: Nick Piggin npig...@suse.de Subject: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v6 Changelog since v5 o __always_inline where appropriate (peterz) o Documentation (akpm) Changelog since v4 o Remove dependency on io_schedule_timeout o Push waiting logic down into waitqueue This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are processes waiting on PG_lock or PG_writeback and uses it to avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath. This adds a few branches to the fast path but avoids bouncing a dirty cache line between CPUs. 32-bit machines always take the slow path but the primary motivation for this patch is large machines so I do not think that is a concern. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the IO to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. The kernels being compared are accessed-v2 which is the patch series up to this patch where as lockpage-v2 includes this patch. async dd 3.15.0-rc53.15.0-rc5 mmotm lockpage-v5 btrfs Max ddtime 0.5863 ( 0.00%) 0.5621