Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 05:04:51PM +0200, Peter Zijlstra wrote: > On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote: > > > > +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, > > > int nr_exclusive, int wake_flags, void *key) > > > { > > > wait_queue_t *curr, *next; > > > + bool woke = false; > > > > > > list_for_each_entry_safe(curr, next, >task_list, task_list) { > > > unsigned flags = curr->flags; > > > > > > + if (curr->func(curr, mode, wake_flags, key)) { > > > + woke = true; > > > + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) > > > + break; > > > + } > > > } > > > + > > > + return woke; > > > > Ok, thinking about this more I'm less sure. > > > > There are cases where the curr->func returns false even though there is a > > task that needs to run -- task was already running or preparing to run. We > > potentially end up clearing PG_waiters while there are still tasks on the > > waitqueue. As __finish_wait checks if the waitqueue is empty and the last > > waiter clears the bit I think there is nothing to gain by trying to do the > > same job in __wake_up_page_bit. > > Hmm, I think you're right, we need the test result from > wake_bit_function(), unpolluted by the ttwu return value. Which would be a bit too special cased and not a clear win. I at least added a comment to explain what is going on here. /* * Unlike __wake_up_bit it is necessary to check waitqueue_active * under the wqh->lock to avoid races with parallel additions that * could result in lost wakeups. */ spin_lock_irqsave(>lock, flags); if (waitqueue_active(wqh)) { /* * Try waking a task on the queue. Responsibility for clearing * the PG_waiters bit is left to the last waiter on the * waitqueue as PageWaiters is called outside wqh->lock and * we cannot miss wakeups. Due to hashqueue collisions, there * may be colliding pages that still have PG_waiters set but * the impact means there will be at least one unnecessary * lookup of the page waitqueue on the next unlock_page or * end of writeback. */ __wake_up_common(wqh, TASK_NORMAL, 1, 0, ); } else { /* No potential waiters, safe to clear PG_waiters */ ClearPageWaiters(page); } spin_unlock_irqrestore(>lock, flags); -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote: > > +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, > > int nr_exclusive, int wake_flags, void *key) > > { > > wait_queue_t *curr, *next; > > + bool woke = false; > > > > list_for_each_entry_safe(curr, next, >task_list, task_list) { > > unsigned flags = curr->flags; > > > > + if (curr->func(curr, mode, wake_flags, key)) { > > + woke = true; > > + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) > > + break; > > + } > > } > > + > > + return woke; > > Ok, thinking about this more I'm less sure. > > There are cases where the curr->func returns false even though there is a > task that needs to run -- task was already running or preparing to run. We > potentially end up clearing PG_waiters while there are still tasks on the > waitqueue. As __finish_wait checks if the waitqueue is empty and the last > waiter clears the bit I think there is nothing to gain by trying to do the > same job in __wake_up_page_bit. Hmm, I think you're right, we need the test result from wake_bit_function(), unpolluted by the ttwu return value. pgpiPAuXoN5IG.pgp Description: PGP signature
Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote: > On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote: > > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void > > *word, int bit) > > +{ > > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); > > + unsigned long flags; > > + > > + /* If there is no PG_waiters bit, always take the slow path */ > > That comment is misleading, this is actually a fast path for > !PG_waiters. > > > + if (!__PG_WAITERS && waitqueue_active(wq)) { > > + __wake_up(wq, TASK_NORMAL, 1, ); > > + return; > > + } > > + > > + /* > > +* Unlike __wake_up_bit it is necessary to check waitqueue_active to be > > +* checked under the wqh->lock to avoid races with parallel additions > > +* to the waitqueue. Otherwise races could result in lost wakeups > > +*/ > > + spin_lock_irqsave(>lock, flags); > > + if (waitqueue_active(wqh)) > > + __wake_up_common(wqh, TASK_NORMAL, 1, 0, ); > > + else > > + ClearPageWaiters(page); > > + spin_unlock_irqrestore(>lock, flags); > > +} > > So I think you missed one Clear opportunity here that was in my original > proposal, possibly because you also frobbed PG_writeback in. > > If you do: > > spin_lock_irqsave(>lock, flags); > if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, > )) > ClearPageWaiters(page); > spin_unlock_irqrestore(>lock, flags); > > With the below change to __wake_up_common(), we'll also clear the bit > when there's no waiters of @page, even if there's waiters for another > page. > > I suppose the one thing to say for the big open coded loop is that its > much easier to read than this scattered stuff. > > --- > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index 0ffa20ae657b..213c5bfe6b56 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue); > * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns > * zero in this (rare) case, and we handle it by continuing to scan the > queue. > */ > -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, > +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, > int nr_exclusive, int wake_flags, void *key) > { > wait_queue_t *curr, *next; > + bool woke = false; > > list_for_each_entry_safe(curr, next, >task_list, task_list) { > unsigned flags = curr->flags; > > - if (curr->func(curr, mode, wake_flags, key) && > - (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) > - break; > + if (curr->func(curr, mode, wake_flags, key)) { > + woke = true; > + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) > + break; > + } > } > + > + return woke; Ok, thinking about this more I'm less sure. There are cases where the curr->func returns false even though there is a task that needs to run -- task was already running or preparing to run. We potentially end up clearing PG_waiters while there are still tasks on the waitqueue. As __finish_wait checks if the waitqueue is empty and the last waiter clears the bit I think there is nothing to gain by trying to do the same job in __wake_up_page_bit. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote: > On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote: > > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void > > *word, int bit) > > +{ > > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); > > + unsigned long flags; > > + > > + /* If there is no PG_waiters bit, always take the slow path */ > > That comment is misleading, this is actually a fast path for > !PG_waiters. > And could have been far better anyway now that you called me on it. > > + if (!__PG_WAITERS && waitqueue_active(wq)) { > > + __wake_up(wq, TASK_NORMAL, 1, ); > > + return; > > + } > > + > > + /* > > +* Unlike __wake_up_bit it is necessary to check waitqueue_active to be > > +* checked under the wqh->lock to avoid races with parallel additions > > +* to the waitqueue. Otherwise races could result in lost wakeups > > +*/ > > + spin_lock_irqsave(>lock, flags); > > + if (waitqueue_active(wqh)) > > + __wake_up_common(wqh, TASK_NORMAL, 1, 0, ); > > + else > > + ClearPageWaiters(page); > > + spin_unlock_irqrestore(>lock, flags); > > +} > > So I think you missed one Clear opportunity here that was in my original > proposal, possibly because you also frobbed PG_writeback in. > It got lost in the midst of all the other modifications to make this as "obvious" as possible. > > > I suppose the one thing to say for the big open coded loop is that its > much easier to read than this scattered stuff. > Sure, but the end result of open coding this is duplicated code that will be harder to maintain overall. I could split __wake_up_bit and use that in both but I do not think it would make the code any clearer for the sake of two lines. Untested but this on top? diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 73cb8c6..d3a8c34 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -60,19 +60,26 @@ EXPORT_SYMBOL(remove_wait_queue); * There are circumstances in which we can try to wake a task which has already * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns * zero in this (rare) case, and we handle it by continuing to scan the queue. + * + * Returns true if a process was woken up */ -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; + bool woke = false; list_for_each_entry_safe(curr, next, >task_list, task_list) { unsigned flags = curr->flags; - if (curr->func(curr, mode, wake_flags, key) && - (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) - break; + if (curr->func(curr, mode, wake_flags, key)) { + woke = true; + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) + break; + } } + + return woke; } /** @@ -441,9 +448,13 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, i struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); unsigned long flags; - /* If there is no PG_waiters bit, always take the slow path */ - if (!__PG_WAITERS && waitqueue_active(wq)) { - __wake_up(wq, TASK_NORMAL, 1, ); + /* +* If there is no PG_waiters bit (32-bit), then waitqueue_active can be +* checked without wqh->lock as there is no PG_waiters race to protect. +*/ + if (!__PG_WAITERS) { + if (waitqueue_active(wqh)) + __wake_up(wqh, TASK_NORMAL, 1, ); return; } @@ -453,9 +464,8 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, i * to the waitqueue. Otherwise races could result in lost wakeups */ spin_lock_irqsave(>lock, flags); - if (waitqueue_active(wqh)) - __wake_up_common(wqh, TASK_NORMAL, 1, 0, ); - else + if (!waitqueue_active(wqh) || + !__wake_up_common(wqh, TASK_NORMAL, 1, 0, )) ClearPageWaiters(page); spin_unlock_irqrestore(>lock, flags); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote: > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void > *word, int bit) > +{ > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); > + unsigned long flags; > + > + /* If there is no PG_waiters bit, always take the slow path */ That comment is misleading, this is actually a fast path for !PG_waiters. > + if (!__PG_WAITERS && waitqueue_active(wq)) { > + __wake_up(wq, TASK_NORMAL, 1, ); > + return; > + } > + > + /* > + * Unlike __wake_up_bit it is necessary to check waitqueue_active to be > + * checked under the wqh->lock to avoid races with parallel additions > + * to the waitqueue. Otherwise races could result in lost wakeups > + */ > + spin_lock_irqsave(>lock, flags); > + if (waitqueue_active(wqh)) > + __wake_up_common(wqh, TASK_NORMAL, 1, 0, ); > + else > + ClearPageWaiters(page); > + spin_unlock_irqrestore(>lock, flags); > +} So I think you missed one Clear opportunity here that was in my original proposal, possibly because you also frobbed PG_writeback in. If you do: spin_lock_irqsave(>lock, flags); if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, )) ClearPageWaiters(page); spin_unlock_irqrestore(>lock, flags); With the below change to __wake_up_common(), we'll also clear the bit when there's no waiters of @page, even if there's waiters for another page. I suppose the one thing to say for the big open coded loop is that its much easier to read than this scattered stuff. --- diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 0ffa20ae657b..213c5bfe6b56 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue); * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns * zero in this (rare) case, and we handle it by continuing to scan the queue. */ -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; + bool woke = false; list_for_each_entry_safe(curr, next, >task_list, task_list) { unsigned flags = curr->flags; - if (curr->func(curr, mode, wake_flags, key) && - (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) - break; + if (curr->func(curr, mode, wake_flags, key)) { + woke = true; + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) + break; + } } + + return woke; } /** pgp20tIIlgBUZ.pgp Description: PGP signature
[PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
Changelog since v6 o Optimisation when PG_waiters is not available (peterz) o Documentation Changelog since v5 o __always_inline where appropriate (peterz) o Documentation (akpm) Changelog since v4 o Remove dependency on io_schedule_timeout o Push waiting logic down into waitqueue From: Nick Piggin This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are *potentially* processes waiting on PG_lock or PG_writeback. If there are no possible waiters then we avoid barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page and end_page_writeback paths. There is no guarantee that waiters exist if PG_waiters is set as multiple pages can hash to the same waitqueue and we cannot accurately detect if a waking process is the last waiter without a reference count. When this happens, the bit is left set and a future unlock or writeback completion will lookup the waitqueue and clear the bit when there are no collisions. This adds a few branches to the fast path but avoids bouncing a dirty cache line between CPUs. 32-bit machines always take the slow path but the primary motivation for this patch is large machines so I do not think that is a concern. The test case used to evaluate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. After each dd there is a sync so the reported times do not vary much. By measuring the time it takes to do async the impact of page_waitqueue overhead for async IO is highlighted. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc53.15.0-rc5 mmotm lockpage btrfs Max ddtime 0.5863 ( 0.00%) 0.5621 ( 4.14%) ext3 Max ddtime 1.4870 ( 0.00%) 1.4609 ( 1.76%) ext4 Max ddtime 1.0440 ( 0.00%) 1.0376 ( 0.61%) tmpfs Max ddtime 0.3541 ( 0.00%) 0.3486 ( 1.54%) xfs Max ddtime 0.4995 ( 0.00%) 0.4834 ( 3.21%) A separate run with profiles showed this samples percentage ext3 2258512.3180 vmlinux-3.15.0-rc5-mmotm test_clear_page_writeback ext3 1068481.0966 vmlinux-3.15.0-rc5-mmotm __wake_up_bit ext3 718490.7374 vmlinux-3.15.0-rc5-mmotm page_waitqueue ext3 403190.4138 vmlinux-3.15.0-rc5-mmotm unlock_page ext3 262430.2693 vmlinux-3.15.0-rc5-mmotm end_page_writeback ext3 1787771.7774 vmlinux-3.15.0-rc5-lockpage test_clear_page_writeback ext3 677020.6731 vmlinux-3.15.0-rc5-lockpage unlock_page ext3 223570.2223 vmlinux-3.15.0-rc5-lockpage end_page_writeback ext3 111310.1107 vmlinux-3.15.0-rc5-lockpage __wake_up_bit ext363600.0632 vmlinux-3.15.0-rc5-lockpage __wake_up_page_bit ext316600.0165 vmlinux-3.15.0-rc5-lockpage page_waitqueue The profiles show a clear reduction in waitqueue and wakeup functions. Note that end_page_writeback costs the same as the savings there are due to reduced calls to __wake_up_bit and page_waitqueue so there is no obvious direct savings. The cost of unlock_page is higher as it's checking PageWaiters but it is offset by reduced numbers of calls to page_waitqueue and _wake_up_bit. There is a similar story told for each of the filesystems. Note that for workloads that contend heavily on the page lock that unlock_page may increase in cost as it has to clear PG_waiters so while the typical case should be much faster, the worst case costs are now higher. This is also reflected in the time taken to mmap a range of pages. These are the results for xfs only but the other filesystems tell a similar story. 3.15.0-rc53.15.0-rc5 mmotm lockpage Procs 107M 423. ( 0.00%)409. ( 3.31%) Procs 214M 847. ( 0.00%)823. ( 2.83%) Procs 322M1296. ( 0.00%) 1232. ( 4.94%) Procs 429M1692. ( 0.00%) 1644. ( 2.84%) Procs 536M2137. ( 0.00%) 2057. ( 3.74%) Procs 644M2542. ( 0.00%) 2472. ( 2.75%) Procs 751M2953. ( 0.00%) 2872. ( 2.74%) Procs 859M3360. ( 0.00%) 3310. ( 1.49%) Procs 966M3770. ( 0.00%) 3724. ( 1.22%) Procs 1073M 4220. ( 0.00%) 4114. ( 2.51%) Procs 1181M 4638. ( 0.00%) 4546. ( 1.98%) Procs 1288M 5038. ( 0.00%) 4940. ( 1.95%) Procs 1395M 5481. ( 0.00%) 5431. ( 0.91%) Procs 1503M 5940. ( 0.00%) 5832. ( 1.82%) Procs 1610M 6316. ( 0.00%) 6204. ( 1.77%) Procs 1717M 6749. ( 0.00%) 6799. ( -0.74%)
[PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
Changelog since v6 o Optimisation when PG_waiters is not available (peterz) o Documentation Changelog since v5 o __always_inline where appropriate (peterz) o Documentation (akpm) Changelog since v4 o Remove dependency on io_schedule_timeout o Push waiting logic down into waitqueue From: Nick Piggin npig...@suse.de This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are *potentially* processes waiting on PG_lock or PG_writeback. If there are no possible waiters then we avoid barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page and end_page_writeback paths. There is no guarantee that waiters exist if PG_waiters is set as multiple pages can hash to the same waitqueue and we cannot accurately detect if a waking process is the last waiter without a reference count. When this happens, the bit is left set and a future unlock or writeback completion will lookup the waitqueue and clear the bit when there are no collisions. This adds a few branches to the fast path but avoids bouncing a dirty cache line between CPUs. 32-bit machines always take the slow path but the primary motivation for this patch is large machines so I do not think that is a concern. The test case used to evaluate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. After each dd there is a sync so the reported times do not vary much. By measuring the time it takes to do async the impact of page_waitqueue overhead for async IO is highlighted. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc53.15.0-rc5 mmotm lockpage btrfs Max ddtime 0.5863 ( 0.00%) 0.5621 ( 4.14%) ext3 Max ddtime 1.4870 ( 0.00%) 1.4609 ( 1.76%) ext4 Max ddtime 1.0440 ( 0.00%) 1.0376 ( 0.61%) tmpfs Max ddtime 0.3541 ( 0.00%) 0.3486 ( 1.54%) xfs Max ddtime 0.4995 ( 0.00%) 0.4834 ( 3.21%) A separate run with profiles showed this samples percentage ext3 2258512.3180 vmlinux-3.15.0-rc5-mmotm test_clear_page_writeback ext3 1068481.0966 vmlinux-3.15.0-rc5-mmotm __wake_up_bit ext3 718490.7374 vmlinux-3.15.0-rc5-mmotm page_waitqueue ext3 403190.4138 vmlinux-3.15.0-rc5-mmotm unlock_page ext3 262430.2693 vmlinux-3.15.0-rc5-mmotm end_page_writeback ext3 1787771.7774 vmlinux-3.15.0-rc5-lockpage test_clear_page_writeback ext3 677020.6731 vmlinux-3.15.0-rc5-lockpage unlock_page ext3 223570.2223 vmlinux-3.15.0-rc5-lockpage end_page_writeback ext3 111310.1107 vmlinux-3.15.0-rc5-lockpage __wake_up_bit ext363600.0632 vmlinux-3.15.0-rc5-lockpage __wake_up_page_bit ext316600.0165 vmlinux-3.15.0-rc5-lockpage page_waitqueue The profiles show a clear reduction in waitqueue and wakeup functions. Note that end_page_writeback costs the same as the savings there are due to reduced calls to __wake_up_bit and page_waitqueue so there is no obvious direct savings. The cost of unlock_page is higher as it's checking PageWaiters but it is offset by reduced numbers of calls to page_waitqueue and _wake_up_bit. There is a similar story told for each of the filesystems. Note that for workloads that contend heavily on the page lock that unlock_page may increase in cost as it has to clear PG_waiters so while the typical case should be much faster, the worst case costs are now higher. This is also reflected in the time taken to mmap a range of pages. These are the results for xfs only but the other filesystems tell a similar story. 3.15.0-rc53.15.0-rc5 mmotm lockpage Procs 107M 423. ( 0.00%)409. ( 3.31%) Procs 214M 847. ( 0.00%)823. ( 2.83%) Procs 322M1296. ( 0.00%) 1232. ( 4.94%) Procs 429M1692. ( 0.00%) 1644. ( 2.84%) Procs 536M2137. ( 0.00%) 2057. ( 3.74%) Procs 644M2542. ( 0.00%) 2472. ( 2.75%) Procs 751M2953. ( 0.00%) 2872. ( 2.74%) Procs 859M3360. ( 0.00%) 3310. ( 1.49%) Procs 966M3770. ( 0.00%) 3724. ( 1.22%) Procs 1073M 4220. ( 0.00%) 4114. ( 2.51%) Procs 1181M 4638. ( 0.00%) 4546. ( 1.98%) Procs 1288M 5038. ( 0.00%) 4940. ( 1.95%) Procs 1395M 5481. ( 0.00%) 5431. ( 0.91%) Procs 1503M 5940. ( 0.00%) 5832. ( 1.82%) Procs 1610M 6316. ( 0.00%) 6204. ( 1.77%) Procs 1717M 6749. ( 0.00%)
Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote: +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit) +{ + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); + unsigned long flags; + + /* If there is no PG_waiters bit, always take the slow path */ That comment is misleading, this is actually a fast path for !PG_waiters. + if (!__PG_WAITERS waitqueue_active(wq)) { + __wake_up(wq, TASK_NORMAL, 1, key); + return; + } + + /* + * Unlike __wake_up_bit it is necessary to check waitqueue_active to be + * checked under the wqh-lock to avoid races with parallel additions + * to the waitqueue. Otherwise races could result in lost wakeups + */ + spin_lock_irqsave(wqh-lock, flags); + if (waitqueue_active(wqh)) + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key); + else + ClearPageWaiters(page); + spin_unlock_irqrestore(wqh-lock, flags); +} So I think you missed one Clear opportunity here that was in my original proposal, possibly because you also frobbed PG_writeback in. If you do: spin_lock_irqsave(wqh-lock, flags); if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, key)) ClearPageWaiters(page); spin_unlock_irqrestore(wqh-lock, flags); With the below change to __wake_up_common(), we'll also clear the bit when there's no waiters of @page, even if there's waiters for another page. I suppose the one thing to say for the big open coded loop is that its much easier to read than this scattered stuff. --- diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 0ffa20ae657b..213c5bfe6b56 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue); * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns * zero in this (rare) case, and we handle it by continuing to scan the queue. */ -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; + bool woke = false; list_for_each_entry_safe(curr, next, q-task_list, task_list) { unsigned flags = curr-flags; - if (curr-func(curr, mode, wake_flags, key) - (flags WQ_FLAG_EXCLUSIVE) !--nr_exclusive) - break; + if (curr-func(curr, mode, wake_flags, key)) { + woke = true; + if ((flags WQ_FLAG_EXCLUSIVE) !--nr_exclusive) + break; + } } + + return woke; } /** pgp20tIIlgBUZ.pgp Description: PGP signature
Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote: On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote: +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit) +{ + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); + unsigned long flags; + + /* If there is no PG_waiters bit, always take the slow path */ That comment is misleading, this is actually a fast path for !PG_waiters. And could have been far better anyway now that you called me on it. + if (!__PG_WAITERS waitqueue_active(wq)) { + __wake_up(wq, TASK_NORMAL, 1, key); + return; + } + + /* +* Unlike __wake_up_bit it is necessary to check waitqueue_active to be +* checked under the wqh-lock to avoid races with parallel additions +* to the waitqueue. Otherwise races could result in lost wakeups +*/ + spin_lock_irqsave(wqh-lock, flags); + if (waitqueue_active(wqh)) + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key); + else + ClearPageWaiters(page); + spin_unlock_irqrestore(wqh-lock, flags); +} So I think you missed one Clear opportunity here that was in my original proposal, possibly because you also frobbed PG_writeback in. It got lost in the midst of all the other modifications to make this as obvious as possible. SNIP I suppose the one thing to say for the big open coded loop is that its much easier to read than this scattered stuff. Sure, but the end result of open coding this is duplicated code that will be harder to maintain overall. I could split __wake_up_bit and use that in both but I do not think it would make the code any clearer for the sake of two lines. Untested but this on top? diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 73cb8c6..d3a8c34 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -60,19 +60,26 @@ EXPORT_SYMBOL(remove_wait_queue); * There are circumstances in which we can try to wake a task which has already * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns * zero in this (rare) case, and we handle it by continuing to scan the queue. + * + * Returns true if a process was woken up */ -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; + bool woke = false; list_for_each_entry_safe(curr, next, q-task_list, task_list) { unsigned flags = curr-flags; - if (curr-func(curr, mode, wake_flags, key) - (flags WQ_FLAG_EXCLUSIVE) !--nr_exclusive) - break; + if (curr-func(curr, mode, wake_flags, key)) { + woke = true; + if ((flags WQ_FLAG_EXCLUSIVE) !--nr_exclusive) + break; + } } + + return woke; } /** @@ -441,9 +448,13 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, i struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); unsigned long flags; - /* If there is no PG_waiters bit, always take the slow path */ - if (!__PG_WAITERS waitqueue_active(wq)) { - __wake_up(wq, TASK_NORMAL, 1, key); + /* +* If there is no PG_waiters bit (32-bit), then waitqueue_active can be +* checked without wqh-lock as there is no PG_waiters race to protect. +*/ + if (!__PG_WAITERS) { + if (waitqueue_active(wqh)) + __wake_up(wqh, TASK_NORMAL, 1, key); return; } @@ -453,9 +464,8 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, i * to the waitqueue. Otherwise races could result in lost wakeups */ spin_lock_irqsave(wqh-lock, flags); - if (waitqueue_active(wqh)) - __wake_up_common(wqh, TASK_NORMAL, 1, 0, key); - else + if (!waitqueue_active(wqh) || + !__wake_up_common(wqh, TASK_NORMAL, 1, 0, key)) ClearPageWaiters(page); spin_unlock_irqrestore(wqh-lock, flags); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote: On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote: +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit) +{ + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); + unsigned long flags; + + /* If there is no PG_waiters bit, always take the slow path */ That comment is misleading, this is actually a fast path for !PG_waiters. + if (!__PG_WAITERS waitqueue_active(wq)) { + __wake_up(wq, TASK_NORMAL, 1, key); + return; + } + + /* +* Unlike __wake_up_bit it is necessary to check waitqueue_active to be +* checked under the wqh-lock to avoid races with parallel additions +* to the waitqueue. Otherwise races could result in lost wakeups +*/ + spin_lock_irqsave(wqh-lock, flags); + if (waitqueue_active(wqh)) + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key); + else + ClearPageWaiters(page); + spin_unlock_irqrestore(wqh-lock, flags); +} So I think you missed one Clear opportunity here that was in my original proposal, possibly because you also frobbed PG_writeback in. If you do: spin_lock_irqsave(wqh-lock, flags); if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, key)) ClearPageWaiters(page); spin_unlock_irqrestore(wqh-lock, flags); With the below change to __wake_up_common(), we'll also clear the bit when there's no waiters of @page, even if there's waiters for another page. I suppose the one thing to say for the big open coded loop is that its much easier to read than this scattered stuff. --- diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 0ffa20ae657b..213c5bfe6b56 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue); * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns * zero in this (rare) case, and we handle it by continuing to scan the queue. */ -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; + bool woke = false; list_for_each_entry_safe(curr, next, q-task_list, task_list) { unsigned flags = curr-flags; - if (curr-func(curr, mode, wake_flags, key) - (flags WQ_FLAG_EXCLUSIVE) !--nr_exclusive) - break; + if (curr-func(curr, mode, wake_flags, key)) { + woke = true; + if ((flags WQ_FLAG_EXCLUSIVE) !--nr_exclusive) + break; + } } + + return woke; Ok, thinking about this more I'm less sure. There are cases where the curr-func returns false even though there is a task that needs to run -- task was already running or preparing to run. We potentially end up clearing PG_waiters while there are still tasks on the waitqueue. As __finish_wait checks if the waitqueue is empty and the last waiter clears the bit I think there is nothing to gain by trying to do the same job in __wake_up_page_bit. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote: +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; + bool woke = false; list_for_each_entry_safe(curr, next, q-task_list, task_list) { unsigned flags = curr-flags; + if (curr-func(curr, mode, wake_flags, key)) { + woke = true; + if ((flags WQ_FLAG_EXCLUSIVE) !--nr_exclusive) + break; + } } + + return woke; Ok, thinking about this more I'm less sure. There are cases where the curr-func returns false even though there is a task that needs to run -- task was already running or preparing to run. We potentially end up clearing PG_waiters while there are still tasks on the waitqueue. As __finish_wait checks if the waitqueue is empty and the last waiter clears the bit I think there is nothing to gain by trying to do the same job in __wake_up_page_bit. Hmm, I think you're right, we need the test result from wake_bit_function(), unpolluted by the ttwu return value. pgpiPAuXoN5IG.pgp Description: PGP signature
Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
On Thu, May 22, 2014 at 05:04:51PM +0200, Peter Zijlstra wrote: On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote: +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; + bool woke = false; list_for_each_entry_safe(curr, next, q-task_list, task_list) { unsigned flags = curr-flags; + if (curr-func(curr, mode, wake_flags, key)) { + woke = true; + if ((flags WQ_FLAG_EXCLUSIVE) !--nr_exclusive) + break; + } } + + return woke; Ok, thinking about this more I'm less sure. There are cases where the curr-func returns false even though there is a task that needs to run -- task was already running or preparing to run. We potentially end up clearing PG_waiters while there are still tasks on the waitqueue. As __finish_wait checks if the waitqueue is empty and the last waiter clears the bit I think there is nothing to gain by trying to do the same job in __wake_up_page_bit. Hmm, I think you're right, we need the test result from wake_bit_function(), unpolluted by the ttwu return value. Which would be a bit too special cased and not a clear win. I at least added a comment to explain what is going on here. /* * Unlike __wake_up_bit it is necessary to check waitqueue_active * under the wqh-lock to avoid races with parallel additions that * could result in lost wakeups. */ spin_lock_irqsave(wqh-lock, flags); if (waitqueue_active(wqh)) { /* * Try waking a task on the queue. Responsibility for clearing * the PG_waiters bit is left to the last waiter on the * waitqueue as PageWaiters is called outside wqh-lock and * we cannot miss wakeups. Due to hashqueue collisions, there * may be colliding pages that still have PG_waiters set but * the impact means there will be at least one unnecessary * lookup of the page waitqueue on the next unlock_page or * end of writeback. */ __wake_up_common(wqh, TASK_NORMAL, 1, 0, key); } else { /* No potential waiters, safe to clear PG_waiters */ ClearPageWaiters(page); } spin_unlock_irqrestore(wqh-lock, flags); -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/