Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 05:04:51PM +0200, Peter Zijlstra wrote:
> On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote:
> 
> > > +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
> > >   int nr_exclusive, int wake_flags, void *key)
> > >  {
> > >   wait_queue_t *curr, *next;
> > > + bool woke = false;
> > >  
> > >   list_for_each_entry_safe(curr, next, >task_list, task_list) {
> > >   unsigned flags = curr->flags;
> > >  
> > > + if (curr->func(curr, mode, wake_flags, key)) {
> > > + woke = true;
> > > + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> > > + break;
> > > + }
> > >   }
> > > +
> > > + return woke;
> > 
> > Ok, thinking about this more I'm less sure.
> > 
> > There are cases where the curr->func returns false even though there is a
> > task that needs to run -- task was already running or preparing to run. We
> > potentially end up clearing PG_waiters while there are still tasks on the
> > waitqueue. As __finish_wait checks if the waitqueue is empty and the last
> > waiter clears the bit I think there is nothing to gain by trying to do the
> > same job in __wake_up_page_bit.
> 
> Hmm, I think you're right, we need the test result from
> wake_bit_function(), unpolluted by the ttwu return value.

Which would be a bit too special cased and not a clear win. I at least
added a comment to explain what is going on here.

/*
 * Unlike __wake_up_bit it is necessary to check waitqueue_active
 * under the wqh->lock to avoid races with parallel additions that
 * could result in lost wakeups.
 */
spin_lock_irqsave(>lock, flags);
if (waitqueue_active(wqh)) {
/*
 * Try waking a task on the queue. Responsibility for clearing
 * the PG_waiters bit is left to the last waiter on the
 * waitqueue as PageWaiters is called outside wqh->lock and
 * we cannot miss wakeups. Due to hashqueue collisions, there
 * may be colliding pages that still have PG_waiters set but
 * the impact means there will be at least one unnecessary
 * lookup of the page waitqueue on the next unlock_page or
 * end of writeback.
 */
__wake_up_common(wqh, TASK_NORMAL, 1, 0, );
} else {
/* No potential waiters, safe to clear PG_waiters */
ClearPageWaiters(page);
}
spin_unlock_irqrestore(>lock, flags);

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Peter Zijlstra
On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote:

> > +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
> > int nr_exclusive, int wake_flags, void *key)
> >  {
> > wait_queue_t *curr, *next;
> > +   bool woke = false;
> >  
> > list_for_each_entry_safe(curr, next, >task_list, task_list) {
> > unsigned flags = curr->flags;
> >  
> > +   if (curr->func(curr, mode, wake_flags, key)) {
> > +   woke = true;
> > +   if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> > +   break;
> > +   }
> > }
> > +
> > +   return woke;
> 
> Ok, thinking about this more I'm less sure.
> 
> There are cases where the curr->func returns false even though there is a
> task that needs to run -- task was already running or preparing to run. We
> potentially end up clearing PG_waiters while there are still tasks on the
> waitqueue. As __finish_wait checks if the waitqueue is empty and the last
> waiter clears the bit I think there is nothing to gain by trying to do the
> same job in __wake_up_page_bit.

Hmm, I think you're right, we need the test result from
wake_bit_function(), unpolluted by the ttwu return value.


pgpiPAuXoN5IG.pgp
Description: PGP signature


Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote:
> On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote:
> > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
> > *word, int bit)
> > +{
> > +   struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > +   unsigned long flags;
> > +
> > +   /* If there is no PG_waiters bit, always take the slow path */
> 
> That comment is misleading, this is actually a fast path for
> !PG_waiters.
> 
> > +   if (!__PG_WAITERS && waitqueue_active(wq)) {
> > +   __wake_up(wq, TASK_NORMAL, 1, );
> > +   return;
> > +   }
> > +
> > +   /*
> > +* Unlike __wake_up_bit it is necessary to check waitqueue_active to be
> > +* checked under the wqh->lock to avoid races with parallel additions
> > +* to the waitqueue. Otherwise races could result in lost wakeups
> > +*/
> > +   spin_lock_irqsave(>lock, flags);
> > +   if (waitqueue_active(wqh))
> > +   __wake_up_common(wqh, TASK_NORMAL, 1, 0, );
> > +   else
> > +   ClearPageWaiters(page);
> > +   spin_unlock_irqrestore(>lock, flags);
> > +}
> 
> So I think you missed one Clear opportunity here that was in my original
> proposal, possibly because you also frobbed PG_writeback in.
> 
> If you do:
> 
>   spin_lock_irqsave(>lock, flags);
>   if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, 
> ))
>   ClearPageWaiters(page);
>   spin_unlock_irqrestore(>lock, flags);
> 
> With the below change to __wake_up_common(), we'll also clear the bit
> when there's no waiters of @page, even if there's waiters for another
> page.
> 
> I suppose the one thing to say for the big open coded loop is that its
> much easier to read than this scattered stuff.
> 
> ---
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 0ffa20ae657b..213c5bfe6b56 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue);
>   * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
>   * zero in this (rare) case, and we handle it by continuing to scan the 
> queue.
>   */
> -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
> +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>   int nr_exclusive, int wake_flags, void *key)
>  {
>   wait_queue_t *curr, *next;
> + bool woke = false;
>  
>   list_for_each_entry_safe(curr, next, >task_list, task_list) {
>   unsigned flags = curr->flags;
>  
> - if (curr->func(curr, mode, wake_flags, key) &&
> - (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> - break;
> + if (curr->func(curr, mode, wake_flags, key)) {
> + woke = true;
> + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> + break;
> + }
>   }
> +
> + return woke;

Ok, thinking about this more I'm less sure.

There are cases where the curr->func returns false even though there is a
task that needs to run -- task was already running or preparing to run. We
potentially end up clearing PG_waiters while there are still tasks on the
waitqueue. As __finish_wait checks if the waitqueue is empty and the last
waiter clears the bit I think there is nothing to gain by trying to do the
same job in __wake_up_page_bit.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote:
> On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote:
> > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
> > *word, int bit)
> > +{
> > +   struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > +   unsigned long flags;
> > +
> > +   /* If there is no PG_waiters bit, always take the slow path */
> 
> That comment is misleading, this is actually a fast path for
> !PG_waiters.
> 

And could have been far better anyway now that you called me on it.

> > +   if (!__PG_WAITERS && waitqueue_active(wq)) {
> > +   __wake_up(wq, TASK_NORMAL, 1, );
> > +   return;
> > +   }
> > +
> > +   /*
> > +* Unlike __wake_up_bit it is necessary to check waitqueue_active to be
> > +* checked under the wqh->lock to avoid races with parallel additions
> > +* to the waitqueue. Otherwise races could result in lost wakeups
> > +*/
> > +   spin_lock_irqsave(>lock, flags);
> > +   if (waitqueue_active(wqh))
> > +   __wake_up_common(wqh, TASK_NORMAL, 1, 0, );
> > +   else
> > +   ClearPageWaiters(page);
> > +   spin_unlock_irqrestore(>lock, flags);
> > +}
> 
> So I think you missed one Clear opportunity here that was in my original
> proposal, possibly because you also frobbed PG_writeback in.
> 

It got lost in the midst of all the other modifications to make this as
"obvious" as possible.

> 
>
> I suppose the one thing to say for the big open coded loop is that its
> much easier to read than this scattered stuff.
> 

Sure, but the end result of open coding this is duplicated code that will
be harder to maintain overall. I could split __wake_up_bit and use that
in both but I do not think it would make the code any clearer for the sake
of two lines.  Untested but this on top?

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 73cb8c6..d3a8c34 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -60,19 +60,26 @@ EXPORT_SYMBOL(remove_wait_queue);
  * There are circumstances in which we can try to wake a task which has already
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
+ *
+ * Returns true if a process was woken up
  */
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, int wake_flags, void *key)
 {
wait_queue_t *curr, *next;
+   bool woke = false;
 
list_for_each_entry_safe(curr, next, >task_list, task_list) {
unsigned flags = curr->flags;
 
-   if (curr->func(curr, mode, wake_flags, key) &&
-   (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
-   break;
+   if (curr->func(curr, mode, wake_flags, key)) {
+   woke = true;
+   if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
+   break;
+   }
}
+
+   return woke;
 }
 
 /**
@@ -441,9 +448,13 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct 
page *page, void *word, i
struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
unsigned long flags;
 
-   /* If there is no PG_waiters bit, always take the slow path */
-   if (!__PG_WAITERS && waitqueue_active(wq)) {
-   __wake_up(wq, TASK_NORMAL, 1, );
+   /*
+* If there is no PG_waiters bit (32-bit), then waitqueue_active can be
+* checked without wqh->lock as there is no PG_waiters race to protect.
+*/
+   if (!__PG_WAITERS) {
+   if (waitqueue_active(wqh))
+   __wake_up(wqh, TASK_NORMAL, 1, );
return;
}
 
@@ -453,9 +464,8 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct page 
*page, void *word, i
 * to the waitqueue. Otherwise races could result in lost wakeups
 */
spin_lock_irqsave(>lock, flags);
-   if (waitqueue_active(wqh))
-   __wake_up_common(wqh, TASK_NORMAL, 1, 0, );
-   else
+   if (!waitqueue_active(wqh) ||
+   !__wake_up_common(wqh, TASK_NORMAL, 1, 0, ))
ClearPageWaiters(page);
spin_unlock_irqrestore(>lock, flags);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Peter Zijlstra
On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote:
> +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
> *word, int bit)
> +{
> + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> + unsigned long flags;
> +
> + /* If there is no PG_waiters bit, always take the slow path */

That comment is misleading, this is actually a fast path for
!PG_waiters.

> + if (!__PG_WAITERS && waitqueue_active(wq)) {
> + __wake_up(wq, TASK_NORMAL, 1, );
> + return;
> + }
> +
> + /*
> +  * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
> +  * checked under the wqh->lock to avoid races with parallel additions
> +  * to the waitqueue. Otherwise races could result in lost wakeups
> +  */
> + spin_lock_irqsave(>lock, flags);
> + if (waitqueue_active(wqh))
> + __wake_up_common(wqh, TASK_NORMAL, 1, 0, );
> + else
> + ClearPageWaiters(page);
> + spin_unlock_irqrestore(>lock, flags);
> +}

So I think you missed one Clear opportunity here that was in my original
proposal, possibly because you also frobbed PG_writeback in.

If you do:

spin_lock_irqsave(>lock, flags);
if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, 
))
ClearPageWaiters(page);
spin_unlock_irqrestore(>lock, flags);

With the below change to __wake_up_common(), we'll also clear the bit
when there's no waiters of @page, even if there's waiters for another
page.

I suppose the one thing to say for the big open coded loop is that its
much easier to read than this scattered stuff.

---
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20ae657b..213c5bfe6b56 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue);
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
  */
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, int wake_flags, void *key)
 {
wait_queue_t *curr, *next;
+   bool woke = false;
 
list_for_each_entry_safe(curr, next, >task_list, task_list) {
unsigned flags = curr->flags;
 
-   if (curr->func(curr, mode, wake_flags, key) &&
-   (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
-   break;
+   if (curr->func(curr, mode, wake_flags, key)) {
+   woke = true;
+   if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
+   break;
+   }
}
+
+   return woke;
 }
 
 /**




pgp20tIIlgBUZ.pgp
Description: PGP signature


[PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Mel Gorman
Changelog since v6
o Optimisation when PG_waiters is not available (peterz)
o Documentation

Changelog since v5
o __always_inline where appropriate (peterz)
o Documentation (akpm)

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

From: Nick Piggin 

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are *potentially* processes waiting on
PG_lock or PG_writeback.  If there are no possible waiters then we avoid
barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page
and end_page_writeback paths. There is no guarantee that waiters exist if
PG_waiters is set as multiple pages can hash to the same waitqueue and we
cannot accurately detect if a waking process is the last waiter without
a reference count. When this happens, the bit is left set and a future
unlock or writeback completion will lookup the waitqueue and clear the
bit when there are no collisions. This adds a few branches to the fast
path but avoids bouncing a dirty cache line between CPUs. 32-bit machines
always take the slow path but the primary motivation for this patch is
large machines so I do not think that is a concern.

The test case used to evaluate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. After each
dd there is a sync so the reported times do not vary much. By measuring
the time it takes to do async the impact of page_waitqueue overhead for
async IO is highlighted.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. The performance results are reported based on a run with
no profiling.  Profile data is based on a separate run with oprofile running.

async dd
 3.15.0-rc53.15.0-rc5
  mmotm   lockpage
btrfs Max  ddtime  0.5863 (  0.00%)  0.5621 (  4.14%)
ext3  Max  ddtime  1.4870 (  0.00%)  1.4609 (  1.76%)
ext4  Max  ddtime  1.0440 (  0.00%)  1.0376 (  0.61%)
tmpfs Max  ddtime  0.3541 (  0.00%)  0.3486 (  1.54%)
xfs   Max  ddtime  0.4995 (  0.00%)  0.4834 (  3.21%)

A separate run with profiles showed this

 samples percentage
ext3  2258512.3180  vmlinux-3.15.0-rc5-mmotm   test_clear_page_writeback
ext3  1068481.0966  vmlinux-3.15.0-rc5-mmotm   __wake_up_bit
ext3   718490.7374  vmlinux-3.15.0-rc5-mmotm   page_waitqueue
ext3   403190.4138  vmlinux-3.15.0-rc5-mmotm   unlock_page
ext3   262430.2693  vmlinux-3.15.0-rc5-mmotm   end_page_writeback
ext3  1787771.7774  vmlinux-3.15.0-rc5-lockpage test_clear_page_writeback
ext3   677020.6731  vmlinux-3.15.0-rc5-lockpage unlock_page
ext3   223570.2223  vmlinux-3.15.0-rc5-lockpage end_page_writeback
ext3   111310.1107  vmlinux-3.15.0-rc5-lockpage __wake_up_bit
ext363600.0632  vmlinux-3.15.0-rc5-lockpage __wake_up_page_bit
ext316600.0165  vmlinux-3.15.0-rc5-lockpage page_waitqueue

The profiles show a clear reduction in waitqueue and wakeup functions. Note
that end_page_writeback costs the same as the savings there are due
to reduced calls to __wake_up_bit and page_waitqueue so there is no
obvious direct savings. The cost of unlock_page is higher as it's checking
PageWaiters but it is offset by reduced numbers of calls to page_waitqueue
and _wake_up_bit. There is a similar story told for each of the filesystems.
Note that for workloads that contend heavily on the page lock that
unlock_page may increase in cost as it has to clear PG_waiters so while
the typical case should be much faster, the worst case costs are now higher.

This is also reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

   3.15.0-rc53.15.0-rc5
mmotm   lockpage
Procs 107M 423. (  0.00%)409. (  3.31%)
Procs 214M 847. (  0.00%)823. (  2.83%)
Procs 322M1296. (  0.00%)   1232. (  4.94%)
Procs 429M1692. (  0.00%)   1644. (  2.84%)
Procs 536M2137. (  0.00%)   2057. (  3.74%)
Procs 644M2542. (  0.00%)   2472. (  2.75%)
Procs 751M2953. (  0.00%)   2872. (  2.74%)
Procs 859M3360. (  0.00%)   3310. (  1.49%)
Procs 966M3770. (  0.00%)   3724. (  1.22%)
Procs 1073M   4220. (  0.00%)   4114. (  2.51%)
Procs 1181M   4638. (  0.00%)   4546. (  1.98%)
Procs 1288M   5038. (  0.00%)   4940. (  1.95%)
Procs 1395M   5481. (  0.00%)   5431. (  0.91%)
Procs 1503M   5940. (  0.00%)   5832. (  1.82%)
Procs 1610M   6316. (  0.00%)   6204. (  1.77%)
Procs 1717M   6749. (  0.00%)   6799. ( -0.74%)

[PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Mel Gorman
Changelog since v6
o Optimisation when PG_waiters is not available (peterz)
o Documentation

Changelog since v5
o __always_inline where appropriate (peterz)
o Documentation (akpm)

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

From: Nick Piggin npig...@suse.de

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are *potentially* processes waiting on
PG_lock or PG_writeback.  If there are no possible waiters then we avoid
barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page
and end_page_writeback paths. There is no guarantee that waiters exist if
PG_waiters is set as multiple pages can hash to the same waitqueue and we
cannot accurately detect if a waking process is the last waiter without
a reference count. When this happens, the bit is left set and a future
unlock or writeback completion will lookup the waitqueue and clear the
bit when there are no collisions. This adds a few branches to the fast
path but avoids bouncing a dirty cache line between CPUs. 32-bit machines
always take the slow path but the primary motivation for this patch is
large machines so I do not think that is a concern.

The test case used to evaluate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. After each
dd there is a sync so the reported times do not vary much. By measuring
the time it takes to do async the impact of page_waitqueue overhead for
async IO is highlighted.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. The performance results are reported based on a run with
no profiling.  Profile data is based on a separate run with oprofile running.

async dd
 3.15.0-rc53.15.0-rc5
  mmotm   lockpage
btrfs Max  ddtime  0.5863 (  0.00%)  0.5621 (  4.14%)
ext3  Max  ddtime  1.4870 (  0.00%)  1.4609 (  1.76%)
ext4  Max  ddtime  1.0440 (  0.00%)  1.0376 (  0.61%)
tmpfs Max  ddtime  0.3541 (  0.00%)  0.3486 (  1.54%)
xfs   Max  ddtime  0.4995 (  0.00%)  0.4834 (  3.21%)

A separate run with profiles showed this

 samples percentage
ext3  2258512.3180  vmlinux-3.15.0-rc5-mmotm   test_clear_page_writeback
ext3  1068481.0966  vmlinux-3.15.0-rc5-mmotm   __wake_up_bit
ext3   718490.7374  vmlinux-3.15.0-rc5-mmotm   page_waitqueue
ext3   403190.4138  vmlinux-3.15.0-rc5-mmotm   unlock_page
ext3   262430.2693  vmlinux-3.15.0-rc5-mmotm   end_page_writeback
ext3  1787771.7774  vmlinux-3.15.0-rc5-lockpage test_clear_page_writeback
ext3   677020.6731  vmlinux-3.15.0-rc5-lockpage unlock_page
ext3   223570.2223  vmlinux-3.15.0-rc5-lockpage end_page_writeback
ext3   111310.1107  vmlinux-3.15.0-rc5-lockpage __wake_up_bit
ext363600.0632  vmlinux-3.15.0-rc5-lockpage __wake_up_page_bit
ext316600.0165  vmlinux-3.15.0-rc5-lockpage page_waitqueue

The profiles show a clear reduction in waitqueue and wakeup functions. Note
that end_page_writeback costs the same as the savings there are due
to reduced calls to __wake_up_bit and page_waitqueue so there is no
obvious direct savings. The cost of unlock_page is higher as it's checking
PageWaiters but it is offset by reduced numbers of calls to page_waitqueue
and _wake_up_bit. There is a similar story told for each of the filesystems.
Note that for workloads that contend heavily on the page lock that
unlock_page may increase in cost as it has to clear PG_waiters so while
the typical case should be much faster, the worst case costs are now higher.

This is also reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

   3.15.0-rc53.15.0-rc5
mmotm   lockpage
Procs 107M 423. (  0.00%)409. (  3.31%)
Procs 214M 847. (  0.00%)823. (  2.83%)
Procs 322M1296. (  0.00%)   1232. (  4.94%)
Procs 429M1692. (  0.00%)   1644. (  2.84%)
Procs 536M2137. (  0.00%)   2057. (  3.74%)
Procs 644M2542. (  0.00%)   2472. (  2.75%)
Procs 751M2953. (  0.00%)   2872. (  2.74%)
Procs 859M3360. (  0.00%)   3310. (  1.49%)
Procs 966M3770. (  0.00%)   3724. (  1.22%)
Procs 1073M   4220. (  0.00%)   4114. (  2.51%)
Procs 1181M   4638. (  0.00%)   4546. (  1.98%)
Procs 1288M   5038. (  0.00%)   4940. (  1.95%)
Procs 1395M   5481. (  0.00%)   5431. (  0.91%)
Procs 1503M   5940. (  0.00%)   5832. (  1.82%)
Procs 1610M   6316. (  0.00%)   6204. (  1.77%)
Procs 1717M   6749. (  0.00%)   

Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Peter Zijlstra
On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote:
 +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
 *word, int bit)
 +{
 + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
 + unsigned long flags;
 +
 + /* If there is no PG_waiters bit, always take the slow path */

That comment is misleading, this is actually a fast path for
!PG_waiters.

 + if (!__PG_WAITERS  waitqueue_active(wq)) {
 + __wake_up(wq, TASK_NORMAL, 1, key);
 + return;
 + }
 +
 + /*
 +  * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
 +  * checked under the wqh-lock to avoid races with parallel additions
 +  * to the waitqueue. Otherwise races could result in lost wakeups
 +  */
 + spin_lock_irqsave(wqh-lock, flags);
 + if (waitqueue_active(wqh))
 + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key);
 + else
 + ClearPageWaiters(page);
 + spin_unlock_irqrestore(wqh-lock, flags);
 +}

So I think you missed one Clear opportunity here that was in my original
proposal, possibly because you also frobbed PG_writeback in.

If you do:

spin_lock_irqsave(wqh-lock, flags);
if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, 
key))
ClearPageWaiters(page);
spin_unlock_irqrestore(wqh-lock, flags);

With the below change to __wake_up_common(), we'll also clear the bit
when there's no waiters of @page, even if there's waiters for another
page.

I suppose the one thing to say for the big open coded loop is that its
much easier to read than this scattered stuff.

---
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20ae657b..213c5bfe6b56 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue);
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
  */
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, int wake_flags, void *key)
 {
wait_queue_t *curr, *next;
+   bool woke = false;
 
list_for_each_entry_safe(curr, next, q-task_list, task_list) {
unsigned flags = curr-flags;
 
-   if (curr-func(curr, mode, wake_flags, key) 
-   (flags  WQ_FLAG_EXCLUSIVE)  !--nr_exclusive)
-   break;
+   if (curr-func(curr, mode, wake_flags, key)) {
+   woke = true;
+   if ((flags  WQ_FLAG_EXCLUSIVE)  !--nr_exclusive)
+   break;
+   }
}
+
+   return woke;
 }
 
 /**




pgp20tIIlgBUZ.pgp
Description: PGP signature


Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote:
 On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote:
  +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
  *word, int bit)
  +{
  +   struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
  +   unsigned long flags;
  +
  +   /* If there is no PG_waiters bit, always take the slow path */
 
 That comment is misleading, this is actually a fast path for
 !PG_waiters.
 

And could have been far better anyway now that you called me on it.

  +   if (!__PG_WAITERS  waitqueue_active(wq)) {
  +   __wake_up(wq, TASK_NORMAL, 1, key);
  +   return;
  +   }
  +
  +   /*
  +* Unlike __wake_up_bit it is necessary to check waitqueue_active to be
  +* checked under the wqh-lock to avoid races with parallel additions
  +* to the waitqueue. Otherwise races could result in lost wakeups
  +*/
  +   spin_lock_irqsave(wqh-lock, flags);
  +   if (waitqueue_active(wqh))
  +   __wake_up_common(wqh, TASK_NORMAL, 1, 0, key);
  +   else
  +   ClearPageWaiters(page);
  +   spin_unlock_irqrestore(wqh-lock, flags);
  +}
 
 So I think you missed one Clear opportunity here that was in my original
 proposal, possibly because you also frobbed PG_writeback in.
 

It got lost in the midst of all the other modifications to make this as
obvious as possible.

 SNIP

 I suppose the one thing to say for the big open coded loop is that its
 much easier to read than this scattered stuff.
 

Sure, but the end result of open coding this is duplicated code that will
be harder to maintain overall. I could split __wake_up_bit and use that
in both but I do not think it would make the code any clearer for the sake
of two lines.  Untested but this on top?

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 73cb8c6..d3a8c34 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -60,19 +60,26 @@ EXPORT_SYMBOL(remove_wait_queue);
  * There are circumstances in which we can try to wake a task which has already
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
+ *
+ * Returns true if a process was woken up
  */
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, int wake_flags, void *key)
 {
wait_queue_t *curr, *next;
+   bool woke = false;
 
list_for_each_entry_safe(curr, next, q-task_list, task_list) {
unsigned flags = curr-flags;
 
-   if (curr-func(curr, mode, wake_flags, key) 
-   (flags  WQ_FLAG_EXCLUSIVE)  !--nr_exclusive)
-   break;
+   if (curr-func(curr, mode, wake_flags, key)) {
+   woke = true;
+   if ((flags  WQ_FLAG_EXCLUSIVE)  !--nr_exclusive)
+   break;
+   }
}
+
+   return woke;
 }
 
 /**
@@ -441,9 +448,13 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct 
page *page, void *word, i
struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
unsigned long flags;
 
-   /* If there is no PG_waiters bit, always take the slow path */
-   if (!__PG_WAITERS  waitqueue_active(wq)) {
-   __wake_up(wq, TASK_NORMAL, 1, key);
+   /*
+* If there is no PG_waiters bit (32-bit), then waitqueue_active can be
+* checked without wqh-lock as there is no PG_waiters race to protect.
+*/
+   if (!__PG_WAITERS) {
+   if (waitqueue_active(wqh))
+   __wake_up(wqh, TASK_NORMAL, 1, key);
return;
}
 
@@ -453,9 +464,8 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct page 
*page, void *word, i
 * to the waitqueue. Otherwise races could result in lost wakeups
 */
spin_lock_irqsave(wqh-lock, flags);
-   if (waitqueue_active(wqh))
-   __wake_up_common(wqh, TASK_NORMAL, 1, 0, key);
-   else
+   if (!waitqueue_active(wqh) ||
+   !__wake_up_common(wqh, TASK_NORMAL, 1, 0, key))
ClearPageWaiters(page);
spin_unlock_irqrestore(wqh-lock, flags);
 }

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote:
 On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote:
  +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
  *word, int bit)
  +{
  +   struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
  +   unsigned long flags;
  +
  +   /* If there is no PG_waiters bit, always take the slow path */
 
 That comment is misleading, this is actually a fast path for
 !PG_waiters.
 
  +   if (!__PG_WAITERS  waitqueue_active(wq)) {
  +   __wake_up(wq, TASK_NORMAL, 1, key);
  +   return;
  +   }
  +
  +   /*
  +* Unlike __wake_up_bit it is necessary to check waitqueue_active to be
  +* checked under the wqh-lock to avoid races with parallel additions
  +* to the waitqueue. Otherwise races could result in lost wakeups
  +*/
  +   spin_lock_irqsave(wqh-lock, flags);
  +   if (waitqueue_active(wqh))
  +   __wake_up_common(wqh, TASK_NORMAL, 1, 0, key);
  +   else
  +   ClearPageWaiters(page);
  +   spin_unlock_irqrestore(wqh-lock, flags);
  +}
 
 So I think you missed one Clear opportunity here that was in my original
 proposal, possibly because you also frobbed PG_writeback in.
 
 If you do:
 
   spin_lock_irqsave(wqh-lock, flags);
   if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, 
 key))
   ClearPageWaiters(page);
   spin_unlock_irqrestore(wqh-lock, flags);
 
 With the below change to __wake_up_common(), we'll also clear the bit
 when there's no waiters of @page, even if there's waiters for another
 page.
 
 I suppose the one thing to say for the big open coded loop is that its
 much easier to read than this scattered stuff.
 
 ---
 diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
 index 0ffa20ae657b..213c5bfe6b56 100644
 --- a/kernel/sched/wait.c
 +++ b/kernel/sched/wait.c
 @@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue);
   * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
   * zero in this (rare) case, and we handle it by continuing to scan the 
 queue.
   */
 -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
 +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
   int nr_exclusive, int wake_flags, void *key)
  {
   wait_queue_t *curr, *next;
 + bool woke = false;
  
   list_for_each_entry_safe(curr, next, q-task_list, task_list) {
   unsigned flags = curr-flags;
  
 - if (curr-func(curr, mode, wake_flags, key) 
 - (flags  WQ_FLAG_EXCLUSIVE)  !--nr_exclusive)
 - break;
 + if (curr-func(curr, mode, wake_flags, key)) {
 + woke = true;
 + if ((flags  WQ_FLAG_EXCLUSIVE)  !--nr_exclusive)
 + break;
 + }
   }
 +
 + return woke;

Ok, thinking about this more I'm less sure.

There are cases where the curr-func returns false even though there is a
task that needs to run -- task was already running or preparing to run. We
potentially end up clearing PG_waiters while there are still tasks on the
waitqueue. As __finish_wait checks if the waitqueue is empty and the last
waiter clears the bit I think there is nothing to gain by trying to do the
same job in __wake_up_page_bit.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Peter Zijlstra
On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote:

  +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
  int nr_exclusive, int wake_flags, void *key)
   {
  wait_queue_t *curr, *next;
  +   bool woke = false;
   
  list_for_each_entry_safe(curr, next, q-task_list, task_list) {
  unsigned flags = curr-flags;
   
  +   if (curr-func(curr, mode, wake_flags, key)) {
  +   woke = true;
  +   if ((flags  WQ_FLAG_EXCLUSIVE)  !--nr_exclusive)
  +   break;
  +   }
  }
  +
  +   return woke;
 
 Ok, thinking about this more I'm less sure.
 
 There are cases where the curr-func returns false even though there is a
 task that needs to run -- task was already running or preparing to run. We
 potentially end up clearing PG_waiters while there are still tasks on the
 waitqueue. As __finish_wait checks if the waitqueue is empty and the last
 waiter clears the bit I think there is nothing to gain by trying to do the
 same job in __wake_up_page_bit.

Hmm, I think you're right, we need the test result from
wake_bit_function(), unpolluted by the ttwu return value.


pgpiPAuXoN5IG.pgp
Description: PGP signature


Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 05:04:51PM +0200, Peter Zijlstra wrote:
 On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote:
 
   +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
 int nr_exclusive, int wake_flags, void *key)
{
 wait_queue_t *curr, *next;
   + bool woke = false;

 list_for_each_entry_safe(curr, next, q-task_list, task_list) {
 unsigned flags = curr-flags;

   + if (curr-func(curr, mode, wake_flags, key)) {
   + woke = true;
   + if ((flags  WQ_FLAG_EXCLUSIVE)  !--nr_exclusive)
   + break;
   + }
 }
   +
   + return woke;
  
  Ok, thinking about this more I'm less sure.
  
  There are cases where the curr-func returns false even though there is a
  task that needs to run -- task was already running or preparing to run. We
  potentially end up clearing PG_waiters while there are still tasks on the
  waitqueue. As __finish_wait checks if the waitqueue is empty and the last
  waiter clears the bit I think there is nothing to gain by trying to do the
  same job in __wake_up_page_bit.
 
 Hmm, I think you're right, we need the test result from
 wake_bit_function(), unpolluted by the ttwu return value.

Which would be a bit too special cased and not a clear win. I at least
added a comment to explain what is going on here.

/*
 * Unlike __wake_up_bit it is necessary to check waitqueue_active
 * under the wqh-lock to avoid races with parallel additions that
 * could result in lost wakeups.
 */
spin_lock_irqsave(wqh-lock, flags);
if (waitqueue_active(wqh)) {
/*
 * Try waking a task on the queue. Responsibility for clearing
 * the PG_waiters bit is left to the last waiter on the
 * waitqueue as PageWaiters is called outside wqh-lock and
 * we cannot miss wakeups. Due to hashqueue collisions, there
 * may be colliding pages that still have PG_waiters set but
 * the impact means there will be at least one unnecessary
 * lookup of the page waitqueue on the next unlock_page or
 * end of writeback.
 */
__wake_up_common(wqh, TASK_NORMAL, 1, 0, key);
} else {
/* No potential waiters, safe to clear PG_waiters */
ClearPageWaiters(page);
}
spin_unlock_irqrestore(wqh-lock, flags);

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/