Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 10:47:22AM -0700, Andrew Morton wrote:
> On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman  wrote:
> 
> > > > If I'm still on track here, what happens if we switch to wake-all so we
> > > > can avoid the dangling flag?  I doubt if there are many collisions on
> > > > that hash table?
> > > 
> > > Wake-all will be ugly and loose a herd of waiters, all racing to
> > > acquire, all but one of whoem will loose the race. It also looses the
> > > fairness, its currently a FIFO queue. Wake-all will allow starvation.
> > > 
> > 
> > And the cost of the thundering herd of waiters may offset any benefit of
> > reducing the number of calls to page_waitqueue and waker functions.
> 
> Well, none of this has been demonstrated.
> 

True, but it's also the type of thing that would deserve a patch of its
own with some separation in case bisection fingerpoints to a patch that
is doing too much on its own.

> As I speculated earlier, hash chain collisions will probably be rare,

They are meant to be (well, they're documented to be). It's the primary
reason why I'm not concerned about "dangling waiters" being that common
a case.

> except for the case where a bunch of processes are waiting on the same
> page.  And in this case, perhaps wake-all is the desired behavior.
> 
> Take a look at do_read_cache_page().  It does lock_page(), but it
> doesn't actually *need* to.  It checks ->mapping and PG_uptodate and
> then...  unlocks the page!  We could have used wait_on_page_locked()
> there and permitted concurrent threads to run concurrently.
> 

It does that later when it calls wait_on_page_read but the flow is weird. It
looks like the first lock_page was to serialise against any IO and double
check it was not racing against a parallel reclaim although the elevated
reference count should have prevented that. Historical artifact maybe?
It looks like there could be some improvement there but also would deserve
a patch on its own.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Andrew Morton
On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman  wrote:

> > > If I'm still on track here, what happens if we switch to wake-all so we
> > > can avoid the dangling flag?  I doubt if there are many collisions on
> > > that hash table?
> > 
> > Wake-all will be ugly and loose a herd of waiters, all racing to
> > acquire, all but one of whoem will loose the race. It also looses the
> > fairness, its currently a FIFO queue. Wake-all will allow starvation.
> > 
> 
> And the cost of the thundering herd of waiters may offset any benefit of
> reducing the number of calls to page_waitqueue and waker functions.

Well, none of this has been demonstrated.

As I speculated earlier, hash chain collisions will probably be rare,
except for the case where a bunch of processes are waiting on the same
page.  And in this case, perhaps wake-all is the desired behavior.

Take a look at do_read_cache_page().  It does lock_page(), but it
doesn't actually *need* to.  It checks ->mapping and PG_uptodate and
then...  unlocks the page!  We could have used wait_on_page_locked()
there and permitted concurrent threads to run concurrently.

btw, I'm struggling a bit to understand why we bother checking
->mapping there as we're about to unlock the page anyway...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 08:45:29AM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote:
> > On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra  
> > wrote:
> 
> > Alternative solution is not to merge the patch ;)
> 
> There is always that.. :-)
> 
> > > Yeah, so we only clear that bit when at 'unlock' we find there are no
> > > more pending waiters, so if the last unlock still had a waiter, we'll
> > > leave the bit set.
> > 
> > Confused.  If the last unlock had a waiter, that waiter will get woken
> > up so there are no waiters any more, so the last unlock clears the flag.
> > 
> > um, how do we determine that there are no more waiters?  By looking at
> > the waitqueue.  But that waitqueue is hashed, so it may contain waiters
> > for other pages so we're screwed?  But we could just go and wake up the
> > other-page waiters anyway and still clear PG_waiters?
> > 
> > um2, we're using exclusive waitqueues so we can't (or don't) wake all
> > waiters, so we're screwed again?
> 
> Ah, so leave it set. Then when we do an uncontended wakeup, that is a
> wakeup where there are _no_ waiters left, we'll iterate the entire
> hashed queue, looking for a matching page.
> 
> We'll find none, and only then clear the bit.
> 

Yes, sorry that was not clear.

> 
> > (This process is proving to be a hard way of writing Mel's changelog btw).
> 
> Agreed :/
> 

I've lost sight of what is obvious and what is not. The introduction
now reads

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are *potentially* processes waiting on
PG_lock or PG_writeback.  If there are no possible waiters then we
avoid barriers, a waitqueue hash lookup and a failed wake_up in the
unlock_page and end_page_writeback paths. There is no guarantee
that waiters exist if PG_waiters is set as multiple pages can
hash to the same waitqueue and we cannot accurately detect if a
waking process is the last waiter without a reference count. When
this happens, the bit is left set and the next unlock or writeback
completion will lookup the waitqueue and clear the bit when there
are no collisions. This adds a few branches to the fast path but
avoids bouncing a dirty cache line between CPUs. 32-bit machines
always take the slow path but the primary motivation for this
patch is large machines so I do not think that is a concern.

> > If I'm still on track here, what happens if we switch to wake-all so we
> > can avoid the dangling flag?  I doubt if there are many collisions on
> > that hash table?
> 
> Wake-all will be ugly and loose a herd of waiters, all racing to
> acquire, all but one of whoem will loose the race. It also looses the
> fairness, its currently a FIFO queue. Wake-all will allow starvation.
> 

And the cost of the thundering herd of waiters may offset any benefit of
reducing the number of calls to page_waitqueue and waker functions.

> > If there *are* a lot of collisions, I bet it's because a great pile of
> > threads are all waiting on the same page.  If they're trying to lock
> > that page then wake-all is bad.  But if they're just waiting for IO
> > completion (probable) then it's OK.
> 
> Yeah, I'm not entirely sure on the rationale for adding PG_waiters to
> writeback completion, and yes PG_writeback is a wake-all.

tmpfs was the most obvious one. We were doing a useless lookup almost
every time writeback completed for async streaming writers. I suspected
it would also apply to normal filesystems if backed by fast storage.

There is not much to gain by continuing to use __wake_up_bit in the
writeback pathso when PG_waiters is available. Only the first waiters
incurs the SetPageWaiters penalty. In the uncontended case, neither
take locks (one approach checks waitqueue_active outside the lock, the
other checks PageWaiters). Both approaches end up taking q->lock either
in __wake_up_bit->__wake_up or __wake_up_page_bit but in some cases
__wake_up_page_bit.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Peter Zijlstra
On Thu, May 22, 2014 at 01:07:15AM +0100, Mel Gorman wrote:

> +PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
> + TESTCLEARFLAG(Waiters, waiters)
> +#define __PG_WAITERS (1 << PG_waiters)
> +#else
> +/* Always fallback to slow path on 32-bit */
> +static inline bool PageWaiters(struct page *page)
> +{
> + return true;
> +}
> +static inline void __ClearPageWaiters(struct page *page) {}
> +static inline void ClearPageWaiters(struct page *page) {}
> +static inline void SetPageWaiters(struct page *page) {}
> +#define __PG_WAITERS 0


> +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
> *word, int bit)
> +{
> + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> + unsigned long flags;
> +
> + /*
> +  * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
> +  * checked under the wqh->lock to avoid races with parallel additions
> +  * to the waitqueue. Otherwise races could result in lost wakeups
> +  */

Well, you could do something like:

if (!__PG_WAITERS && !waitqueue_active(wqh))
return;

Which at least for 32bit restores some of the performance loss of this
patch (did you have 32bit numbers in that massive changelog?, I totally
tl;dr it).

> + spin_lock_irqsave(>lock, flags);
> + if (waitqueue_active(wqh))
> + __wake_up_common(wqh, TASK_NORMAL, 1, 0, );
> + else
> + ClearPageWaiters(page);
> + spin_unlock_irqrestore(>lock, flags);
> +}


pgpOeEVJL8s9M.pgp
Description: PGP signature


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Peter Zijlstra
On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote:
> On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra  
> wrote:

> Alternative solution is not to merge the patch ;)

There is always that.. :-)

> > Yeah, so we only clear that bit when at 'unlock' we find there are no
> > more pending waiters, so if the last unlock still had a waiter, we'll
> > leave the bit set.
> 
> Confused.  If the last unlock had a waiter, that waiter will get woken
> up so there are no waiters any more, so the last unlock clears the flag.
> 
> um, how do we determine that there are no more waiters?  By looking at
> the waitqueue.  But that waitqueue is hashed, so it may contain waiters
> for other pages so we're screwed?  But we could just go and wake up the
> other-page waiters anyway and still clear PG_waiters?
> 
> um2, we're using exclusive waitqueues so we can't (or don't) wake all
> waiters, so we're screwed again?

Ah, so leave it set. Then when we do an uncontended wakeup, that is a
wakeup where there are _no_ waiters left, we'll iterate the entire
hashed queue, looking for a matching page.

We'll find none, and only then clear the bit.


> (This process is proving to be a hard way of writing Mel's changelog btw).

Agreed :/

> If I'm still on track here, what happens if we switch to wake-all so we
> can avoid the dangling flag?  I doubt if there are many collisions on
> that hash table?

Wake-all will be ugly and loose a herd of waiters, all racing to
acquire, all but one of whoem will loose the race. It also looses the
fairness, its currently a FIFO queue. Wake-all will allow starvation.

> If there *are* a lot of collisions, I bet it's because a great pile of
> threads are all waiting on the same page.  If they're trying to lock
> that page then wake-all is bad.  But if they're just waiting for IO
> completion (probable) then it's OK.

Yeah, I'm not entirely sure on the rationale for adding PG_waiters to
writeback completion, and yes PG_writeback is a wake-all.


pgp0z0FmMKIdX.pgp
Description: PGP signature


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Peter Zijlstra
On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote:
 On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra pet...@infradead.org 
 wrote:

 Alternative solution is not to merge the patch ;)

There is always that.. :-)

  Yeah, so we only clear that bit when at 'unlock' we find there are no
  more pending waiters, so if the last unlock still had a waiter, we'll
  leave the bit set.
 
 Confused.  If the last unlock had a waiter, that waiter will get woken
 up so there are no waiters any more, so the last unlock clears the flag.
 
 um, how do we determine that there are no more waiters?  By looking at
 the waitqueue.  But that waitqueue is hashed, so it may contain waiters
 for other pages so we're screwed?  But we could just go and wake up the
 other-page waiters anyway and still clear PG_waiters?
 
 um2, we're using exclusive waitqueues so we can't (or don't) wake all
 waiters, so we're screwed again?

Ah, so leave it set. Then when we do an uncontended wakeup, that is a
wakeup where there are _no_ waiters left, we'll iterate the entire
hashed queue, looking for a matching page.

We'll find none, and only then clear the bit.


 (This process is proving to be a hard way of writing Mel's changelog btw).

Agreed :/

 If I'm still on track here, what happens if we switch to wake-all so we
 can avoid the dangling flag?  I doubt if there are many collisions on
 that hash table?

Wake-all will be ugly and loose a herd of waiters, all racing to
acquire, all but one of whoem will loose the race. It also looses the
fairness, its currently a FIFO queue. Wake-all will allow starvation.

 If there *are* a lot of collisions, I bet it's because a great pile of
 threads are all waiting on the same page.  If they're trying to lock
 that page then wake-all is bad.  But if they're just waiting for IO
 completion (probable) then it's OK.

Yeah, I'm not entirely sure on the rationale for adding PG_waiters to
writeback completion, and yes PG_writeback is a wake-all.


pgp0z0FmMKIdX.pgp
Description: PGP signature


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Peter Zijlstra
On Thu, May 22, 2014 at 01:07:15AM +0100, Mel Gorman wrote:

 +PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
 + TESTCLEARFLAG(Waiters, waiters)
 +#define __PG_WAITERS (1  PG_waiters)
 +#else
 +/* Always fallback to slow path on 32-bit */
 +static inline bool PageWaiters(struct page *page)
 +{
 + return true;
 +}
 +static inline void __ClearPageWaiters(struct page *page) {}
 +static inline void ClearPageWaiters(struct page *page) {}
 +static inline void SetPageWaiters(struct page *page) {}
 +#define __PG_WAITERS 0


 +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
 *word, int bit)
 +{
 + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
 + unsigned long flags;
 +
 + /*
 +  * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
 +  * checked under the wqh-lock to avoid races with parallel additions
 +  * to the waitqueue. Otherwise races could result in lost wakeups
 +  */

Well, you could do something like:

if (!__PG_WAITERS  !waitqueue_active(wqh))
return;

Which at least for 32bit restores some of the performance loss of this
patch (did you have 32bit numbers in that massive changelog?, I totally
tl;dr it).

 + spin_lock_irqsave(wqh-lock, flags);
 + if (waitqueue_active(wqh))
 + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key);
 + else
 + ClearPageWaiters(page);
 + spin_unlock_irqrestore(wqh-lock, flags);
 +}


pgpOeEVJL8s9M.pgp
Description: PGP signature


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 08:45:29AM +0200, Peter Zijlstra wrote:
 On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote:
  On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra pet...@infradead.org 
  wrote:
 
  Alternative solution is not to merge the patch ;)
 
 There is always that.. :-)
 
   Yeah, so we only clear that bit when at 'unlock' we find there are no
   more pending waiters, so if the last unlock still had a waiter, we'll
   leave the bit set.
  
  Confused.  If the last unlock had a waiter, that waiter will get woken
  up so there are no waiters any more, so the last unlock clears the flag.
  
  um, how do we determine that there are no more waiters?  By looking at
  the waitqueue.  But that waitqueue is hashed, so it may contain waiters
  for other pages so we're screwed?  But we could just go and wake up the
  other-page waiters anyway and still clear PG_waiters?
  
  um2, we're using exclusive waitqueues so we can't (or don't) wake all
  waiters, so we're screwed again?
 
 Ah, so leave it set. Then when we do an uncontended wakeup, that is a
 wakeup where there are _no_ waiters left, we'll iterate the entire
 hashed queue, looking for a matching page.
 
 We'll find none, and only then clear the bit.
 

Yes, sorry that was not clear.

 
  (This process is proving to be a hard way of writing Mel's changelog btw).
 
 Agreed :/
 

I've lost sight of what is obvious and what is not. The introduction
now reads

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are *potentially* processes waiting on
PG_lock or PG_writeback.  If there are no possible waiters then we
avoid barriers, a waitqueue hash lookup and a failed wake_up in the
unlock_page and end_page_writeback paths. There is no guarantee
that waiters exist if PG_waiters is set as multiple pages can
hash to the same waitqueue and we cannot accurately detect if a
waking process is the last waiter without a reference count. When
this happens, the bit is left set and the next unlock or writeback
completion will lookup the waitqueue and clear the bit when there
are no collisions. This adds a few branches to the fast path but
avoids bouncing a dirty cache line between CPUs. 32-bit machines
always take the slow path but the primary motivation for this
patch is large machines so I do not think that is a concern.

  If I'm still on track here, what happens if we switch to wake-all so we
  can avoid the dangling flag?  I doubt if there are many collisions on
  that hash table?
 
 Wake-all will be ugly and loose a herd of waiters, all racing to
 acquire, all but one of whoem will loose the race. It also looses the
 fairness, its currently a FIFO queue. Wake-all will allow starvation.
 

And the cost of the thundering herd of waiters may offset any benefit of
reducing the number of calls to page_waitqueue and waker functions.

  If there *are* a lot of collisions, I bet it's because a great pile of
  threads are all waiting on the same page.  If they're trying to lock
  that page then wake-all is bad.  But if they're just waiting for IO
  completion (probable) then it's OK.
 
 Yeah, I'm not entirely sure on the rationale for adding PG_waiters to
 writeback completion, and yes PG_writeback is a wake-all.

tmpfs was the most obvious one. We were doing a useless lookup almost
every time writeback completed for async streaming writers. I suspected
it would also apply to normal filesystems if backed by fast storage.

There is not much to gain by continuing to use __wake_up_bit in the
writeback pathso when PG_waiters is available. Only the first waiters
incurs the SetPageWaiters penalty. In the uncontended case, neither
take locks (one approach checks waitqueue_active outside the lock, the
other checks PageWaiters). Both approaches end up taking q-lock either
in __wake_up_bit-__wake_up or __wake_up_page_bit but in some cases
__wake_up_page_bit.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Andrew Morton
On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman mgor...@suse.de wrote:

   If I'm still on track here, what happens if we switch to wake-all so we
   can avoid the dangling flag?  I doubt if there are many collisions on
   that hash table?
  
  Wake-all will be ugly and loose a herd of waiters, all racing to
  acquire, all but one of whoem will loose the race. It also looses the
  fairness, its currently a FIFO queue. Wake-all will allow starvation.
  
 
 And the cost of the thundering herd of waiters may offset any benefit of
 reducing the number of calls to page_waitqueue and waker functions.

Well, none of this has been demonstrated.

As I speculated earlier, hash chain collisions will probably be rare,
except for the case where a bunch of processes are waiting on the same
page.  And in this case, perhaps wake-all is the desired behavior.

Take a look at do_read_cache_page().  It does lock_page(), but it
doesn't actually *need* to.  It checks -mapping and PG_uptodate and
then...  unlocks the page!  We could have used wait_on_page_locked()
there and permitted concurrent threads to run concurrently.

btw, I'm struggling a bit to understand why we bother checking
-mapping there as we're about to unlock the page anyway...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-22 Thread Mel Gorman
On Thu, May 22, 2014 at 10:47:22AM -0700, Andrew Morton wrote:
 On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman mgor...@suse.de wrote:
 
If I'm still on track here, what happens if we switch to wake-all so we
can avoid the dangling flag?  I doubt if there are many collisions on
that hash table?
   
   Wake-all will be ugly and loose a herd of waiters, all racing to
   acquire, all but one of whoem will loose the race. It also looses the
   fairness, its currently a FIFO queue. Wake-all will allow starvation.
   
  
  And the cost of the thundering herd of waiters may offset any benefit of
  reducing the number of calls to page_waitqueue and waker functions.
 
 Well, none of this has been demonstrated.
 

True, but it's also the type of thing that would deserve a patch of its
own with some separation in case bisection fingerpoints to a patch that
is doing too much on its own.

 As I speculated earlier, hash chain collisions will probably be rare,

They are meant to be (well, they're documented to be). It's the primary
reason why I'm not concerned about dangling waiters being that common
a case.

 except for the case where a bunch of processes are waiting on the same
 page.  And in this case, perhaps wake-all is the desired behavior.
 
 Take a look at do_read_cache_page().  It does lock_page(), but it
 doesn't actually *need* to.  It checks -mapping and PG_uptodate and
 then...  unlocks the page!  We could have used wait_on_page_locked()
 there and permitted concurrent threads to run concurrently.
 

It does that later when it calls wait_on_page_read but the flow is weird. It
looks like the first lock_page was to serialise against any IO and double
check it was not racing against a parallel reclaim although the elevated
reference count should have prevented that. Historical artifact maybe?
It looks like there could be some improvement there but also would deserve
a patch on its own.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Mel Gorman
On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote:
> On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra  
> wrote:
> 
> > On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> > > > +static inline void
> > > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > > > +   struct page *page, int state, bool exclusive)
> > > 
> > > Putting MM stuff into core waitqueue code is rather bad.  I really
> > > don't know how I'm going to explain this to my family.
> > 
> > Right, so we could avoid all that and make the functions in mm/filemap.c
> > rather large and opencode a bunch of wait.c stuff.
> > 
> 
> The world won't end if we do it Mel's way and it's probably the most
> efficient.  But ugh.  This stuff does raise the "it had better be a
> useful patch" bar.
> 
> > Which is pretty much what I initially pseudo proposed.
> 
> Alternative solution is not to merge the patch ;)
> 

While true, the overhead of the page_waitqueue lookups and unnecessary
wakeups sucks even on small machines. Not only does it hit us during simple
operations like dd to a file but we would hit it during page reclaim as
well which is trylock_page/unlock_page intensive

> > > > +   __ClearPageWaiters(page);
> > > 
> > > We're freeing the page - if someone is still waiting on it then we have
> > > a huge bug?  It's the mysterious collision thing again I hope?
> > 
> > Yeah, so we only clear that bit when at 'unlock' we find there are no
> > more pending waiters, so if the last unlock still had a waiter, we'll
> > leave the bit set.
> 
> Confused.  If the last unlock had a waiter, that waiter will get woken
> up so there are no waiters any more, so the last unlock clears the flag.
> 
> um, how do we determine that there are no more waiters?  By looking at
> the waitqueue.  But that waitqueue is hashed, so it may contain waiters
> for other pages so we're screwed?  But we could just go and wake up the
> other-page waiters anyway and still clear PG_waiters?
> 
> um2, we're using exclusive waitqueues so we can't (or don't) wake all
> waiters, so we're screwed again?
> 
> (This process is proving to be a hard way of writing Mel's changelog btw).
> 
> If I'm still on track here, what happens if we switch to wake-all so we
> can avoid the dangling flag?  I doubt if there are many collisions on
> that hash table?
> 
> If there *are* a lot of collisions, I bet it's because a great pile of
> threads are all waiting on the same page.  If they're trying to lock
> that page then wake-all is bad.  But if they're just waiting for IO
> completion (probable) then it's OK.
> 
> I'll stop now.

Rather than putting details in the changelog, here is an updated version
that hopefully improves the commentary to the point where it's actually
clear. 

---8<---
From: Nick Piggin 
Subject: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups 
in unlock_page fastpath v6

Changelog since v5
o __always_inline where appropriate (peterz)
o Documentation (akpm)

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are processes waiting on PG_lock or PG_writeback
and uses it to avoid memory barriers and waitqueue hash lookup in the
unlock_page fastpath.

This adds a few branches to the fast path but avoids bouncing a dirty
cache line between CPUs. 32-bit machines always take the slow path but the
primary motivation for this patch is large machines so I do not think that
is a concern.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running. The
kernels being compared are "accessed-v2" which is the patch series up
to this patch where as lockpage-v2 includes this patch.

async dd
 3.15.0-rc53.15.0-rc5
  mmotm   

Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Mel Gorman
On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman  wrote:
> 
> > Andrew had suggested dropping v4 of the patch entirely as the numbers were
> > marginal and the complexity was high. However, even on a relatively small
> > machine running simple workloads the overhead of page_waitqueue and wakeup
> > functions is around 5% of system CPU time. That's quite high for basic
> > operations so I felt it was worth another shot. The performance figures
> > are better with this version than they were for v4 and overall the patch
> > should be more comprehensible.
> > 
> > Changelog since v4
> > o Remove dependency on io_schedule_timeout
> > o Push waiting logic down into waitqueue
> > 
> > This patch introduces a new page flag for 64-bit capable machines,
> > PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> > 
> > This adds a few branches to the fast path but avoids bouncing a dirty
> > cache line between CPUs. 32-bit machines always take the slow path but the
> > primary motivation for this patch is large machines so I do not think that
> > is a concern.
> > 
> > The test case used to evaulate this is a simple dd of a large file done
> > multiple times with the file deleted on each iterations. The size of
> > the file is 1/10th physical memory to avoid dirty page balancing. In the
> > async case it will be possible that the workload completes without even
> > hitting the disk and will have variable results but highlight the impact
> > of mark_page_accessed for async IO. The sync results are expected to be
> > more stable. The exception is tmpfs where the normal case is for the "IO"
> > to not hit the disk.
> > 
> > The test machine was single socket and UMA to avoid any scheduling or
> > NUMA artifacts. Throughput and wall times are presented for sync IO, only
> > wall times are shown for async as the granularity reported by dd and the
> > variability is unsuitable for comparison. As async results were variable
> > do to writback timings, I'm only reporting the maximum figures. The sync
> > results were stable enough to make the mean and stddev uninteresting.
> > 
> > The performance results are reported based on a run with no profiling.
> > Profile data is based on a separate run with oprofile running. The
> > kernels being compared are "accessed-v2" which is the patch series up
> > to this patch where as lockpage-v2 includes this patch.
> > 
> > ...
> >
> > --- a/include/linux/wait.h
> > +++ b/include/linux/wait.h
> > @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned 
> > int mode, int nr, void *k
> >  void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
> >  void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
> >  void __wake_up_bit(wait_queue_head_t *, void *, int);
> > +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, 
> > int);
> 
> You're going to need to forward-declare struct page in wait.h.  The
> good thing about this is that less people will notice that we've gone
> and mentioned struct page in wait.h :(
> 

Will add the forward-declare.

> >  int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int 
> > (*)(void *), unsigned);
> > +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
> > 
> > ...
> >
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);/* For internal 
> > use only */
> >   * stops them from bleeding out - it would still allow subsequent
> >   * loads to move into the critical region).
> >   */
> > -void
> > -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> > +static inline void
> > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +   struct page *page, int state, bool exclusive)
> 
> Putting MM stuff into core waitqueue code is rather bad.  I really
> don't know how I'm going to explain this to my family.
> 

The alternative is updating wait.h and wait.c was open-coding the waitqueue
modifications in filemap.c but that is just as ugly. The wait queue stuff
is complex and there was motivation to keep it in one place even if we
are special casing struct page handling.

FWIW, I cannot explain anything I do in work to my family. It gets blank
looks no matter what.


> >  {
> > unsigned long flags;
> >  
> > -   wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> > spin_lock_irqsave(>lock, flags);
> > -   if (list_empty(>task_list))
> > -   __add_wait_queue(q, wait);
> > +   if (page && !PageWaiters(page))
> > +   SetPageWaiters(page);
> 
> And this isn't racy because we're assuming that all users of `page' are
> using the same waitqueue.  ie, assuming all callers use
> page_waitqueue()?   Subtle, unobvious, worth documenting.
> 

All users of the page will get the same 

Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Andrew Morton
On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra  wrote:

> On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> > > +static inline void
> > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > > + struct page *page, int state, bool exclusive)
> > 
> > Putting MM stuff into core waitqueue code is rather bad.  I really
> > don't know how I'm going to explain this to my family.
> 
> Right, so we could avoid all that and make the functions in mm/filemap.c
> rather large and opencode a bunch of wait.c stuff.
> 

The world won't end if we do it Mel's way and it's probably the most
efficient.  But ugh.  This stuff does raise the "it had better be a
useful patch" bar.

> Which is pretty much what I initially pseudo proposed.

Alternative solution is not to merge the patch ;)

> > > + __ClearPageWaiters(page);
> > 
> > We're freeing the page - if someone is still waiting on it then we have
> > a huge bug?  It's the mysterious collision thing again I hope?
> 
> Yeah, so we only clear that bit when at 'unlock' we find there are no
> more pending waiters, so if the last unlock still had a waiter, we'll
> leave the bit set.

Confused.  If the last unlock had a waiter, that waiter will get woken
up so there are no waiters any more, so the last unlock clears the flag.

um, how do we determine that there are no more waiters?  By looking at
the waitqueue.  But that waitqueue is hashed, so it may contain waiters
for other pages so we're screwed?  But we could just go and wake up the
other-page waiters anyway and still clear PG_waiters?

um2, we're using exclusive waitqueues so we can't (or don't) wake all
waiters, so we're screwed again?

(This process is proving to be a hard way of writing Mel's changelog btw).

If I'm still on track here, what happens if we switch to wake-all so we
can avoid the dangling flag?  I doubt if there are many collisions on
that hash table?

If there *are* a lot of collisions, I bet it's because a great pile of
threads are all waiting on the same page.  If they're trying to lock
that page then wake-all is bad.  But if they're just waiting for IO
completion (probable) then it's OK.

I'll stop now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Peter Zijlstra
On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> > +static inline void
> > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +   struct page *page, int state, bool exclusive)
> 
> Putting MM stuff into core waitqueue code is rather bad.  I really
> don't know how I'm going to explain this to my family.

Right, so we could avoid all that and make the functions in mm/filemap.c
rather large and opencode a bunch of wait.c stuff.

Which is pretty much what I initially pseudo proposed.

> > +   __ClearPageWaiters(page);
> 
> We're freeing the page - if someone is still waiting on it then we have
> a huge bug?  It's the mysterious collision thing again I hope?

Yeah, so we only clear that bit when at 'unlock' we find there are no
more pending waiters, so if the last unlock still had a waiter, we'll
leave the bit set. So its entirely reasonable to still have it set when
we free a page etc..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Andrew Morton
On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman  wrote:

> Andrew had suggested dropping v4 of the patch entirely as the numbers were
> marginal and the complexity was high. However, even on a relatively small
> machine running simple workloads the overhead of page_waitqueue and wakeup
> functions is around 5% of system CPU time. That's quite high for basic
> operations so I felt it was worth another shot. The performance figures
> are better with this version than they were for v4 and overall the patch
> should be more comprehensible.
> 
> Changelog since v4
> o Remove dependency on io_schedule_timeout
> o Push waiting logic down into waitqueue
> 
> This patch introduces a new page flag for 64-bit capable machines,
> PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> 
> This adds a few branches to the fast path but avoids bouncing a dirty
> cache line between CPUs. 32-bit machines always take the slow path but the
> primary motivation for this patch is large machines so I do not think that
> is a concern.
> 
> The test case used to evaulate this is a simple dd of a large file done
> multiple times with the file deleted on each iterations. The size of
> the file is 1/10th physical memory to avoid dirty page balancing. In the
> async case it will be possible that the workload completes without even
> hitting the disk and will have variable results but highlight the impact
> of mark_page_accessed for async IO. The sync results are expected to be
> more stable. The exception is tmpfs where the normal case is for the "IO"
> to not hit the disk.
> 
> The test machine was single socket and UMA to avoid any scheduling or
> NUMA artifacts. Throughput and wall times are presented for sync IO, only
> wall times are shown for async as the granularity reported by dd and the
> variability is unsuitable for comparison. As async results were variable
> do to writback timings, I'm only reporting the maximum figures. The sync
> results were stable enough to make the mean and stddev uninteresting.
> 
> The performance results are reported based on a run with no profiling.
> Profile data is based on a separate run with oprofile running. The
> kernels being compared are "accessed-v2" which is the patch series up
> to this patch where as lockpage-v2 includes this patch.
> 
> ...
>
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned 
> int mode, int nr, void *k
>  void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
>  void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
>  void __wake_up_bit(wait_queue_head_t *, void *, int);
> +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);

You're going to need to forward-declare struct page in wait.h.  The
good thing about this is that less people will notice that we've gone
and mentioned struct page in wait.h :(

>  int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void 
> *), unsigned);
> +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
> 
> ...
>
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);  /* For internal 
> use only */
>   * stops them from bleeding out - it would still allow subsequent
>   * loads to move into the critical region).
>   */
> -void
> -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> +static inline void
> +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> + struct page *page, int state, bool exclusive)

Putting MM stuff into core waitqueue code is rather bad.  I really
don't know how I'm going to explain this to my family.

>  {
>   unsigned long flags;
>  
> - wait->flags &= ~WQ_FLAG_EXCLUSIVE;
>   spin_lock_irqsave(>lock, flags);
> - if (list_empty(>task_list))
> - __add_wait_queue(q, wait);
> + if (page && !PageWaiters(page))
> + SetPageWaiters(page);

And this isn't racy because we're assuming that all users of `page' are
using the same waitqueue.  ie, assuming all callers use
page_waitqueue()?   Subtle, unobvious, worth documenting.

> + if (list_empty(>task_list)) {
> + if (exclusive) {
> + wait->flags |= WQ_FLAG_EXCLUSIVE;
> + __add_wait_queue_tail(q, wait);
> + } else {
> + wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> + __add_wait_queue(q, wait);
> + }
> + }
>   set_current_state(state);
>   spin_unlock_irqrestore(>lock, flags);
>  }
> 
> ...
>
> @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
>   * the wait descriptor from the given waitqueue if still
>   * queued.
>   */
> -void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> +static inline void 

Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Peter Zijlstra
On Wed, May 21, 2014 at 04:33:57PM +0100, Mel Gorman wrote:
> > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > > + struct page *page, int state, bool exclusive)
> > >  {
> > >   unsigned long flags;
> > >  
> > > + if (page && !PageWaiters(page))
> > > + SetPageWaiters(page);
> > > + if (list_empty(>task_list)) {
> > > + if (exclusive) {
> > > + wait->flags |= WQ_FLAG_EXCLUSIVE;
> > > + __add_wait_queue_tail(q, wait);
> > > + } else {
> > 
> > I'm fairly sure we've just initialized the wait thing to 0, so clearing
> > the bit would be superfluous.
> > 
> 
> I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be
> superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT()
> then it's redundant. If it's a wait_queue_t that is being reused and
> sometimes used for exclusive waits and other times for non-exclusive
> waits then it's required. The API allows this to happen so I see no harm
> is clearing the flag like the old code did. Am I missing your point?

Yeah, I'm not aware of any other users except the on-stack kind, but
you're right.

Maybe we should stick an object_is_on_stack() test in there to see if
anything falls out, something for a rainy afternoon perhaps..

> > > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
> > > *word, int bit)
> > > +{
> > > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(>lock, flags);
> > > + if (waitqueue_active(wqh))
> > > + __wake_up_common(wqh, TASK_NORMAL, 1, 0, );
> > > + else
> > > + ClearPageWaiters(page);
> > > + spin_unlock_irqrestore(>lock, flags);
> > > +}
> > 
> > Seeing how word is always going to be >flags, might it make sense
> > to remove that argument?
> > 
> 
> The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses
> wake_bit_function() as a wakeup function and that thing consumes both the
> page->flags and the bit number it's interested in. This is used for both
> PG_writeback and PG_locked so assumptions cannot really be made about
> the value.

Well, both PG_flags come from the same >flags word, right? But
yeah, if we ever decide to grow the page frame with another flags word
we'd be in trouble :-)

In any case I don't feel too strongly about either of these points.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Mel Gorman
On Wed, May 21, 2014 at 03:02:23PM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote:
> > Andrew had suggested dropping v4 of the patch entirely as the numbers were
> > marginal and the complexity was high. However, even on a relatively small
> > machine running simple workloads the overhead of page_waitqueue and wakeup
> > functions is around 5% of system CPU time. That's quite high for basic
> > operations so I felt it was worth another shot. The performance figures
> > are better with this version than they were for v4 and overall the patch
> > should be more comprehensible.
> 
> Simpler patch and better performance, yay!
> 
> > This patch introduces a new page flag for 64-bit capable machines,
> > PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> 
> The patch seems to also explicitly use it for PG_writeback, yet no
> mention of that here.
> 

I'll add a note.

> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> > index 0ffa20a..f829e73 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);/* For internal 
> > use only */
> >   * stops them from bleeding out - it would still allow subsequent
> >   * loads to move into the critical region).
> >   */
> > +static inline void
> 
> Make that __always_inline, that way we're guaranteed to optimize the
> build time constant .page=NULL cases.
> 

Done.

> > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +   struct page *page, int state, bool exclusive)
> >  {
> > unsigned long flags;
> >  
> > +   if (page && !PageWaiters(page))
> > +   SetPageWaiters(page);
> > +   if (list_empty(>task_list)) {
> > +   if (exclusive) {
> > +   wait->flags |= WQ_FLAG_EXCLUSIVE;
> > +   __add_wait_queue_tail(q, wait);
> > +   } else {
> 
> I'm fairly sure we've just initialized the wait thing to 0, so clearing
> the bit would be superfluous.
> 

I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be
superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT()
then it's redundant. If it's a wait_queue_t that is being reused and
sometimes used for exclusive waits and other times for non-exclusive
waits then it's required. The API allows this to happen so I see no harm
is clearing the flag like the old code did. Am I missing your point?

> > +   wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> > +   __add_wait_queue(q, wait);
> > +   }
> > +   }
> > set_current_state(state);
> > spin_unlock_irqrestore(>lock, flags);
> >  }
> > +
> > +void
> > +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> > +{
> > +   return __prepare_to_wait(q, wait, NULL, state, false);
> > +}
> >  EXPORT_SYMBOL(prepare_to_wait);
> >  
> >  void
> >  prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int 
> > state)
> >  {
> > +   return __prepare_to_wait(q, wait, NULL, state, true);
> >  }
> >  EXPORT_SYMBOL(prepare_to_wait_exclusive);
> >  
> > @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
> >   * the wait descriptor from the given waitqueue if still
> >   * queued.
> >   */
> > +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +   struct page *page)
> >  {
> 
> Same thing, make that __always_inline.
> 

Done.

> > unsigned long flags;
> >  
> > @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t 
> > *wait)
> > if (!list_empty_careful(>task_list)) {
> > spin_lock_irqsave(>lock, flags);
> > list_del_init(>task_list);
> > +   if (page && !waitqueue_active(q))
> > +   ClearPageWaiters(page);
> > spin_unlock_irqrestore(>lock, flags);
> > }
> >  }
> > +
> > +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> > +{
> > +   return __finish_wait(q, wait, NULL);
> > +}
> >  EXPORT_SYMBOL(finish_wait);
> >  
> >  /**
> 
> > @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, 
> > int bit,
> >  }
> >  EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);
> >  
> > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
> > *word, int bit)
> > +{
> > +   struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > +   unsigned long flags;
> > +
> > +   spin_lock_irqsave(>lock, flags);
> > +   if (waitqueue_active(wqh))
> > +   __wake_up_common(wqh, TASK_NORMAL, 1, 0, );
> > +   else
> > +   ClearPageWaiters(page);
> > +   spin_unlock_irqrestore(>lock, flags);
> > +}
> 
> Seeing how word is always going to be >flags, might it make sense
> to remove that argument?
> 

The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses
wake_bit_function() as a wakeup function and 

Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Peter Zijlstra
On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote:
> Andrew had suggested dropping v4 of the patch entirely as the numbers were
> marginal and the complexity was high. However, even on a relatively small
> machine running simple workloads the overhead of page_waitqueue and wakeup
> functions is around 5% of system CPU time. That's quite high for basic
> operations so I felt it was worth another shot. The performance figures
> are better with this version than they were for v4 and overall the patch
> should be more comprehensible.

Simpler patch and better performance, yay!

> This patch introduces a new page flag for 64-bit capable machines,
> PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.

The patch seems to also explicitly use it for PG_writeback, yet no
mention of that here.

> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 0ffa20a..f829e73 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);  /* For internal 
> use only */
>   * stops them from bleeding out - it would still allow subsequent
>   * loads to move into the critical region).
>   */
> +static inline void

Make that __always_inline, that way we're guaranteed to optimize the
build time constant .page=NULL cases.

> +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> + struct page *page, int state, bool exclusive)
>  {
>   unsigned long flags;
>  
> + if (page && !PageWaiters(page))
> + SetPageWaiters(page);
> + if (list_empty(>task_list)) {
> + if (exclusive) {
> + wait->flags |= WQ_FLAG_EXCLUSIVE;
> + __add_wait_queue_tail(q, wait);
> + } else {

I'm fairly sure we've just initialized the wait thing to 0, so clearing
the bit would be superfluous.

> + wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> + __add_wait_queue(q, wait);
> + }
> + }
>   set_current_state(state);
>   spin_unlock_irqrestore(>lock, flags);
>  }
> +
> +void
> +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> +{
> + return __prepare_to_wait(q, wait, NULL, state, false);
> +}
>  EXPORT_SYMBOL(prepare_to_wait);
>  
>  void
>  prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int 
> state)
>  {
> + return __prepare_to_wait(q, wait, NULL, state, true);
>  }
>  EXPORT_SYMBOL(prepare_to_wait_exclusive);
>  
> @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
>   * the wait descriptor from the given waitqueue if still
>   * queued.
>   */
> +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
> + struct page *page)
>  {

Same thing, make that __always_inline.

>   unsigned long flags;
>  
> @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t 
> *wait)
>   if (!list_empty_careful(>task_list)) {
>   spin_lock_irqsave(>lock, flags);
>   list_del_init(>task_list);
> + if (page && !waitqueue_active(q))
> + ClearPageWaiters(page);
>   spin_unlock_irqrestore(>lock, flags);
>   }
>  }
> +
> +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> +{
> + return __finish_wait(q, wait, NULL);
> +}
>  EXPORT_SYMBOL(finish_wait);
>  
>  /**

> @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int 
> bit,
>  }
>  EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);
>  
> +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
> *word, int bit)
> +{
> + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> + unsigned long flags;
> +
> + spin_lock_irqsave(>lock, flags);
> + if (waitqueue_active(wqh))
> + __wake_up_common(wqh, TASK_NORMAL, 1, 0, );
> + else
> + ClearPageWaiters(page);
> + spin_unlock_irqrestore(>lock, flags);
> +}

Seeing how word is always going to be >flags, might it make sense
to remove that argument?


Anyway, looks good in principle. Oleg?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Peter Zijlstra
On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote:
 Andrew had suggested dropping v4 of the patch entirely as the numbers were
 marginal and the complexity was high. However, even on a relatively small
 machine running simple workloads the overhead of page_waitqueue and wakeup
 functions is around 5% of system CPU time. That's quite high for basic
 operations so I felt it was worth another shot. The performance figures
 are better with this version than they were for v4 and overall the patch
 should be more comprehensible.

Simpler patch and better performance, yay!

 This patch introduces a new page flag for 64-bit capable machines,
 PG_waiters, to signal there are processes waiting on PG_lock and uses it to
 avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.

The patch seems to also explicitly use it for PG_writeback, yet no
mention of that here.

 diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
 index 0ffa20a..f829e73 100644
 --- a/kernel/sched/wait.c
 +++ b/kernel/sched/wait.c
 @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);  /* For internal 
 use only */
   * stops them from bleeding out - it would still allow subsequent
   * loads to move into the critical region).
   */
 +static inline void

Make that __always_inline, that way we're guaranteed to optimize the
build time constant .page=NULL cases.

 +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
 + struct page *page, int state, bool exclusive)
  {
   unsigned long flags;
  
 + if (page  !PageWaiters(page))
 + SetPageWaiters(page);
 + if (list_empty(wait-task_list)) {
 + if (exclusive) {
 + wait-flags |= WQ_FLAG_EXCLUSIVE;
 + __add_wait_queue_tail(q, wait);
 + } else {

I'm fairly sure we've just initialized the wait thing to 0, so clearing
the bit would be superfluous.

 + wait-flags = ~WQ_FLAG_EXCLUSIVE;
 + __add_wait_queue(q, wait);
 + }
 + }
   set_current_state(state);
   spin_unlock_irqrestore(q-lock, flags);
  }
 +
 +void
 +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
 +{
 + return __prepare_to_wait(q, wait, NULL, state, false);
 +}
  EXPORT_SYMBOL(prepare_to_wait);
  
  void
  prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int 
 state)
  {
 + return __prepare_to_wait(q, wait, NULL, state, true);
  }
  EXPORT_SYMBOL(prepare_to_wait_exclusive);
  
 @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
   * the wait descriptor from the given waitqueue if still
   * queued.
   */
 +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
 + struct page *page)
  {

Same thing, make that __always_inline.

   unsigned long flags;
  
 @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t 
 *wait)
   if (!list_empty_careful(wait-task_list)) {
   spin_lock_irqsave(q-lock, flags);
   list_del_init(wait-task_list);
 + if (page  !waitqueue_active(q))
 + ClearPageWaiters(page);
   spin_unlock_irqrestore(q-lock, flags);
   }
  }
 +
 +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
 +{
 + return __finish_wait(q, wait, NULL);
 +}
  EXPORT_SYMBOL(finish_wait);
  
  /**

 @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int 
 bit,
  }
  EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);
  
 +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
 *word, int bit)
 +{
 + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
 + unsigned long flags;
 +
 + spin_lock_irqsave(wqh-lock, flags);
 + if (waitqueue_active(wqh))
 + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key);
 + else
 + ClearPageWaiters(page);
 + spin_unlock_irqrestore(wqh-lock, flags);
 +}

Seeing how word is always going to be page-flags, might it make sense
to remove that argument?


Anyway, looks good in principle. Oleg?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Mel Gorman
On Wed, May 21, 2014 at 03:02:23PM +0200, Peter Zijlstra wrote:
 On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote:
  Andrew had suggested dropping v4 of the patch entirely as the numbers were
  marginal and the complexity was high. However, even on a relatively small
  machine running simple workloads the overhead of page_waitqueue and wakeup
  functions is around 5% of system CPU time. That's quite high for basic
  operations so I felt it was worth another shot. The performance figures
  are better with this version than they were for v4 and overall the patch
  should be more comprehensible.
 
 Simpler patch and better performance, yay!
 
  This patch introduces a new page flag for 64-bit capable machines,
  PG_waiters, to signal there are processes waiting on PG_lock and uses it to
  avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
 
 The patch seems to also explicitly use it for PG_writeback, yet no
 mention of that here.
 

I'll add a note.

  diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
  index 0ffa20a..f829e73 100644
  --- a/kernel/sched/wait.c
  +++ b/kernel/sched/wait.c
  @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);/* For internal 
  use only */
* stops them from bleeding out - it would still allow subsequent
* loads to move into the critical region).
*/
  +static inline void
 
 Make that __always_inline, that way we're guaranteed to optimize the
 build time constant .page=NULL cases.
 

Done.

  +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
  +   struct page *page, int state, bool exclusive)
   {
  unsigned long flags;
   
  +   if (page  !PageWaiters(page))
  +   SetPageWaiters(page);
  +   if (list_empty(wait-task_list)) {
  +   if (exclusive) {
  +   wait-flags |= WQ_FLAG_EXCLUSIVE;
  +   __add_wait_queue_tail(q, wait);
  +   } else {
 
 I'm fairly sure we've just initialized the wait thing to 0, so clearing
 the bit would be superfluous.
 

I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be
superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT()
then it's redundant. If it's a wait_queue_t that is being reused and
sometimes used for exclusive waits and other times for non-exclusive
waits then it's required. The API allows this to happen so I see no harm
is clearing the flag like the old code did. Am I missing your point?

  +   wait-flags = ~WQ_FLAG_EXCLUSIVE;
  +   __add_wait_queue(q, wait);
  +   }
  +   }
  set_current_state(state);
  spin_unlock_irqrestore(q-lock, flags);
   }
  +
  +void
  +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
  +{
  +   return __prepare_to_wait(q, wait, NULL, state, false);
  +}
   EXPORT_SYMBOL(prepare_to_wait);
   
   void
   prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int 
  state)
   {
  +   return __prepare_to_wait(q, wait, NULL, state, true);
   }
   EXPORT_SYMBOL(prepare_to_wait_exclusive);
   
  @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
* the wait descriptor from the given waitqueue if still
* queued.
*/
  +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
  +   struct page *page)
   {
 
 Same thing, make that __always_inline.
 

Done.

  unsigned long flags;
   
  @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t 
  *wait)
  if (!list_empty_careful(wait-task_list)) {
  spin_lock_irqsave(q-lock, flags);
  list_del_init(wait-task_list);
  +   if (page  !waitqueue_active(q))
  +   ClearPageWaiters(page);
  spin_unlock_irqrestore(q-lock, flags);
  }
   }
  +
  +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
  +{
  +   return __finish_wait(q, wait, NULL);
  +}
   EXPORT_SYMBOL(finish_wait);
   
   /**
 
  @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, 
  int bit,
   }
   EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);
   
  +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
  *word, int bit)
  +{
  +   struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
  +   unsigned long flags;
  +
  +   spin_lock_irqsave(wqh-lock, flags);
  +   if (waitqueue_active(wqh))
  +   __wake_up_common(wqh, TASK_NORMAL, 1, 0, key);
  +   else
  +   ClearPageWaiters(page);
  +   spin_unlock_irqrestore(wqh-lock, flags);
  +}
 
 Seeing how word is always going to be page-flags, might it make sense
 to remove that argument?
 

The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses
wake_bit_function() as a wakeup function and that thing consumes both the
page-flags and the bit number it's interested in. This is used for both
PG_writeback and PG_locked so assumptions cannot really be made about
the value.

-- 
Mel Gorman
SUSE 

Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Peter Zijlstra
On Wed, May 21, 2014 at 04:33:57PM +0100, Mel Gorman wrote:
   +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
   + struct page *page, int state, bool exclusive)
{
 unsigned long flags;

   + if (page  !PageWaiters(page))
   + SetPageWaiters(page);
   + if (list_empty(wait-task_list)) {
   + if (exclusive) {
   + wait-flags |= WQ_FLAG_EXCLUSIVE;
   + __add_wait_queue_tail(q, wait);
   + } else {
  
  I'm fairly sure we've just initialized the wait thing to 0, so clearing
  the bit would be superfluous.
  
 
 I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be
 superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT()
 then it's redundant. If it's a wait_queue_t that is being reused and
 sometimes used for exclusive waits and other times for non-exclusive
 waits then it's required. The API allows this to happen so I see no harm
 is clearing the flag like the old code did. Am I missing your point?

Yeah, I'm not aware of any other users except the on-stack kind, but
you're right.

Maybe we should stick an object_is_on_stack() test in there to see if
anything falls out, something for a rainy afternoon perhaps..

   +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void 
   *word, int bit)
   +{
   + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
   + unsigned long flags;
   +
   + spin_lock_irqsave(wqh-lock, flags);
   + if (waitqueue_active(wqh))
   + __wake_up_common(wqh, TASK_NORMAL, 1, 0, key);
   + else
   + ClearPageWaiters(page);
   + spin_unlock_irqrestore(wqh-lock, flags);
   +}
  
  Seeing how word is always going to be page-flags, might it make sense
  to remove that argument?
  
 
 The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses
 wake_bit_function() as a wakeup function and that thing consumes both the
 page-flags and the bit number it's interested in. This is used for both
 PG_writeback and PG_locked so assumptions cannot really be made about
 the value.

Well, both PG_flags come from the same page-flags word, right? But
yeah, if we ever decide to grow the page frame with another flags word
we'd be in trouble :-)

In any case I don't feel too strongly about either of these points.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Andrew Morton
On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman mgor...@suse.de wrote:

 Andrew had suggested dropping v4 of the patch entirely as the numbers were
 marginal and the complexity was high. However, even on a relatively small
 machine running simple workloads the overhead of page_waitqueue and wakeup
 functions is around 5% of system CPU time. That's quite high for basic
 operations so I felt it was worth another shot. The performance figures
 are better with this version than they were for v4 and overall the patch
 should be more comprehensible.
 
 Changelog since v4
 o Remove dependency on io_schedule_timeout
 o Push waiting logic down into waitqueue
 
 This patch introduces a new page flag for 64-bit capable machines,
 PG_waiters, to signal there are processes waiting on PG_lock and uses it to
 avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
 
 This adds a few branches to the fast path but avoids bouncing a dirty
 cache line between CPUs. 32-bit machines always take the slow path but the
 primary motivation for this patch is large machines so I do not think that
 is a concern.
 
 The test case used to evaulate this is a simple dd of a large file done
 multiple times with the file deleted on each iterations. The size of
 the file is 1/10th physical memory to avoid dirty page balancing. In the
 async case it will be possible that the workload completes without even
 hitting the disk and will have variable results but highlight the impact
 of mark_page_accessed for async IO. The sync results are expected to be
 more stable. The exception is tmpfs where the normal case is for the IO
 to not hit the disk.
 
 The test machine was single socket and UMA to avoid any scheduling or
 NUMA artifacts. Throughput and wall times are presented for sync IO, only
 wall times are shown for async as the granularity reported by dd and the
 variability is unsuitable for comparison. As async results were variable
 do to writback timings, I'm only reporting the maximum figures. The sync
 results were stable enough to make the mean and stddev uninteresting.
 
 The performance results are reported based on a run with no profiling.
 Profile data is based on a separate run with oprofile running. The
 kernels being compared are accessed-v2 which is the patch series up
 to this patch where as lockpage-v2 includes this patch.
 
 ...

 --- a/include/linux/wait.h
 +++ b/include/linux/wait.h
 @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned 
 int mode, int nr, void *k
  void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
  void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
  void __wake_up_bit(wait_queue_head_t *, void *, int);
 +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);

You're going to need to forward-declare struct page in wait.h.  The
good thing about this is that less people will notice that we've gone
and mentioned struct page in wait.h :(

  int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void 
 *), unsigned);
 +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
 
 ...

 --- a/kernel/sched/wait.c
 +++ b/kernel/sched/wait.c
 @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);  /* For internal 
 use only */
   * stops them from bleeding out - it would still allow subsequent
   * loads to move into the critical region).
   */
 -void
 -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
 +static inline void
 +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
 + struct page *page, int state, bool exclusive)

Putting MM stuff into core waitqueue code is rather bad.  I really
don't know how I'm going to explain this to my family.

  {
   unsigned long flags;
  
 - wait-flags = ~WQ_FLAG_EXCLUSIVE;
   spin_lock_irqsave(q-lock, flags);
 - if (list_empty(wait-task_list))
 - __add_wait_queue(q, wait);
 + if (page  !PageWaiters(page))
 + SetPageWaiters(page);

And this isn't racy because we're assuming that all users of `page' are
using the same waitqueue.  ie, assuming all callers use
page_waitqueue()?   Subtle, unobvious, worth documenting.

 + if (list_empty(wait-task_list)) {
 + if (exclusive) {
 + wait-flags |= WQ_FLAG_EXCLUSIVE;
 + __add_wait_queue_tail(q, wait);
 + } else {
 + wait-flags = ~WQ_FLAG_EXCLUSIVE;
 + __add_wait_queue(q, wait);
 + }
 + }
   set_current_state(state);
   spin_unlock_irqrestore(q-lock, flags);
  }
 
 ...

 @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
   * the wait descriptor from the given waitqueue if still
   * queued.
   */
 -void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
 +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
 + struct page *page)


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Peter Zijlstra
On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
  +static inline void
  +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
  +   struct page *page, int state, bool exclusive)
 
 Putting MM stuff into core waitqueue code is rather bad.  I really
 don't know how I'm going to explain this to my family.

Right, so we could avoid all that and make the functions in mm/filemap.c
rather large and opencode a bunch of wait.c stuff.

Which is pretty much what I initially pseudo proposed.

  +   __ClearPageWaiters(page);
 
 We're freeing the page - if someone is still waiting on it then we have
 a huge bug?  It's the mysterious collision thing again I hope?

Yeah, so we only clear that bit when at 'unlock' we find there are no
more pending waiters, so if the last unlock still had a waiter, we'll
leave the bit set. So its entirely reasonable to still have it set when
we free a page etc..
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Andrew Morton
On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra pet...@infradead.org wrote:

 On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
   +static inline void
   +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
   + struct page *page, int state, bool exclusive)
  
  Putting MM stuff into core waitqueue code is rather bad.  I really
  don't know how I'm going to explain this to my family.
 
 Right, so we could avoid all that and make the functions in mm/filemap.c
 rather large and opencode a bunch of wait.c stuff.
 

The world won't end if we do it Mel's way and it's probably the most
efficient.  But ugh.  This stuff does raise the it had better be a
useful patch bar.

 Which is pretty much what I initially pseudo proposed.

Alternative solution is not to merge the patch ;)

   + __ClearPageWaiters(page);
  
  We're freeing the page - if someone is still waiting on it then we have
  a huge bug?  It's the mysterious collision thing again I hope?
 
 Yeah, so we only clear that bit when at 'unlock' we find there are no
 more pending waiters, so if the last unlock still had a waiter, we'll
 leave the bit set.

Confused.  If the last unlock had a waiter, that waiter will get woken
up so there are no waiters any more, so the last unlock clears the flag.

um, how do we determine that there are no more waiters?  By looking at
the waitqueue.  But that waitqueue is hashed, so it may contain waiters
for other pages so we're screwed?  But we could just go and wake up the
other-page waiters anyway and still clear PG_waiters?

um2, we're using exclusive waitqueues so we can't (or don't) wake all
waiters, so we're screwed again?

(This process is proving to be a hard way of writing Mel's changelog btw).

If I'm still on track here, what happens if we switch to wake-all so we
can avoid the dangling flag?  I doubt if there are many collisions on
that hash table?

If there *are* a lot of collisions, I bet it's because a great pile of
threads are all waiting on the same page.  If they're trying to lock
that page then wake-all is bad.  But if they're just waiting for IO
completion (probable) then it's OK.

I'll stop now.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Mel Gorman
On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
 On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman mgor...@suse.de wrote:
 
  Andrew had suggested dropping v4 of the patch entirely as the numbers were
  marginal and the complexity was high. However, even on a relatively small
  machine running simple workloads the overhead of page_waitqueue and wakeup
  functions is around 5% of system CPU time. That's quite high for basic
  operations so I felt it was worth another shot. The performance figures
  are better with this version than they were for v4 and overall the patch
  should be more comprehensible.
  
  Changelog since v4
  o Remove dependency on io_schedule_timeout
  o Push waiting logic down into waitqueue
  
  This patch introduces a new page flag for 64-bit capable machines,
  PG_waiters, to signal there are processes waiting on PG_lock and uses it to
  avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
  
  This adds a few branches to the fast path but avoids bouncing a dirty
  cache line between CPUs. 32-bit machines always take the slow path but the
  primary motivation for this patch is large machines so I do not think that
  is a concern.
  
  The test case used to evaulate this is a simple dd of a large file done
  multiple times with the file deleted on each iterations. The size of
  the file is 1/10th physical memory to avoid dirty page balancing. In the
  async case it will be possible that the workload completes without even
  hitting the disk and will have variable results but highlight the impact
  of mark_page_accessed for async IO. The sync results are expected to be
  more stable. The exception is tmpfs where the normal case is for the IO
  to not hit the disk.
  
  The test machine was single socket and UMA to avoid any scheduling or
  NUMA artifacts. Throughput and wall times are presented for sync IO, only
  wall times are shown for async as the granularity reported by dd and the
  variability is unsuitable for comparison. As async results were variable
  do to writback timings, I'm only reporting the maximum figures. The sync
  results were stable enough to make the mean and stddev uninteresting.
  
  The performance results are reported based on a run with no profiling.
  Profile data is based on a separate run with oprofile running. The
  kernels being compared are accessed-v2 which is the patch series up
  to this patch where as lockpage-v2 includes this patch.
  
  ...
 
  --- a/include/linux/wait.h
  +++ b/include/linux/wait.h
  @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned 
  int mode, int nr, void *k
   void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
   void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
   void __wake_up_bit(wait_queue_head_t *, void *, int);
  +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, 
  int);
 
 You're going to need to forward-declare struct page in wait.h.  The
 good thing about this is that less people will notice that we've gone
 and mentioned struct page in wait.h :(
 

Will add the forward-declare.

   int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int 
  (*)(void *), unsigned);
  +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
  
  ...
 
  --- a/kernel/sched/wait.c
  +++ b/kernel/sched/wait.c
  @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);/* For internal 
  use only */
* stops them from bleeding out - it would still allow subsequent
* loads to move into the critical region).
*/
  -void
  -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
  +static inline void
  +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
  +   struct page *page, int state, bool exclusive)
 
 Putting MM stuff into core waitqueue code is rather bad.  I really
 don't know how I'm going to explain this to my family.
 

The alternative is updating wait.h and wait.c was open-coding the waitqueue
modifications in filemap.c but that is just as ugly. The wait queue stuff
is complex and there was motivation to keep it in one place even if we
are special casing struct page handling.

FWIW, I cannot explain anything I do in work to my family. It gets blank
looks no matter what.


   {
  unsigned long flags;
   
  -   wait-flags = ~WQ_FLAG_EXCLUSIVE;
  spin_lock_irqsave(q-lock, flags);
  -   if (list_empty(wait-task_list))
  -   __add_wait_queue(q, wait);
  +   if (page  !PageWaiters(page))
  +   SetPageWaiters(page);
 
 And this isn't racy because we're assuming that all users of `page' are
 using the same waitqueue.  ie, assuming all callers use
 page_waitqueue()?   Subtle, unobvious, worth documenting.
 

All users of the page will get the same waitqueue. page_waitqueue is hashed
on the pointer value.

  +   if (list_empty(wait-task_list)) {
  +   if (exclusive) {
  +   wait-flags |= 

Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

2014-05-21 Thread Mel Gorman
On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote:
 On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra pet...@infradead.org 
 wrote:
 
  On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
+static inline void
+__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
+   struct page *page, int state, bool exclusive)
   
   Putting MM stuff into core waitqueue code is rather bad.  I really
   don't know how I'm going to explain this to my family.
  
  Right, so we could avoid all that and make the functions in mm/filemap.c
  rather large and opencode a bunch of wait.c stuff.
  
 
 The world won't end if we do it Mel's way and it's probably the most
 efficient.  But ugh.  This stuff does raise the it had better be a
 useful patch bar.
 
  Which is pretty much what I initially pseudo proposed.
 
 Alternative solution is not to merge the patch ;)
 

While true, the overhead of the page_waitqueue lookups and unnecessary
wakeups sucks even on small machines. Not only does it hit us during simple
operations like dd to a file but we would hit it during page reclaim as
well which is trylock_page/unlock_page intensive

+   __ClearPageWaiters(page);
   
   We're freeing the page - if someone is still waiting on it then we have
   a huge bug?  It's the mysterious collision thing again I hope?
  
  Yeah, so we only clear that bit when at 'unlock' we find there are no
  more pending waiters, so if the last unlock still had a waiter, we'll
  leave the bit set.
 
 Confused.  If the last unlock had a waiter, that waiter will get woken
 up so there are no waiters any more, so the last unlock clears the flag.
 
 um, how do we determine that there are no more waiters?  By looking at
 the waitqueue.  But that waitqueue is hashed, so it may contain waiters
 for other pages so we're screwed?  But we could just go and wake up the
 other-page waiters anyway and still clear PG_waiters?
 
 um2, we're using exclusive waitqueues so we can't (or don't) wake all
 waiters, so we're screwed again?
 
 (This process is proving to be a hard way of writing Mel's changelog btw).
 
 If I'm still on track here, what happens if we switch to wake-all so we
 can avoid the dangling flag?  I doubt if there are many collisions on
 that hash table?
 
 If there *are* a lot of collisions, I bet it's because a great pile of
 threads are all waiting on the same page.  If they're trying to lock
 that page then wake-all is bad.  But if they're just waiting for IO
 completion (probable) then it's OK.
 
 I'll stop now.

Rather than putting details in the changelog, here is an updated version
that hopefully improves the commentary to the point where it's actually
clear. 

---8---
From: Nick Piggin npig...@suse.de
Subject: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups 
in unlock_page fastpath v6

Changelog since v5
o __always_inline where appropriate (peterz)
o Documentation (akpm)

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are processes waiting on PG_lock or PG_writeback
and uses it to avoid memory barriers and waitqueue hash lookup in the
unlock_page fastpath.

This adds a few branches to the fast path but avoids bouncing a dirty
cache line between CPUs. 32-bit machines always take the slow path but the
primary motivation for this patch is large machines so I do not think that
is a concern.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the IO
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running. The
kernels being compared are accessed-v2 which is the patch series up
to this patch where as lockpage-v2 includes this patch.

async dd
 3.15.0-rc53.15.0-rc5
  mmotm   lockpage-v5
btrfs Max  ddtime  0.5863 (  0.00%)  0.5621