Re: [PATCH-tip v2 02/12] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-11 Thread Peter Zijlstra
On Wed, Apr 10, 2019 at 10:25:16PM -0400, Waiman Long wrote:
> On 04/10/2019 02:44 PM, Peter Zijlstra wrote:

> > However there is another site that fiddles with the HANDOFF bit, namely
> > __rwsem_down_write_failed_common(), and that does:
> >
> > +   atomic_long_or(RWSEM_FLAG_HANDOFF, 
> > >count);
> >
> > _OUTSIDE_ of ->wait_lock, which would yield:
> >
> >   CPU0  CPU1
> >
> >   oldcount = atomic_long_fetch_add(adjustment, >count)
> >
> > atomic_long_or(HANDOFF)
> >
> >   if (!(oldcount & HANDOFF))
> > adjustment -= HANDOFF;
> >
> >   atomic_long_sub(adjustment)
> >
> > *whoops*, incremented HANDOFF on HANDOFF.
> >
> >
> > And there's not a comment in sight that would elucidate if this is
> > possible or not.
> >
> 
> A writer can only set the handoff bit if it is the first waiter in the
> queue. If it is the first waiter, a racing __rwsem_mark_wake() will see
> that the first waiter is a writer and so won't go into the reader path.
> I know I something don't spell out all the conditions that may look
> obvious to me but not to others. I will elaborate more in comments.

Aah, indeed.




Re: [PATCH-tip v2 02/12] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-10 Thread Waiman Long
On 04/10/2019 02:44 PM, Peter Zijlstra wrote:
> On Fri, Apr 05, 2019 at 03:21:05PM -0400, Waiman Long wrote:
>> Because of writer lock stealing, it is possible that a constant
>> stream of incoming writers will cause a waiting writer or reader to
>> wait indefinitely leading to lock starvation.
>>
>> The mutex code has a lock handoff mechanism to prevent lock starvation.
>> This patch implements a similar lock handoff mechanism to disable
>> lock stealing and force lock handoff to the first waiter in the queue
>> after at least a 5ms waiting period. The waiting period is used to
>> avoid discouraging lock stealing too much to affect performance.
> I would say the handoff it not at all similar to the mutex code. It is
> in fact radically different.
>

I mean they are similar in concept. Of course, the implementations are
quite different.

>> @@ -131,6 +138,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>  adjustment = RWSEM_READER_BIAS;
>>  oldcount = atomic_long_fetch_add(adjustment, >count);
>>  if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
>> +/*
>> + * Initiate handoff to reader, if applicable.
>> + */
>> +if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>> +time_after(jiffies, waiter->timeout)) {
>> +adjustment -= RWSEM_FLAG_HANDOFF;
>> +lockevent_inc(rwsem_rlock_handoff);
>> +}
>> +
>>  atomic_long_sub(adjustment, >count);
>>  return;
>>  }
> That confuses the heck out of me...
>
> The above seems to rely on __rwsem_mark_wake() to be fully serialized
> (and it is, by ->wait_lock, but that isn't spelled out anywhere) such
> that we don't get double increment of FLAG_HANDOFF.
>
> So there is NO __rwsem_mark_wake() vs __wesem_mark_wake() race like:
>
>   CPU0CPU1
>
>   oldcount = atomic_long_fetch_add(adjustment, >count)
>
>   oldcount = 
> atomic_long_fetch_add(adjustment, >count)
>
>   if (!(oldcount & HANDOFF))
> adjustment -= HANDOFF;
>
>   if (!(oldcount & HANDOFF))
> adjustment -= HANDOFF;
>   atomic_long_sub(adjustment)
>   atomic_long_sub(adjustment)
>
>
> *whoops* double negative decrement of HANDOFF (aka double increment).

Yes, __rwsem_mark_wake() is always called with wait_lock held. I can add
a lockdep_assert() statement to clarify this point.

>
> However there is another site that fiddles with the HANDOFF bit, namely
> __rwsem_down_write_failed_common(), and that does:
>
> +   atomic_long_or(RWSEM_FLAG_HANDOFF, 
> >count);
>
> _OUTSIDE_ of ->wait_lock, which would yield:
>
>   CPU0CPU1
>
>   oldcount = atomic_long_fetch_add(adjustment, >count)
>
>   atomic_long_or(HANDOFF)
>
>   if (!(oldcount & HANDOFF))
> adjustment -= HANDOFF;
>
>   atomic_long_sub(adjustment)
>
> *whoops*, incremented HANDOFF on HANDOFF.
>
>
> And there's not a comment in sight that would elucidate if this is
> possible or not.
>

A writer can only set the handoff bit if it is the first waiter in the
queue. If it is the first waiter, a racing __rwsem_mark_wake() will see
that the first waiter is a writer and so won't go into the reader path.
I know I something don't spell out all the conditions that may look
obvious to me but not to others. I will elaborate more in comments.

> Also:
>
> +   atomic_long_or(RWSEM_FLAG_HANDOFF, 
> >count);
> +   first++;
> +
> +   /*
> +* Make sure the handoff bit is seen by
> +* others before proceeding.
> +*/
> +   smp_mb__after_atomic();
>
> That comment is utter nonsense. smp_mb() doesn't (and cannot) 'make
> visible'. There needs to be order between two memops on both sides.
>
I kind of add that for safety. I will take some time to rethink if it is
really necessary.

Cheers,
Longman




Re: [PATCH-tip v2 02/12] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-10 Thread Peter Zijlstra
On Fri, Apr 05, 2019 at 03:21:05PM -0400, Waiman Long wrote:
> Because of writer lock stealing, it is possible that a constant
> stream of incoming writers will cause a waiting writer or reader to
> wait indefinitely leading to lock starvation.
> 
> The mutex code has a lock handoff mechanism to prevent lock starvation.
> This patch implements a similar lock handoff mechanism to disable
> lock stealing and force lock handoff to the first waiter in the queue
> after at least a 5ms waiting period. The waiting period is used to
> avoid discouraging lock stealing too much to affect performance.

I would say the handoff it not at all similar to the mutex code. It is
in fact radically different.

> @@ -131,6 +138,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>   adjustment = RWSEM_READER_BIAS;
>   oldcount = atomic_long_fetch_add(adjustment, >count);
>   if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
> + /*
> +  * Initiate handoff to reader, if applicable.
> +  */
> + if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
> + time_after(jiffies, waiter->timeout)) {
> + adjustment -= RWSEM_FLAG_HANDOFF;
> + lockevent_inc(rwsem_rlock_handoff);
> + }
> +
>   atomic_long_sub(adjustment, >count);
>   return;
>   }

That confuses the heck out of me...

The above seems to rely on __rwsem_mark_wake() to be fully serialized
(and it is, by ->wait_lock, but that isn't spelled out anywhere) such
that we don't get double increment of FLAG_HANDOFF.

So there is NO __rwsem_mark_wake() vs __wesem_mark_wake() race like:

  CPU0  CPU1

  oldcount = atomic_long_fetch_add(adjustment, >count)

oldcount = 
atomic_long_fetch_add(adjustment, >count)

  if (!(oldcount & HANDOFF))
adjustment -= HANDOFF;

if (!(oldcount & HANDOFF))
  adjustment -= HANDOFF;
  atomic_long_sub(adjustment)
atomic_long_sub(adjustment)


*whoops* double negative decrement of HANDOFF (aka double increment).


However there is another site that fiddles with the HANDOFF bit, namely
__rwsem_down_write_failed_common(), and that does:

+   atomic_long_or(RWSEM_FLAG_HANDOFF, >count);

_OUTSIDE_ of ->wait_lock, which would yield:

  CPU0  CPU1

  oldcount = atomic_long_fetch_add(adjustment, >count)

atomic_long_or(HANDOFF)

  if (!(oldcount & HANDOFF))
adjustment -= HANDOFF;

  atomic_long_sub(adjustment)

*whoops*, incremented HANDOFF on HANDOFF.


And there's not a comment in sight that would elucidate if this is
possible or not.


Also:

+   atomic_long_or(RWSEM_FLAG_HANDOFF, >count);
+   first++;
+
+   /*
+* Make sure the handoff bit is seen by
+* others before proceeding.
+*/
+   smp_mb__after_atomic();

That comment is utter nonsense. smp_mb() doesn't (and cannot) 'make
visible'. There needs to be order between two memops on both sides.



Re: [PATCH-tip v2 02/12] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-10 Thread Waiman Long
On 04/10/2019 11:10 AM, Peter Zijlstra wrote:
> On Fri, Apr 05, 2019 at 03:21:05PM -0400, Waiman Long wrote:
>> +#define RWSEM_WAIT_TIMEOUT  ((HZ - 1)/200 + 1)
> Given the choices in HZ, the above seems fairly 'optimistic'.

I can tighten it up and make it less "optimistic" :-)

Cheers,
Longman



Re: [PATCH-tip v2 02/12] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-10 Thread Waiman Long
On 04/10/2019 11:07 AM, Peter Zijlstra wrote:
> On Fri, Apr 05, 2019 at 03:21:05PM -0400, Waiman Long wrote:
>> Because of writer lock stealing, it is possible that a constant
>> stream of incoming writers will cause a waiting writer or reader to
>> wait indefinitely leading to lock starvation.
>>
>> The mutex code has a lock handoff mechanism to prevent lock starvation.
>> This patch implements a similar lock handoff mechanism to disable
>> lock stealing and force lock handoff to the first waiter in the queue
>> after at least a 5ms waiting period. The waiting period is used to
>> avoid discouraging lock stealing too much to affect performance.
> So the mutex code doesn't have that timeout, it foces the handoff if the
> top waiter fails to acquire.
>
> I don't find the above sufficiently justifies the additional complexity.
> What were the numbers with the simple scheme vs this etc..

When the handoff bit is set, it stops the lock from being acquired by
anyone else until the front waiter is woken up, scheduled and take the
lock. Doing that too frequently will impede the throughput when the
rwsem is highly contended. I can ran some performance test to show the
difference it can make.

I have also been thinking about having the front waiter set the handoff
bit and then spin on owner so that it can acquire the lock immediately
after it is freed. It will be a follow up patch.

Cheers,
Longman




Re: [PATCH-tip v2 02/12] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-10 Thread Peter Zijlstra
On Fri, Apr 05, 2019 at 03:21:05PM -0400, Waiman Long wrote:
> +#define RWSEM_WAIT_TIMEOUT   ((HZ - 1)/200 + 1)

Given the choices in HZ, the above seems fairly 'optimistic'.


Re: [PATCH-tip v2 02/12] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-10 Thread Peter Zijlstra
On Fri, Apr 05, 2019 at 03:21:05PM -0400, Waiman Long wrote:
> Because of writer lock stealing, it is possible that a constant
> stream of incoming writers will cause a waiting writer or reader to
> wait indefinitely leading to lock starvation.
> 
> The mutex code has a lock handoff mechanism to prevent lock starvation.
> This patch implements a similar lock handoff mechanism to disable
> lock stealing and force lock handoff to the first waiter in the queue
> after at least a 5ms waiting period. The waiting period is used to
> avoid discouraging lock stealing too much to affect performance.

So the mutex code doesn't have that timeout, it foces the handoff if the
top waiter fails to acquire.

I don't find the above sufficiently justifies the additional complexity.
What were the numbers with the simple scheme vs this etc..


[PATCH-tip v2 02/12] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-05 Thread Waiman Long
Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

The mutex code has a lock handoff mechanism to prevent lock starvation.
This patch implements a similar lock handoff mechanism to disable
lock stealing and force lock handoff to the first waiter in the queue
after at least a 5ms waiting period. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

A rwsem microbenchmark was run for 5 seconds on a 8-socket 120-core
240-thread IvyBridge-EX system with a v5.1 based kernel and 240
write_lock threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for the
locking threads were 1/2,433/320,955. After the patch, the figures became
891/1,807/3,174. It can be seen that the rwsem became much more fair,
though there was a drop of about 26% in the mean locking operations
done which was a tradeoff of having better fairness.

Signed-off-by: Waiman Long 
---
 kernel/locking/lock_events_list.h |   2 +
 kernel/locking/rwsem-xadd.c   | 154 ++
 kernel/locking/rwsem.h|  23 +++--
 3 files changed, 134 insertions(+), 45 deletions(-)

diff --git a/kernel/locking/lock_events_list.h 
b/kernel/locking/lock_events_list.h
index ad7668cfc9da..29e5c52197fa 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,7 +61,9 @@ LOCK_EVENT(rwsem_opt_fail)/* # of failed opt-spinnings
*/
 LOCK_EVENT(rwsem_rlock)/* # of read locks acquired 
*/
 LOCK_EVENT(rwsem_rlock_fast)   /* # of fast read locks acquired*/
 LOCK_EVENT(rwsem_rlock_fail)   /* # of failed read lock acquisitions   */
+LOCK_EVENT(rwsem_rlock_handoff)/* # of read lock handoffs  
*/
 LOCK_EVENT(rwsem_rtrylock) /* # of read trylock calls  */
 LOCK_EVENT(rwsem_wlock)/* # of write locks acquired
*/
 LOCK_EVENT(rwsem_wlock_fail)   /* # of failed write lock acquisitions  */
+LOCK_EVENT(rwsem_wlock_handoff)/* # of write lock handoffs 
*/
 LOCK_EVENT(rwsem_wtrylock) /* # of write trylock calls */
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index adab6477be51..58b3a64e6f2c 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -73,6 +73,7 @@ struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
enum rwsem_waiter_type type;
+   unsigned long timeout;
 };
 
 enum rwsem_wake_type {
@@ -81,6 +82,12 @@ enum rwsem_wake_type {
RWSEM_WAKE_READ_OWNED   /* Waker thread holds the read lock */
 };
 
+/*
+ * The minimum waiting time (5ms) in the wait queue before initiating the
+ * handoff protocol.
+ */
+#define RWSEM_WAIT_TIMEOUT ((HZ - 1)/200 + 1)
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_(), then the RWSEM_FLAG_WAITERS bit must
@@ -131,6 +138,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
adjustment = RWSEM_READER_BIAS;
oldcount = atomic_long_fetch_add(adjustment, >count);
if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+   /*
+* Initiate handoff to reader, if applicable.
+*/
+   if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
+   time_after(jiffies, waiter->timeout)) {
+   adjustment -= RWSEM_FLAG_HANDOFF;
+   lockevent_inc(rwsem_rlock_handoff);
+   }
+
atomic_long_sub(adjustment, >count);
return;
}
@@ -179,6 +195,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
adjustment -= RWSEM_FLAG_WAITERS;
}
 
+   /*
+* Clear the handoff flag
+*/
+   if (woken && RWSEM_COUNT_HANDOFF(atomic_long_read(>count)))
+   adjustment -= RWSEM_FLAG_HANDOFF;
+
if (adjustment)
atomic_long_add(adjustment, >count);
 }
@@ -188,15 +210,19 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
  */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+static inline bool
+rwsem_try_write_lock(long count, struct rw_semaphore *sem, bool first)
 {
long new;
 
if (RWSEM_COUNT_LOCKED(count))
return false;
 
-   new = count + RWSEM_WRITER_LOCKED -
-(list_is_singular(>wait_list) ? RWSEM_FLAG_WAITERS : 0);
+   if (!first && RWSEM_COUNT_HANDOFF(count))
+   return false;
+
+   new = (count & ~RWSEM_FLAG_HANDOFF) +