Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-29 Thread Ard Biesheuvel
On Tue, 29 Sep 2020 at 19:50, Will Deacon  wrote:
>
> On Thu, Sep 24, 2020 at 07:55:19PM +0800, Hou Tao wrote:
> > The following is the newest performance data:
> >
> > aarch64 host (4 sockets, 24 cores per sockets)
> >
> > * v4.19.111
> >
> > no writer, reader cn| 24| 48
> > | 72| 96
> > rate of percpu_down_read/percpu_up_read per second  |
> > default: use __this_cpu_inc|dec()   | 166129572 | 166064100 
> > | 165963448 | 165203565
> > patched: use this_cpu_inc|dec() |  87727515 |  87698669 
> > |  87675397 |  87337435
> > modified: local_irq_save + __this_cpu_inc|dec() |  15470357   |  
> > 15460642 |  15439423 |  15377199
> >
> > * v4.19.111+ [1]
> >
> > modified: use this_cpu_inc|dec() + LSE atomic   |   8224023 |   8079416 
> > | 7883046 |   7820350
> >
> > * 5.9-rc6
> >
> > no writer, reader cn| 24| 48
> > | 72| 96
> > rate of percpu_down_read/percpu_up_read per second  |
> > reverted: use __this_cpu_inc|dec() + revert 91fc957c| 169664061 | 169481176 
> > | 168493488 | 168844423
> > reverted: use __this_cpu_inc|dec()  |  78355071 |  78294285 
> > |  78026030 |  77860492
> > modified: use this_cpu_inc|dec() + no LSE atomic|  64291101 |  64259867 
> > |  64223206 |  63992316
> > default: use this_cpu_inc|dec() + LSE atomic|  16231421 |  16215618 
> > |  16188581 |  15959290
> >
> > It seems that enabling LSE atomic has a negative impact on performance 
> > under this test scenario.
> >
> > And it is astonished to me that for my test scenario the performance of 
> > v5.9-rc6 is just one half of v4.19.
> > The bisect finds the culprit is 91fc957c9b1d6 ("arm64/bpf: don't allocate 
> > BPF JIT programs in module memory").
> > If reverting the patch brute-forcibly under 5.9-rc6 [2], the performance 
> > will be the same with
> > v4.19.111 (169664061 vs 166129572). I have had the simplified test module 
> > [3] and .config attached [4],
> > so could you please help to check what the problem is ?
>
> I have no idea how that patch can be responsible for this :/ Have you
> confirmed that the bisection is not bogus?
>
> Ard, do you have any ideas?
>

Unless the benchmark could be affected by the fact that BPF programs
are now loaded out of direct branching range of the core kernel, I
don't see how that patch could affect performance in this way.

What you could do is revert the patch partially - drop the new alloc
and free routines, but retain the new placement of the module and
kernel areas, which all got shifted around as a result. That would at
least narrow it down, although it looks very unlikely to me that this
change itself is the root cause of the regression.


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-29 Thread Will Deacon
On Thu, Sep 24, 2020 at 07:55:19PM +0800, Hou Tao wrote:
> The following is the newest performance data:
> 
> aarch64 host (4 sockets, 24 cores per sockets)
> 
> * v4.19.111
> 
> no writer, reader cn| 24| 48| 
> 72| 96
> rate of percpu_down_read/percpu_up_read per second  |
> default: use __this_cpu_inc|dec()   | 166129572 | 166064100 | 
> 165963448 | 165203565
> patched: use this_cpu_inc|dec() |  87727515 |  87698669 | 
>  87675397 |  87337435
> modified: local_irq_save + __this_cpu_inc|dec() |  15470357   |  
> 15460642 |  15439423 |  15377199
> 
> * v4.19.111+ [1]
> 
> modified: use this_cpu_inc|dec() + LSE atomic   |   8224023 |   8079416 | 
> 7883046 |   7820350
> 
> * 5.9-rc6
> 
> no writer, reader cn| 24| 48| 
> 72| 96
> rate of percpu_down_read/percpu_up_read per second  |
> reverted: use __this_cpu_inc|dec() + revert 91fc957c| 169664061 | 169481176 | 
> 168493488 | 168844423
> reverted: use __this_cpu_inc|dec()  |  78355071 |  78294285 | 
>  78026030 |  77860492
> modified: use this_cpu_inc|dec() + no LSE atomic|  64291101 |  64259867 | 
>  64223206 |  63992316
> default: use this_cpu_inc|dec() + LSE atomic|  16231421 |  16215618 | 
>  16188581 |  15959290
> 
> It seems that enabling LSE atomic has a negative impact on performance under 
> this test scenario.
> 
> And it is astonished to me that for my test scenario the performance of 
> v5.9-rc6 is just one half of v4.19.
> The bisect finds the culprit is 91fc957c9b1d6 ("arm64/bpf: don't allocate BPF 
> JIT programs in module memory").
> If reverting the patch brute-forcibly under 5.9-rc6 [2], the performance will 
> be the same with
> v4.19.111 (169664061 vs 166129572). I have had the simplified test module [3] 
> and .config attached [4],
> so could you please help to check what the problem is ?

I have no idea how that patch can be responsible for this :/ Have you
confirmed that the bisection is not bogus?

Ard, do you have any ideas?

Will


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-24 Thread Hou Tao
Hi Will & Ard,

+to Ard Biesheuvel  for the "regression" caused by 
91fc957c9b1d6
("arm64/bpf: don't allocate BPF JIT programs in module memory")

On 2020/9/17 16:48, Will Deacon wrote:
> On Wed, Sep 16, 2020 at 08:32:20PM +0800, Hou Tao wrote:
>>> Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
>>> From: Hou Tao 
>>> Date: Tue, 15 Sep 2020 22:07:50 +0800
>>>
>>> From: Hou Tao 
>>>
>>> The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
>>> that percpu-rwsem is a blocking primitive, should be just fine.
>>>
>>> However, file_end_write() is used from IRQ context and will cause
>>> load-store issues.
>>>
>>> Fixing it by using the IRQ-safe this_cpu_*() for operations on
>>> read_count. This will generate more expensive code on a number of
>>> platforms, which might cause a performance regression for some of the
>>> other percpu-rwsem users.
>>>
>>> If any such is reported, we can consider alternative solutions.
>>>
>> I have simply test the performance impact on both x86 and aarch64.
>>
>> There is no degradation under x86 (2 sockets, 18 core per sockets, 2 threads 
>> per core)
>>
>> v5.8.9
>> no writer, reader cn   | 18| 36| 
>> 72
>> the rate of down_read/up_read per second   | 231423957 | 230737381 | 
>> 109943028
>> the rate of down_read/up_read per second (patched) | 232864799 | 233555210 | 
>> 109768011
>>
>> However the performance degradation is huge under aarch64 (4 sockets, 24 
>> core per sockets): nearly 60% lost.
>>
>> v4.19.111
>> no writer, reader cn   | 24| 48| 
>> 72| 96
>> the rate of down_read/up_read per second   | 166129572 | 166064100 | 
>> 165963448 | 165203565
>> the rate of down_read/up_read per second (patched) |  63863506 |  63842132 | 
>>  63757267 |  63514920
>>
>> I will test the aarch64 host by using v5.8 tomorrow.
>
>Thanks. We did improve the preempt_count() munging a bit since 4.19 (I
>think), so maybe 5.8 will be a bit better. Please report back!

Sorry for the long delay.

>the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
>63757267 |  63514920

The line above is actually the performance of v5.9-rc5 under the same aarch64 
host without any patch, but
the performance after applied the patch under v4.9.111 is still bad (see below, 
~50% lost).

The following is the newest performance data:

aarch64 host (4 sockets, 24 cores per sockets)

* v4.19.111

no writer, reader cn| 24| 48| 
72| 96
rate of percpu_down_read/percpu_up_read per second  |
default: use __this_cpu_inc|dec()   | 166129572 | 166064100 | 
165963448 | 165203565
patched: use this_cpu_inc|dec() |  87727515 |  87698669 |  
87675397 |  87337435
modified: local_irq_save + __this_cpu_inc|dec() |  15470357 |  15460642 |  
15439423 |  15377199

* v4.19.111+ [1]

modified: use this_cpu_inc|dec() + LSE atomic   |   8224023 |   8079416 |   
7883046 |   7820350

* 5.9-rc6

no writer, reader cn| 24| 48| 
72| 96
rate of percpu_down_read/percpu_up_read per second  |
reverted: use __this_cpu_inc|dec() + revert 91fc957c| 169664061 | 169481176 | 
168493488 | 168844423
reverted: use __this_cpu_inc|dec()  |  78355071 |  78294285 |  
78026030 |  77860492
modified: use this_cpu_inc|dec() + no LSE atomic|  64291101 |  64259867 |  
64223206 |  63992316
default: use this_cpu_inc|dec() + LSE atomic|  16231421 |  16215618 |  
16188581 |  15959290

It seems that enabling LSE atomic has a negative impact on performance under 
this test scenario.

And it is astonished to me that for my test scenario the performance of 
v5.9-rc6 is just one half of v4.19.
The bisect finds the culprit is 91fc957c9b1d6 ("arm64/bpf: don't allocate BPF 
JIT programs in module memory").
If reverting the patch brute-forcibly under 5.9-rc6 [2], the performance will 
be the same with
v4.19.111 (169664061 vs 166129572). I have had the simplified test module [3] 
and .config attached [4],
so could you please help to check what the problem is ?

Regards,
Tao

[1]: apply 959bf2fd03b5 arm64: percpu: Rewrite per-cpu ops to allow use of LSE 
atomics) and its fix
[2]: redefine MODULES_VADDR as KASAN_SHADOW_END, and remove 
bpf_jit_alloc_exec() & bpf_jit_free_exec()
 in arch/arm64/net/bpf_jit_comp.c
[3]: simple.c

#include 
#include 
#include 
#include 
#include 

static unsigned int duration = 1;
module_param(duration, uint, 0);

static struct percpu_rw_semaphore rwsem;

int __init simple_init(void)
{
unsigned long end;
unsigned long cnt;

if (duration <= 0)
return -EINVAL;

percpu_init_rwsem();

cnt = 0;
end = jiffies + HZ * duration;
while (time_before_eq(jiffies, end)) {
percpu_down_read();

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-20 Thread Dave Chinner
On Fri, Sep 18, 2020 at 03:26:35PM +0200, Jan Kara wrote:
> On Fri 18-09-20 15:09:14, Oleg Nesterov wrote:
> > On 09/18, Peter Zijlstra wrote:
> > > > But again, do we really want this?
> > >
> > > I like the two counters better, avoids atomics entirely, some archs
> > > hare horridly expensive atomics (*cough* power *cough*).
> > 
> > I meant... do we really want to introduce percpu_up_read_irqsafe() ?
> > 
> > Perhaps we can live with the fix from Hou? At least until we find a
> > "real" performance regression.
> 
> I can say that for users of percpu rwsem in filesystems the cost of atomic
> inc/dec is unlikely to matter. The lock hold times there are long enough
> that it would be just lost in the noise.

I'm not sure that is correct. We do an inc/dec pair per AIO, so if
we are running millions of IOPS through the AIO subsystem, then the
cost of doing millions of extra atomic ops every second is going to
be noticable...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread Jan Kara
On Fri 18-09-20 15:09:14, Oleg Nesterov wrote:
> On 09/18, Peter Zijlstra wrote:
> > > But again, do we really want this?
> >
> > I like the two counters better, avoids atomics entirely, some archs
> > hare horridly expensive atomics (*cough* power *cough*).
> 
> I meant... do we really want to introduce percpu_up_read_irqsafe() ?
> 
> Perhaps we can live with the fix from Hou? At least until we find a
> "real" performance regression.

I can say that for users of percpu rwsem in filesystems the cost of atomic
inc/dec is unlikely to matter. The lock hold times there are long enough
that it would be just lost in the noise.

For other stuff using them like get_online_cpus() or get_online_mems() I'm
not so sure...

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread Oleg Nesterov
On 09/18, Peter Zijlstra wrote:
>
> On Fri, Sep 18, 2020 at 12:48:24PM +0200, Oleg Nesterov wrote:
>
> > Of course, this assumes that atomic_t->counter underflows "correctly", just
> > like "unsigned int".
>
> We're documented that we do. Lots of code relies on that.
>
> See Documentation/atomic_t.txt TYPES

Aha, thanks!

> > But again, do we really want this?
>
> I like the two counters better, avoids atomics entirely, some archs
> hare horridly expensive atomics (*cough* power *cough*).

I meant... do we really want to introduce percpu_up_read_irqsafe() ?

Perhaps we can live with the fix from Hou? At least until we find a
"real" performance regression.

Oleg.



Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz
On Fri, Sep 18, 2020 at 12:48:24PM +0200, Oleg Nesterov wrote:

> Of course, this assumes that atomic_t->counter underflows "correctly", just
> like "unsigned int".

We're documented that we do. Lots of code relies on that.

See Documentation/atomic_t.txt TYPES

> But again, do we really want this?

I like the two counters better, avoids atomics entirely, some archs
hare horridly expensive atomics (*cough* power *cough*).

I just tried to be clever and use a single u64 load (where possible)
instead of two 32bit loads and got the sum vs split order wrong.


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread Oleg Nesterov
On 09/18, Peter Zijlstra wrote:
>
> On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote:
> > +   u64 sum = per_cpu_sum(*(u64 *)sem->read_count);
>
> Moo, that doesn't work, we have to do two separate sums.

Or we can re-introduce "atomic_t slow_read_ctr".

percpu_up_read_irqsafe(sem)
{
preempt_disable();
atomic_dec_release(>slow_read_ctr);
if (!rcu_sync_is_idle(>rss))
rcuwait_wake_up(>writer);
preempt_enable();
}

readers_active_check(sem)
{
unsigned int sum = per_cpu_sum(*sem->read_count) +
(unsigned int)atomic_read(>slow_read_ctr);
if (sum)
return false;
...
}

Of course, this assumes that atomic_t->counter underflows "correctly", just
like "unsigned int".

But again, do we really want this?

Oleg.



Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz
On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote:
> + u64 sum = per_cpu_sum(*(u64 *)sem->read_count);

Moo, that doesn't work, we have to do two separate sums. I shouldn't try
to be clever on a Friday I suppose :-(


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz
On Fri, Sep 18, 2020 at 12:04:32PM +0200, pet...@infradead.org wrote:
> On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote:
> > @@ -198,7 +198,9 @@ EXPORT_SYMBOL_GPL(__percpu_down_read);
> >   */
> >  static bool readers_active_check(struct percpu_rw_semaphore *sem)
> >  {
> > -   if (per_cpu_sum(*sem->read_count) != 0)
> > +   u64 sum = per_cpu_sum(*(u64 *)sem->read_count);
> > +
> > +   if (sum + (sum >> 32))
> 
> That obviously wants to be:
> 
>   if ((u32)(sum + (sum >> 32)))
> 
> > return false;
> >  
> > /*

I suppose an alternative way of writing that would be something like:

union {
u64 sum;
struct {
u32 a, b;
};
} var;

var.sum = per_cpu_sum(*(u64 *)sem->read_count);

if (var.a + var.b)
return false;

which is more verbose, but perhaps easier to read.


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz
On Fri, Sep 18, 2020 at 12:01:12PM +0200, pet...@infradead.org wrote:
> @@ -198,7 +198,9 @@ EXPORT_SYMBOL_GPL(__percpu_down_read);
>   */
>  static bool readers_active_check(struct percpu_rw_semaphore *sem)
>  {
> - if (per_cpu_sum(*sem->read_count) != 0)
> + u64 sum = per_cpu_sum(*(u64 *)sem->read_count);
> +
> + if (sum + (sum >> 32))

That obviously wants to be:

if ((u32)(sum + (sum >> 32)))

>   return false;
>  
>   /*


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread peterz
On Fri, Sep 18, 2020 at 11:07:02AM +0200, Jan Kara wrote:
> If people really wanted to avoid irq-safe inc/dec for archs where it is
> more expensive, one idea I had was that we could add 'read_count_in_irq' to
> percpu_rw_semaphore. So callers in normal context would use read_count and
> callers in irq context would use read_count_in_irq. And the writer side
> would sum over both but we don't care about performance of that one much.

That's not a bad idea... something like so I suppose.

(completely untested)

---
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 5fda40f97fe9..9c847490a86a 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -11,7 +11,7 @@
 
 struct percpu_rw_semaphore {
struct rcu_sync rss;
-   unsigned int __percpu   *read_count;
+   u32 __percpu*read_count;
struct rcuwait  writer;
wait_queue_head_t   waiters;
atomic_tblock;
@@ -60,7 +60,7 @@ static inline void percpu_down_read(struct 
percpu_rw_semaphore *sem)
 * anything we did within this RCU-sched read-size critical section.
 */
if (likely(rcu_sync_is_idle(>rss)))
-   this_cpu_inc(*sem->read_count);
+   __this_cpu_inc(sem->read_count[0]);
else
__percpu_down_read(sem, false); /* Unconditional memory barrier 
*/
/*
@@ -74,12 +74,16 @@ static inline bool percpu_down_read_trylock(struct 
percpu_rw_semaphore *sem)
 {
bool ret = true;
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+   WARN_ON_ONCE(!in_task());
+#endif
+
preempt_disable();
/*
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(>rss)))
-   this_cpu_inc(*sem->read_count);
+   __this_cpu_inc(sem->read_count[0]);
else
ret = __percpu_down_read(sem, true); /* Unconditional memory 
barrier */
preempt_enable();
@@ -98,12 +102,16 @@ static inline void percpu_up_read(struct 
percpu_rw_semaphore *sem)
 {
rwsem_release(>dep_map, _RET_IP_);
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+   WARN_ON_ONCE(!in_task());
+#endif
+
preempt_disable();
/*
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(>rss))) {
-   this_cpu_dec(*sem->read_count);
+   __this_cpu_dec(sem->read_count[0]);
} else {
/*
 * slowpath; reader will only ever wake a single blocked
@@ -115,12 +123,39 @@ static inline void percpu_up_read(struct 
percpu_rw_semaphore *sem)
 * aggregate zero, as that is the only time it matters) they
 * will also see our critical section.
 */
-   this_cpu_dec(*sem->read_count);
+   __this_cpu_dec(sem->read_count[0]);
rcuwait_wake_up(>writer);
}
preempt_enable();
 }
 
+static inline void __percpu_up_read_irqsafe(struct percpu_rw_semaphore *sem)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   /*
+* Same as in percpu_down_read().
+*/
+   if (likely(rcu_sync_is_idle(>rss))) {
+   __this_cpu_dec(sem->read_count[1]);
+   } else {
+   /*
+* slowpath; reader will only ever wake a single blocked
+* writer.
+*/
+   smp_mb(); /* B matches C */
+   /*
+* In other words, if they see our decrement (presumably to
+* aggregate zero, as that is the only time it matters) they
+* will also see our critical section.
+*/
+   __this_cpu_dec(sem->read_count[1]);
+   rcuwait_wake_up(>writer);
+   }
+   local_irq_restore(flags);
+}
+
 extern void percpu_down_write(struct percpu_rw_semaphore *);
 extern void percpu_up_write(struct percpu_rw_semaphore *);
 
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 70a32a576f3f..00741216a7f6 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -12,7 +12,7 @@
 int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
const char *name, struct lock_class_key *key)
 {
-   sem->read_count = alloc_percpu(int);
+   sem->read_count = (u32 *)alloc_percpu(u64);
if (unlikely(!sem->read_count))
return -ENOMEM;
 
@@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem);
 
 static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 {
-   this_cpu_inc(*sem->read_count);
+   __this_cpu_inc(sem->read_count[0]);
 
/*
 * Due to having preemption disabled the decrement happens on
@@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(struct 
percpu_rw_semaphore *sem)
if (likely(!atomic_read_acquire(>block)))
return true;
 
-   

Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-18 Thread Jan Kara
On Thu 17-09-20 14:01:33, Oleg Nesterov wrote:
> On 09/17, Boaz Harrosh wrote:
> >
> > On 16/09/2020 15:32, Hou Tao wrote:
> > <>
> > >However the performance degradation is huge under aarch64 (4 sockets, 24 
> > >core per sockets): nearly 60% lost.
> > >
> > >v4.19.111
> > >no writer, reader cn   | 24| 48
> > >| 72| 96
> > >the rate of down_read/up_read per second   | 166129572 | 166064100 
> > >| 165963448 | 165203565
> > >the rate of down_read/up_read per second (patched) |  63863506 |  63842132 
> > >|  63757267 |  63514920
> > >
> >
> > I believe perhaps Peter Z's suggestion of an additional
> > percpu_down_read_irqsafe() API and let only those in IRQ users pay the
> > penalty.
> >
> > Peter Z wrote:
> > >My leading alternative was adding: percpu_down_read_irqsafe() /
> > >percpu_up_read_irqsafe(), which use local_irq_save() instead of
> > >preempt_disable().
> 
> This means that __sb_start/end_write() and probably more users in fs/super.c
> will have to use this API, not good.
> 
> IIUC, file_end_write() was never IRQ safe (at least if !CONFIG_SMP), even
> before 8129ed2964 ("change sb_writers to use percpu_rw_semaphore"), but this
> doesn't matter...
> 
> Perhaps we can change aio.c, io_uring.c and fs/overlayfs/file.c to avoid
> file_end_write() in IRQ context, but I am not sure it's worth the trouble.

Well, that would be IMO rather difficult. We need to do file_end_write()
after the IO has completed so if we don't want to do it in IRQ context,
we'd have to queue a work to a workqueue or something like that. And that's
going to be expensive compared to pure per-cpu inc/dec...

If people really wanted to avoid irq-safe inc/dec for archs where it is
more expensive, one idea I had was that we could add 'read_count_in_irq' to
percpu_rw_semaphore. So callers in normal context would use read_count and
callers in irq context would use read_count_in_irq. And the writer side
would sum over both but we don't care about performance of that one much.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Christoph Hellwig
On Thu, Sep 17, 2020 at 04:46:38PM +0300, Boaz Harrosh wrote:
> From my totally subjective experience on the filesystem side (user of
> bio_endio) all HW block drivers I used including Nvme isci, sata... etc. end
> up calling bio_endio in softirq. The big exception to that is the vdX
> drivers under KVM. Which is very Ironic to me.

NVMe normally calls it from hardirq or IPI context.  The only time it
would use softirq context is if you have a single I/O queue, which is
very unusual.


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Boaz Harrosh

On 17/09/2020 15:48, Matthew Wilcox wrote:

On Thu, Sep 17, 2020 at 02:01:33PM +0200, Oleg Nesterov wrote:

<>


If we change bio_endio to invoke the ->bi_end_io callbacks in softirq
context instead of hardirq context, we can change the pagecache to take
BH-safe locks instead of IRQ-safe locks.  I believe the only reason the
lock needs to be IRQ-safe is for the benefit of paths like:



From my totally subjective experience on the filesystem side (user of 
bio_endio) all HW block drivers I used including Nvme isci, sata... etc. 
end up calling bio_endio in softirq. The big exception to that is the 
vdX drivers under KVM. Which is very Ironic to me.

I wish we could make all drivers be uniform in this regard.

But maybe I'm just speaking crap. Its only from my limited debuging 
expirience.


Thanks
Boaz


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Oleg Nesterov
On 09/17, Matthew Wilcox wrote:
>
> On Thu, Sep 17, 2020 at 02:01:33PM +0200, Oleg Nesterov wrote:
> > IIUC, file_end_write() was never IRQ safe (at least if !CONFIG_SMP), even
> > before 8129ed2964 ("change sb_writers to use percpu_rw_semaphore"), but this
> > doesn't matter...
> >
> > Perhaps we can change aio.c, io_uring.c and fs/overlayfs/file.c to avoid
> > file_end_write() in IRQ context, but I am not sure it's worth the trouble.
>
> If we change bio_endio to invoke the ->bi_end_io callbacks in softirq

Not sure I understand...

How can this help? irq_exit() can invoke_softirq() -> __do_softirq() ?

Oleg.



Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread peterz
On Thu, Sep 17, 2020 at 01:48:38PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 17, 2020 at 02:01:33PM +0200, Oleg Nesterov wrote:
> > IIUC, file_end_write() was never IRQ safe (at least if !CONFIG_SMP), even
> > before 8129ed2964 ("change sb_writers to use percpu_rw_semaphore"), but this
> > doesn't matter...
> > 
> > Perhaps we can change aio.c, io_uring.c and fs/overlayfs/file.c to avoid
> > file_end_write() in IRQ context, but I am not sure it's worth the trouble.
> 
> If we change bio_endio to invoke the ->bi_end_io callbacks in softirq
> context instead of hardirq context, we can change the pagecache to take

SoftIRQ context has exactly the same problem vs __this_cpu*().


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Matthew Wilcox
On Thu, Sep 17, 2020 at 02:01:33PM +0200, Oleg Nesterov wrote:
> IIUC, file_end_write() was never IRQ safe (at least if !CONFIG_SMP), even
> before 8129ed2964 ("change sb_writers to use percpu_rw_semaphore"), but this
> doesn't matter...
> 
> Perhaps we can change aio.c, io_uring.c and fs/overlayfs/file.c to avoid
> file_end_write() in IRQ context, but I am not sure it's worth the trouble.

If we change bio_endio to invoke the ->bi_end_io callbacks in softirq
context instead of hardirq context, we can change the pagecache to take
BH-safe locks instead of IRQ-safe locks.  I believe the only reason the
lock needs to be IRQ-safe is for the benefit of paths like:

mpage_end_io
page_endio
end_page_writeback
test_clear_page_writeback

Admittedly, I haven't audited all the places that call end_page_writeback;
there might be others called from non-BIO contexts (network filesystems?).
That was the point where I gave up my investigation of why we use an
IRQ-safe spinlock when basically all page cache operations are done
from user context.


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Oleg Nesterov
On 09/17, Boaz Harrosh wrote:
>
> On 16/09/2020 15:32, Hou Tao wrote:
> <>
> >However the performance degradation is huge under aarch64 (4 sockets, 24 
> >core per sockets): nearly 60% lost.
> >
> >v4.19.111
> >no writer, reader cn   | 24| 48| 
> >72| 96
> >the rate of down_read/up_read per second   | 166129572 | 166064100 | 
> >165963448 | 165203565
> >the rate of down_read/up_read per second (patched) |  63863506 |  63842132 | 
> > 63757267 |  63514920
> >
>
> I believe perhaps Peter Z's suggestion of an additional
> percpu_down_read_irqsafe() API and let only those in IRQ users pay the
> penalty.
>
> Peter Z wrote:
> >My leading alternative was adding: percpu_down_read_irqsafe() /
> >percpu_up_read_irqsafe(), which use local_irq_save() instead of
> >preempt_disable().

This means that __sb_start/end_write() and probably more users in fs/super.c
will have to use this API, not good.

IIUC, file_end_write() was never IRQ safe (at least if !CONFIG_SMP), even
before 8129ed2964 ("change sb_writers to use percpu_rw_semaphore"), but this
doesn't matter...

Perhaps we can change aio.c, io_uring.c and fs/overlayfs/file.c to avoid
file_end_write() in IRQ context, but I am not sure it's worth the trouble.

Oleg.



Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Boaz Harrosh

On 16/09/2020 15:32, Hou Tao wrote:
<>

However the performance degradation is huge under aarch64 (4 sockets, 24 core 
per sockets): nearly 60% lost.

v4.19.111
no writer, reader cn   | 24| 48| 72 
   | 96
the rate of down_read/up_read per second   | 166129572 | 166064100 | 
165963448 | 165203565
the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
63757267 |  63514920



I believe perhaps Peter Z's suggestion of an additional
percpu_down_read_irqsafe() API and let only those in IRQ users pay the 
penalty.


Peter Z wrote:

My leading alternative was adding: percpu_down_read_irqsafe() /
percpu_up_read_irqsafe(), which use local_irq_save() instead of
preempt_disable().


Thanks
Boaz


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Will Deacon
On Wed, Sep 16, 2020 at 08:32:20PM +0800, Hou Tao wrote:
> > Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
> > From: Hou Tao 
> > Date: Tue, 15 Sep 2020 22:07:50 +0800
> > 
> > From: Hou Tao 
> > 
> > The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
> > that percpu-rwsem is a blocking primitive, should be just fine.
> > 
> > However, file_end_write() is used from IRQ context and will cause
> > load-store issues.
> > 
> > Fixing it by using the IRQ-safe this_cpu_*() for operations on
> > read_count. This will generate more expensive code on a number of
> > platforms, which might cause a performance regression for some of the
> > other percpu-rwsem users.
> > 
> > If any such is reported, we can consider alternative solutions.
> > 
> I have simply test the performance impact on both x86 and aarch64.
> 
> There is no degradation under x86 (2 sockets, 18 core per sockets, 2 threads 
> per core)
> 
> v5.8.9
> no writer, reader cn   | 18| 36| 
> 72
> the rate of down_read/up_read per second   | 231423957 | 230737381 | 
> 109943028
> the rate of down_read/up_read per second (patched) | 232864799 | 233555210 | 
> 109768011
> 
> However the performance degradation is huge under aarch64 (4 sockets, 24 core 
> per sockets): nearly 60% lost.
> 
> v4.19.111
> no writer, reader cn   | 24| 48| 
> 72| 96
> the rate of down_read/up_read per second   | 166129572 | 166064100 | 
> 165963448 | 165203565
> the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
> 63757267 |  63514920
> 
> I will test the aarch64 host by using v5.8 tomorrow.

Thanks. We did improve the preempt_count() munging a bit since 4.19 (I
think), so maybe 5.8 will be a bit better. Please report back!

Will


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-16 Thread Hou Tao
Hi,

On 2020/9/16 0:03, pet...@infradead.org wrote:
> On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote:
> 
>> Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.
> 
> How's this?
> 
Thanks for that.

> ---
> Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
> From: Hou Tao 
> Date: Tue, 15 Sep 2020 22:07:50 +0800
> 
> From: Hou Tao 
> 
> The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
> that percpu-rwsem is a blocking primitive, should be just fine.
> 
> However, file_end_write() is used from IRQ context and will cause
> load-store issues.
> 
> Fixing it by using the IRQ-safe this_cpu_*() for operations on
> read_count. This will generate more expensive code on a number of
> platforms, which might cause a performance regression for some of the
> other percpu-rwsem users.
> 
> If any such is reported, we can consider alternative solutions.
> 
I have simply test the performance impact on both x86 and aarch64.

There is no degradation under x86 (2 sockets, 18 core per sockets, 2 threads 
per core)

v5.8.9
no writer, reader cn   | 18| 36| 72
the rate of down_read/up_read per second   | 231423957 | 230737381 | 
109943028
the rate of down_read/up_read per second (patched) | 232864799 | 233555210 | 
109768011

However the performance degradation is huge under aarch64 (4 sockets, 24 core 
per sockets): nearly 60% lost.

v4.19.111
no writer, reader cn   | 24| 48| 72 
   | 96
the rate of down_read/up_read per second   | 166129572 | 166064100 | 
165963448 | 165203565
the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
63757267 |  63514920

I will test the aarch64 host by using v5.8 tomorrow.

Regards,
Tao


> Fixes: 70fe2f48152e ("aio: fix freeze protection of aio writes")
> Signed-off-by: Hou Tao 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: https://lkml.kernel.org/r/20200915140750.137881-1-hout...@huawei.com
> ---
>  include/linux/percpu-rwsem.h  |8 
>  kernel/locking/percpu-rwsem.c |4 ++--
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -60,7 +60,7 @@ static inline void percpu_down_read(stru
>* anything we did within this RCU-sched read-size critical section.
>*/
>   if (likely(rcu_sync_is_idle(>rss)))
> - __this_cpu_inc(*sem->read_count);
> + this_cpu_inc(*sem->read_count);
>   else
>   __percpu_down_read(sem, false); /* Unconditional memory barrier 
> */
>   /*
> @@ -79,7 +79,7 @@ static inline bool percpu_down_read_tryl
>* Same as in percpu_down_read().
>*/
>   if (likely(rcu_sync_is_idle(>rss)))
> - __this_cpu_inc(*sem->read_count);
> + this_cpu_inc(*sem->read_count);
>   else
>   ret = __percpu_down_read(sem, true); /* Unconditional memory 
> barrier */
>   preempt_enable();
> @@ -103,7 +103,7 @@ static inline void percpu_up_read(struct
>* Same as in percpu_down_read().
>*/
>   if (likely(rcu_sync_is_idle(>rss))) {
> - __this_cpu_dec(*sem->read_count);
> + this_cpu_dec(*sem->read_count);
>   } else {
>   /*
>* slowpath; reader will only ever wake a single blocked
> @@ -115,7 +115,7 @@ static inline void percpu_up_read(struct
>* aggregate zero, as that is the only time it matters) they
>* will also see our critical section.
>*/
> - __this_cpu_dec(*sem->read_count);
> + this_cpu_dec(*sem->read_count);
>   rcuwait_wake_up(>writer);
>   }
>   preempt_enable();
> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c
> @@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem);
>  
>  static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
>  {
> - __this_cpu_inc(*sem->read_count);
> + this_cpu_inc(*sem->read_count);
>  
>   /*
>* Due to having preemption disabled the decrement happens on
> @@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(s
>   if (likely(!atomic_read_acquire(>block)))
>   return true;
>  
> - __this_cpu_dec(*sem->read_count);
> + this_cpu_dec(*sem->read_count);
>  
>   /* Prod writer to re-evaluate readers_active_check() */
>   rcuwait_wake_up(>writer);
> .
> 


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-16 Thread peterz
On Wed, Sep 16, 2020 at 08:32:20PM +0800, Hou Tao wrote:

> I have simply test the performance impact on both x86 and aarch64.
> 
> There is no degradation under x86 (2 sockets, 18 core per sockets, 2 threads 
> per core)

Yeah, x86 is magical here, it's the same single instruction for both ;-)
But it is, afaik, unique in this position, no other arch can pull that
off.

> However the performance degradation is huge under aarch64 (4 sockets, 24 core 
> per sockets): nearly 60% lost.
> 
> v4.19.111
> no writer, reader cn   | 24| 48| 
> 72| 96
> the rate of down_read/up_read per second   | 166129572 | 166064100 | 
> 165963448 | 165203565
> the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
> 63757267 |  63514920

Teh hurt :/


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-16 Thread Will Deacon
On Tue, Sep 15, 2020 at 08:11:12PM +0200, pet...@infradead.org wrote:
> On Tue, Sep 15, 2020 at 05:11:23PM +0100, Will Deacon wrote:
> > On Tue, Sep 15, 2020 at 06:03:44PM +0200, pet...@infradead.org wrote:
> > > On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote:
> > > 
> > > > Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.
> > > 
> > > How's this?
> > > 
> > > ---
> > > Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
> > > From: Hou Tao 
> > > Date: Tue, 15 Sep 2020 22:07:50 +0800
> > > 
> > > From: Hou Tao 
> > > 
> > > The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
> > > that percpu-rwsem is a blocking primitive, should be just fine.
> > > 
> > > However, file_end_write() is used from IRQ context and will cause
> > > load-store issues.
> > 
> > ... on architectures where the per-cpu accessors are not atomic.
> 
> That's not entirely accurate, on x86 for example the per-cpu ops are not
> atomic, but they are not susceptible to this problem due to them being a
> single instruction from the point of interrupts -- either they wholly
> happen or they don't.

Hey, the implication is still correct though ;)

> So I'd reformulate it like: "... on architectures where the per-cpu
> accessors are not natively irq-safe" ?

But yeah, that's better. Thanks.

Will


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread Oleg Nesterov
On 09/15, Peter Zijlstra wrote:
>
> On Tue, Sep 15, 2020 at 10:07:50PM +0800, Hou Tao wrote:
> > Under aarch64, __this_cpu_inc() is neither IRQ-safe nor atomic, so
> > when percpu_up_read() is invoked under IRQ-context (e.g. aio completion),
> > and it interrupts the process on the same CPU which is invoking
> > percpu_down_read(), the decreasement on read_count may lost and
> > the final value of read_count on the CPU will be unexpected
> > as shown below:
>
> > Fixing it by using the IRQ-safe helper this_cpu_inc|dec() for
> > operations on read_count.
> >
> > Another plausible fix is to state that percpu-rwsem can NOT be
> > used under IRQ context and convert all users which may
> > use it under IRQ context.
>
> *groan*...
>
> So yeah, fs/super totally abuses percpu_rwsem, and yes, using it from
> IRQ context is totally out of spec. That said, we've (grudgingly)
> accomodated them before.

Yes, I didn't expect percpu_up_ can be called from IRQ :/

> This seems to be a fairly long standing issue, and certainly not unique
> to ARM64 either (Power, and anyone else using asm-gemeric/percpu.h,
> should be similarly affected I think). The issue seems to stem from
> Oleg's original rewrite:
>
>   a1fd3e24d8a4 ("percpu_rw_semaphore: reimplement to not block the readers 
> unnecessarily")

Not really... I think it was 70fe2f48152e ("aio: fix freeze protection of aio 
writes").
And iiuc io_uring does the same.

> and is certainly an understandable mistake.
>
> I'm torn on what to do, using this_cpu over __this_cpu is going to
> adversely affect code-gen (and possibly performance) for all the
> percpu-rwsem users that are not quite so 'creative'.

Yes, but what else can we do?

Oleg.



Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread Will Deacon
On Tue, Sep 15, 2020 at 06:03:44PM +0200, pet...@infradead.org wrote:
> On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote:
> 
> > Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.
> 
> How's this?
> 
> ---
> Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
> From: Hou Tao 
> Date: Tue, 15 Sep 2020 22:07:50 +0800
> 
> From: Hou Tao 
> 
> The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
> that percpu-rwsem is a blocking primitive, should be just fine.
> 
> However, file_end_write() is used from IRQ context and will cause
> load-store issues.

... on architectures where the per-cpu accessors are not atomic.

> 
> Fixing it by using the IRQ-safe this_cpu_*() for operations on

Fixing => Fix ?

> read_count. This will generate more expensive code on a number of
> platforms, which might cause a performance regression for some of the
> other percpu-rwsem users.
> 
> If any such is reported, we can consider alternative solutions.
> 
> Fixes: 70fe2f48152e ("aio: fix freeze protection of aio writes")
> Signed-off-by: Hou Tao 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: https://lkml.kernel.org/r/20200915140750.137881-1-hout...@huawei.com
> ---
>  include/linux/percpu-rwsem.h  |8 
>  kernel/locking/percpu-rwsem.c |4 ++--
>  2 files changed, 6 insertions(+), 6 deletions(-)

For the patch:

Acked-by: Will Deacon 

Will


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread peterz
On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote:

> Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.

How's this?

---
Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
From: Hou Tao 
Date: Tue, 15 Sep 2020 22:07:50 +0800

From: Hou Tao 

The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
that percpu-rwsem is a blocking primitive, should be just fine.

However, file_end_write() is used from IRQ context and will cause
load-store issues.

Fixing it by using the IRQ-safe this_cpu_*() for operations on
read_count. This will generate more expensive code on a number of
platforms, which might cause a performance regression for some of the
other percpu-rwsem users.

If any such is reported, we can consider alternative solutions.

Fixes: 70fe2f48152e ("aio: fix freeze protection of aio writes")
Signed-off-by: Hou Tao 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200915140750.137881-1-hout...@huawei.com
---
 include/linux/percpu-rwsem.h  |8 
 kernel/locking/percpu-rwsem.c |4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -60,7 +60,7 @@ static inline void percpu_down_read(stru
 * anything we did within this RCU-sched read-size critical section.
 */
if (likely(rcu_sync_is_idle(>rss)))
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
else
__percpu_down_read(sem, false); /* Unconditional memory barrier 
*/
/*
@@ -79,7 +79,7 @@ static inline bool percpu_down_read_tryl
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(>rss)))
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
else
ret = __percpu_down_read(sem, true); /* Unconditional memory 
barrier */
preempt_enable();
@@ -103,7 +103,7 @@ static inline void percpu_up_read(struct
 * Same as in percpu_down_read().
 */
if (likely(rcu_sync_is_idle(>rss))) {
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
} else {
/*
 * slowpath; reader will only ever wake a single blocked
@@ -115,7 +115,7 @@ static inline void percpu_up_read(struct
 * aggregate zero, as that is the only time it matters) they
 * will also see our critical section.
 */
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
rcuwait_wake_up(>writer);
}
preempt_enable();
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem);
 
 static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 {
-   __this_cpu_inc(*sem->read_count);
+   this_cpu_inc(*sem->read_count);
 
/*
 * Due to having preemption disabled the decrement happens on
@@ -71,7 +71,7 @@ static bool __percpu_down_read_trylock(s
if (likely(!atomic_read_acquire(>block)))
return true;
 
-   __this_cpu_dec(*sem->read_count);
+   this_cpu_dec(*sem->read_count);
 
/* Prod writer to re-evaluate readers_active_check() */
rcuwait_wake_up(>writer);


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread Oleg Nesterov
On 09/15, Peter Zijlstra wrote:
>
> On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote:
>
> > Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.
>
> How's this?

Thanks Peter,

Acked-by: Oleg Nesterov 



Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread peterz
On Tue, Sep 15, 2020 at 05:11:23PM +0100, Will Deacon wrote:
> On Tue, Sep 15, 2020 at 06:03:44PM +0200, pet...@infradead.org wrote:
> > On Tue, Sep 15, 2020 at 05:51:50PM +0200, pet...@infradead.org wrote:
> > 
> > > Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.
> > 
> > How's this?
> > 
> > ---
> > Subject: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count
> > From: Hou Tao 
> > Date: Tue, 15 Sep 2020 22:07:50 +0800
> > 
> > From: Hou Tao 
> > 
> > The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
> > that percpu-rwsem is a blocking primitive, should be just fine.
> > 
> > However, file_end_write() is used from IRQ context and will cause
> > load-store issues.
> 
> ... on architectures where the per-cpu accessors are not atomic.

That's not entirely accurate, on x86 for example the per-cpu ops are not
atomic, but they are not susceptible to this problem due to them being a
single instruction from the point of interrupts -- either they wholly
happen or they don't.

So I'd reformulate it like: "... on architectures where the per-cpu
accessors are not natively irq-safe" ?





Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread peterz
On Tue, Sep 15, 2020 at 05:31:14PM +0200, Oleg Nesterov wrote:

> > So yeah, fs/super totally abuses percpu_rwsem, and yes, using it from
> > IRQ context is totally out of spec. That said, we've (grudgingly)
> > accomodated them before.
> 
> Yes, I didn't expect percpu_up_ can be called from IRQ :/

Yeah, me neither. That's well out of spec for a blocking primitive in
general.

> > This seems to be a fairly long standing issue, and certainly not unique
> > to ARM64 either (Power, and anyone else using asm-gemeric/percpu.h,
> > should be similarly affected I think). The issue seems to stem from
> > Oleg's original rewrite:
> >
> >   a1fd3e24d8a4 ("percpu_rw_semaphore: reimplement to not block the readers 
> > unnecessarily")
> 
> Not really... I think it was 70fe2f48152e ("aio: fix freeze protection of aio 
> writes").

Ah, that came later? Fair enough, I'll change the Fixes line.

> And iiuc io_uring does the same.

Indeed, I just went through a bunch of the file_end_write() callers.

> > and is certainly an understandable mistake.
> >
> > I'm torn on what to do, using this_cpu over __this_cpu is going to
> > adversely affect code-gen (and possibly performance) for all the
> > percpu-rwsem users that are not quite so 'creative'.
> 
> Yes, but what else can we do?

Well, I just talked about it with Will, there's a bunch of things we
could do, but they're all quite ugly.

My leading alternative was adding: percpu_down_read_irqsafe() /
percpu_up_read_irqsafe(), which use local_irq_save() instead of
preempt_disable().

But blergh.. Will also argued that by going with this patch, we'll get
an affected workload when someone reports a performance regression,
which I suppose is a bonus.

Anyway, I'll rewrite the Changelog and stuff it in locking/urgent.


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-15 Thread peterz
On Tue, Sep 15, 2020 at 10:07:50PM +0800, Hou Tao wrote:
> Under aarch64, __this_cpu_inc() is neither IRQ-safe nor atomic, so
> when percpu_up_read() is invoked under IRQ-context (e.g. aio completion),
> and it interrupts the process on the same CPU which is invoking
> percpu_down_read(), the decreasement on read_count may lost and
> the final value of read_count on the CPU will be unexpected
> as shown below:

> Fixing it by using the IRQ-safe helper this_cpu_inc|dec() for
> operations on read_count.
> 
> Another plausible fix is to state that percpu-rwsem can NOT be
> used under IRQ context and convert all users which may
> use it under IRQ context.

*groan*...

So yeah, fs/super totally abuses percpu_rwsem, and yes, using it from
IRQ context is totally out of spec. That said, we've (grudgingly)
accomodated them before.

This seems to be a fairly long standing issue, and certainly not unique
to ARM64 either (Power, and anyone else using asm-gemeric/percpu.h,
should be similarly affected I think). The issue seems to stem from
Oleg's original rewrite:

  a1fd3e24d8a4 ("percpu_rw_semaphore: reimplement to not block the readers 
unnecessarily")

and is certainly an understandable mistake.

I'm torn on what to do, using this_cpu over __this_cpu is going to
adversely affect code-gen (and possibly performance) for all the
percpu-rwsem users that are not quite so 'creative'.